Data Confidentiality

The success of social science research relies on participants’ willingness to engage in the research process. People often participate in research projects under an assumption that their responses will be kept confidential and will not be linked back to them. Thus, it is critically important to protect the identities of research participants. One way to protect participants’ identities is by assessing each study’s disclosure risk, which is the degree of risk that a data record from a study could be linked to a specific person or organization, thereby revealing information that otherwise would not be known or known with as much certainty. Concerns about disclosure risk have grown as more datasets have become available online and it has become increasingly easy to link data. ICPSR works to ensure that the appropriate level of confidentiality remains intact for all of its data holdings.

ICPSR's Approach to Confidentiality

ICPSR accepts data with identifying information under conditions consistent with the informed consent of the study participants and the relevant Institutional Review Board (IRB) approval. ICPSR staff work with data depositors to address disclosure risks. Once data are deposited with ICPSR, staff employ stringent procedures to protect the confidentiality of individuals and organizations whose personal information may be part of the archived data collection. Steps ICPSR staff take to maintain data confidentiality include:

  • Completing a detailed review of all datasets to assess disclosure risk
  • If necessary, modifying data to reduce disclosure risk
  • Limiting access to datasets for which modifying the data would substantially limit their utility or the risk of disclosure remains high
  • Training staff and consulting with data producers in methods of disclosure risk assessment and mitigation

Disclosure Risk

With the exception of deposits placed in openICPSR, our public data-sharing service, ICPSR reviews all datasets to assess disclosure risk. ICPSR trains data curators to apply specified procedures to protect respondent confidentiality in all of the data ICPSR curates, archives, and distributes. For example, ICPSR checks each study for identifiers present in the data.

Identifiers

Two kinds of variables often found in social science data present problems that could endanger research subjects' confidentiality: direct identifiers and indirect identifiers.

Direct identifiers. These are variables that point explicitly to particular individuals or units. Examples include:

  • Names
  • Addresses, including ZIP and other postal codes
  • Telephone numbers, including area codes
  • Social Security numbers
  • Other linkable numbers such as driver's license numbers, certification numbers, etc.

Indirect identifiers. These are variables that can be problematic as they may be used together or in conjunction with other information to identify individual respondents. Examples include:

  • Detailed geographic information (e.g., state, county, province, or census tract of residence)
  • Organizations to which the respondent belongs
  • Educational institutions from which the respondent graduated and year of graduation
  • Detailed occupational titles
  • Place where respondent grew up
  • Exact dates of events (e.g., birth, death, marriage, divorce)
  • Detailed income
  • Offices or posts held by respondent

ICPSR may recode data to reduce disclosure risk. Recoding can include converting dates to time intervals, exact dates of birth to age groups, detailed geographic codes to broader levels of geography, and detailed income to income ranges or categories.

If modifications to address identifiers to create a public-use dataset will seriously reduce the analytic utility of the data, ICPSR may release a restricted-use dataset or both public- and restricted-use datasets. Restricted-use datasets retain confidential, identifying information, and are accessible under controlled conditions.

Confidentiality, Informed Consent, and Data Sharing

Protection of respondent confidentiality is a core tenet of responsible research practice that begins with obtaining informed consent. Informed consent is a process of communication between a participant and researcher which enables the participant to decide voluntarily whether or not to participate in a study. Human subjects involved in a project must participate willingly and be adequately informed about the research. The informed consent must include a statement describing how the confidentiality of subject records will be maintained. However, it is also important that informed consent be written in a way that does not unduly limit an investigator's discretion to share data with the research community.

Recommended informed consent language for data sharing

Confidentiality and IRBs

Institutional Review Boards (IRBs) take different approaches to secondary analysis of research datasets such as those distributed on ICPSR's website. Some institutions require IRB review of proposals to analyze secondary data. Other institutions provide IRB exemption for projects involving secondary data if the data were acquired from preapproved sources such as ICPSR. Other institutions are establishing unique policies to address these issues. More about Institutional Review Boards

Levels of Restricted Data Access

Depending on the outcome of the disclosure risk review, ICPSR may suggest modifying the data and/or distributing the data at a higher level of restriction. Sometimes data cannot be modified to protect confidentiality without significantly compromising the research potential of the data. In these cases, access to the data is restricted in order to impose further confidentiality safeguards.

ICPSR has established several mechanisms through which restricted data can be distributed:

  • Secure Download: With this option, users submit an application to access the data, and after approval, download the data using a single-use password. At the end of the approved access period, users must destroy the data.
  • Virtual Data Enclave (VDE): The VDE is a secure, online environment in which approved users analyze restricted data via a remote desktop using several available software options, including SAS, Stata, and SPSS. Researchers do not receive a copy of the data, but rather analyze the data stored on ICPSR’s servers. Final analysis output is vetted and, if approved, released to the researcher.
  • Physical Data Enclave: For highly restricted data, ICPSR has a physical enclave which requires that approved users be on site at ICPSR to use the data. Data use in the physical data enclave is monitored by ICPSR staff. Final analysis output is vetted and, if approved, released to the researcher.
  • Secure online analysis: This option provides analysis of restricted-use data behind an interface with programmable disclosure protection for selected users. With this option, users submit an application to access the data.

Consulting

In addition to the steps ICPSR takes to ensure the confidentiality of data that has already been deposited, we also offer the following services related to disclosure risk assessment and mitigation to researchers who have not yet deposited their data or who are in the earlier stages of the data collection process:

  • Informed consent review
  • Consultation regarding issues of disclosure risk (no charge)
  • Basic disclosure risk assessment
  • Full disclosure analysis: risk assessment and options for mitigation and data distribution
  • Training

For further information on ICPSR services, contact us at help@icpsr.umich.edu or 734-647-2200.