Confidentiality

The success of the social science research enterprise relies on the willingness of research participants to take part in surveys. Thus, it is critically important to protect the identities of research subjects. Disclosure risk is a term used for the possibility that a data record from a study could be linked to a specific person or organization, thereby revealing information that otherwise would not be known or known with as much certainty. Concerns about disclosure risk have grown as more datasets have become available online, and it has become easier to link research datasets with publicly available external databases.

ICPSR's Approach to Confidentiality

ICPSR employs stringent procedures to protect the confidentiality of individuals and organizations whose personal information may be part of archived data. These include:

  • Detailed review of all datasets to assess disclosure risk
  • Modifying data if necessary to reduce risk and protect respondents' confidentiality
  • Limiting access to datasets for which modifying the data would substantially limit utility or the risk of disclosure remains unacceptably high
  • Training of staff and consulting with data producers in methods of disclosure risk assessment and mitigation

Disclosure Risk

With the exception of deposits placed in openICPSR, our public data-sharing service, ICPSR reviews all datasets to assess disclosure risk. It trains data processors to apply specified procedures to protect respondent confidentiality in the data ICPSR processes, archives, and distributes.

Removing Identifiers

Two kinds of variables often found in social science datasets present problems that could endanger research subjects' confidentiality: direct identifiers and indirect identifiers.

Direct identifiers. These are variables that point explicitly to particular individuals or units. Examples include:

  • Names
  • Addresses, including ZIP and other postal codes
  • Telephone numbers, including area codes
  • Social Security numbers
  • Other linkable numbers such as driver's license numbers, certification numbers, etc.

All variables directly identifying research subjects must be removed or masked prior to deposit.

Indirect identifiers. These are variables that can be problematic as they may be used together or in conjunction with other information to identify individual respondents. Examples include:

  • Detailed geographic information (e.g., state, county, province, or census tract of residence)
  • Organizations to which the respondent belongs
  • Educational institutions (from which the respondent graduated and year of graduation)
  • Detailed occupational titles
  • Place where respondent grew up
  • Exact dates of events (birth, death, marriage, divorce)
  • Detailed income
  • Offices or posts held by respondent

ICPSR may recode data to reduce disclosure risk. Recoding can include converting dates to time intervals, exact dates of birth to age groups, detailed geographic codes to broader levels of geography, and income to income ranges or categories.

ICPSR staff work closely with data depositors to address disclosure risks. ICPSR strongly recommends that data producers remove all direct respondent identifiers before they deposit their data in the archive. For more information, see Phase 5: Preparing Data for Sharing in the ICPSR Guide to Social Science Data Preparation and Archiving, 5th Edition (PDF 2MB).

If removing all direct identifiers will unacceptably reduce the analytic utility of the data, depositors should contact ICPSR about releasing a restricted-use dataset or both public- and restricted-use datasets. As restricted-use datasets retain confidential, indirectly identifying information, investigators must access them under controlled conditions.

Confidentiality, Informed Consent, and Data Sharing

Protection of respondent confidentiality is a core tenet of responsible research practice that begins with obtaining their informed consent.

Informed consent is a process of communication between a subject and researcher to enable the person to decide voluntarily whether to participate in a study. Human subjects involved in a project must participate willingly and be adequately informed about the research. The informed consent must include a statement describing how the confidentiality of subject records will be maintained. However, it also is important that informed consent be written in a way that does not unduly limit an investigator's discretion to share data with the research community.

Recommended informed consent language for data sharing

Confidentiality and IRBs

Institutional Review Boards (IRBs) take different approaches to secondary analysis of research datasets such as those distributed on ICPSR's website. Some institutions require IRB review of proposals to analyze secondary data. Other institutions provide IRB exemption for projects involving secondary data if the data were acquired from preapproved sources such as ICPSR. Other institutions are establishing unique policies to address these issues. More about Institutional Review Boards

Levels of Access

Sometimes data cannot be modified to protect confidentiality without significantly compromising the research potential of the data. In these cases, access to the data is restricted and confidentiality safeguards are imposed.

ICPSR has established several mechanisms by which restricted-use data can be distributed:

  • Secure online analysis (publicly available): This option provides immediate access to restricted-use data behind an analytic interface that has programmable disclosure protection.
  • Secure online analysis (password protected): This option provides analysis of restricted-use data behind an interface with programmable disclosure protection for selected users. With this option, users may have to submit an application to access the data, or they may be part of a defined group, such as a research group.
  • Restricted Use Data Agreement: With this option, users submit a request to access the data, and after approval, download the data using a single-use password or receive the data on CD-ROM.
  • Virtual Data Enclave (VDE): The VDE is a secure, online environment via which approved users analyze restricted-use data using several software options available within the VDE, such as SAS, Stata, and SUDAAN.
  • Physical Data Enclave: For highly restricted data, ICPSR has a physical enclave, which requires that approved users be on site at ICPSR to use the data. Data use in the physical data enclave is monitored by ICPSR staff.

Consulting

ICPSR offers the following services related to disclosure risk assessment and mitigation:

  • Informed consent review
  • Consultation regarding issues of disclosure risk (no charge)
  • Basic disclosure risk assessment
  • Full disclosure analysis: risk assessment and options for mitigation and data distribution
  • Training

For further information on ICPSR services, contact us at netmail@icpsr.umich.edu or 734-647-2200.

Found a problem? Use our Report Problem form to let us know.