Confidentiality
The success of the social science research enterprise relies on the willingness of research participants to take part in surveys. Thus, it is critically important to protect the identities of research subjects. Disclosure risk is a term that is often used for the possibility that a data record from a study could be linked to a specific person, thereby revealing information about that person that otherwise would not be known. Concerns about disclosure risk have grown as more datasets have become available online, and it has become easier to link research datasets with publicly available external databases.
ICPSR's Approach to Confidentiality
ICPSR employs stringent procedures to protect the confidentiality of individuals whose personal information may be part of archived data. These include
- Rigorous review of all datasets to assess disclosure risk
- Modifying data if necessary to protect respondents' confidentiality
- Limiting access to datasets where risk of disclosure remains high
- Training of staff and consulting with data producers in methods of disclosure risk management
Disclosure Risk
ICPSR subjects all datasets to a review to assess disclosure risk. It has established a certification program in disclosure risk management that involves training data processors to apply rigorous procedures to protect the confidentiality of data that ICPSR processes, archives, and distributes.
Removing Identifiers
Two kinds of variables often found in social science datasets present problems that could endanger research subjects' confidentiality. Some variables point explicitly to particular individuals or units. Examples of direct identifiers include:
- Names
- Addresses, including ZIP codes
- Telephone numbers, including area codes
- Social Security numbers
- Other linkable numbers such as driver license numbers, certification numbers, etc.
All variables directly identifying reearch subjects must be removed or masked prior to deposit.
Variables that can also be problematic are the indirect identifiers that may be used in conjunction with other information to identify individual respondents. Examples of indirect identifiers include:
- Detailed geographic information (e.g., state, county, or census tract of residence)
- Organizations (to which the respondent belongs)
- Educational institutions (from which the respondent graduated and year of graduation)
- Exact occupations
- Place where respondent grew up
- Exact dates of events (birth, death, marriage, divorce)
- Detailed income
- Offices or posts held by respondent
ICPSR may recode data to remove the threat of disclosure. Recoding can include converting dates to time intervals, exact dates of birth to age groups, state of residence to regional codes, and income to income ranges or categories.
ICPSR staff work closely with data depositors to resolve confidentiality issues. ICPSR strongly recommends that data producers remove all respondent identifiers before they deposit their data in the archive. For more information, see Phase 5: Preparing Data for Sharing in the ICPSR Guide to Social Science Data Preparation and Archiving, 5th Edition (PDF 2MB).
If removing all identifiers will unacceptably reduce the analytic utility of the data, depositors should contact ICPSR about releasing a restricted-use dataset. Restricted-use datasets retain confidential information so investigators must meet stringent requirements to access them.
Confidentiality, Informed Consent, and Data Sharing
Protection of the confidentiality of respondents is a core tenet of responsible research practice that begins with obtaining their informed consent.
Informed consent is a process of communication between a subject and researcher to enable the person to voluntarily decide whether or not to participate as a research subject. Human subjects involved in a project must participate willingly and be adequately informed about the research. The informed consent must include a statement describing how the confidentiality of subject records will be maintained. However, it is also important that informed consent is written in a way that does not unduly limit an investigator's discretion to share data with the research community.
Confidentiality and IRBs
IRBs take different approaches to secondary analysis of public-use datasets such as those distributed on ICPSR's Web site. Some require IRB review of proposals to analyze these data. Others exempt projects involving secondary analysis of existing datasets from preapproved sources, such as ICPSR. Others are establishing their own policies to address these issues. More
Levels of Access
Sometimes data cannot be modified to protect confidentiality without significantly compromising the research potential of the data. In these cases, access to the data is restricted and stringent confidentiality safeguards are imposed.
ICPSR has established two categories of restricted access.
Restricted-Use Data are made available for research purposes for use by investigators who agree to stringent conditions for the use of the data and its physical safekeeping.
Enclave Data are those datasets which present especially acute disclosure risks. They can be accessed only on-site in ICPSR's physical data enclave in Ann Arbor. Investigators must be approved. Their notes and analytic output are reviewed by ICPSR staff.
