The success of the social science research enterprise relies on the willingness of research participants to take part in surveys. Thus, it is critically important to ensure that the identities of research subjects are not revealed in archived data. Disclosure risk is a term that is often used for the possibility that a data record in a study can be linked to a specific person, thereby revealing information about that person that otherwise would not be known. Concerns about disclosure risk have grown as more datasets have become available online, and as it has become easier to link datasets with publicly available external databases.
ICPSR employs stringent procedures to protect the confidentiality of individuals whose personal information may be part of archived data. These include
- Rigorous review of all datasets to assess disclosure risk
- Modifying data if necessary to protect respondents' confidentiality
- Limiting access to datasets where risk of disclosure remains high
- Training of staff and consultation with data producers in methods of disclosure risk management
Removing Direct and Indirect Identifiers
Two kinds of variables often found in social science datasets present problems that could endanger the privacy of research subjects. Some variables point explicitly to particular individuals or units. Examples of direct identifiers include:
- Addresses, including ZIP codes
- Telephone numbers, including area codes
- Social Security numbers
- Other linkable numbers such as driver license numbers, certification numbers, etc.
Data depositors should carefully consider the analytic role that such variables play and should remove any identifiers not necessary for analysis.
Variables that can also be problematic are the indirect identifiers that may be used in conjunction with other information to identify individual respondents. Examples of indirect identifiers include:
- Detailed geographic information (e.g., state, county, or census tract of residence)
- Organizations (to which the respondent belongs)
- Educational institutions (from which the respondent graduated and year of graduation)
- Exact occupations
- Place where respondent grew up
- Exact dates of events (birth, death, marriage, divorce)
- Detailed income
- Offices or posts held by respondent
ICPSR may recode data to remove the threat of disclosure. Recoding can include converting dates to time intervals, exact dates of birth to age groups, state of residence to regional codes, and income to income ranges or categories.
ICPSR staff work closely with data depositors to resolve confidentiality issues. ICPSR strongly recommends that data producers remove all respondent identifiers before they deposit their data in the archive. For more information, see the Guide to Social Science Data Preparation and Archiving, 5th Edition (PDF 2MB).
Confidentiality and Informed Consent
Protection of the privacy of survey respondents is a core tenet of responsible research practice that begins with obtaining their informed consent.
Informed consent is the formal agreement by an individual who participates in the proposed research project. Human subjects involved in a project must participate willingly and be adequately informed about the research. The informed consent agreement must include a statement describing how the confidentiality of records identifying the subject will be maintained. It is important that an informed consent agreement be written in a way that does not unduly limit an investigator's discretion to share data with the research community.
Confidentiality and IRBs
IRBs take different approaches to secondary analysis of public-use datasets such as those distributed on ICPSR's website. Some require IRB review of proposals to analyze these data. Others exempt projects involving secondary analysis of existing datasets from preapproved sources, such as ICPSR. Others are establishing their own policies to address these issues. More
Levels of Access
Sometimes data cannot be modified to protect confidentiality without significantly compromising the research potential of the data. In these cases, access to the data is limited and stringent confidentiality safeguards are imposed.
ICPSR has established two categories of limited access.
Restricted-Use Data are made available for research purposes for use by investigators who agree to stringent conditions for the use of the data and its physical safekeeping.
Enclave Data are those datasets which present especially acute confidentiality risks. They can be accessed only on-site in ICPSR's secure data enclave in Ann Arbor. Investigators must be approved. Their notes and analytic output are reviewed by ICPSR staff.