Guide to Social Science Data Preparation and Archiving
Phase 5: Preparing Data for Sharing

This chapter addresses the critical final steps researchers should undertake in preparing to archive and/or disseminate their data.

Respondent Confidentiality

Much of this guide has focused on on data preparation methods that can serve the research needs of both principal investigators and analysts of secondary data. In this chapter, however, we highlight one area of divergence necessitated by the responsibility to protect respondent confidentiality. Researchers must pay special attention to this issue. Once data are released to the public, it is impossible to monitor use to ensure that other researchers respect respondent confidentiality. Thus, it is common practice in preparing public-use datasets to alter the files so that information that could imperil the confidentiality of research subjects is removed or masked before the dataset is made public. At the same time, care must be used to make certain that the alterations do not unnecessarily reduce the researcher’s ability to reproduce or extend the original study findings.

Below, we suggest steps that principal investigators can take to protect respondent confidentiality before submitting their data for archiving. But first, a quick review of why this is important.

The principles of disclosure risk limitation

Social scientists must demonstrate a deep and genuine commitment to preserve the privacy of the subjects whom they study in the course of their research. Most often applied to individuals who consent to be interviewed in surveys, this commitment extends also to groups, organizations, and entities whose information is recorded in administrative and other kinds of records.

"Institutions conducting research using human subjects funded by the federal department of Health and Human Services are responsible for compliance with the federal regulation on Protection of Human Subjects (45CFR46). Every such university and research institution must file an "assurance of compliance" with the HHS Office for Human Research Protections that includes "a statement of ethical principles to be followed in protecting human subjects of research." For more information, see

College or university Institutional Review Boards (IRBs) approve proposals for research involving human subjects and take actions to ensure that any research is carried out appropriately and without harming research participants.

Archives place a high priority on preserving the confidentiality of respondent data and review all data collections they receive to ensure that confidentiality is protected in the public-use datasets released. Two major concerns govern policy and practice in this area: professional ethics and applicable regulations. The social sciences broadly defined (as well as a number of professional associations) have promulgated codes of ethics that require social scientists to ensure the confidentiality of data collected for research purposes. (See, for example, the American Statistical Association’s “Ethical Guidelines for Statistical Practice,” (1999) which stresses the appropriate treatment of data to protect respondent confidentiality.) Both the rights of respondents and their continued willingness to voluntarily provide answers to scientific inquiries underlie this professional ethic. The ethic applies to all participants in the research enterprise, from data collectors to archivists to secondary analysts who use such data in their research.

Regulations also bind all participants in the research enterprise to measures intended to protect research subjects as well as data obtained from such subjects. These regulations range from federal and local statutes to rules instituted by universities and colleges.

The practice of protecting confidentiality

Two kinds of variables often found in social science datasets present problems that could endanger the confidentiality of research subjects: direct and indirect identifiers.

Direct identifiers. These are variables that point explicitly to particular individuals or units. They may have been collected in the process of survey administration and are usually easily recognized. For instance, in the United States, Social Security numbers uniquely identify individuals who are registered with the Social Security Administration. Any variable that functions as an explicit name can be a direct identifier -- for example, a license number, phone number, or mailing address. Data depositors should carefully consider the analytic role that such variables fulfill and should remove any identifiers not necessary for analysis.

Indirect identifiers. Data depositors should also carefully consider a second class of problematic variables -- indirect identifiers. Such variables make unique cases visible. For instance, a United States ZIP code field may not be troublesome on its own, but when combined with other attributes like race and annual income, a ZIP code may identify unique individuals (e.g., extremely wealthy or poor) within that ZIP code, which means that answers the respondent thought would be private are no longer private. Some examples of possible indirect identifiers are detailed geography (e.g., state, county, or census tract of residence), organizations to which the respondent belongs, educational institutions from which the respondent graduated (and year of graduation), exact occupations held, places where the respondent grew up, exact dates of events, detailed income, and offices or posts held by the respondent. Indirect identifiers often are items that are useful for statistical analysis. The data depositor must carefully assess their analytic importance. Do analysts need the ZIP code, for example, or will data aggregated to the county or state levels suffice?

Geographic identifiers. Some projects collect data containing direct geographic identifiers such as coordinates that can be used with a mapping application. These data can be classified and displayed with geographic information system (GIS) software. Direct geographic identifiers are actual addresses (e.g., of an incident, a business, a public agency, etc.). As described above, the role of these variables should be considered and only included if necessary for analysis. Indirect geographic identifiers include location information such as state, county, census tract, census block, telephone area codes, and place where the respondent grew up.

Treating indirect identifiers. If, in the judgment of the principal investigator, a variable might act as an indirect identifier (and thus could be used to compromise the confidentiality of a research subject), the investigator should treat that variable in a special manner when preparing a public-use dataset. Commonly used types of treatment are as follows:

  • Removal -- eliminating the variable from the dataset entirely.
  • Top-coding -- restricting the upper range of a variable.
  • Collapsing and/or combining variables -- combining values of a single variable or merging data recorded in two or more variables into a new summary variable.
  • Sampling -- rather than providing all of the original data, releasing a random sample of sufficient size to yield reasonable inferences.
  • Swapping -- matching unique cases on the indirect identifier, then exchanging the values of key variables between the cases. This retains the analytic utility and covariate structure of the dataset while protecting subject confidentiality. Swapping is a service that archives may offer to limit disclosure risk. (For more in-depth discussion of this technique, see O’Rourke, 2003 and 2006.)
  • Disturbing -- adding random variation or stochastic error to the variable. This retains the statistical properties between the variable and its covariates, while preventing someone from using the variable as a means for linking records.

An example from a national survey of physicians (containing many details of each doctor’s practice patterns, background, and personal characteristics) illustrates some of these categories of treatment of variables to protect confidentiality. Variables identifying the school from which the physician’s medical degree was obtained and the year graduated should probably be removed entirely, due to the ubiquity of publicly available rosters of college and university graduates. The state of residence of the physician could be bracketed into a new “Region” variable (substituting more general geographic categories such as “East,” “South,” “Midwest,” and “West”). The upper end of the range of the “Physician’s Income” variable could be top-coded (e.g., “$150,000 or More”) to avoid identifying the most highly paid individuals. Finally, a series of variables documenting the responding physician’s certification in several medical specialties could be collapsed into a summary indicator (with new categories such as “Surgery,” “Pediatrics,” “Internal Medicine,” and “Two or More Specialties”).

Data producers can consult with a social science data archive to design public-use datasets that maintain the confidentiality of respondents and are of maximum utility for all users. The staff will also perform an independent confidentiality review of datasets submitted to the archive and will work with the investigators to resolve any remaining problems of confidentiality. If the investigator anticipates that significant work will need to be performed before deposit to anonymize the data, this should be noted and funds set aside for this purpose at the beginning of the project.

Restricted-use data collections

Public-use data collections include content that has been carefully screened to reduce the risk of confidentiality breaches, either directly or through deductive analyses. Some original data items -- direct or indirect identifiers -- will be removed or adjusted through the treatment procedures discussed above. These treatments, however, frequently impose limitations on the research uses of such files. It is possible that the loss of the confidential data could detract from the significance and analytic potential of a dataset.

Creating a restricted dataset provides a viable alternative to removing sensitive variables. In such instances, a public-use dataset that has these variables removed is released, while the dataset preserving the original variables is kept as a restricted-use dataset. The restricted-use dataset is released only to approved clients/users who have agreed in writing to abide by rules assuring that respondent confidentiality is maintained. Designating data for restricted-use can occur at the request of a data depositor, upon determination by the archive staff following review of the data content, or after consultation between the depositor and the archive. Maintenance of, and approval of access to, a restricted-use file is managed by archive staff in accordance with the terms of access.

Access to restricted-use files is offered to approved researchers under a set of highly controlled conditions. The right to use these files requires acceptance of a restricted data use agreement that spells out the conditions that a researcher must accept before obtaining access. Most agreements require that a researcher provide a detailed summary of the research question and precisely explain why access to the confidential variables is needed. Each user of restricted data must provide a data protection plan outlining steps he or she will take to safeguard the data during the project period. Researchers are usually given access to the data for a limited time period, at the end of which they must return the original files, or destroy them in good faith. The restricted-use dataset approach effectively permits access to sensitive research information while protecting confidentiality, and has proven acceptable to researchers.

However, the advent of virtual data enclaves (see below) may eliminate the need for physical transfer of data files.

Data enclaves

In general, the more identifying information there is in a dataset, the more restrictive are the regulations governing access and location of use. Archives grant access to the most confidential data -- for example, medical records containing identifying information such as respondent name and address -- through a data enclave environment, either physical or virtual.

Virtual data enclaves

These data portals allow users to obtain remote access to restricted data that would not otherwise be available for research. This often includes using a restricted access application system, getting set up with secure remote access to the restricted data (including possible on-site inspection), monitoring research behavior during data access, and having analytic results reviewed for disclosure risk before they are permitted to leave the secure environment. Such systems generally prevent users from emailing, copying, or otherwise moving files outside of the secure environment, either accidentally or intentionally.

Physical data enclaves

A physical data enclave is a secure data analysis laboratory that allows access to the original data in a controlled setting. Secure data enclaves have added security features to ensure the safekeeping of the most confidential data. They typically have appropriate physical security measures (no windows, video monitoring, key card entry) to strictly control access. Their computing environments are not connected to the Internet, but rather have their own network server (connected to a small number of work stations). Researchers using the enclave are monitored by archive staff who see to it that no unauthorized materials are removed. Any analyses produced are scrutinized to determine that they do not include any potential breaches of confidentiality. Other policies and procedures also govern the use of restricted data in enclaves.

Secure Survey Documentation and Analysis (SSDA)

SSDA is an online data analysis program that performs bi-variate cross-tabulation, comparison of means, correlation, and regression analyses. SSDA is designed to provide a safe, reliable way to distribute restricted-use data publicly, thereby, democratizing access to data previously unavailable or that required special procedures to obtain. SSDA automates several disclosure protections that prevent the use of organization-defined high-risk variables, singularly or in combination, and restrict types of output commonly associated with disclosure risk (e.g., small un-weighted sample sizes). When the organization-defined rules are violated by an attempted analysis, the resulting output is completely suppressed or partially suppressed.