Deductive Disclosure Risk

flavor image depicting a file folder secured with a padlock

"Any data on human subjects inevitably raises privacy issues, and the real risks of abuse of such data are difficult to quantify." -A Matter of Trust. Nature, 2007.

Introduction

Deductive disclosure is the identification of an individual's identity using known characteristics of that individual. Even though direct identifiers (e.g. name, addresses) are removed from survey data, it may be possible to identify respondents with unique characteristics. An individual who is known to have participated in a study may be identified from various personal characteristics.

The risk of deductive disclosure is of vital importance and interest in the social science community. There are several methods to mitigate deductive disclosure risk. This page describes a few of the more commonly used methods.

Report on Statistical Disclosure Limitation Methodology

In 2005, the Federal Committee on Statistical Methodology and Confidentiality and Data Access Committee revised their 1994 report on statistical disclosure and limitation methodology. The most current working paper details the current practices of the principal Federal statistical agencies, including research and new methodologies which have been developed over the past decade.

A synopsis of the key topics with chapter references can be found below. Further elaborations on these topics can be found in chapters 4 and 5. Chapter 3 describes current federal statistical agency practices.

Chapter 1: Introduction

  • Confidentiality: the dissemination of data prohibits the public identification of the respondent or is in any way harmful to them
  • Disclosure: public identification of the identity of the individual reporting units and information about them
    • Identity Disclosure: when a subject is identified from a released file
    • Attribute Disclosure: when sensitive information about a data subject is revealed through the released file
    • Inferential Disclosure: when the released data makes it possible to determine the value of some characteristic of an individual more accurately than would have otherwise been possible
  • Restricted Data: protecting confidentiality by restricting the amount of information provided or by adjusting the data in released tables and microdata files
  • Restricted Access: imposing conditions on access to the data products

Chapter 2: Statistical Disclosure Limitation Methods

  • Most common method of providing data to the public is through statistical tables
  • User-friendly products can be provided via microdata files, where each record contains a set of variables that pertain to a single respondent and are related to that respondent's reported values; direct identifiers, such as names and addresses, are removed
  • Ways to protect confidentiality/disclosure
    • Sampling - conduct a sample survey rather than a census
      • Estimates are calculated by multiplying a respondent's data by a sampling weight and aggregating all the weighted responses
    • Suppression is one of the most common methods of protecting sensitive cells - removing data to prevent the identification of individuals in small groups or those with unique characteristics
      • Primary Suppressions: withholding identified sensitive cells from publication
      • Complementary Suppressions: selecting other cells and suppressing them so that the sensitive cells cannot be derived by addition or subtraction from published marginal totals
    • Random Rounding - cell values are rounded, but instead of using standard rounding conventions, it is randomly decided whether they will be rounded up or down
      • Controlled Rounding: a form of random rounding that is constrained to have the sum of the published entries in each row and column equal the appropriate published marginal totals
    • Tabular data can be protected by applying disclosure protection methods to the underlying microdata files to assure that any tables generated from the microdata files are fully protected
  • Microdata
    • Tips to reduce the potential for disclosure with public-use microdata files:
      • Include data from only a sample of the population
      • Do not include obvious identifiers
      • Limit geographic detail
      • Limit the number and detailed breakdown of categories within variables on the file
  • Methods for disguising high risk variables
    • Truncate extreme codes for certain variables
      • Top-coding: only showing that a particular variable is greater than X amount
      • Bottom-coding: only showing that a particular variable is less than X amount
    • Recoding into intervals or rounding
    • Adding or multiplying by random numbers (noise)
    • Swapping or rank swapping (aka switching) - involves selecting a sample of the records, finding a match in the database on a set of predetermined variables, and swapping all other variables
    • Data Shuffling
      • First, the values of the confidential variables are modified using a general perturbation technique
      • Second, a data shuffling procedure is applied using the perturbed values of the confidential variables on file
    • Selecting records at random, blanking out selected variables and imputing for them (aka blank/impute)
    • Aggregating across small groups of respondents and replacing one individual's reported value with the average (aka blurring)
  • Removing identifiers and limiting geographic detail
    • Include only the data from a sample of the population
    • Remove identifiers that directly identify respondents such as name, address, and identification numbers
    • Consider the geographic detail

Additional Best Practices

Data perturbation is a privacy-preserving method which protects sensitive data from unauthorized use.

  • Some common data perturbation techniques are summarized below; additional information can be found in The International Household Survey and the Privacy Technical Assistance Center
    • Micro-Aggregation: Replacing an observed value with the average computed on a small group of units
    • Post-Randomization: Induces uncertainty in the values of some variables by exchanging them according to a probabilistic mechanism
    • Resampling: Drawing with replacement t samples of n values from the original data, sorting the sample, and averaging the sampled values
    • Data Masking: Creating a structurally similar, but inauthentic version of the data; protecting the actual data while creating a functional substitute

Synthetic micro data can also be used to protect participant identities, and is produced using data simulation algorithms. (Seiber 2001; Reiter 2011).

  • Values of confidential data are replaced with simulations from statistical models allowing for higher data quality
  • There are two main kinds of synthetic data : full and partial
    • Fully Synthetic Data
      • Randomly and independently sampling units from the sampling frame to comprise each synthetic data set
      • Impute the unknown data values for units in the synthetic samples using models fit with the original survey data
      • Release multiple versions of the datasets to the public
    • Partially Synthetic Data
      • Comprise the units originally surveyed with some collected values replaced with multiple imputations
  • Imputations should be drawn from distributions designed to preserve important relationships in the confidential data
  • Synthetic data can produce valid inferences in a variety of settings, including: simple random sampling, probability proportional to size sampling, two-stage cluster sampling, and stratified sampling

Protecting Respondent Confidentiality

Informed Consent and Anonymity/Confidentiality - In a research setting, it is incredibly important that from the get-go subjects understand the purpose of the study and agree to do what it entails from them. Once informed consent has been given, there are several things to consider when it comes to participant anonymity and confidentiality. (Kaiser 2009; Bickford and Nisker 2015; Saunders, Kitzinger, and Kitzinger 2014; Giordano, O'Reilly, Taylor, and Dogra 2007).

  • It is not always fair to assume that subjects do not wish to be linked to the responses they provide
    • Give participants the opportunity to decide whether they would like their responses to remain anonymous or not
  • Ensure that from the outset of the study, respondents are well-informed of both what the study will require from them and how their responses, and the data more broadly, is going to be used and distributed
    • Informed consent should be an ongoing process, and is not something that takes place solely at the start of the study
  • It is important to find a good balance between nuanced descriptions and depth that ensures the research will be meaningful

Other ways to protect participant confidentiality (Petrova, Dewing, and Camilleri 2014; Privacy Technical Assistance Center)

  • Avoid releasing as much demographic information as is possible
  • Use pseudonyms or codes
  • Give participants sovereignty over their data and intentionally concealing information which may reveal their identities
  • Before disseminating the data, ask an independent person who is knowledgeable of the participants and the system to read the report to ensure that the identities of the participants are not made explicit in any way

References/Resources

The Add Health study has extensive experience with issues of deductive disclosure. In addition, the following references may be helpful.

Bickford, Julia and Jeff Nisker. 2015. "Tensions between Anonymity and Thick Description when "Studying Up" in Genetics Research." Qualitative Health Research 25(2):276-282.

Bournazian, Jacob, Nancy Kirkendall, Steve Cohen, Philip Steel, Alvan Zarate, Arnold Reznek, and Paul Massell. 2005. "Report on Statistical Disclosure Limitation Methodology." Federal Committee on Statistical Methodology 1(2).

Giordano, James, Michelle O'Reilly, Helen Taylor and Nisha Dogra. 2007. "Confidentiality and Autonomy: The Challenge(s) of Offering Research Participants a Choice of Disclosing their Identity." Qualitative Health Research 17(2):264-275.

Kaiser, Karen. 2009. "Protecting Respondent Confidentiality in Qualitative Research." Qualitative Health Research 19(11):1632-1641.

Petrova, Elmira, Jan Dewing and Michelle Camilleri. 2014. "Confidentiality in Participatory Research: Challenges from One Study." Nursing Ethics 1(13).

Reiter, Jerome P. 2011. "Data Confidentiality." Wiley Interdisciplinary Reviews: Computational Statistics 3(5):450-456.

Saunders, Benjamin, Jenny Kitzinger and Celia Kitzinger. 2014. "Anonymising Interview Data: Challenges and Compromise in Practice." Qualitative Research 1(17).

Sieber, Joan E. 2001. Summary of Human Subjects Protection Issues Related to Large Sample Surveys. U.S Department of Justice and Bureau of Justice Statistics.