What are the Guidelines for Disclosure Protection and Output Vetting?

The following guidelines are provided for disclosure protection and output vetting when analyzing data. The purpose of output vetting is to prevent re-identification of respondents and protect sensitive information. Please note that for certain restricted data studies or series, specific standards may apply and these are found within the Data Use Agreement. If using restricted data, always check your Data Use Agreement for these standards or email DSDR@icpsr.umich.edu  for guidance.

How output is vetted for disclosure protection can depend on your access method:

  • If you are using public data or restricted data via the secure download access method, YOU are responsible for checking your own output.
  • If you are using the Virtual Data Enclave (VDE) for access (with either restricted or restricted/public data), YOU are responsible for the FIRST check prior to requesting the release of output. ICPSR staff will then complete a final check before releasing any output from the VDE.

Disclosure-Protection Rules

Disclosure protection rules define what results from analyses involving restricted-use data may be presented or published. These rules prevent the indirect re-identification of respondents and organizations.

The following table provides general disclosure protection guidelines that can be applied to all studies, regardless of access method. Specific guidelines for disclosure may also be outlined in the Data Use Agreement for each study.

Rule Description Typical Values
PII or PHI Personally Identifiable Information such as names, addresses, and respondent ID cannot be released Direct identifiers
Suppressed Variables While these variables can be included in analysis, coefficients and tables for them cannot be reported Geographic identifiers
Suppressed combinations of variables While these variables can be reported separately, they may not be used together in tables or interactions Detailed household structure
Minimum cell sizes For tables, minimum allowed cell sizes. Cells below this value require rows or columns to be combined. Redaction of the individual cell is insufficient 10
Minimum sample and sub-sample size Minimum number of valid observations (excluding missing data) for regression analysis 50
Disallowed sub-samples Sub-samples that are not allowed even if the sub-sample meets sample size requirements Ranking of specific places or organizations
Dummy variables Dummy variables for which coefficients cannot be reported Ranking of specific places or organizations
Organizations and Groups Organizations and Groups for which results cannot be presented separately Ranking of specific organizations
Nested tables Tables that can be combined into one table Tables may not be combined to produce another table, typically by subtracting cell counts across tables.
Saturated or near saturated models Models that reproduce the data exactly Maximum R-squared 0.5

Minimum df remaining 40

List cases including predicted values An individual case or roster of cases cannot be reported List cases and scatterplots are not allowed
Weights Do results have to be weighted? Unweighted totals may be presented for tables but not individual cells.
Visualizations Maps must obscure exact locations
Linkages What data may not be linked? Contextual linkages for geographic areas are typically allowed. Linkages at the individual level must be explicitly approved.

VDE Output Vetting for Compliance with Disclosure Protection Rules

Please be aware that no output may be removed or transcribed from the VDE in any form without approval by DSDR or ICPSR staff. This includes sending any information via email, including simple statistics or screenshots, and even to ICPSR staff or your project team. Doing so would constitute a violation of your legal agreement with ICPSR and the University of Michigan. Please refer to your Restricted Data Use Agreement for additional information.

Researchers are responsible for adhering to disclosure protection rules and must review their output before submitting to DSDR for disclosure review and final approval. Submitted output should not exceed 100 pages and can include Word tables, Excel spreadsheets, and graphics (preferred) or raw output from a statistical package (files and logs). SPSS output must be saved and submitted as a PDF. Files submitted for disclosure review should be well-labeled tables of frequencies, crosstabs, and/or regression estimates that could be ready for publication, presentation, etc.

To request output, please email dsdr@icpsr.umich.edu  with a subject line that contains your VDE project number. Include the following information in your message:

  • Name of study or data that are the basis of the results
  • PI of the VDE Project
  • Location of output file(s) to be reviewed
  • Name(s) of output file(s) to be reviewed

In the VDE, include the following information within the output file(s) to be reviewed:

  • Sample size for each regression or table; if table contains several regressions, sample size for each regression
  • Sample size for each statistic
  • Description of sample or sub-sample and population or sub-population for each regression or table
  • Description of content of table cells
  • If regression, indicate the y-variate or outcome
  • If regression, indicate dummy variables
  • Indicate whether the results are weighted
  • Counts of observations for dummy variables
  • Counts of observations for table cells
  • Labels for all variables or glossary of variables
  • Labels for all categories of variables
  • Minimum and maximum values must have counts of observations
  • Histograms and other charts must have associated tables.

Output Tips

A.) Always cite the data used. ICPSR data citations can be found on the study page for the data, at the top: “Cite this study.”

B.) For clarity, output should include:

  • Name of study or data that are the basis of the results
  • Sample size for each regression or table; if table contains several regressions, sample size for each regression
  • Sample size for each statistic
  • Description of sample or sub-sample and population or sub-population for each regression or table
  • Description of content of table cells
  • If regression, indicate the y-variate
  • If regression, indicate dummy variables
  • Indicate whether the results are weighted
  • Counts for dummy variables
  • Counts for table cells
  • Labels for all variables or glossary of variables

Example 1: Crosstab

Risk of Infant Mortality by Water Source in San Cristobal 2010

Live births between 1 July 2009 and 30 June 2010 (n=1248)

Weighted column percentages (unweighted counts)

Municipal Water Well Water Other Water Total
Infant Died 1.5% (18) 4.3% (12) 6.9% (10) 2.8% (40)
Infant Survived 98.5% (883) 95.7% (233) 93.1% (92) 97.2% (1208)
Total 100.0% (901) 100.0% (245) 100.0% (102) 100.0% (1248)

Chi-square: 20.81 (df = 2) p–value: 0.00003

Source: San Cristobal Health Survey 2011

Example 2: Logistic Regression

Adjusted Risk of Infant Mortality by Water Source in San Cristobal 2010

Live births between 1 July 2009 and 30 June 2010 (n=1248)

Logistic regression adjusting for urban/ rural, education, family size

Outcome: Infant died before 12 months

Weighted Odds Ratios (95% confidence interval)

Unadjusted Odds Ratio Adjusted Odds Ratio
Well vs municipal 2.9 (1.5 – 5.8) 2.1 (1.1 – 4.0)
Other vs. municipal 4.9 (2.8 – 8.6) 2.7 (0.9 – 8.1)
Adjustments (unadjusted) Location, education, family size
Sample size 1239 1239
Missing information on at least one variable 9 9
df 1236 1231

Likelihood Ratio: 17.3 (df = 5) p-value: 0.004

Source: San Cristobal Health Survey 2011