What are the Guidelines for Disclosure Protection and Output Vetting?

The following guidelines are provided for disclosure protection and output vetting when analyzing data. The purpose of output vetting is to prevent re-identification of respondents and protect sensitive information. Please note that for certain restricted data studies or series, specific standards may apply and these are found within the Data Use Agreement. If using restricted data, always check your Data Use Agreement for these standards or email DSDR@icpsr.umich.edu for guidance.

How output is vetted for disclosure protection can depend on your access method:

If you are using public data or restricted data via the secure download access method, YOU are responsible for checking your own output.
If you are using the Virtual Data Enclave (VDE) for access (with either restricted or restricted/public data), YOU are responsible for the FIRST check prior to requesting the release of output. ICPSR staff will then complete a final check before releasing any output from the VDE.

Disclosure-Protection Rules

Disclosure protection rules define what results from analyses involving restricted-use data may be presented or published. These rules prevent the indirect re-identification of respondents and organizations.

The following table provides general disclosure protection guidelines that can be applied to all studies, regardless of access method. Specific guidelines for disclosure may also be outlined in the Data Use Agreement for each study.

Rule	Description	Typical Values
PII or PHI	Personally Identifiable Information such as names, addresses, and respondent ID cannot be released	Direct identifiers
Suppressed Variables	While these variables can be included in analysis, coefficients and tables for them cannot be reported	Geographic identifiers
Suppressed combinations of variables	While these variables can be reported separately, they may not be used together in tables or interactions	Detailed household structure
Minimum cell sizes	For tables, minimum allowed cell sizes. Cells below this value require rows or columns to be combined. Redaction of the individual cell is insufficient	10
Minimum sample and sub-sample size	Minimum number of valid observations (excluding missing data) for regression analysis	50
Disallowed sub-samples	Sub-samples that are not allowed even if the sub-sample meets sample size requirements	Ranking of specific places or organizations
Dummy variables	Dummy variables for which coefficients cannot be reported	Ranking of specific places or organizations
Organizations and Groups	Organizations and Groups for which results cannot be presented separately	Ranking of specific organizations
Nested tables	Tables that can be combined into one table	Tables may not be combined to produce another table, typically by subtracting cell counts across tables.
Saturated or near saturated models	Models that reproduce the data exactly	Maximum R-squared 0.5 Minimum df remaining 40
List cases including predicted values	An individual case or roster of cases cannot be reported	List cases and scatterplots are not allowed
Weights	Do results have to be weighted?	Unweighted totals may be presented for tables but not individual cells.
Visualizations		Maps must obscure exact locations
Linkages	What data may not be linked?	Contextual linkages for geographic areas are typically allowed. Linkages at the individual level must be explicitly approved.

VDE Output Vetting for Compliance with Disclosure Protection Rules

Please be aware that no output may be removed or transcribed from the VDE in any form without approval by DSDR or ICPSR staff. This includes sending any information via email, including simple statistics or screenshots, and even to ICPSR staff or your project team. Doing so would constitute a violation of your legal agreement with ICPSR and the University of Michigan. Please refer to your Restricted Data Use Agreement for additional information.

Researchers are responsible for adhering to disclosure protection rules and must review their output before submitting to DSDR for disclosure review and final approval. Submitted output should not exceed 100 pages and can include Word tables, Excel spreadsheets, and graphics (preferred) or raw output from a statistical package (files and logs). SPSS output must be saved and submitted as a PDF. Files submitted for disclosure review should be well-labeled tables of frequencies, crosstabs, and/or regression estimates that could be ready for publication, presentation, etc.

To request output, please email dsdr@icpsr.umich.edu with a subject line that contains your VDE project number. Include the following information in your message:

Name of study or data that are the basis of the results
PI of the VDE Project
Location of output file(s) to be reviewed
Name(s) of output file(s) to be reviewed

In the VDE, include the following information within the output file(s) to be reviewed:

Sample size for each regression or table; if table contains several regressions, sample size for each regression
Sample size for each statistic
Description of sample or sub-sample and population or sub-population for each regression or table
Description of content of table cells
If regression, indicate the y-variate or outcome
If regression, indicate dummy variables
Indicate whether the results are weighted
Counts of observations for dummy variables
Counts of observations for table cells
Labels for all variables or glossary of variables
Labels for all categories of variables
Minimum and maximum values must have counts of observations
Histograms and other charts must have associated tables.

Output Tips

A.) Always cite the data used. ICPSR data citations can be found on the study page for the data, at the top: “Cite this study.”

B.) For clarity, output should include:

Name of study or data that are the basis of the results
Sample size for each regression or table; if table contains several regressions, sample size for each regression
Sample size for each statistic
Description of sample or sub-sample and population or sub-population for each regression or table
Description of content of table cells
If regression, indicate the y-variate
If regression, indicate dummy variables
Indicate whether the results are weighted
Counts for dummy variables
Counts for table cells
Labels for all variables or glossary of variables

Example 1: Crosstab

Risk of Infant Mortality by Water Source in San Cristobal 2010

Live births between 1 July 2009 and 30 June 2010 (n=1248)

Weighted column percentages (unweighted counts)

	Municipal Water	Well Water	Other Water	Total
Infant Died	1.5% (18)	4.3% (12)	6.9% (10)	2.8% (40)
Infant Survived	98.5% (883)	95.7% (233)	93.1% (92)	97.2% (1208)
Total	100.0% (901)	100.0% (245)	100.0% (102)	100.0% (1248)

Chi-square: 20.81 (df = 2) p–value: 0.00003

Source: San Cristobal Health Survey 2011

Example 2: Logistic Regression

Adjusted Risk of Infant Mortality by Water Source in San Cristobal 2010

Live births between 1 July 2009 and 30 June 2010 (n=1248)

Logistic regression adjusting for urban/ rural, education, family size

Outcome: Infant died before 12 months

Weighted Odds Ratios (95% confidence interval)

	Unadjusted Odds Ratio	Adjusted Odds Ratio
Well vs municipal	2.9 (1.5 – 5.8)	2.1 (1.1 – 4.0)
Other vs. municipal	4.9 (2.8 – 8.6)	2.7 (0.9 – 8.1)
Adjustments	(unadjusted)	Location, education, family size
Sample size	1239	1239
Missing information on at least one variable	9	9
df	1236	1231

Likelihood Ratio: 17.3 (df = 5) p-value: 0.004

Source: San Cristobal Health Survey 2011