What are the Guidelines for Disclosure Protection and Output Vetting?
The following guidelines are provided for disclosure protection and output vetting when analyzing data. The purpose of output vetting is to prevent re-identification of respondents and protect sensitive information. Please note that for certain restricted data studies or series, specific standards may apply and these are found within the Data Use Agreement. If using restricted data, always check your Data Use Agreement for these standards or email DSDR@icpsr.umich.edu for guidance.
How output is vetted for disclosure protection can depend on your access method:
- If you are using public data or restricted data via the secure download access method, YOU are responsible for checking your own output.
- If you are using the Virtual Data Enclave (VDE) for access (with either restricted or restricted/public data), YOU are responsible for the FIRST check prior to requesting the release of output. ICPSR staff will then complete a final check before releasing any output from the VDE.
Disclosure-Protection Rules
Disclosure protection rules define what results from analyses involving restricted-use data may be presented or published. These rules prevent the indirect re-identification of respondents and organizations.
The following table provides general disclosure protection guidelines that can be applied to all studies, regardless of access method. Specific guidelines for disclosure may also be outlined in the Data Use Agreement for each study.
| Rule | Description | Typical Values |
|---|---|---|
| PII or PHI | Personally Identifiable Information such as names, addresses, and respondent ID cannot be released | Direct identifiers |
| Suppressed Variables | While these variables can be included in analysis, coefficients and tables for them cannot be reported | Geographic identifiers |
| Suppressed combinations of variables | While these variables can be reported separately, they may not be used together in tables or interactions | Detailed household structure |
| Minimum cell sizes | For tables, minimum allowed cell sizes. Cells below this value require rows or columns to be combined. Redaction of the individual cell is insufficient | 10 |
| Minimum sample and sub-sample size | Minimum number of valid observations (excluding missing data) for regression analysis | 50 |
| Disallowed sub-samples | Sub-samples that are not allowed even if the sub-sample meets sample size requirements | Ranking of specific places or organizations |
| Dummy variables | Dummy variables for which coefficients cannot be reported | Ranking of specific places or organizations |
| Organizations and Groups | Organizations and Groups for which results cannot be presented separately | Ranking of specific organizations |
| Nested tables | Tables that can be combined into one table | Tables may not be combined to produce another table, typically by subtracting cell counts across tables. |
| Saturated or near saturated models | Models that reproduce the data exactly | Maximum R-squared 0.5
Minimum df remaining 40 |
| List cases including predicted values | An individual case or roster of cases cannot be reported | List cases and scatterplots are not allowed |
| Weights | Do results have to be weighted? | Unweighted totals may be presented for tables but not individual cells. |
| Visualizations | Maps must obscure exact locations | |
| Linkages | What data may not be linked? | Contextual linkages for geographic areas are typically allowed. Linkages at the individual level must be explicitly approved. |
VDE Output Vetting for Compliance with Disclosure Protection Rules
Please be aware that no output may be removed or transcribed from the VDE in any form without approval by DSDR or ICPSR staff. This includes sending any information via email, including simple statistics or screenshots, and even to ICPSR staff or your project team. Doing so would constitute a violation of your legal agreement with ICPSR and the University of Michigan. Please refer to your Restricted Data Use Agreement for additional information.
Researchers are responsible for adhering to disclosure protection rules and must review their output before submitting to DSDR for disclosure review and final approval. Submitted output should not exceed 100 pages and can include Word tables, Excel spreadsheets, and graphics (preferred) or raw output from a statistical package (files and logs). SPSS output must be saved and submitted as a PDF. Files submitted for disclosure review should be well-labeled tables of frequencies, crosstabs, and/or regression estimates that could be ready for publication, presentation, etc.
To request output, please email dsdr@icpsr.umich.edu with a subject line that contains your VDE project number. Include the following information in your message:
- Name of study or data that are the basis of the results
- PI of the VDE Project
- Location of output file(s) to be reviewed
- Name(s) of output file(s) to be reviewed
In the VDE, include the following information within the output file(s) to be reviewed:
- Sample size for each regression or table; if table contains several regressions, sample size for each regression
- Sample size for each statistic
- Description of sample or sub-sample and population or sub-population for each regression or table
- Description of content of table cells
- If regression, indicate the y-variate or outcome
- If regression, indicate dummy variables
- Indicate whether the results are weighted
- Counts of observations for dummy variables
- Counts of observations for table cells
- Labels for all variables or glossary of variables
- Labels for all categories of variables
- Minimum and maximum values must have counts of observations
- Histograms and other charts must have associated tables.
Output Tips
A.) Always cite the data used. ICPSR data citations can be found on the study page for the data, at the top: “Cite this study.”
B.) For clarity, output should include:
- Name of study or data that are the basis of the results
- Sample size for each regression or table; if table contains several regressions, sample size for each regression
- Sample size for each statistic
- Description of sample or sub-sample and population or sub-population for each regression or table
- Description of content of table cells
- If regression, indicate the y-variate
- If regression, indicate dummy variables
- Indicate whether the results are weighted
- Counts for dummy variables
- Counts for table cells
- Labels for all variables or glossary of variables
Example 1: Crosstab
Risk of Infant Mortality by Water Source in San Cristobal 2010
Live births between 1 July 2009 and 30 June 2010 (n=1248)
Weighted column percentages (unweighted counts)
| Municipal Water | Well Water | Other Water | Total | |
|---|---|---|---|---|
| Infant Died | 1.5% (18) | 4.3% (12) | 6.9% (10) | 2.8% (40) |
| Infant Survived | 98.5% (883) | 95.7% (233) | 93.1% (92) | 97.2% (1208) |
| Total | 100.0% (901) | 100.0% (245) | 100.0% (102) | 100.0% (1248) |
Chi-square: 20.81 (df = 2) p–value: 0.00003
Source: San Cristobal Health Survey 2011
Example 2: Logistic Regression
Adjusted Risk of Infant Mortality by Water Source in San Cristobal 2010
Live births between 1 July 2009 and 30 June 2010 (n=1248)
Logistic regression adjusting for urban/ rural, education, family size
Outcome: Infant died before 12 months
Weighted Odds Ratios (95% confidence interval)
| Unadjusted Odds Ratio | Adjusted Odds Ratio | |
|---|---|---|
| Well vs municipal | 2.9 (1.5 – 5.8) | 2.1 (1.1 – 4.0) |
| Other vs. municipal | 4.9 (2.8 – 8.6) | 2.7 (0.9 – 8.1) |
| Adjustments | (unadjusted) | Location, education, family size |
| Sample size | 1239 | 1239 |
| Missing information on at least one variable | 9 | 9 |
| df | 1236 | 1231 |
Likelihood Ratio: 17.3 (df = 5) p-value: 0.004
Source: San Cristobal Health Survey 2011