Tips for Preparing Data Deposit

RWJF grantees can take a variety of steps to help expedite data through the archiving process. These steps also improve the quality of the data files available to the original research team and enhance collaborative research efforts well in advance of depositing data with HMCA. Many steps necessary to prepare data for secondary use are best done by data producers who have intimate knowledge of the data.

Preparing Data

To facilitate secondary use, it is important to fully document variables in the context of the data file as well as in the codebook. When preparing the data for archiving, please remember the following:

1. Address confidentiality regarding the data before deposit. We cannot accept any Personally Identifiable Information (PII).
  - Names, telephone numbers, addresses, and any other direct identifiers should be removed or masked.
2. Create new ID or case identification variables for the anonymized dataset. ID variables in character format are recommended as most statistical software sort and match character- format IDs faster than numeric IDs. Also, character IDs are better retained between software packages. If the data collection consists of two or more related datasets, clearly identify all identification variables needed to link data files together, and explain the relationship among the files and the variables in the documentation.
3. Except for ID and date variables, define all variables as numeric whenever possible. A wider range of analyses can be performed with numeric variables.
4. Convert data files to one record per case. Complex record structures (i.e., multiple records per case, hierarchical, mixed record types) are difficult for most users and cause difficulty both for ICPSR’s automation process and for software interoperability. For example, if individuals or institutions experience more than one “event” such as a hospital visit or doctor visit, create a person file that includes all of the events on the person record. Likewise, if a doctor visit file includes lab tests with individual results attached to those lab tests, structure the data so that the doctor visit is the observation and each lab test and its results are recorded as variables on the doctor visit record.
5. Assign a set of exhaustive, mutually exclusive codes to each variable and use the same codes across variables recording the same type of responses (e.g., 0 No, 1 Yes). Provide each variable and each code with descriptive labels in the data file (e.g., the SPSS, SAS, or Stata file) to aid proper understanding of the data content. Secondary analysts rely on the data file to provide the majority of the information they use to analyze the data. Despite the best attempts to convince users to read documentation carefully, often they do not. Review labels for comprehensibility and to make sure that they clearly describe the information or question recorded in that variable. If labels in the data must be abbreviated due to length limitations, the full information should be provided in a codebook, data collection instrument, or other documentation.
6. Assign separate missing codes for not applicable, non-response, refusals, and other types of missing data. Use numeric codes because special missing characters are often lost when converting between software packages and in preservation formats such as ASCII. If retaining blanks (i.e., “system missing),” the documentation must identify what type of missing data the blanks represent.
7. Deposit transformed variables (i.e., variables constructed or derived from other variables). For example, scale items may be scored using the scale algorithm and stored in one or more summary scale variables. Assign labels to the transformed variables in the recode statements used to create the variables. Clearly identify data files comprised of only transformed or analysis variables and clearly mark transformed variables stored with the original variables. Describe the source of the transformed variable and the method for deriving it by depositing the recode statements and/or a more extensive explanation in the documentation. In some cases, the recode statements and explanation only may be deposited if secondary analysts can faithfully reproduce the transformed variable.
8. Include the technical variables in the data file that are needed in order for statistical inference to be valid, such as weights, non-response adjustment, survey design variables (stratum and cluster), case disposition indicators, and other related variables. The documentation and labels should clearly describe how the variables were constructed and how they should be used, especially if different analyses require different weights or disposition variables. Imputation flags should also be keyed to the corresponding variable and the method of imputation should be fully explained in the documentation.
9. Reconcile univariate statistics on each variable. Secondary data users confirm that they are reading the data properly by comparing the documentation with univariate statistics they produce from the data. Data producers are best positioned to reconcile case counts, out-of-range codes, skip patterns, and univariate distributions before deposit. Secondary analysts resort to unsupported assumptions if they are unable to reconcile the data with the documentation.
10. Provide all documentation needed for others to sufficiently understand the data. HMCA often distributes the original project documentation with the data.Relevant documentation includes:
  - Research project documents, such as the Data Management Plan, informed consent, the final study protocol, final amendments, and statistical analysis plan
  - Description of post-processing decisions such as weighting, imputation, and recodes
  - Descriptions of any de-identification changes made to variables to address disclosure risk
  - Measures, assessments, case report forms, or other data collection instruments
  - Original data documentation that is not embedded in the data file or instruments, such as code used for analysis or to create derived variables
  - Any working papers, technical reports, or publications associated with the data collection or substantive aspects of the data.

Online Deposit Form

The Deposit Manager is a customizable deposit workspace that provides a secure upload for the deposit of data and documentation files. Depositors can use a collaboration space, complete the form over multiple sessions, upload many accepted file types and sizes, describe their data using standard metadata fields, and view their deposit status and history of deposits.

Deposit with HMCA

Make sure “HMCA: Data collected under grants from the Robert Wood Johnson Foundation” is selected for Archive.