Preparing Data for Deposit to NAHDAP

Researchers can take a variety of steps to help expedite data through the NAHDAP archiving process. These steps also improve the quality of the data files available to the original research team and enhance collaborative research efforts well in advance of depositing data with NAHDAP. Many steps necessary to prepare data for secondary use are best done by data producers who have intimate knowledge of the data.

Preparing Data

To facilitate secondary use, it is important to fully document variables in the context of the data file as well as in the codebook. When preparing the data for archiving, please remember the following:

  1. Address confidentiality regarding the data before deposit. Phase 5: Preparing Data for Sharing in the ICPSR Guide to Social Science Data Preparation and Archiving provides suggested steps to treat data to protect respondent confidentiality before archiving. If treating the data will unduly impact the analytic utility of the data, please contact NAHDAP staff to discuss releasing the data as a restricted-use dataset.
  2. Retain ID or case identification variables in character format. Most statistical software sort and match character format IDs faster than numeric IDs. Also, character IDs are better retained between software packages. If the data collection consists of two or more related datasets, clearly identify all IDs needed to link data files together, and explain the relationship among the files and the variables in the documentation.
  3. Keep dates in character or simple numeric format (i.e., no dashes or slashes). Statistical software packages use different start dates for their date variables. For example, JUN2010 is not recognized as a date format by all software packages. Alternatively, dates can be replaced by time lapse variables (e.g., days or century months between events or age at time of event).
  4. Except for ID and date variables, define all variables as numeric whenever possible. A wider range of analyses can be performed with numeric variables.
  5. Convert data files to one record per case. Complex record structures (i.e., multiple records per case, hierarchical, mixed record types) are difficult for most users and cause difficulty both for NAHDAP’s automation process and for software interoperability. For example, if individuals or institutions experience more than one “event” such as a hospital visit or doctor’s visit, create a person file that includes all of the events on the person record. Likewise, if a doctor visit file includes lab tests with individual results attached to those lab tests, structure the data so that the doctor visit is the observation and each lab test and its results are recorded as variables on the doctor visit record.
  6. Assign a set of exhaustive, mutually exclusive codes to each variable and use the same codes across variables recording the same type of responses (e.g., 0 No, 1 Yes). Provide each variable and each code with descriptive labels in the data file (e.g., the SPSS, SAS, or Stata file) to aid proper understanding of the data content. Secondary analysts rely on the data file to provide the majority of the information they use to analyze the data. Despite the best attempts to convince users to read documentation carefully, often they do not. Review labels for comprehensibility and to make sure that they clearly describe the information or question National Addiction & HIV Data Archive Program PREPARING DATA FOR DEPOSIT AT NAHDAP recorded in that variable. If labels in the data must be abbreviated due to length limitations, the full information should be provided in a codebook data collection instrument or other documentation.
  7. Assign separate missing codes for not applicable, non-response, refusals, and other types of missing data. Use numeric codes because special missing characters are often lost when converting between software packages and in preservation formats such as ASCII. If retaining blanks (i.e., “system missing),” the documentation must identify what type of missing data the blanks represent.
  8. Deposit transformed variables (i.e., variables constructed or derived from variables collected using the questionnaire). For example, scale items may be scored using the scale algorithm and stored in one or more summary scale variables. Assign labels to the transformed variables in the recode statements used to create the variables. Clearly identify data files comprised of only transformed or analysis variables and clearly mark transformed variables stored with the original variables. Describe the source of the transformed variable and the method for deriving it by depositing the recode statements and/or a more extensive explanation in the codebook or other documentation. In some cases, the recode statements and explanation only may be deposited if secondary analysts can faithfully reproduce the transformed variable.
  9. Include the technical variables in the data file that are needed in order for statistical inference to be valid, such as weights, non-response adjustment, survey design variables, case disposition indicators, and other related variables. The documentation and labels should clearly describe how the variables were constructed and how they should be used, especially if different analyses require different weights or disposition variables. Imputation flags should also be keyed to the corresponding variable and the method of imputation should be fully explained in the documentation.
  10. Reconcile univariate statistics on each variable. Secondary data users confirm that they are reading the data properly by comparing the documentation with univariate statistics they produce from the data. Data producers are best positioned to reconcile case counts, out-of-range codes, skip patterns, and univariate distributions before deposit. Secondary analysts resort to unsupported assumptions if they are unable to reconcile the data with the documentation.
  11. Provide all documentation needed for others to sufficiently understand the data. NAHDAP often distributes the original project documentation with data. Relevant documentation includes:
    1. Data collection design documents which include study rationale, data collection strategies, and a description of post-processing decisions such as weighting, imputation, and recodes,
    2. Questionnaires or data collection instruments,
    3. Original data documentation that is not embedded in the data file or instruments, and
    4. Any working papers, technical reports, or publications associated with the data collection or substantive aspects of the data.

Guide to Best Practices

The ICPSR Guide to Social Science Data Preparation and Archiving is aimed at those engaged in the cycle of research, from applying for a research grant, through the data collection phase, and ultimately to preparation of the data for deposit in a public archive.

The Guide is a compilation of best practices gleaned from the experience of many archivists and investigators.