Guide to Social Science Data Preparation and Archiving
Phase 6: Depositing Data

In addition to adhering to the specific requirements of a data archive, data creators who intend to deposit their data should be aware of the OAIS Reference Model standard for what to deposit. See ICPSR’s Digital Preservation site regarding the Submission Information Package. The SIP includes a deposit form, the original files, and associated study-level and variable-level metadata (e.g., codebooks).

When preparing data for final deposit in a data archive, it is important to consider the factors detailed below. Since the following list is not all-inclusive, data creators should watch for developments relating to the digital preservation of social science data.

File Formats

If a dataset is to be archived, it must be organized in such a way that others can read it. Ideally, the dataset should be accessible using a standard statistical package, such as SAS, SPSS, or Stata. Three common approaches to data file preparation are: (1) provide the data in raw ASCII format, along with setup files to read them into standard statistical programs; (2) provide the data as a system file within a specific analysis program; or (3) provide the data in a portable file produced by a statistical program. Each of these alternatives has its advantages and disadvantages.

Software-specific system files. System files are compact and efficient, and archives increasingly encourage the deposit of system files and use this format for dissemination. Older system files may not always be cross-platform compatible, however. Newer versions of statistical software packages not only incorporate new data management and analytical features, but may also support new operating systems and hardware. In such cases, previous versions of system files may need to be migrated to newer versions. To prepare system files, consult the user manual for the statistical software of your choice.

Portable software-specific files. Some archives prefer to receive data in portable or transport file format. Portable versions of software-specific files have the advantage that they can be accessed on any hardware platform. SPSS calls transportable files “portable,” and SAS calls them “transport” files, while for Stata data files, no portable equivalent is necessary. However, users should be careful to preserve missing data values. It should be noted that when SAS transport files are generated, missing data are blanked out unless SAS alpha missing codes are used. This can be a problem because the distinctions between different types of missing data (such as legitimate skip vs. refused) become irretrievable, and they may be very important to the secondary analyst. When preparing SAS transport files, it is recommended that the missing data command not be activated but that separate program files be created instead. SPSS maintains the original missing data values when creating portable files. Stata allows the user to assign alpha missing values, and since no separate transportable files are created, alpha missing values are not affected.

A problem also surfaces with respect to SAS proc formats (value labels), which are not stored in SAS transport data files. SAS proc formats can be provided using program files or stored in SAS catalog files, which are operating-system-specific. The best approach is to provide user-defined SAS proc formats and formats in separate program files.

ASCII data plus setup files. For this option, it is necessary to determine which syntax will be used -- SAS, SPSS, Stata, or another statistical program. In the case of large datasets, for which users will want to create subsets, the setup files can be edited to meet specific needs. Many archives view ASCII (raw) data files as the most stable format for preserving data. They are software-independent, and hence are apt to remain readable in the future, regardless of changes in particular statistical software packages. Most archives are capable of producing ASCII data and setup files from data files provided in proprietary formats. If a researcher has maintained the dataset in ASCII and read it into a statistical package for analysis, a raw ASCII data file may be the most cost-efficient way to archive the data.

Writing an ASCII file can be time-consuming and prone to error, even when a software system has been used to store the data. For example, if SAS has been used to manage and analyze a dataset, the following steps are required: writing SAS statements to export the data in ASCII format, careful checking to make sure the conversion procedure worked properly, and creating documentation telling users where to find variables in the ASCII data file.

Online analysis-ready files. Online data exploration and analysis packages allow users not only to perform analysis online, but to select only those variables and cases actually required for an analysis in the form of subsets. Increasingly, these systems accept DDI XML as input. Depositing documentation in DDI facilitates online analysis after archival deposit.

Other file formats

Video files. Video file formats are changing rapidly. These changes bring improvements to quality and flexibility while reducing the size of the compressed file. It is essential to deposit the source files for video data, along with the compressed files, to ensure the long-term playability of video files. An archive is able to migrate video data to better formats over time with the source file, which maintains all the original captured information. The compressed file, in comparison, already has much of the original captured information removed. An archive is much more limited in its ability to ensure playability over time without access to the source file, because the technologies of the future may require file information not kept in a compressed file.

Technical information, such as the video file type, compression format, and video source should be included, as well as a written summary of what the video contains. Ideally, the summary should have time codes indicating where significant events occur in the video. A text file with time-coded subtitles for hearing-impaired viewers would also be valuable to include.

Geospatial data files. When coordinate-based geographic data are used as units of analysis or variables, the researcher must submit to the archive the relevant geometry files (or information on how to access them) to permit others to recreate or extend the original analysis using the same boundaries. This is encouraged even if the boundary file is easily obtained from the U.S. Census Bureau or from a known third party, and is absolutely necessary if the original spatial analysis used specially created zones. Generally, depositors can submit the geometry (boundary) files in one compressed file containing all of the files that produce the geometry (e.g., single geographic layer visualization, map visualization) for any geographic information system (GIS). Corresponding project files, geospatial metadata, and geocoding rates should also be submitted. Finally, depositors should assure that issues of proprietary visualizations and/or data have been addressed prior to archiving with the understanding that all archived data will be available for distribution.

Archiving files from analysis of existing or secondary data

For projects that do not involve original data collection or may involve combining data from one or more existing sources, the decision regarding whether or what to archive may be less clear. This decision should be made in conjunction with an archive, as archives can differ in their acquisition policies. Here are some guidelines to consider:

  • Existing data not publicly available. If the existing data used for analysis are not already publicly available, researchers are encouraged to submit the data for archiving, with the permission of the original data producer.
  • Combination of primary and existing data. If the researcher collects some primary data and also appends existing data to it for analyses, the guiding questions on whether to submit the whole dataset or just the primary dataset to the archive are: (a) how easily the existing data can be linked to the primary data, and (b) whether the existing data are publicly available.

Linked data

Linked census data. When the primary data are linked to census data, the linked census data should be archived also, even though the link between the data files is straightforward and the census data are publicly available. Since the original census files are large in size and contain a large number of variables, determining which census variables to use and at what level to extract the data for the subsets can be time-consuming. Archiving the linked census data makes it unnecessary for other users to repeat these subsetting steps.

Straightforward links. If the linkage is straightforward and the existing data are publicly available, then users can easily obtain the existing data themselves and link them to the primary data submitted by the researcher. In this case, the project report(s) should clearly identify the source of the existing data including version and/or date so that other users know which data to obtain. Information about the variable (or combination of variables) that constitutes the unique identifier used to link the data also should be provided.

Links that are not straightforward. If the linkage between datasets is not straightforward, then the researcher is providing a useful service by archiving the linked data. Examples of this include: (a) linkage requiring judgments about combinations of nonunique variables, such as age, sex, and race of an individual and date of incident; and (b) an understanding of local geographic factors is needed to link correctly, for example, neighborhoods or block levels, especially over a range of years when boundaries shift. Here, the redundancy of having data stored twice at the archive is outweighed by the usefulness of providing others with the data already linked.

Derived variables. Often, after the data are linked, the researcher may compute new variables based on the linked data (e.g., new categories are created, rates are produced, or scales are developed). All useful derived variables should be archived also, especially if they are used in analyses included in publications. The derived variables may be deposited in a data file that includes the primary data collected for the project, the existing data from another source, or the derived variables may be deposited with the primary data alone. The code or setup file used to link the files and create the derived variables should also be provided.

Programming code. If the project involves only analysis of data already publicly available and the product of the project is the analysis alone, then data may not need to be submitted for archiving. However, researchers are encouraged to deposit their programming code that created new variables or scales, especially if the derived variables are not deposited within a data file and are cited in publications.