Guide to Social Science Data Preparation and Archiving
Phase 4: Data Analysis

In this chapter, we turn to important issues that should be addressed during the analysis phase when project staff are actively working with data files to investigate their research questions.

Master Datasets and work files

As analysis proceeds, there will be various changes, additions, and deletions to the dataset. Despite the most rigorous data cleaning, additional errors will undoubtedly be discovered. The need to construct new variables might arise. Staff members might want to subset the data by cases and/or variables. Thus, there is a good chance that before long multiple versions of the dataset will be in use. It is not uncommon for a research group to discover that when it comes time to prepare a final version of the data for archiving, there are multiple versions that must be merged to include all of the newly created variables. This problem can be avoided to a degree if the research files are stored on a network where a single version of the data is maintained.

It is a good practice to maintain a master version of the dataset that is stored on a read-only basis. Only one or two staff members should be allowed to change this dataset. Ideally, this dataset should be the basis of all analyses, and other staff members should be discouraged from making copies of it. If a particular user of the data wants to create new variables and save them, a choice should be made between creating a work file for that researcher or adding the new variables to the master dataset. If the latter route is chosen, then all of the standard checks for outliers, inconsistencies, and the like need to be made on the new variables, and full documentation should be prepared. The final dataset reflecting published analyses is the version to archive.

Data and documentation versioning

One way to keep track of changes is to maintain explicit versions of a dataset. The first version might result from the data collection process, the second version from data cleaning, the third from composite variable construction, and so forth. With explicit version numbers, which are reflected in dataset names, it becomes easier to match documentation to datasets and to keep track of what was done by whom and when.

The documentation process starts at the beginning of the project and is ongoing, reflecting changes, additions, and deletions to the documentation. Here are a few suggestions to keep track of the various versions of the documentation files that will inevitably develop:

  • Establish documentation versions similar to those used for the data. Versions could be established in the following manner: the first version contains results from the data collection phase, the second version results from the data cleaning phase, and the third version adds any constructed variables, if applicable, to the end of the codebook, with appropriate labels and the formulas used to create them recorded when the variables are created.
  • Keep a separate change file that tracks changes to the documentation.
  • Denote changes in working documents with special characters (e.g., use ??? or ***) that facilitate search, review, and replacement during the creation of the final version of the documentation file.
  • Conduct a review of the final files to make sure the data and documentation are harmonized, i.e., that the final version of the documentation accurately corresponds to the final version of the data.
  • Store final electronic versions of instruments and reports on a read-only basis.

Raw data versus statistical system files

Data may be maintained for analysis purposes in a number of different formats. From the standpoint of data storage, system files take up less space than raw ASCII data and permit the user to perform analytic tasks much more readily. System files, which are the proprietary formats of the major statistical programs, are extremely efficient because the statistical package reads in the data values and the various data specifications, labels, missing data codes, and so on, only once and then accesses the system file directly afterwards. Because the data are stored on disk directly in the machine’s internal representation, the step of translating the ASCII data each time to the internal binary representation of the specific machine is avoided. Many research groups use system files for all their data analysis and data storage after the first reading of the ASCII version. Although this is an efficient way to work, it is important to keep in mind that system files created in older versions of statistical packages may be readable only on the specific systems that created them. Recent versions of most software, however, produce files such as export/transport files or portable files that are compatible across platforms and systems. These kinds of files preserve all of the variable labeling and identification information in a format suitable for long-term preservation. Increasingly, these are the formats that archives prefer to receive. However, data producers should consider the implications of software changes during the project to make certain that stored copies of data remain readable and understandable.

A note on non-ASCII characters: Avoid the use of nonstandard character sets when you create archival quality documentation that will be used by a wide range of people over time. Be sure to remove non-ASCII characters from data and documentation files. Often, these characters are generated by proprietary word processing packages. For example, a curly non-ASCII apostrophe in the text string ‘Respondent’s Age’ is read in binary with a different ASCII code than a straight ASCII apostrophe in ‘Respondent's Age’.

File structure

Flat rectangular files. Having collected data, the researcher is faced with the question of what form the computer record should take. For the vast majority of datasets this is a very simple decision; the data are organized in one long record from variable to variable. Typically, an ID number comes first, followed by the set of variables collected on each subject. This is referred to as a rectangular record, or a flat file. The term comes about because each observation has exactly the same amount of information. Again, for the vast majority of studies, the length of the record is irrelevant. Data analysis programs can read very long records containing thousands of columns of data. Technically, each character of information consists of one byte of data.

Hierarchical files. Although long records are not a problem for most users, large datasets may be difficult to store, even in this age of generous disk storage space. As a result, it is desirable to reduce the amount of blank space on a record. Blank space typically results when a set of variables is not applicable for the respondent. For example, consider a survey in which the interview elicits detailed information on each of the respondent’s children, with the interview protocol allowing up to 13 children. For most respondents, almost all of this information is blank in the sense that no information is collected, although a code to indicate “Inapplicable” may appear on the record. Suppose that the average respondent has two children and that for each child 40 bytes of data are collected. On a sample size of 8,000 cases, this means that the file contains something like 3.5 megabytes of blanks (8,000 respondents x 11 “missing children” x 40 bytes of data).

In this case, one should consider other ways of storing the data. One option is to create a hierarchical record. In the ASCII file structure, there is a header record containing information on the number of children and a varying number of secondary records, one for each child. From the standpoint of data storage, this is very efficient, but it increases the complexity of the programming task substantially. Most major statistical packages will allow the user to read such data, but some programming is required to produce the rectangular record required for the analysis phase. Analyzing hierarchical files requires sophisticated knowledge of data analysis software. Complex files like these, while they can save lots of disk space, also require a greater level of skill on the part of the user.

A second approach to this problem -- the preferred approach -- is to form separate files for the two kinds of records: one file for respondents and another file for children. This approach has the advantage of allowing a user to work with a rectangular respondent record, skipping the child records entirely if they are not of interest. On the other hand, if the children are of interest, then the secondary analyst can write merge routines to match the respondents’ and the children’s data. Therefore, the flexibility of this approach allows separate files to be merged or returned to individual files for analysis, as needed.

Relational databases. A relational database is a collection of data tables that are linked together through defined associations. For example, a database that includes a ‘respondents’ table and a ‘children’ table, as in the last example, would use a key variable (“Family ID”) to associate children with their parents. Relational databases allow a user to perform queries that select rows with specific attributes or combine data from multiple tables to produce customized tables, views, or reports. To preserve relational databases, users should export the database tables as flat rectangular files and preserve the table relationships using, for instance, SQL schema statements. When databases are used as survey instruments or other data input/out mechanisms, the look and feel of the user interface can be preserved by creating a static PDF image of the interface. Promising software is currently under development to normalize relational databases into non-proprietary formats such as XML.

Longitudinal/multi-wave study files. Many multiple data file studies are longitudinal, that is, they contain data collected from the same individuals over multiple points in time, or waves. Longitudinal studies often consist of hierarchical files. For longitudinal data, it is important to make file information as consistent as possible across waves. Data should include clearly specified linking identifiers, such as respondent IDs, which are included in data from each wave so that users can link data files across time. In addition, identical variables across waves should have the same variable labels and values to make it easier for users to compare the data across files.

Data backups

All relevant files, particularly datasets under construction, should be backed up frequently -- even more often than once a day -- to prevent having to re-enter data. Master datasets should be backed up every time they are changed in any way. Computing environments in most universities and research centers support devices for data backup and storage. It is also advisable to maintain a backup copy of the data off-site, in case of an emergency or disaster that could destroy years of work.