Guide to Social Science Data Preparation and Archiving:
Phase 3: Data Collection and File Creation

Quantitative Data

Dataset creation and integrity

Transcribing data from a questionnaire or interview schedule to an actual data record can introduce several types of errors, including typing errors, codes that do not make sense, and records that do not match. For this reason, employing a data collection strategy that captures data directly during the interview process is recommended. Consistency checks can then be integrated into the data collection process through the use of CATI/CAPI software in order to correct problems during an interview.

However, even if data are being transcribed (either from survey forms or published tables), several steps can be taken in advance to lessen the incidence of errors.

  • Separate the coding and data-entry tasks as much as possible. Coding should be performed in such a way that distractions to coding tasks are minimized.
  • Arrange to have particularly complex tasks such as occupation coding carried out by people specially trained for the task.
  • Use a data-entry program that is designed to catch typing errors, i.e., one that is pre-programmed to detect out-of-range values.
  • Perform double entry of the data, in which each record is keyed in and then re-keyed against the original. Several standard packages offer this feature. In the re-entry process, the program catches discrepancies immediately.
  • Carefully check the first 5 to 10 percent of the data records created, and then choose random records for quality-control checks throughout the process.
  • Let the computer do complex coding and recoding if possible. For example, to create a series of variables describing family structure, write computer code to perform the task. Not only are the computer codes accurate if the instructions are accurate, but they can also be easily changed to correct a logical or programming error.

Despite best efforts, errors will undoubtedly occur regardless of the data collection mode. Here is a list of things to check.

Wild codes and out-of-range values. Frequency distributions and data plots will usually reveal this kind of problem, although not every error is as obvious as, for example, a respondent with 99 rather than 9 children. Sometimes frequency distributions will contain apparently valid values but might be incorrect. For example, the columns for a given variable might have been defined incorrectly, causing the data to be read from the wrong columns. Data plots often instantly reveal outlying observations that merit checking.

Consistency checks. Checks for consistency require substantive knowledge of the study. Typically, they involve comparisons across variables. Checks can reveal inconsistencies between responses to gate or filter questions and subsequent responses. For example, a respondent indicates that she did not work within the last week, yet the data show that she reported income for that week.

Other consistency checks involve complex relationships among variables, e.g., unlikely combinations of respondents’ and children’s ages. At a minimum, researchers should assure that fields that are applicable to a respondent contain valid values, while those that are not applicable contain only missing values.

Measures to prevent inconsistencies should be undertaken even before any data are collected. As previously mentioned, implementing a data collection system that captures data during the interview process and that can correct problems during the interview (such as use of CATI/CAPI software) can eliminate transcription errors that can occur during post-survey data entry. The data collection instrument should also be tested before data collection begins, to ensure that data will be captured correctly and that any skip patterns are accurately followed. However, these measures do not eliminate the need by the researcher to examine the relationships among variables to ensure consistency.

Record matches and counts. In some studies, each subject or study participant might have more than one record. This occurs most frequently in longitudinal studies in which each subject has one record for each occasion during which he or she is observed. In other instances, the number of additional records may actually vary from subject to subject. For example, in a study of families one might have a household record, followed by a varying number of person records. This is sometimes known as a hierarchical file.

See the section titled “File structure” in Chapter 4 for more information on best practice in setting up files with different record types.

Variable names

It is important to remember that the variable name is the referent that analysts will use most often when working with the data. At a minimum, it should convey correct information, and ideally it should be unambiguous.

When selecting a variable name, choose a name that is consistent in length with the requirements of the software package being used and consider the long-term utility of the variable name to the widest audience of users. Several systems for constructing variable names are as follows:

One-up numbers. This system numbers variables from 1 through n (the total number of variables). Since most statistical software does not permit variable names starting with a digit, the usual format is V1 (or V0001) ... Vn. This has the advantage of simplicity, but provides no indication of the variable content. Although most software allows extended labels for variables (allowing entry of descriptive information, e.g., V0023 is “Q6b, Mother’s Education”), the one-up system is prone to error.

Question numbers. Variable names also may correspond to question numbers, e.g., Q1, Q2a, Q2b . . . Qn. This approach relates variable names directly to the original questionnaire, but, like one-up numbers, such names are not easily remembered. Further, a single question often yields several distinct variables with letters or numbers (e.g., Q12a, Q12a1), which may not exist on the questionnaire.

Mnemonic names. Short variable names that represent the substantive meaning of variables have some advantages, in that they are recognizable and memorable. They can have drawbacks, however. What might be an “obvious” abbreviation to the person who created it might not be understood by a new user. Also, software sometimes limits the number of characters, so it can be difficult to create immediately recognizable names.

Prefix, root, suffix systems. A more systematic approach involves constructing variable names containing a root, a prefix, and possibly a suffix. For example, all variables having to do with education might have the root ED. Mother’s education might then be MOED, father’s education FAED, and so on. Suffixes often indicate the wave of data in longitudinal studies, the form of a question, or other such information. Implementing a prefix, root, suffix system requires prior planning to establish a list of standard two- or three-letter abbreviations.

Variable labels

Most statistical programs permit the user to link extended labels for each variable to the variable name. Variable labels are extremely important. They should provide at least three pieces of information: (1) the item or question number in the original data collection instrument (unless the item number is part of the variable name), (2) a clear indication of the variable’s content, and (3) an indication of whether the variable is constructed from other items. If the number of characters available for labels is limited, one should develop a set of standard abbreviations in advance and present it as part of the documentation for the dataset.

Variable groups

Grouping substantively related variables together and presenting such lists in the codebook for a study can effectively organize a dataset and enable secondary analysts to get an overview of a dataset quickly. Groups are especially recommended if a dataset contains a large number of variables. They are especially useful for data made available through an online analysis system, as they offer a navigational structure for exploring the dataset.

Codes and coding

Before survey data are analyzed, the interview or questionnaire responses must be represented by numeric codes. Common coding conventions (a) assure that all statistical software packages will be able to handle the data, and (b) promote greater measurement comparability. Computer-assisted interviewing systems assign codes automatically by programming them into the instrument, so that most coding decisions are made before the instrument is fielded. The principles discussed here apply to such situations as well as those in which coding follows data collection.

No attempt is made here to provide standardized coding schemes for all variables. However, the U.S. Census Bureau occupation and industry codes and the National Institute of Standards and Technology’s state, county, and metropolitan area codes (also known as Federal Information Processing Codes [FIPS]) are standard schemes used to code these types of information.

Guidelines to keep in mind while coding:

  • Identification variables. Provide fields at the beginning of each record to accommodate all identification variables. Identification variables often include a unique study number and a respondent number to represent each case.
  • Code categories. Code categories should be mutually exclusive, exhaustive, and precisely defined. Each interview response should fit into one and only one category. Ambiguity will cause coding difficulties and problems with the interpretation of the data.
  • Preserving original information. Code as much detail as possible. Recording original data, such as age and income, is more useful than collapsing or bracketing the information. With original or detailed data, secondary analysts can determine other meaningful brackets on their own rather than being restricted to those chosen by others.
  • Closed-ended questions. Responses to survey questions that are precoded in the questionnaire should retain the coding scheme in the machine-readable data to avoid errors and confusion.
  • Open-ended questions. For open-ended items, investigators can either use a predetermined coding scheme or review the initial survey responses to construct a coding scheme based on major categories that emerge. Any coding scheme and its derivation should be reported in study documentation.
  • User-coded responses. Increasingly, investigators submit the full verbatim text of responses to open-ended questions to archives so that users can code these responses themselves. Because such responses may contain sensitive information, they must be reviewed for disclosure risk and, if necessary, treated by archives prior to dissemination.
  • Check-coding. It is a good idea to verify or check-code some cases during the coding process -- that is, repeat the process with an independent coder. For example, if more than one code is assigned to an interview response, this highlights problems or ambiguities in the coding scheme. Such check-coding provides an important means of quality control in the coding process.
  • Series of responses. If a series of responses requires more than one field, organizing the responses into meaningful major classifications is helpful. Responses within each major category are assigned the same first digit. Secondary digits can distinguish specific responses within the major categories. Such a coding scheme permits analysis of the data using broad groupings or more detailed categories.

Figure 2 presents an example of the use of the series of responses coding scheme for coding parental employment status, from the 1990 Census of Population and Housing Public Use Microdata Samples (PUMS) person record. The first digit of the scheme describes the number of parents present in the household; the second indicates the employment status of parents; the third tells whether employed parents work full- or part-time.

Missing data

Missing data can arise in a number of ways, and it is important to distinguish among them. There are at least six missing data situations, each of which should have a distinct missing data code.

  1. Refusal/No Answer. The subject explicitly refused to answer a question or did not answer it when he or she should have.
  2. Don’t Know. The subject was unable to answer a question, either because he or she had no opinion or because the required information was not available (e.g., a respondent could not provide family income in dollars for the previous year).
  3. Processing Error. For some reason, there is no answer to the question, although the subject provided one. This can result from interviewer error, incorrect coding, machine failure, or other problems.
  4. Not Applicable. The subject was never asked a question for some reason. Sometimes this results from skip patterns following filter questions, for example, subjects who are not working are not asked about job characteristics. Other examples of inapplicability are sets of items asked only of random subsamples and those asked of one member of a household but not another.
  5. No Match. This situation arises when data are drawn from different sources (for example, a survey questionnaire and an administrative database), and information from one source cannot be located.
  6. No Data Available. The question should have been asked of the respondent, but for a reason other than those listed above, no answer was given or recorded.

Effective methods for missing data imputation and missing data analysis rely on accurate identification of missing data. For more information on best practice in handling missing data, see Little (2002) and McNight (2007).

Back to top of page.

Selecting missing data codes

Missing data codes should match the content of the field. If the field is numeric, the codes should be numeric, and if the field is alphanumeric, the codes may be numeric or alphanumeric. Most researchers use codes for missing data that are above the maximum valid value for the variable (e.g., 97, 98, 99). This occasionally presents problems, most typically when the valid values are single-digit values but two digits are required to accommodate all necessary missing data codes. Similar problems sometimes arise if negative numbers are used for missing data (e.g., -1 or -9), because codes must accommodate the minus sign. Missing data codes should be standardized such that the same code is used for each type of missing data for all variables in a data file, or across the entire collection if the study consists of multiple data files.

In general, blanks should not be used as missing data codes unless there is no need to differentiate types of missing data such as “Don’t Know,” “Refused,” etc. Blanks are acceptable when a case is missing a large number of variables (e.g., when a follow-up interview in a longitudinal study was not conducted), or when an entire sequence of variables is missing due to inapplicability, such as data on nonexistent children. In such instances, an indicator variable should allow analysts to determine unambiguously when cases should have blanks in particular areas of the data record.

A note on “not applicable” and skip patterns

Although we have referred to this issue previously, some reiteration is perhaps in order. Handling skip patterns is a constant source of error in both data management and analysis. On the management side, deciding what to do about codes for respondents who are not asked certain questions is crucial. “Not Applicable” or “Inapplicable” codes, as noted above, should be distinct from other missing data codes. Dataset documentation should clearly show for every item exactly who was or was not asked the question. At the data cleaning stage, all “filter items” should be checked against items that follow to make sure that the coded answers do not contradict one another, and that unanswered items have the correct missing data codes.

Imputed data

If missing data have been imputed in any way, this should be indicated. There are two standard ways of doing so. One approach is to include two versions of any imputed variables: the original variable, including missing data codes, and the imputed version that contains complete data. Another approach is to create an “imputation flag,” or indicator variable, for each variable subject to imputation, set to 1 if the variable is imputed and 0 otherwise. (Not all missing data need to be imputed. In the case of job characteristics, for example, the investigator might want to impute responses for “Don’t Know” and “Refuse” cases, but not impute for “Inapplicable” cases where the data are missing because the respondent is not working.)

Geographic identifiers and geospatial data

Some projects collect data containing direct and indirect geographic identifiers that can be geocoded and used with a mapping application. Direct geographic identifiers are actual addresses (e.g., of an incident, a business, a public agency, etc.). Indirect geographic identifiers include location information such as state, county, census tract, census block, telephone area codes, and place where the respondent grew up.

Investigators are encouraged to add to the dataset derived variables that aggregate their data to a spatial level that can provide greater subject anonymity (such as state, county, or census tract, division, or region). It is desirable for data producers to geocode address data to coordinate data as they can often produce better geocoding rates with their knowledge of the geographic area. When data producers convert addresses to geospatial coordinates, the data can later be aggregated to a higher level that protects respondent anonymity.

In such instances, the original geographic identifiers should be saved to a separate data file that also contains a variable to link to the research data. The file with the direct identifiers should be password protected and both data files should be submitted to the archive in separate submissions. Investigators are encouraged to contact archive staff for assistance when preparing data for submission that contain detailed geographic information.

When data contain geographic information that pose confidentiality concerns, archive staff can produce a restricted-use version of the data file. The restricted-use version maintains the detailed geographic information and the data can be obtained only through a restricted data use agreement with the archive. In these situations, a publicly available (i.e., downloadable) version of the data may also be distributed that retains the aggregated geographic information but with detailed geographic information masked or removed.

Geospatial data. When coordinate-based geographic data are used as units of analysis or variables, the researcher must submit to the archive the relevant geometry files (or information on how to access them) to permit others to recreate or extend the original analysis using the same boundaries. This is encouraged even if the boundary file is easily obtained from the U.S. Census Bureau or from a known third party, and is absolutely necessary if the original spatial analysis used specially created zones. Generally, depositors can submit the geometry (boundary) files in one compressed file containing all of the files that produce the geometry (e.g., single geographic layer visualization, map visualization) for any geographic information system (GIS). Corresponding project files, geospatial metadata, and geocoding rates should also be submitted. Finally, depositors should assure that issues of proprietary visualizations and/or data have been addressed prior to archiving with the understanding that all archived data will be available for distribution.