What is a codebook?
A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the data file(s).
While codebooks vary widely in quality and amount of information given, a typical codebook includes:
- Column locations and widths for each variable
- Definitions of different record types
- Response codes for each variable
- Codes used to indicate nonresponse and missing data
- Exact questions and skip patterns used in a survey
- Other indications of the content and characteristics of each variable
Additionally, codebooks may also contain:
- Frequencies of response
- Survey objectives
- Concept definitions
- A description of the survey design and methodology
- A copy of the survey questionnaire
- Information on data collection, data processing, and data quality
The following example illustrates the main components of a typical SAMHDA codebook using the National Survey on Drug Use and Health (NSDUH), 2012 (ICPSR 34933):
Data Collection Description
The body of a codebook describes the content of the data file. The following elements are generally included for each variable in the data file:
Variable Name: Indicates the variable number or name assigned to each variable in the data collection.
Variable Column Location: Indicates the starting location and width of a variable. If the variable is a multiple-response type, the width referenced is that of a single response.
Variable Label: Indicates an abbreviated variable description (maximum of 40 characters) that can be used to identify the variable. In some cases, an expanded version of the variable name can be found in a variable description list.
Missing Data Code: Indicates the values and labels of missing data. If 9 is a missing value, then the codebook could note 9=Missing Data. Other examples of missing data labels include REFUSED, DON’T KNOW, BLANK (NO ANSWER), and LEGITIMATE SKIP. Some analysis software packages require that certain types of data that the user desires to be excluded from analysis be designated as "MISSING DATA," (i.e.., inappropriate, unascertained, unascertainable, or ambiguous data categories). Although these codes are defined as missing data categories, this does not mean that the user should not or could not use them, if so desired.
Code Value: Indicates the code values occurring in the data for a variable.
Value Label: Indicates the textual definitions of the codes. Abbreviations commonly used in the code definitions are "DK" (Do Not Know), "NA" (Not Ascertained), and "INAP" (Inapplicable).