What is a codebook?
A codebook provides information on the structure, contents,
and layout of a data file. Users are strongly encouraged to
look at the codebook of a study before downloading the data
files.
While codebooks vary widely in quality and amount of
information given, a typical codebook includes:
- Column locations and widths for each variable
- Definitions of different record types
- Response codes for each variable
- Codes used to indicate nonresponse and missing data
- Exact questions and skip patterns used in a survey
- Other indications of the content and characteristics of each variable
Additionally, codebooks may also contain:
- Frequencies of response
- Survey objectives
- Concept definitions
- A description of the survey design and methodology
- A copy of the survey questionnaire (if applicable)
- Information on data collection, data processing, and data quality
The following example from ICPSR 9721 (Descriptors and Measurements of the Height of Runaway Slaves and Indentured Servants in the United States, 1700-1850) illustrates the main components of a typical ICPSR codebook:
The body of a codebook describes the content of the data
file and generally includes the following elements for each
variable in the data file:
Variable Name: Indicates the variable number or name
assigned to each variable in the data collection.
Variable Column Location: Indicates the starting
location and width of a variable. If the variable is a
multiple-response type, the width referenced is that of a
single response.
Variable Label: Indicates an abbreviated variable
description (maximum of 40 characters) to identify the
variable for the user. In some cases, an expanded version of
the variable name can be found in a variable description
list.
Missing Data Code: Indicates the values and labels of
missing data. If 9 is a missing value, then the codebook
could note (MD=9). Alternative statements for other
variables are "MD=8 OR GE 9" or "NO MISSING DATA CODES."
Some analysis software packages require that certain types
of data which the user desires to be excluded from analysis
be designated as "MISSING DATA," i.e.., inappropriate,
unascertained, unascertainable, or ambiguous data
categories. Although these codes are defined as missing data
categories, this does not mean that the user should or could
not use them if so desired.
Code Value: Indicates the code values occurring in the
data for this variable.
Value Label: Indicates the textual definitions of the
codes. Abbreviations commonly used in the code definitions
are "DK" (Do Not Know), "NA" (Not Ascertained), and "INAP"
(Inapplicable).
|