Guide to Social Science Data Preparation and Archiving
Phase 3: Data Collection and File Creation

Best Practices in Creating Metadata

Metadata — often called technical documentation or the codebook — are critical to effective data use as they convey information that is necessary to fully exploit the analytic potential of the data. Preparing high-quality metadata can be a time-consuming task, but the cost can be significantly reduced by planning ahead. In this section, we describe the structure and content of optimal metadata for social science data.

In this section:


ICPSR recommends using XML to create structured documentation compliant with the Data Documentation Initiative (DDI) metadata specification, an international standard for the content and exchange of documentation. XML stands for eXtensible Markup Language and was developed by the W3C, the governing body for all Web standards. Structured, XML-based metadata are ideal for documenting research data because the structure provides machine-actionability and the potential for metadata reuse.

XML defines structured rules for tagging text in a way that allows the author to express semantic meaning in the markup. Thus, question text — for example, <question>Do you own your own home?</question> — can be tagged separately from the answer categories. This type of tagging embeds “intelligence” in the metadata and permits flexibility in rendering the information for display on the Web.

Data Documentation Intiative (DDI)

At the outset of a project, we encourage data producers to generate documentation that is tagged according to the Data Documentation Initiative (DDI) metadata specification, an emerging international standard for the content, presentation, transport, and preservation of documentation (Blank and Rasmussen, 2004). The DDI specification is written in XML, which permits the markup, or tagging, of technical documentation content for retrieval and repurposing across the data life cycle. (See “Getting Started with the DDI.”)

The Data Documentation Initiative (DDI) provides a set of XML rules specifically for describing social, behavioral, and economic data. DDI is designed to encourage the use of a comprehensive set of elements to describe social science datasets, thereby providing the potential data analyst with broader knowledge about a given collection. In addition, DDI supports a life cycle orientation to data that is crucial for thorough understanding of a dataset. DDI enables the documentation of a project from its earliest stages through questionnaire development, data collection, archiving and dissemination, and beyond, with no metadata loss.

DDI authoring options

Several XML authoring tools are available to facilitate the creation of DDI metadata. With a generic XML editor, the user imports the DDI rules (i.e., the DDI XML Schema) into the software and is then able to enter text for specific DDI elements and attributes. The resulting document is a valid DDI instance or file.

There are also DDI-specific tools, such as Nesstar Publisher and Colectica, which produce DDI-compliant XML markup automatically. For more information on DDI and a list of tools and other XML resources, please consult the DDI website.

Depositing DDI metadata

ICPSR encourages the deposit of DDI metadata with deposits of research data. There are currently two main versions of the DDI specification — DDI Codebook (Version 2.*) and DDI Lifecycle (Version 3.1). Most archives will prefer or at least readily accept documentation submitted in either of the DDI versions. To be in full compliance, a document should have question text integrated into each variable.

It may not be possible for a project to produce documentation that is DDI-conformant. In those situations, using a uniform, structured format with integrated question text is the best alternative, as it will enable the archive to convert the files to XML format easily.

Important metadata elements

Since most standard computer programs will produce frequency distributions that show counts and percents for each value of numeric variables, it may seem logical to use that information as the basis for documentation, but there are several reasons why this is not recommended. First, the output typically does not show the exact form of the question or item. Second, it does not contain other important information such as skip patterns, derivations of constructed variables, etc.

A list of the most important items to include in social science metadata is presented below. Note that many of the high-level elements have counterparts in the Dublin Core Metadata Initiative (DCMI) element set. The DCMI is a standard aimed at making it easier to describe and to find resources using the Internet. For more information on the DCMI, please view its website.

Principal investigator(s) [Dublin Core -- Creator]. Principal investigator name(s), and affiliation(s) at time of data collection.

Title [Dublin Core -- Title]. Official title of the data collection.

Funding sources. Names of funders, including grant numbers and related acknowledgments.

Data collector/producer. Persons or organizations responsible for data collection, and the date and location of data production.

Project description [Dublin Core -- Description]. A description of the project and its intellectual goals and indicates how the data articulate with related datasets. Publications providing essential information about the project should be cited. A brief project history detailing any major difficulties faced or decisions made in the course of the project is useful.

Sample and sampling procedures. This section should describe the target population investigated and the methods used to sample it (assuming the entire population is not studied). The discussion of the sampling procedure should indicate whether standard errors based on simple random sampling are appropriate, or if more complex methods are required. If weights were created, they should be described. If available, a copy of the original sampling plan should be included as an appendix. A clear indication of the response rate should be provided, indicating the proportion of those sampled who actually participated in the study. For longitudinal studies, the retention rate across studies should also be noted.

Weighting. If weights are required, information on weight variables, how they were constructed, and how they should be used.

Substantive, temporal, and geographic coverage of the data collection [Dublin Core -- Coverage]. Descriptions of topics covered, time period, and location

Data source(s) [Dublin Core -- Source]. If a dataset draws on resources other than surveys, citations to the original sources or documents from which data were obtained.

Unit(s) of analysis/observation. A description of who or what is being studied.

Variables. For each variable, the following information should be provided:

  1. The exact question wording or the exact meaning of the datum. Sources should be cited for questions drawn from previous surveys or published work.
  2. The text of the question integrated into the variable text. If this is not possible, it is useful to have the item or questionnaire number (e.g., Question 3a), so that the archive can make the necessary linkages.
  3. Universe information, i.e., who was actually asked the question. Documentation should indicate exactly who was asked and was not asked the question. If a filter or skip pattern indicates that data on the variable were not obtained for all respondents, that information should appear together with other documentation for that variable.
  4. Exact meaning of codes. The documentation should show the interpretation of the codes assigned to each variable. For some variables such as occupation or industry, this information might appear in an appendix.
  5. Missing data codes. Codes assigned to represent data that are missing. Such codes typically fall outside of the range of valid values. Different types of missing data should have distinct codes.
  6. Unweighted frequency distribution or summary statistics. These distributions should show both valid and missing cases.
  7. Imputation and editing information. Documentation should identify data that have been estimated or extensively edited.
  8. Details on constructed and weight variables. Datasets often include variables constructed using other variables. Documentation should include “audit trails” for such variables, indicating exactly how they were constructed, what decisions were made about imputations, and the like. Ideally, documentation would include the exact programming statements used to construct such variables. Detailed information on the construction of weights should also be provided.
  9. Location in the data file. For raw data files, documentation should provide the field or column location and the record number (if there is more than one record per case). If a dataset is in a software-specific system format, location is not important, but the order of the variables is. Ordinarily, the order of variables in the documentation will be the same as in the file; if not, the position of the variable within the file must be indicated.
  10. Variable groupings. Particularly for large datasets, it is useful to categorize variables into conceptual groupings.

Related publications. Citations to publications based on the data, by the principal investigators or others.

Technical information on files. Information on file formats, file linking, and similar information.

Data collection instruments. Copies of the original data collection forms and instruments. Other researchers often want to know the context in which a particular question was asked, and it is helpful to see the survey instrument as a whole. Copyrighted survey questions should be acknowledged with a citation so that users may access and give credit to the original survey and its author.

Flowchart of the data collection instrument. A graphical guide to the data, showing which respondents were asked which questions and how various items link to each other. This is particularly useful for complex questionnaires or when no hardcopy questionnaire is available.

Index or table of contents. A list of variables either in alphabetic order or organized into variable groups with corresponding page numbers or links to the variables in the technical documentation or codebook.

List of abbreviations and other conventions. Variable names and variable labels often contain abbreviations. Ideally, these should be standardized and described.

Interviewer guide. Details on how interviews were administered, including probes, interviewer specifications, use of visual aids such as hand cards, and the like.

Coding instrument. A document that details the rules and definitions used for coding the data. This is particularly useful when open-ended responses are coded into quantitative data and the codes are not provided on the original data collection instrument.