Guide to Social Science Data Preparation and ArchivingPhase 2: Project Start-Up
Importance of Good Data Management
Once funding is received and the research project has started, the researcher will want to continue to think about and plan for the final form of the collection, including metadata, which will ultimately be deposited in an archive. Planning for the management and archiving of a data collection at the outset is critical to the project’s success. The cost of a project can be significantly reduced if careful planning takes place early in the project.
Initial questions to consider
At a minimum, a project plan should involve decisions on the following data and documentation topics, many of which are related to the core data management plan. Documentation should be as much a part of project planning as data-related considerations such as questionnaire construction or analysis plans.
Data and file structure. What is the data file going to look like and how will it be organized? What is the unit of analysis? Will there be one large data record or several shorter ones?
Naming conventions. How will files and variables be named? What naming conventions will be used to achieve consistency?
Data integrity. How will data be input or captured? Will the variable formats be numeric or character? What checks will be used to find invalid values, inconsistent responses, incomplete records, and so on? What checks will be used to manage the data versions? For example, archives increasingly use checksums and other techniques for ensuring integrity.
Preparing dataset documentation. What will the dataset documentation or metadata look like and how will it be produced? How much is necessary for future retrieval and archival processing? What documentation standard will be used? (See the section titled “Best Practices in Creating Technical Documentation” section in Phase 3, Data Collection and File Creation for guidance in using a standards-based approach to documentation production.)
Variable construction. What variables will be constructed following the collection of the original data? How will these be documented?
Project documentation. What steps will be taken to document decisions that are made as the project unfolds? How will information be recorded on field procedures, coding decisions, variable construction, and the like? Research project websites and various Intranet options are increasingly used for capturing this kind of information, and archives are prepared to include Web-based information in deposits.
Integrating data and documentation
To what extent can the various tasks mentioned above be integrated into a single process? Using a single computer program or an integrated set of programs to carry out these tasks simplifies data management, reduces costs, and is more reliable. It is advisable to determine which program or programs will handle data management and documentation tasks at the outset of the project.
Computer-assisted interviewing. Computer-assisted interviewing (CATI/CAPI) is increasingly being used for both telephone and personal interviews. These programs -- e.g., Blaise, CASES -- typically perform a number of functions simultaneously including direct data entry, integrity checks, and skips and fills. Somewhat similar software can be used to format mail questionnaires and prepare data entry templates. Be aware that not all CAPI-generated variables are needed in the data file that is deposited in an archive; variables that are artifacts of the CAPI process do not contribute useful information for analysis. If possible, it is desirable to program the instrument to be fielded according to specifications of the resulting data files. Keeping a focus on the ultimate desired form of the data collection can make dataset preparation that much easier.
Using integrated software. Most large-scale data collection efforts now involve computer-assisted interviewing, but there are still situations in which data entry will be required -- e.g., inputting of administrative records, observation data, or open-ended question responses. A number of software tools are available to make the documentation task easier. For projects requiring data entry directly from mail questionnaires or interview instruments, a variety of programs will not only make data entry a good deal easier, but also carry out data integrity checks as the data are entered and create programming statements to read the data into other programs. A good data-entry program will also recognize automatic skips and fills. For example, suppose that a questionnaire contains a series of items on work experience. If the respondent has never worked, then as soon as that code is keyed, the program skips to the next valid entry, filling in missing data codes in intervening fields as appropriate.
Spreadsheets and databases. Spreadsheet packages can also be used for data entry. These packages usually can be programmed to perform integrity checks as data are entered. In addition, a variety of database packages, such as Microsoft Access, MySQL, and Oracle, can be used for both data entry and documentation. Note that when such systems are intended to serve as the format for deposit, it is important to provide full documentation for all of the fields and relationships built into the files.
Other kinds of software can be used to perform many documentation tasks. For example, word processing packages like Microsoft Word can be used for data entry, maintenance of dataset documentation, and similar tasks, but they are not suitable tools for data integrity checks. Producing an attractive final document using word processing is also quite simple. In fact, if the basic document has been set up in a word processor, retrieving and merging statistical information such as frequencies and descriptive statistics from computer output stored in an external file is a relatively easy task. See Chapter 3 for a discussion of using the Data Documentation Initiative (DDI) metadata specification to produce documentation in eXtensible Markup Language (XML) format. The DDI standard provides a way to produce comprehensive documentation that is consistent in format and thus easy to integrate into larger systems.
Data entry and documentation as part of pretests and pilot studies
Conducting pretests or pilot studies is a good way to uncover potential problems with all aspects of a project. There are two major reasons to include both data entry and documentation as part of the initial phase. First, the best way to estimate those costs is to pretest them. Secondly, pretest data entry and documentation reveal unanticipated difficulties in record layouts, naming conventions, etc. The cost of the most expensive aspect -- data entry -- may be reduced, since the pretest covers only a small number of cases. The investigator may not want to prepare a comprehensive codebook on the basis of pretest, but it is a good idea at least to prepare a mockup, or to work out the codebook layout for a few variables. See “Important Documentation Elements” in Phase 3, Data Collection and File Creation for essential codebook components.