Guide to Social Science Data Preparation and Archiving Phase 1: Proposal Development and Data Management Plans
In the earliest stages of proposal development, researchers should consider the growing emphasis — and new requirements, in some cases — on data management plans and data sharing generally. As indicated earlier, funding agencies increasingly require that applications for support include data sharing and dissemination plans. Plans for deposit and long-term preservation should be fleshed out while the researcher is at the stage of outlining and writing the grant application. Planning ahead during this early phase of the project permits the researcher to take into account important issues — particularly issues related to disclosure risk — from the very beginning, which can simplify the process and avert problems later on at the data deposit stage.
In 2010, the National Science Foundation’s Social, Behavioral and Economic Sciences Directorate started requiring that all grant applications include a data management plan, which should include the following elements:
- Roles and responsibilities
- Expected data, including types of data to be produced by the research
- Period of data retention
- Data formats and dissemination
- Data storage and preservation of access
Other federal funding agencies such as NIH have long-standing policies recommending similar plans for data management. The information in this section is meant to help researchers meet these requirements, and is taken in large part from the ICPSR website on data management plans.
The following table lists all recommended elements (including the NSF Mapping category of each), with links to more detailed information on each. An example data management plan for depositing data with ICPSR is also included in PDF format and in a Word document for easy editing.
|Data description||A description of the information to be gathered; the nature and scale of the data that will be generated or collected.||Yes||Expected Data|
|Existing data||A survey of existing data relevant to the project and a discussion of whether and how these data will be integrated.||Yes||Expected Data|
|Format||Formats in which the data will be generated, maintained, and made available, including a justification for the procedural and archival appropriateness of those formats.||Yes||Data Format and Dissemination|
|Metadata||A description of the metadata to be provided along with the generated data, and a discussion of the metadata standards used.||Yes||Data Format and Dissemination|
|Storage and backup||Storage methods and backup procedures for the data, including the physical and cyber resources and facilities that will be used for the effective preservation and storage of the research data.||Yes||Data Storage and Preservation of Access|
|Security||A description of technical and procedural protections for information, including confidential information, and how permissions, restrictions, and embargoes will be enforced.||Yes||Data Format and Dissemination|
|Responsibility||Names of the individuals responsible for data management in the research project.||Yes||Roles and Responsibilities|
|Intellectual property rights||Entities or persons who will hold the intellectual property rights to the data, and how IP will be protected if necessary. Any copyright constraints (e.g., copyrighted data collection instruments) should be noted.||Yes||Data Format and Dissemination|
|Access and sharing||A description of how data will be shared, including access procedures, embargo periods, technical mechanisms for dissemination and whether access will be open or granted only to specific user groups. A timeframe for data sharing and publishing should also be provided.||Yes||Data Storage and Preservation of Access|
|Audience||The potential secondary users of the data.||Yes||Data Format and Dissemination|
|Selection and retention periods||A description of how data will be selected for archiving, how long the data will be held, and plans for eventual transition or termination of the data collection in the future.||Yes||Period of Data Retention|
|Archiving and preservation||The procedures in place or envisioned for long-term archiving and preservation of the data, including succession plans for the data should the expected archiving entity go out of existence.||Yes||Data Storage and Preservation of Access|
|Ethics and privacy||A discussion of how informed consent will be handled and how privacy will be protected, including any exceptional arrangements that might be needed to protect participant confidentiality, and other ethical issues that may arise.||Yes||Data Format and Dissemination|
|Budget||The costs of preparing data and documentation for archiving and how these costs will be paid. Requests for funding may be included.|
|Data organization||How the data will be managed during the project, with information about version control, naming conventions, etc.|
|Quality assurance||Procedures for ensuring data quality during the project|
|Legal requirements||A listing of all relevant federal or funder requirements for data management and data sharing.|
Provide a brief description of the information to be gathered, including the nature, scope, and scale of the data to be produced. This will help reviewers understand the data, their relationship to existing data, and possible disclosure risks.
A thorough review of existing data in related journals and data archives will make clear the value of the proposed research and why currently available datasets are inadequate to answer your research questions.
Describe the formats of the data in the submission, distribution, and preservation phases (note that these formats may be the same). Choosing formats preferred for archiving can make processing and release of data faster and more efficient. Platform-independent and non-proprietary formats will ensure that data will be usable over the long term.
Note that when writing the grant proposal, it is useful to think of “data” in the widest sense, including numeric data files, interview transcripts, and other qualitative materials such as diaries and field notes. Increasingly, social science research data include audio and video formats, geospatial data, biomedical data, and websites, and many data archives are interested in capturing this broadening array of data.
Archiving and disseminating derived datasets — that is, those resulting from the combination of data from more than one data source, including existing data outside the current research scope — also should be considered. See Phase 6, Depositing Data, for a more in-depth discussion.
Describe the metadata to be provided along with the generated data, and discuss the metadata standards used. As metadata are often the only form of communication between the secondary analyst and the data producer, good descriptive metadata are essential for effective data use. Structured or tagged metadata, such as the XML format of the Data Documentation Initiative (DDI), are optimal because of the flexibility they offer in display. XML is also preservation-ready and machine-actionable. For a more detailed discussion on metadata and documentation, please see the “Best Practices in Creating Technical Documentation” section in Phase 3, Data Collection and File Creation.
Indicate how and where you will store copies of your research files to ensure their safety, as well as how many copies you will keep and how you will synchronize them. The best practice for protecting data is to store multiple copies in multiple locations.
Describe measures you will take to ensure your data are secure. This is an important consideration over the entire life cycle of the data. Raw data may include direct identifiers of study participants and should be well protected during collection and processing. Examples of good security practices include access restrictions such as passwords, encryption, power supply backup, and virus and intruder protections.
State who will act as the responsible steward for the data throughout the data life cycle. Researchers should describe any atypical circumstances. For example, if there is more than one principal investigator, describe the division of responsibilities between them.
Indicate who will hold intellectual property rights to the data and other information created by the project, and whether these rights will be transferred to another organization for data distribution and archiving. If any copyrighted material (i.e., instruments or scales) are used, how will the project obtain permission to use or disseminate it?
Data archives need a clear statement from the data producer of who owns the data before they can be disseminated. However, issues of data ownership can be complex. For example, principal investigators on federally funded projects are responsible for collecting research data and publishing their research findings, but the resulting research data are typically owned by the institution where the principal investigator is employed.
Funding organizations expect researchers to share their data. Public archives can help universities meet those expectations without requiring a transfer of copyright along with research data. A copy of the research data can be shared publicly through an archive while ownership rights remain with the copyright holder. Agreements to publicly archive data typically grant a repository permission to preserve and disseminate the data.
Indicate how you intend to archive and share your data, and why you have chosen that particular option. Mechanisms for archiving and sharing include:
- Domain repositories, such as ICPSR (social science)
- Self-dissemination through a dedicated website created by the research team. Options for eventual dissemination should be arranged through an established archive after the self-dissemination period ends. A schedule of when dissemination will be turned over to a third party should be included. The archive may want to make a preservation copy during the period of self-dissemination for a number of reasons: (1) to develop expertise with the data; (2) to process the data while knowledgeable staff are available; and (3) for general safekeeping.
- Preservation with delayed dissemination, in which the data producer arranges with a public data repository for archival presentation with dissemination to occur at a later date, usually within a year. With delayed dissemination, the deposit may be completed when it is easiest for the depositor and the archive to manage the data, as opposed to delaying preservation activities until the time has come to disseminate the data. Issues regarding the schedule for eventual dissemination, embargo periods, and human subject protections specific to these studies will be settled prior to deposit, as will ground rules on the extent of processing by archival staff while the study remains in the “preservation with delayed dissemination” category.
- Institutional repositories at academic institutions, which have the goal of preserving and making available some portion of the academic work of their students, faculty, and staff. Not all such repositories have the capacity to accept and curate data. There are generally two types of institutional repositories: those with a focus on a particular discipline, and those without. Each type provides certain benefits and drawbacks for data producers and users that should be considered when deciding which to use.
- Restricted-use collections. In cases in which masking of sensitive data would lessen the analytic power of a dataset, a restricted-use release may be appropriate. Access to restricted-use data can be limited to approved researchers under controlled conditions. Some archives can provide both restricted-use and public-use releases, where the public files have been altered to prevent disclosure of sensitive information about survey participants. See Phase 5, Final Project Phase, for more on protecting respondent confidentiality.
Sharing data helps advance science and maximize research investment. Recent research has found that when data are shared through an archive, research productivity is enhanced and the number of publications based on the data are dramatically increased (Peinta, 2010). Experience also has shown that the durability of the data improves and the cost of processing and preservation decreases when data deposits are timely. It is important that data be deposited while the producers are still familiar with the dataset and able to transfer their knowledge fully to the archive.
The grant proposal should specify the likely users (academic or nonacademic) of the datasets. Most potential users will be within the higher education research community, but increasingly policymakers and practitioners are using research data. If the dataset has commercial or other uses, this should also be stated in the application for funding. This will potentially influence how the data are managed or shared.
Describe how data will be selected for archiving, how long they will be held, and plans for eventual transition or termination of the data collection in the future.
Describe how the data will be preserved for the long term. Digital data need to be actively managed over time to ensure they are always available and usable. Digital content requires ongoing preservation action to remain readable, understandable, and meaningful. Depositing data resources with a trusted digital archive can ensure they are curated and handled according to good practices in digital preservation.
If applicable, indicate how you will handle informed consent with respect to informing respondents that the personal information they provide will remain confidential when data are shared or made available for secondary analysis. This may mean describing:
- Plans to obtain Institutional Review Board approval
- Any legal constraints on sharing data such as HIPAA
- Methods of managing disclosure risk.
Generally speaking, informed consent agreements and confidentiality should be considered as early as possible in the research process. Protection of individuals’ privacy is a core tenet of responsible research practice, and must be thoroughly addressed.
“Informed consent” refers to the communication process that allows individuals to make informed choices about participating in a research study. An informed consent agreement provides required information about the study and serves as a formal agreement by an individual to willingly participate in the proposed research. A description of how participant confidentiality will be protected must be included in an informed consent agreement.
Language in an informed consent agreement giving the research team exclusive access to the data or promising that the data will only be shared in aggregate form or statistical tables could make archiving and disseminating the data more difficult later. Disclosure protection methods can guard sensitive information while preserving the analytic power of a dataset, rendering such restrictive language in informed consent agreements unnecessary. Two examples of non-restrictive statements on confidentiality are given below:
Sample 1. We will make our best effort to protect your statements and answers, so that no one will be able to connect them with you. These records will remain confidential. Federal or state laws may require us to show information to university or government officials [or sponsors], who are responsible for monitoring the safety of this study. Any personal information that could identify you will be safeguarded and maintained under controlled conditions.
Sample 2. The information in this study will only be used in ways that will not reveal who you are. You will not be identified in any publication from this study. Your participation in this study is confidential. Federal or state laws may require us to show information to university or government officials [or sponsors], who are responsible for monitoring the safety of this study.
The investigator should outline the plans for and cost of preparing the data and documentation for archiving. Ideally, this should be planned in conjunction with an archive. Some potentially costly activities are listed below:
- For quantitative data, investigators should allocate resources to create system-specific files with appropriate variable and value labeling, to supply the syntax for derived variables, etc.
- Grant applications should allocate sufficient time and money for the preparation of high-quality documentation.
- Informed consent and confidentiality issues impact costs for archiving. For clarity, informed consent agreement forms should be drawn up at the start of the project.
- It is strongly recommended that a set period of time be dedicated to preparing and collating materials for deposit. This normally comprises the majority of the costs for archiving.
Describe how the data will be managed during the project, including information about version control, naming conventions, etc. Indicating how your data may be different than the norm will help other researchers during secondary analysis. For example, if the data is dynamic, version control would be central to how the data will be used and understood by the research community.
Describe procedures for ensuring data quality during the project.
A listing of all relevant federal or funder requirements for data management and data sharing.