Depositing Files from Analysis of Existing Data
Some projects do not involve original data collection and may instead involve combining data from one or more existing sources. Projects may also involve enhancing the original data file with additional variables created during the analysis. From these types of projects, a deposit might include the original data and derived variables, the derived variables alone, or adequately just the codes used during the project to link and create additional variables.
For any of these situations, the source of the data used for the analysis of existing data should be clearly identified as well as the date the data were obtained from the source. If the original data are already available from ICPSR or NACJD, please specify the ICPSR study number of the existing collection used as well as the ICPSR version of the data collection.
Existing data not already publicly available
The existing data may be the primary data file analyzed for a project. The existing data may have also been enhanced by variables derived from the existing data. The guiding questions on what to deposit are: (a) are the existing data already publicly available, and (b) how easily the derived variables can be recreated.
If the existing data used for the analysis are not already publicly available, researchers are encouraged to deposit the data, with the permission of the original data producer. If existing data were linked to original data collected during the project, the investigator is encouraged to deposit the existing data as well, especially if these data are needed along with the project's original data to replicate the project findings. Sometimes this is not an option, as in instances when the data were obtained informally from individuals who did not intend for the public release of the data via another researcher or when the data were obtained from other government sources and the investigator does not have approval to publicly archive the data. Still, if the data are not yet publicly available and can be archived, researchers are encouraged to do so.
Often, the researcher will compute new variables based on the existing data (new categories are created, rates are produced, scales are developed). All useful derived variables should be deposited also, especially if the variables were used for analyses included in publications. Depending on the factors mentioned above, the derived variables may be deposited with the existing data or the derived variables could be deposited alone. The code or setup file used to create the derived variables should also be provided.
Existing data may have been linked to original data collected by the project. The guiding questions on what to deposit are: (a) are the existing data already publicly available, (b) how easily the original data can be linked to the existing data, and (c) how easily the derived variables can be recreated.
If the linkage is straightforward and the existing data are publicly available, then users can easily obtain the existing data themselves from the original source and link them to the original data submitted by the researcher. In this case, the project report(s) should clearly identify the source of the existing data including version and/or date so that other users know which data to obtain. The code or setup file used to link the data that includes information about the variable (or combination of variables) that constitutes the unique identifier used to link the data files should also be provided.
If the linkage is not straightforward, then the researcher is providing a useful service by depositing the linked data. Examples that fit this type of situation include: (a) the link needs to be made using a judgment call of a combination of nonunique variables, for example, age, sex, and race of an individual and date of incident; and (b) an understanding of local geographic factors is needed to link correctly, for example, neighborhoods or block levels, especially over a range of years when boundaries shift. Here, the redundancy of having data stored twice at the archive is outweighed by the usefulness of providing others with the data already linked.
When the existing data are Census data, the linked Census data should be deposited as well, even though the link between the data files can be quite straightforward and the Census data are publicly available. Since the original Census files are large in size and contain a large number of variables, determining which Census variables to use and at what level to extract the data for the subsets can be time-consuming. Depositing the linked Census data makes it unnecessary for other users to repeat these subsetting steps.
Often, after the data are linked, the researcher may compute new variables based on the linked data (new categories are created, rates are produced, scales are developed). All useful derived variables should be deposited also, especially if the variables were used for analyses included in publications. Depending on the factors mentioned above, the derived variables may be deposited in a data file that includes the original data collected for the project, the existing data from another source, or the derived variables may be deposited with the original data alone. The code or setup file used to link the files and create the derived variables should also be provided.
Existing data used solely
If the project only involved analysis of existing data already publicly available and the product of the project is the analysis alone, then data may not need to be deposited. Researchers are encouraged, however, to deposit the code that produced the analysis.
Researchers are always encouraged to deposit original data collected during their project. On the other hand, the decision on whether to archive the existing data, the existing data and the derived variables, or the code alone needs to be made on a case-by-case basis. The principal investigator should contact NACJD staff for assistance at email@example.com if the decision on what to deposit is unclear.