Projects Analyzing Existing Data

Submission Guidelines

Some projects are secondary data analysis projects and do not involve original data collection. These projects obtain data from one or more existing sources. Original data may be available to the public, or restrictions preclude the deposit of original data at NACJD. In these situations, the original data may not need to be deposited; however, the existing data source should be clearly identified, and a description of the steps taken to acquire the data should be provided so that another analyst will be able to replicate the process. This is also known as a "data road map".

If the original data are already available from ICPSR, please specify the ICPSR study number and version of the data. Any new, or derived, variables created from the original data should be submitted for archiving along with documentation (code, syntax or setup files) for how the variables were created. Refer to the Guidelines for Depositing NIJ and OJJDP Data at NACJD document for more information.

When the existing data used for the analysis are not publicly available, researchers are strongly encouraged to deposit the data, with the permission of the original data producer. This is especially important if existing data were linked to data collected under the award to produce project findings, and both the existing and original data are needed to replicate the analysis. If the original data producer does not permit their data - existing or derived - to be archived, the investigator should contact their assigned NIJ or OJJDP Grant Manager before submitting their files to NACJD.

Existing Data Linked to Collected Data

Existing data may have been linked to original data collected by the project. The guiding questions on what to deposit are: (a) are the existing data already publicly available, (b) how easily the original data can be linked to the existing data, and (c) how easily the derived variables can be recreated.

If the linkage is straightforward and the existing data are publicly available, then users can easily obtain the existing data themselves from the original source and link them to the original data submitted by the researcher. In this case, the project report(s) should clearly identify the source of the existing data including version and/or date so that other users know which data to obtain. The code or setup file used to link the data that includes information about the variable (or combination of variables) that constitutes the unique identifier used to link the data files should also be provided.

If the linkage is not straightforward, then the researcher is providing a useful service by depositing the linked data. Examples that fit this type of situation include: (a) the link needs to be made using a judgment call of a combination of nonunique variables, for example, age, sex, and race of an individual and date of incident; and (b) an understanding of local geographic factors is needed to link correctly, for example, neighborhoods or block levels, especially over a range of years when boundaries shift. Here, the redundancy of having data stored twice at the archive is outweighed by the usefulness of providing others with the data already linked.

When the existing data are Census data, the linked Census data should be deposited as well, even though the link between the data files can be quite straightforward and the Census data are publicly available. Since the original Census files are large in size and contain a large number of variables, determining which Census variables to use and at what level to extract the data for the subsets can be time-consuming. Depositing the linked Census data makes it unnecessary for other users to repeat these subsetting steps.

Often, after the data are linked, the researcher may compute new variables based on the linked data (new categories are created, rates are produced, scales are developed). All useful derived variables should be deposited also, especially if the variables were used for analyses included in publications. Depending on the factors mentioned above, the derived variables may be deposited in a data file that includes the original data collected for the project, the existing data from another source, or the derived variables may be deposited with the original data alone. The code or setup file used to link the files and create the derived variables should also be provided.