Using Machine Learning to Identify High-Risk Domestic Violence Offenders in New York City, New York, 2006-2017 (ICPSR 38540)
Version Date: Feb 12, 2024 View help for published
Principal Investigator(s): View help for Principal Investigator(s)
Jens Ludwig, University of Chicago. Crime Lab
https://doi.org/10.3886/ICPSR38540.v1
Version V1
Summary View help for Summary
To address the relative difficulty in predicting domestic violence incidents and effectively targeting resources, the University of Chicago Crime Lab and the New York Police Department (NYPD) collaborated to develop and test a machine learning-based statistical model to predict the risk of domestic violence victimization in New York City.
Phase 1 of the project was to develop a statistical model using machine learning techniques. NYPD administrative records dated between January 2006 and January 2017 were used as input data to build and refine the tool. Due to the lack of unique identifiers for victims in the records, the research team also used data from the Chicago Police Department to create a probabilistic record linkage toolkit (Name Match) to identify which records belonged to the same person within and across data sources.
In Phase 2, the researchers aimed to field test the tool's capability to identify individuals at risk of repeated domestic violence through a large-scale randomized control trial. Measuring the effects of regular home visits of high-priority individuals thought to be at risk of serious domestic assault, the test intended to compare the selections of individuals made by officers versus those predicted by the tool.
This collection contains only the machine learning code files (R and Python) created during secondary analysis, which have been released as a zipped package. Please refer to the Data Roadmap for instructions on how to obtain the original NYPD data. To access the Name Change algorithm and documentation, please visit the Github repository.
Citation View help for Citation
Export Citation:
Funding View help for Funding
Subject Terms View help for Subject Terms
Geographic Coverage View help for Geographic Coverage
Smallest Geographic Unit View help for Smallest Geographic Unit
None
Distributor(s) View help for Distributor(s)
Time Period(s) View help for Time Period(s)
Date of Collection View help for Date of Collection
Study Purpose View help for Study Purpose
The purpose of this project was to develop and test a machine learning-based statistical model to predict the risk of domestic violence victimization to improve targeting of resources in New York City.
Study Design View help for Study Design
The machine learning tool developed incorporated New York Police Department (NYPD) administrative records covering all of New York City between January 2006 and January 2017, including domestic incident reports, criminal complaints, arrests, aided reports, shootings, and homicides.
While unique identifiers were present for arrestees, they did not exist for victims. The research team developed a probabilistic record linkage algorithm, Name Match, to identify which records belonged to the same person within and across data sources. The algorithm compares identifying fields (e.g., name, birthdate, address, sex, race) between two records and predicts whether or not they refer to the same person. With Name Match, the researchers were able to create a victim-level dataset linking domestic violence victims and offenders to past and future law enforcement incidents. To generate predictions of violent felony domestic violence victimization over a 12-month follow-up period, the researchers used data between 2006-2014 to predict outcomes, dividing the data into a training set and a test set.
Phase 2 was designed to test the developed model in the field. The researchers sought to test whether the statistical model or domestic violence officers selected victims at higher risk for regular home visits, as well as to determine the treatment effect of home visits on violent felony domestic violence revictimization. Launched in July 2017, the field intervention involved 60 NYPD commands randomized into either treatment or control groups (30 each group). The control group operated as usual. The intervention group added two individuals per officer to the high-priority list of those receiving regular home visits (one selected via algorithm, one selected by officers). However, due to external constraints, the study design was modified to add a quasi-experiment comparing individuals who received home visits to those who did not receive home visits based on residence in a particular NYPD command area.
Sample View help for Sample
Not applicable.
Time Method View help for Time Method
Universe View help for Universe
New York Police Department administrative records covering all of New York City between January 2006 and January 2017.
Unit(s) of Observation View help for Unit(s) of Observation
Data Source View help for Data Source
New York Police Department (NYPD)
Data Type(s) View help for Data Type(s)
Mode of Data Collection View help for Mode of Data Collection
Description of Variables View help for Description of Variables
The following variables were used in the secondary analysis and tool development:
- Victim and/or offender personal identifiable information (PII): name, birthdate, reported age, sex, race, home address, unique ID (if available)
- Incident details: date, time, precinct, address, type of incident, penal code, law code, police department code description, narrative description of incident
- Other incident indicators: fatal vs. non-fatal, whether desk appearance ticket was issued, if arrest was victim-driven or proactive
Response Rates View help for Response Rates
Not applicable.
Presence of Common Scales View help for Presence of Common Scales
None
HideNotes
These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed.
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
