Crime Hot Spot Forecasting with Data from the Pittsburgh [Pennsylvania] Bureau of Police, 1990-1998 (ICPSR 3469)

Version Date: Aug 7, 2015 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Wilpen L. Gorr, Carnegie Mellon University; Andreas Olligschlaeger, Carnegie Mellon University

https://doi.org/10.3886/ICPSR03469.v1

Version V1

Slide tabs to view more

This study used crime count data from the Pittsburgh, Pennsylvania, Bureau of Police offense reports and 911 computer-aided dispatch (CAD) calls to determine the best univariate forecast method for crime and to evaluate the value of leading indicator crime forecast models.

The researchers used the rolling-horizon experimental design, a design that maximizes the number of forecasts for a given time series at different times and under different conditions. Under this design, several forecast models are used to make alternative forecasts in parallel. For each forecast model included in an experiment, the researchers estimated models on training data, forecasted one month ahead to new data not previously seen by the model, and calculated and saved the forecast error. Then they added the observed value of the previously forecasted data point to the next month's training data, dropped the oldest historical data point, and forecasted the following month's data point. This process continued over a number of months.

A total of 15 statistical datasets and 3 geographic information systems (GIS) shapefiles resulted from this study.

The statistical datasets consist of

  • Univariate Forecast Data by Police Precinct (Dataset 1) with 3,240 cases
  • Output Data from the Univariate Forecasting Program: Sectors and Forecast Errors (Dataset 2) with 17,892 cases
  • Multivariate, Leading Indicator Forecast Data by Grid Cell (Dataset 3) with 5,940 cases
  • Output Data from the 911 Drug Calls Forecast Program (Dataset 4) with 5,112 cases
  • Output Data from the Part One Property Crimes Forecast Program (Dataset 5) with 5,112 cases
  • Output Data from the Part One Violent Crimes Forecast Program (Dataset 6) with 5,112 cases
  • Input Data for the Regression Forecast Program for 911 Drug Calls (Dataset 7) with 10,011 cases
  • Input Data for the Regression Forecast Program for Part One Property Crimes (Dataset 8) with 10,011 cases
  • Input Data for the Regression Forecast Program for Part One Violent Crimes (Dataset 9) with 10,011 cases
  • Output Data from Regression Forecast Program for 911 Drug Calls: Estimated Coefficients for Leading Indicator Models (Dataset 10) with 36 cases
  • Output Data from Regression Forecast Program for Part One Property Crimes: Estimated Coefficients for Leading Indicator Models (Dataset 11) with 36 cases
  • Output Data from Regression Forecast Program for Part One Violent Crimes: Estimated Coefficients for Leading Indicator Models (Dataset 12) with 36 cases
  • Output Data from Regression Forecast Program for 911 Drug Calls: Forecast Errors (Dataset 13) with 4,936 cases
  • Output Data from Regression Forecast Program for Part One Property Crimes: Forecast Errors (Dataset 14) with 4,936 cases
  • Output Data from Regression Forecast Program for Part One Violent Crimes: Forecast Errors (Dataset 15) with 4,936 cases.
  • The GIS Shapefiles (Dataset 16) are provided with the study in a single zip file: Included are polygon data for the 4,000 foot, square, uniform grid system used for much of the Pittsburgh crime data (grid400); polygon data for the 6 police precincts, alternatively called districts or zones, of Pittsburgh(policedist); and polygon data for the 3 major rivers in Pittsburgh the Allegheny, Monongahela, and Ohio (rivers).

Gorr, Wilpen L., and Olligschlaeger, Andreas. Crime Hot Spot Forecasting with Data from the Pittsburgh [Pennsylvania] Bureau of Police, 1990-1998. Inter-university Consortium for Political and Social Research [distributor], 2015-08-07. https://doi.org/10.3886/ICPSR03469.v1

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote
United States Department of Justice. Office of Justice Programs. National Institute of Justice (1998-IJ-CX-K005)

4,000-foot uniform grid cells

Inter-university Consortium for Political and Social Research
Hide

1990 -- 1998
1998
Hide

This study had two purposes: 1) To determine the best univariate forecast method for crime, and 2) To evaluate the value of leading indicator crime forecast models using the best univariate forecast model as the benchmark of comparison.

This study design is based on the rationale that in order to be a candidate for practical use, a leading indicator model must forecast more accurately than the simpler, but best univariate model.

This study used the rolling-horizon experimental design, a design that maximizes the number of forecasts for a given time series at different times and under different conditions. Under this design, several forecast models are used to make alternative forecasts in parallel. For each forecast model included in an experiment, the researchers estimated models on training data, forecasted one month ahead to new data not previously seen by the model, and calculated and saved the forecast error. Then they added the observed value of the previously forecasted data point to the next month's training data, dropped the oldest historical data point, and forecasted the following month's data point. This process continued over a number of months.

For univariate forecast methods, the researchers used a five-year rolling horizon. For multivariate, leading indicator models estimated by least squares regression, the researchers used a three-year moving window. The researchers made forecasts over a 36-month period (January 1996 through December 1998) in order to generate an adequate sample size of forecast errors for statistical testing purposes.

The researchers took the following steps:

  • They collected all offense reports and 9-1-1 Computer Aided Dispatch calls from the Pittsburgh, Pennsylvania, Bureau of Police for the years 1990 through 1998.
  • They aggregated the crime space data and time series data.
  • They conducted two major sets of forecast experiments with these data: 1) a study based on precincts to determine the best univariate forecast method for crime and 2) a study based on 4,000 foot, uniform grid cells to evaluate the value of leading indicator forecast models with the best univariate forecast model as the benchmark of comparison.
  • To compare forecast accuracy of competing univariate methods, they used pair-wise (matched comparisons) t-tests of forecasts for significance testing.
  • They used a form of Granger causality testing (Granger 1969) to determine the relative value of leading indicator models.
  • To develop benchmark accuracy measures, they first carefully optimized over univariate methods to get the most accurate forecasts (Gorr, Thompson, and Olligschlaeger 2000).
  • Rather than assess accuracy based on the performance of individual point forecasts for each grid cell, they examined forecast performance within ranges of changes for both decreases and increases.
  • Using contingency tables they contrasted forecasts and actual outcomes within each range and designated correct forecasts as true positives and true negatives, and incorrect forecasts as false negatives and false positives.
  • They applied pair-wise comparison t-tests within classes to determine if leading indicator forecasts were significantly better than univariate forecasts.
  • Within actual change categories, they identified the corresponding sets of actual and forecasted values. A univariate and a multivariate-leading-indicator forecast resulted for each point.
  • They computed the difference of squared or absolute forecast errors for each matched pair in the same change category.
  • To evaluate the relative performance of the multivariate method within a change category, they asked whether the mean error over all matched pairs in the category was significantly different from zero. If they subtracted the univariate absolute error from the multivariate absolute error, then a mean error that is significantly different from zero in a negative direction would indicate that the multivariate forecast is more accurate (i.e., has smaller forecast errors).

Crime counts from all offense reports and 911 computer-aided dispatch (CAD) calls in electronic form from the Pittsburgh, Pennsylvania, Bureau of Police for the years 1990 through 1998.

Time Series

Crime counts as reported in offense reports and 911 computer-aided dispatch (CAD) call records from the Pittsburgh, Pennsylvania, Bureau of Police

Aggregate crime counts by crime type, area, and time period

Offense reports and 911 computer-aided dispatch (CAD) call records from the Pittsburgh, Pennsylvania, Bureau of Police for the years 1990 through 1998.

The Univariate Forecast Data by Police Precinct (Dataset 1) contain 11 variables comprised of 1 unique identification variable, 2 variables indicating time (month, year), 1 aggregate crime code variable, and 7 crime count variables (1 variable for each of the 6 police precincts in the city of Pittsburgh, plus 1 variable for the city of Pittsburgh as a whole).

The Output Data from the Univariate Forecasting Program: Sectors and Forecast Errors (Dataset 2) contain 17 variables comprised of 1 unique identification variable, 2 time variables (month, year), 1 police precinct variable, 2 crime counts variables (actual and forecasted), 1 multiplicative seasonal index variable, 5 forecast error variables (signed, absolute, squared, absolute percentage, and modified absolute percentage), 3 weight variables (exponential smoothing, Holt exponential smoothing level, and Holt exponential smoothing slope) 1 forecast method code variable, and 1 short aggregate crime code variable.

The Multivariate, Leading Indicator Forecast Data by Grid Cell (Dataset 3) contain 213 variables comprised of 1 unique identification variable, 2 time variables (month, year), and 209 variables specifying crime counts for individual cells of the grid overlay of Pittsburgh.

The Output Data from 911 Drug Calls Forecast Program (Dataset 4) contain 17 variables comprised of 1 unique identification variable, 2 time variables (month, year), 1 police precinct variable, 2 crime counts variables (actual and forecasted), 1 multiplicative seasonal index variable, 5 forecast error variables (signed, absolute, squared, absolute percentage, and modified absolute percentage), 3 weight variables (exponential smoothing, Holt exponential smoothing level, and Holt exponential smoothing slope) 1 forecast method code variable, and 1 short aggregate crime code variable.

The Output Data from Part One Property Crimes Forecast Program (Dataset 5) contain 17 variables comprised of 1 unique identification variable, 2 time variables (month, year), 1 police precinct variable, 2 crime counts variables (actual and forecasted), 1 multiplicative seasonal index variable, 5 forecast error variables (signed, absolute, squared, absolute percentage, and modified absolute percentage), 3 weight variables (exponential smoothing, Holt exponential smoothing level, and Holt exponential smoothing slope) 1 forecast method code variable, and 1 short aggregate crime code variable.

The Output Data from Part One Violent Crimes Forecast Program (Dataset 6) contain 17 variables comprised of 1 unique identification variable, 2 time variables (month, year), 1 police precinct variable, 2 crime counts variables (actual and forecasted), 1 multiplicative seasonal index variable, 5 forecast error variables (signed, absolute, squared, absolute percentage, and modified absolute percentage), 3 weight variables (exponential smoothing, Holt exponential smoothing level, and Holt exponential smoothing slope) 1 forecast method code variable, and 1 short aggregate crime code variable.

The Input Data for the Regression Forecast Program for 911 Drug Calls (Dataset 7) contain 41 variables comprised of 1 unique identification variable, 1 grid cell number variable, 2 time variables (month, year), 1 dependent variable (911 drug calls), 1 dependent variable that has been lagged by one month, 1 dependent variable that has been lagged by one month and averaged over neighboring grid cells, 17 leading indicator variables (criminal mischief 911 call, shots fired 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, drug offense, drinking offense, field contact report, liquor law violations, prostitution offense, public drunkenness offense, stolen property offense, trespass offense, and weapons offense), and 17 leading indicator variables averaged over neighboring grid cells (criminal mischief 911 call, shots fired 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, drug offense, drinking offense, field contact report, liquor law violations, prostitution offense, public drunkenness offense, stolen property offense, trespass offense, and weapons offense).

The Input Data for the Regression Forecast Program for Part One Property Crimes (Dataset 8) contain 33 variables comprised of 1 unique identification variable, 1 grid cell number variable, 2 time variables (month, year), 1 dependent variable (911 property crime calls), 1 dependent variable that has been lagged by one month, 1 dependent variable that has been lagged by one month and averaged over neighboring grid cells, 13 leading indicator variables (criminal mischief 911 call, drug 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, drug offense, field contact report, liquor law violations, stolen property offense, stolen property offense, trespass offense, and weapons offense), and 13 leading indicator variables averaged over neighboring grid cells (criminal mischief 911 call, drug 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, drug offense, field contact, liquor law violation, stolen property offense, trespass offense, and weapons offense).

The Input Data for the Regression Forecast Program for Part One Violent Crimes (Dataset 9) contain 45 variables comprised of 1 unique identification variable, 1 grid cell number variable, 2 time variables (month, year), 1 dependent variable (911 violent crime calls), 1 dependent variable that has been lagged by one month, 1 dependent variable that has been lagged by one month and averaged over neighboring grid cells, 19 leading indicator variables (criminal mischief 911 call, domestic 911 call, drug 911 call, public disorder 911 call, shots fired 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, domestic offense, drug offense, field contact report, liquor law violations, prostitution offense, public drunkenness offense, simple assault offense, trespass offense, and weapons offense), and 19 leading indicator variables averaged over neighboring grid cells (criminal mischief 911 call, domestic 911 call, drug 911 call, public disorder 911 call, shots fired 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, domestic offense, drug offense, field contact, liquor law violation, prostitution offense, public drunkenness offense, simple assault offense, trespass offense, and weapons violation offense).

The Output Data from Regression Forecast Program for 911 Drug Calls: Estimated Coefficients for Leading Indicator Models (Dataset 10) contain 30 variables comprised of 1 unique identification variable, 1 label of model variable, 1 type of statistics variable, 2 dependent variable identifiers (string and numeric), 1 root mean square error variable, 23 estimated regression coefficient variables (for intercept term, criminal mischief 911 call, spatially lagged criminal mischief 911 call, public disorder 911 call, spatially lagged public disorder 911 call, shots fired 911 call, spatially lagged shots fired 911 call, truancy 911 call, spatially lagged truancy 911 call, vice 911 call, spatially lagged vice 911 call, weapons 911 call, spatially lagged weapons 911 call, disorderly conduct offense, spatially lagged disorderly conduct offense, liquor law violation, spatially lagged liquor law violation, prostitution offense, spatially lagged prostitution offense, drinking offense, spatially lagged drinking offense, trespass offense, and spatially lagged trespass offense), and 1 join key variable for linking with other tables.

The Output Data from Regression Forecast Program for Part One Property Crimes: Estimated Coefficients for Leading Indicator Models (Dataset 11) contain 24 variables comprised of 1 unique identification variable, 1 label of model variable, 1 type of statistics variable, 2 dependent variable identifiers (string and numeric), 1 root mean square error variable, 17 estimated regression coefficient variables (for intercept term, drugs 911 call, spatially lagged drugs 911 call, truancy 911 call, spatially lagged truancy 911 call, vice 911 call, spatially lagged vice 911 call, criminal mischief offense, spatially lagged criminal mischief offense, disorderly conduct offense, spatially lagged disorderly conduct offense, liquor law violation, spatially lagged liquor law violation, trespass offense, spatially lagged trespass offense, weapons offense, and spatially lagged weapons offense), and 1 join key variable for linking with other tables.

The Output Data from Regression Forecast Program for Part One Violent Crimes: Estimated Coefficients for Leading Indicator Models (Dataset 12) contain 34 variables comprised of 1 unique identification variable, 1 label of model variable, 1 type of statistics variable, 2 dependent variable identifiers (string and numeric), 1 root mean square error variable, 26 estimated regression coefficient variables (for intercept term, domestic 911 call, spatially lagged domestic 911 call, drugs 911 call, spatially lagged drugs 911 call, public disorder 911 call, spatially lagged public disorder 911 call, shots fired 911 call, spatially lagged shots fired 911 call, vice 911 call, spatially lagged vice 911 call, weapons 911 call, spatially lagged weapons 911 call, criminal mischief offense, spatially lagged criminal mischief offense, disorderly conduct offense, spatially lagged disorderly conduct offense, liquor law violation, spatially lagged liquor law violation, prostitution offense, spatially lagged prostitution offense, public drunkenness offense, spatially lagged public drunkenness offense, simple assault offense, spatially lagged simple assault offense, trespass offense, and spatially lagged trespass offense), and 1 join key variable for linking with other tables.

The Output Data from Regression Forecast Program for 911 Drug Calls: Forecast Errors (Dataset 13) contain 9 variables comprised of 1 unique identification variable, 1 grid cell indicator variable, 2 variables indicating time (month, year), 1 actual drug crime count variable, 1 forecasted drug count variable, 1 signed forecast error variable, 1 absolute forecast error variable, and 1 squared forecast error variable.

The Output Data from Regression Forecast Program for Part One Property Crimes: Forecast Errors (Dataset 14) contain 13 variables comprised of 1 unique identification variable, 1 grid cell indicator variable, 2 variables indicating time (month, year), 1 actual property crime count variable, 1 forecasted property count variable, 1 level variable, 1 change variable, 1 pcheck variable, 1 signed forecast error variable, 1 absolute forecast error variable, 1 squared forecast error variable, and 1 modified absolute percentage error variable.

The Output Data from Regression Forecast Program for Part One Violent Crimes: Forecast Errors (Dataset 15) contain 10 variables comprised of 1 unique identification variable, 1 grid cell indicator variable, 2 variables indicating time (month, year), 1 actual violent crime count variable, 1 forecasted violent count variable, 1 signed forecast error variable, 1 absolute forecast error variable, 1 squared forecast error variable, and 1 modified absolute percentage error variable.

The Geographic Information Systems (GIS) Data consist of 3 shapefiles.

  • grid4000.shp: The 4,000 foot, square, uniform grid system used for much of the Pittsburgh crime data.
  • policedist.shp: The six police precincts (districts or zones) of Pittsburgh.
  • rivers.shp: The three major rivers in Pittsburgh (the Allegheny, Monongahela, and Ohio).

Not applicable.

none.

Hide

2015-08-07

2018-02-15 The citation of this study may have changed due to the new version control system that has been implemented. The previous citation was:
  • Gorr, Wilpen L., and Andreas Olligschlaeger. Crime Hot Spot Forecasting with Data from the Pittsburgh [Pennsylvania] Bureau of Police, 1990-1998. ICPSR03469-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2015-08-07. http://doi.org/10.3886/ICPSR03469.v1

2015-08-07 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

  • Created variable labels and/or value labels.
Hide

Notes

  • The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.

NACJD logo

This dataset is maintained and distributed by the National Archive of Criminal Justice Data (NACJD), the criminal justice archive within ICPSR. NACJD is primarily sponsored by three agencies within the U.S. Department of Justice: the Bureau of Justice Statistics, the National Institute of Justice, and the Office of Juvenile Justice and Delinquency Prevention.