Crime Hot Spot Forecasting with Data from the Pittsburgh [Pennsylvania] Bureau of Police, 1990-1998 (ICPSR 3469)

Principal Investigator(s): Gorr, Wilpen, Carnegie Mellon University; Olligschlaeger, Andreas, Carnegie Mellon University

Summary:

This study used crime count data from the Pittsburgh, Pennsylvania, Bureau of Police offense reports and 911 computer-aided dispatch (CAD) calls to determine the best univariate forecast method for crime and to evaluate the value of leading indicator crime forecast models.

The researchers used the rolling-horizon experimental design, a design that maximizes the number of forecasts for a given time series at different times and under different conditions. Under this design, several forecast models are used to make alternative forecasts in parallel. For each forecast model included in an experiment, the researchers estimated models on training data, forecasted one month ahead to new data not previously seen by the model, and calculated and saved the forecast error. Then they added the observed value of the previously forecasted data point to the next month's training data, dropped the oldest historical data point, and forecasted the following month's data point. This process continued over a number of months.

A total of 15 statistical datasets and 3 geographic information systems (GIS) shapefiles resulted from this study.

The statistical datasets consist of

  • Univariate Forecast Data by Police Precinct (Dataset 1) with 3,240 cases
  • Output Data from the Univariate Forecasting Program: Sectors and Forecast Errors (Dataset 2) with 17,892 cases
  • Multivariate, Leading Indicator Forecast Data by Grid Cell (Dataset 3) with 5,940 cases
  • Output Data from the 911 Drug Calls Forecast Program (Dataset 4) with 5,112 cases
  • Output Data from the Part One Property Crimes Forecast Program (Dataset 5) with 5,112 cases
  • Output Data from the Part One Violent Crimes Forecast Program (Dataset 6) with 5,112 cases
  • Input Data for the Regression Forecast Program for 911 Drug Calls (Dataset 7) with 10,011 cases
  • Input Data for the Regression Forecast Program for Part One Property Crimes (Dataset 8) with 10,011 cases
  • Input Data for the Regression Forecast Program for Part One Violent Crimes (Dataset 9) with 10,011 cases
  • Output Data from Regression Forecast Program for 911 Drug Calls: Estimated Coefficients for Leading Indicator Models (Dataset 10) with 36 cases
  • Output Data from Regression Forecast Program for Part One Property Crimes: Estimated Coefficients for Leading Indicator Models (Dataset 11) with 36 cases
  • Output Data from Regression Forecast Program for Part One Violent Crimes: Estimated Coefficients for Leading Indicator Models (Dataset 12) with 36 cases
  • Output Data from Regression Forecast Program for 911 Drug Calls: Forecast Errors (Dataset 13) with 4,936 cases
  • Output Data from Regression Forecast Program for Part One Property Crimes: Forecast Errors (Dataset 14) with 4,936 cases
  • Output Data from Regression Forecast Program for Part One Violent Crimes: Forecast Errors (Dataset 15) with 4,936 cases.
  • The GIS Shapefiles (Dataset 16) are provided with the study in a single zip file: Included are polygon data for the 4,000 foot, square, uniform grid system used for much of the Pittsburgh crime data (grid400); polygon data for the 6 police precincts, alternatively called districts or zones, of Pittsburgh(policedist); and polygon data for the 3 major rivers in Pittsburgh the Allegheny, Monongahela, and Ohio (rivers).

Access Notes

  • The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.

Dataset(s)

DS0:  Study-Level Files
Documentation:
DS1:  Univariate Forecast Data by Police Precinct - Download All Files (3.162 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS2:  Output Data from the Univariate Forecasting Program: Sectors and Forecast Errors - Download All Files (12.632 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS3:  Multivariate, Leading Indicator Forecast Data by Grid Cell - Download All Files (22.415 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS4:  Output Data from the 911 Drug Calls Forecast Program - Download All Files (5.233 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS5:  Output Data from the Part One Property Crimes Forecast Program - Download All Files (5.365 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS6:  Output Data from the Part One Violent Crimes Forecast Program - Download All Files (5.267 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS7:  Input Data for the Regression Forecast Program for 911 Drug Calls - Download All Files (12.936 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS8:  Input Data for the Regression Forecast Program for Part One Property Crimes - Download All Files (11.05 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS9:  Input Data for the Regression Forecast Program for Part One Violent Crimes - Download All Files (14.244 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS10:  Output Data from Regression Forecast Program for 911 Drug Calls: Estimated Coefficients for Leading Indicator Models - Download All Files (2.863 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS11:  Output Data from Regression Forecast Program for Part One Property Crimes: Estimated Coefficients for Leading Indicator Models - Download All Files (2.73 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS12:  Output Data from Regression Forecast Program for Part One Violent Crimes: Estimated Coefficients for Leading Indicator Models - Download All Files (2.95 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS13:  Output Data from Regression Forecast Program for 911 Drug Calls: Forecast Errors - Download All Files (4.196 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS14:  Output Data from Regression Forecast Program for Part One Property Crimes: Forecast Errors - Download All Files (5.417 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS15:  Output Data from Regression Forecast Program for Part One Violent Crimes: Forecast Errors - Download All Files (4.452 MB)
Documentation:
Download:
SAS    SPSS    Stata    R    ASCII    Excel/TSV
ASCII + SAS Setup    SPSS Setup    Stata Setup   
DS16:  GIS Shapefiles - Download All Files (0.695 MB)
Documentation:
Download:

Study Description

Citation

Gorr, Wilpen, and Andreas Olligschlaeger. Crime Hot Spot Forecasting with Data from the Pittsburgh [Pennsylvania] Bureau of Police, 1990-1998. ICPSR03469-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2015-08-07. https://doi.org/10.3886/ICPSR03469.v1

Persistent URL: https://doi.org/10.3886/ICPSR03469.v1

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote XML (EndNote X4.0.1 or higher)

Funding

This study was funded by:

  • United States Department of Justice. Office of Justice Programs. National Institute of Justice (1998-IJ-CX-K005)

Scope of Study

Subject Terms:    crime, crime mapping, crime patterns, forecasting models, geographic distribution, geographic information systems, mapping, police effectiveness, police records, prediction, trends

Smallest Geographic Unit:    4,000-foot uniform grid cells

Geographic Coverage:    Pennsylvania, Pittsburgh, United States

Time Period:   

  • 1990--1998

Date of Collection:   

  • 1998

Unit of Observation:    Aggregate crime counts by crime type, area, and time period

Universe:    Crime counts as reported in offense reports and 911 computer-aided dispatch (CAD) call records from the Pittsburgh, Pennsylvania, Bureau of Police

Data Type(s):    administrative records data, aggregate data, experimental data

Methodology

Study Purpose:   

This study had two purposes: 1) To determine the best univariate forecast method for crime, and 2) To evaluate the value of leading indicator crime forecast models using the best univariate forecast model as the benchmark of comparison.

This study design is based on the rationale that in order to be a candidate for practical use, a leading indicator model must forecast more accurately than the simpler, but best univariate model.

Study Design:   

This study used the rolling-horizon experimental design, a design that maximizes the number of forecasts for a given time series at different times and under different conditions. Under this design, several forecast models are used to make alternative forecasts in parallel. For each forecast model included in an experiment, the researchers estimated models on training data, forecasted one month ahead to new data not previously seen by the model, and calculated and saved the forecast error. Then they added the observed value of the previously forecasted data point to the next month's training data, dropped the oldest historical data point, and forecasted the following month's data point. This process continued over a number of months.

For univariate forecast methods, the researchers used a five-year rolling horizon. For multivariate, leading indicator models estimated by least squares regression, the researchers used a three-year moving window. The researchers made forecasts over a 36-month period (January 1996 through December 1998) in order to generate an adequate sample size of forecast errors for statistical testing purposes.

The researchers took the following steps:

  • They collected all offense reports and 9-1-1 Computer Aided Dispatch calls from the Pittsburgh, Pennsylvania, Bureau of Police for the years 1990 through 1998.
  • They aggregated the crime space data and time series data.
  • They conducted two major sets of forecast experiments with these data: 1) a study based on precincts to determine the best univariate forecast method for crime and 2) a study based on 4,000 foot, uniform grid cells to evaluate the value of leading indicator forecast models with the best univariate forecast model as the benchmark of comparison.
  • To compare forecast accuracy of competing univariate methods, they used pair-wise (matched comparisons) t-tests of forecasts for significance testing.
  • They used a form of Granger causality testing (Granger 1969) to determine the relative value of leading indicator models.
  • To develop benchmark accuracy measures, they first carefully optimized over univariate methods to get the most accurate forecasts (Gorr, Thompson, and Olligschlaeger 2000).
  • Rather than assess accuracy based on the performance of individual point forecasts for each grid cell, they examined forecast performance within ranges of changes for both decreases and increases.
  • Using contingency tables they contrasted forecasts and actual outcomes within each range and designated correct forecasts as true positives and true negatives, and incorrect forecasts as false negatives and false positives.
  • They applied pair-wise comparison t-tests within classes to determine if leading indicator forecasts were significantly better than univariate forecasts.
  • Within actual change categories, they identified the corresponding sets of actual and forecasted values. A univariate and a multivariate-leading-indicator forecast resulted for each point.
  • They computed the difference of squared or absolute forecast errors for each matched pair in the same change category.
  • To evaluate the relative performance of the multivariate method within a change category, they asked whether the mean error over all matched pairs in the category was significantly different from zero. If they subtracted the univariate absolute error from the multivariate absolute error, then a mean error that is significantly different from zero in a negative direction would indicate that the multivariate forecast is more accurate (i.e., has smaller forecast errors).

Sample:    Crime counts from all offense reports and 911 computer-aided dispatch (CAD) calls in electronic form from the Pittsburgh, Pennsylvania, Bureau of Police for the years 1990 through 1998.

Time Method:    Time Series

Mode of Data Collection:    record abstracts

Data Source:

Offense reports and 911 computer-aided dispatch (CAD) call records from the Pittsburgh, Pennsylvania, Bureau of Police for the years 1990 through 1998.

Description of Variables:   

The Univariate Forecast Data by Police Precinct (Dataset 1) contain 11 variables comprised of 1 unique identification variable, 2 variables indicating time (month, year), 1 aggregate crime code variable, and 7 crime count variables (1 variable for each of the 6 police precincts in the city of Pittsburgh, plus 1 variable for the city of Pittsburgh as a whole).

The Output Data from the Univariate Forecasting Program: Sectors and Forecast Errors (Dataset 2) contain 17 variables comprised of 1 unique identification variable, 2 time variables (month, year), 1 police precinct variable, 2 crime counts variables (actual and forecasted), 1 multiplicative seasonal index variable, 5 forecast error variables (signed, absolute, squared, absolute percentage, and modified absolute percentage), 3 weight variables (exponential smoothing, Holt exponential smoothing level, and Holt exponential smoothing slope) 1 forecast method code variable, and 1 short aggregate crime code variable.

The Multivariate, Leading Indicator Forecast Data by Grid Cell (Dataset 3) contain 213 variables comprised of 1 unique identification variable, 2 time variables (month, year), and 209 variables specifying crime counts for individual cells of the grid overlay of Pittsburgh.

The Output Data from 911 Drug Calls Forecast Program (Dataset 4) contain 17 variables comprised of 1 unique identification variable, 2 time variables (month, year), 1 police precinct variable, 2 crime counts variables (actual and forecasted), 1 multiplicative seasonal index variable, 5 forecast error variables (signed, absolute, squared, absolute percentage, and modified absolute percentage), 3 weight variables (exponential smoothing, Holt exponential smoothing level, and Holt exponential smoothing slope) 1 forecast method code variable, and 1 short aggregate crime code variable.

The Output Data from Part One Property Crimes Forecast Program (Dataset 5) contain 17 variables comprised of 1 unique identification variable, 2 time variables (month, year), 1 police precinct variable, 2 crime counts variables (actual and forecasted), 1 multiplicative seasonal index variable, 5 forecast error variables (signed, absolute, squared, absolute percentage, and modified absolute percentage), 3 weight variables (exponential smoothing, Holt exponential smoothing level, and Holt exponential smoothing slope) 1 forecast method code variable, and 1 short aggregate crime code variable.

The Output Data from Part One Violent Crimes Forecast Program (Dataset 6) contain 17 variables comprised of 1 unique identification variable, 2 time variables (month, year), 1 police precinct variable, 2 crime counts variables (actual and forecasted), 1 multiplicative seasonal index variable, 5 forecast error variables (signed, absolute, squared, absolute percentage, and modified absolute percentage), 3 weight variables (exponential smoothing, Holt exponential smoothing level, and Holt exponential smoothing slope) 1 forecast method code variable, and 1 short aggregate crime code variable.

The Input Data for the Regression Forecast Program for 911 Drug Calls (Dataset 7) contain 41 variables comprised of 1 unique identification variable, 1 grid cell number variable, 2 time variables (month, year), 1 dependent variable (911 drug calls), 1 dependent variable that has been lagged by one month, 1 dependent variable that has been lagged by one month and averaged over neighboring grid cells, 17 leading indicator variables (criminal mischief 911 call, shots fired 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, drug offense, drinking offense, field contact report, liquor law violations, prostitution offense, public drunkenness offense, stolen property offense, trespass offense, and weapons offense), and 17 leading indicator variables averaged over neighboring grid cells (criminal mischief 911 call, shots fired 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, drug offense, drinking offense, field contact report, liquor law violations, prostitution offense, public drunkenness offense, stolen property offense, trespass offense, and weapons offense).

The Input Data for the Regression Forecast Program for Part One Property Crimes (Dataset 8) contain 33 variables comprised of 1 unique identification variable, 1 grid cell number variable, 2 time variables (month, year), 1 dependent variable (911 property crime calls), 1 dependent variable that has been lagged by one month, 1 dependent variable that has been lagged by one month and averaged over neighboring grid cells, 13 leading indicator variables (criminal mischief 911 call, drug 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, drug offense, field contact report, liquor law violations, stolen property offense, stolen property offense, trespass offense, and weapons offense), and 13 leading indicator variables averaged over neighboring grid cells (criminal mischief 911 call, drug 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, drug offense, field contact, liquor law violation, stolen property offense, trespass offense, and weapons offense).

The Input Data for the Regression Forecast Program for Part One Violent Crimes (Dataset 9) contain 45 variables comprised of 1 unique identification variable, 1 grid cell number variable, 2 time variables (month, year), 1 dependent variable (911 violent crime calls), 1 dependent variable that has been lagged by one month, 1 dependent variable that has been lagged by one month and averaged over neighboring grid cells, 19 leading indicator variables (criminal mischief 911 call, domestic 911 call, drug 911 call, public disorder 911 call, shots fired 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, domestic offense, drug offense, field contact report, liquor law violations, prostitution offense, public drunkenness offense, simple assault offense, trespass offense, and weapons offense), and 19 leading indicator variables averaged over neighboring grid cells (criminal mischief 911 call, domestic 911 call, drug 911 call, public disorder 911 call, shots fired 911 call, truancy 911 call, vice 911 call, weapons 911 call, criminal mischief offense, disorderly conduct offense, domestic offense, drug offense, field contact, liquor law violation, prostitution offense, public drunkenness offense, simple assault offense, trespass offense, and weapons violation offense).

The Output Data from Regression Forecast Program for 911 Drug Calls: Estimated Coefficients for Leading Indicator Models (Dataset 10) contain 30 variables comprised of 1 unique identification variable, 1 label of model variable, 1 type of statistics variable, 2 dependent variable identifiers (string and numeric), 1 root mean square error variable, 23 estimated regression coefficient variables (for intercept term, criminal mischief 911 call, spatially lagged criminal mischief 911 call, public disorder 911 call, spatially lagged public disorder 911 call, shots fired 911 call, spatially lagged shots fired 911 call, truancy 911 call, spatially lagged truancy 911 call, vice 911 call, spatially lagged vice 911 call, weapons 911 call, spatially lagged weapons 911 call, disorderly conduct offense, spatially lagged disorderly conduct offense, liquor law violation, spatially lagged liquor law violation, prostitution offense, spatially lagged prostitution offense, drinking offense, spatially lagged drinking offense, trespass offense, and spatially lagged trespass offense), and 1 join key variable for linking with other tables.

The Output Data from Regression Forecast Program for Part One Property Crimes: Estimated Coefficients for Leading Indicator Models (Dataset 11) contain 24 variables comprised of 1 unique identification variable, 1 label of model variable, 1 type of statistics variable, 2 dependent variable identifiers (string and numeric), 1 root mean square error variable, 17 estimated regression coefficient variables (for intercept term, drugs 911 call, spatially lagged drugs 911 call, truancy 911 call, spatially lagged truancy 911 call, vice 911 call, spatially lagged vice 911 call, criminal mischief offense, spatially lagged criminal mischief offense, disorderly conduct offense, spatially lagged disorderly conduct offense, liquor law violation, spatially lagged liquor law violation, trespass offense, spatially lagged trespass offense, weapons offense, and spatially lagged weapons offense), and 1 join key variable for linking with other tables.

The Output Data from Regression Forecast Program for Part One Violent Crimes: Estimated Coefficients for Leading Indicator Models (Dataset 12) contain 34 variables comprised of 1 unique identification variable, 1 label of model variable, 1 type of statistics variable, 2 dependent variable identifiers (string and numeric), 1 root mean square error variable, 26 estimated regression coefficient variables (for intercept term, domestic 911 call, spatially lagged domestic 911 call, drugs 911 call, spatially lagged drugs 911 call, public disorder 911 call, spatially lagged public disorder 911 call, shots fired 911 call, spatially lagged shots fired 911 call, vice 911 call, spatially lagged vice 911 call, weapons 911 call, spatially lagged weapons 911 call, criminal mischief offense, spatially lagged criminal mischief offense, disorderly conduct offense, spatially lagged disorderly conduct offense, liquor law violation, spatially lagged liquor law violation, prostitution offense, spatially lagged prostitution offense, public drunkenness offense, spatially lagged public drunkenness offense, simple assault offense, spatially lagged simple assault offense, trespass offense, and spatially lagged trespass offense), and 1 join key variable for linking with other tables.

The Output Data from Regression Forecast Program for 911 Drug Calls: Forecast Errors (Dataset 13) contain 9 variables comprised of 1 unique identification variable, 1 grid cell indicator variable, 2 variables indicating time (month, year), 1 actual drug crime count variable, 1 forecasted drug count variable, 1 signed forecast error variable, 1 absolute forecast error variable, and 1 squared forecast error variable.

The Output Data from Regression Forecast Program for Part One Property Crimes: Forecast Errors (Dataset 14) contain 13 variables comprised of 1 unique identification variable, 1 grid cell indicator variable, 2 variables indicating time (month, year), 1 actual property crime count variable, 1 forecasted property count variable, 1 level variable, 1 change variable, 1 pcheck variable, 1 signed forecast error variable, 1 absolute forecast error variable, 1 squared forecast error variable, and 1 modified absolute percentage error variable.

The Output Data from Regression Forecast Program for Part One Violent Crimes: Forecast Errors (Dataset 15) contain 10 variables comprised of 1 unique identification variable, 1 grid cell indicator variable, 2 variables indicating time (month, year), 1 actual violent crime count variable, 1 forecasted violent count variable, 1 signed forecast error variable, 1 absolute forecast error variable, 1 squared forecast error variable, and 1 modified absolute percentage error variable.

The Geographic Information Systems (GIS) Data consist of 3 shapefiles.

  • grid4000.shp: The 4,000 foot, square, uniform grid system used for much of the Pittsburgh crime data.
  • policedist.shp: The six police precincts (districts or zones) of Pittsburgh.
  • rivers.shp: The three major rivers in Pittsburgh (the Allegheny, Monongahela, and Ohio).

Response Rates:    Not applicable.

Presence of Common Scales:    none.

Extent of Processing:   ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

  • Created variable labels and/or value labels.

Version(s)

Original ICPSR Release:   2015-08-07

Related Publications

Variables

Utilities

Metadata Exports

If you're looking for collection-level metadata rather than an individual metadata record, please visit our Metadata Records page.

Download Statistics