Using Sentiment Analysis and Topic Modeling in Assessing the Impact of Police "Signaling" on Investigative and Prosecutorial Outcomes in Sexual Assault Reports, Cleveland, Ohio, 1993-2011 (ICPSR 38644)

Version Date: Dec 16, 2025 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Rachel E. Lovell, Cleveland State University; Daniel J. Flannery, Case Western Reserve University

https://doi.org/10.3886/ICPSR38644.v1

Version V1

Slide tabs to view more

The Cuyahoga County (Ohio, USA) Sexual Assault Kit (SAK) Initiative, led by the Cuyahoga County Prosecutor's Office, was launched in 2013 to test and follow up on previously untested sexual assault/rape kits that were collected as evidence from sexual assault victims. Rape reports typically include an incident report taken by the responding officer(s) who is tasked with gathering pertinent facts and evidence and then forwarding the report to an investigator (detective) for follow-up, as well as a summary of the investigative activity on the case as noted by the investigator, which can include the decision of the assigned prosecutor to file or not file charges.

Signaling is defined as information conveyed by responding officers in the narratives of police reports regarding a victim's creditability and rape-myth adherence. The goal was to better understand if and how responding officers' written reports in sexual assault cases impact investigating officers' decision-making and how cases proceed (or fail to proceed) in the criminal justice process. The objective of the study was to explore the first step in the investigative process to elucidate facilitators and barriers to sexual assault cases reaching a successful disposition.

The research team employed text mining and machine learning methods using natural language processing and advanced statistical analyses to evaluate the narratives of 5,638 police reports of sexual assaults where victims had sexual assault kits collected in Cuyahoga County over the span of nearly two decades (primarily 1993 through 2011). These reports were analyzed using topic modeling and sentiment analysis. The team addressed three research questions:

  1. To what extent did "sentiments" in the responding officers' narratives reveal positive or negative signaling of victims' credibility?
  2. To what extent were the "topics" and sentiments in the responding officers' narratives different in cases with increased investigative activity compared to those with less?
  3. To what extent were both the topics and sentiments in the responding officers' narratives different for cases that were successfully investigated and prosecuted compared to those that were not?

This collection includes a quantitative dataset (DS1) and a qualitative dataset (DS2). The report-level quantitative dataset contains calculated sentiment scores, categorical variables describing the incident and outcome, and demographic variables of the victim(s) and suspect(s) for all reports analyzed (n=5,639). The full text for all reports is available in a CSV file that can be merged with the main data file. The qualitative data is a subset of reports from the main dataset (n=18) with high, medium, and low sentiment scores that were manually coded by the research team.

Lovell, Rachel E., and Flannery, Daniel J. Using Sentiment Analysis and Topic Modeling in Assessing the Impact of Police “Signaling” on Investigative and Prosecutorial Outcomes in Sexual Assault Reports, Cleveland, Ohio, 1993-2011. Inter-university Consortium for Political and Social Research [distributor], 2025-12-16. https://doi.org/10.3886/ICPSR38644.v1

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote
United States Department of Justice. Office of Justice Programs. National Institute of Justice (2018-VA-CX-0002)

Due to the sensitive nature of the data and to protect respondent confidentiality, the data are restricted from general dissemination. They may only be accessed at the ICPSR Data Enclave in Ann Arbor, MI. Users wishing to view these data must complete an Application for Use of the ICPSR Data Enclave and receive permission to analyze the files before traveling to Ann Arbor. More general information about the Enclave may be found at ICPSR's Data Enclaves website.

Inter-university Consortium for Political and Social Research
Hide

1993 -- 2011
  1. Users will note that the full report text in the quantitative data (DS1) has been partially de-identified. The research team performed this work prior to depositing at ICPSR.

  2. The Sexual Assault Kit Initiative's focus was on previously untested kits from 1993 through 2011. However, the data includes some reports prior to 1993 and after 2011. If an offender was linked to a sexual assault outside of the 1993-2011 time frame, the report was included with the untested kit's investigation and prosecution.

Hide

The purpose of this study was to identify signaling in narratives of police officers' sexual assault reports that might have affected subsequent decision-making, case flow, and attrition.

The research team extracted Cleveland Division of Police rape reports in PDF format from the Cuyahoga County Prosecutor's Office electronic management database, which produced a list of 6,353 potential reports. Of the total reports that were eligible for extraction, 6,071 (95.6 percent) were extracted. Reasons for not extracting included duplication, erroneous labeling, technical issues with extraction, or if the report was associated with a very small number of victims who reported a large number of rapes determined to be unfounded.

Reports were then converted from PDF to text using pdfMiner and optical character recognition programs. A portion of files (n=320) at this stage required manual conversion via typing or using Dragon Naturally Speaking. These were files with parts of the investigative narrative missing, handwritten reports, or narratives that did not correctly convert to text. A total of 5,638 reports were eligible to proceed to quality control, where a research team member manually reviewed each text file for inaccuracies.

Researchers manually coded case outcomes based on the report's closing language and information from each report's discrete fields (e.g., dates, victim name, suspect name, location of incident). Textual analysis was limited to the incident reports taken by responding officers. The research team trained machine-learning methods to classify the police reports, for which sentiment scores were assigned. For context and validation, the team also conducted human-detected sentiment and thematic analyses on portions of the data. For each sentiment measure (subjectivity, polarity, and overall sentiment), two of the highest sentiment analysis scored reports were selected at random from a list of the top 20, the median, and from the bottom 20 for hand-coding.

Cross-sectional

Sexual assault reports from individuals who had sexual assault kits collected in Cleveland, Ohio primarily from 1993 through 2011.

Event/Process (Police Report)

Cleveland Division of Police (CDP)

Cuyahoga County Prosecutor's Office (CCPO)

Sentiment score: indicates how positive or negative the words comprising the narrative are (higher score = more positive words)

Polarity score: indicates how negative or positive the sentiment is in the text (higher score = more positive)

Subjectivity score: indicates the amount of personal opinion and factual information in the text (higher score = more personal opinion)

Demographics: victim and suspect race/ethnicity (does not include a multi-racial option), gender, age

Case characteristics: if suspect was fully named, if suspect/victim criminal history was mentioned, if victim was not believed, if victim did not follow-up with investigators, year of report, number of victims, case outcome, offense type

Hide

2025-12-16

2025-12-16 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

  • Checked for undocumented or out-of-range codes.

Hide

Notes

  • The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.

  • ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.

  • One or more files in this data collection have special restrictions. Restricted data files are not available for direct download from the website; click on the Restricted Data button to learn more.