Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime, Los Angeles, California, 2017-2018 (ICPSR 37470)
Version Date: Jul 28, 2021 View help for published
Principal Investigator(s): View help for Principal Investigator(s)
Meagan Cahill, RAND Corporation;
Katya Migacheva, RAND Corporation;
Jirka Taylor, RAND Corporation;
Matthew Williams, Cardiff University;
Pete Burnap, Cardiff University;
Amir Javed, Cardiff University;
Han Liu, Cardiff University;
Hui Lu, RAND Europe;
Alex Sutherland, RAND Europe
https://doi.org/10.3886/ICPSR37470.v1
Version V1
Summary View help for Summary
In the United States, a number of challenges prevent an accurate assessment of the prevalence of hate crimes in different areas of the country. These challenges create huge gaps in knowledge about hate crime--who is targeted, how, and in what areas--which in turn hinder appropriate policy efforts and allocation of resources to the prevention of hate crime. In the absence of high-quality hate crime data, online platforms may provide information that can contribute to a more accurate estimate of the risk of hate crimes in certain places and against certain groups of people. Data on social media posts that use hate speech or internet search terms related to hate against specific groups has the potential to enhance and facilitate timely understanding of what is happening offline, outside of traditional monitoring (e.g., police crime reports). This study assessed the utility of Twitter data to illuminate the prevalence of hate crimes in the United States with the goals of (i) addressing the lack of reliable knowledge about hate crime prevalence in the U.S. by (ii) identifying and analyzing online hate speech and (iii) examining the links between the online hate speech and offline hate crimes.
The project drew on four types of data: recorded hate crime data, social media data, census data, and data on hate crime risk factors. An ecological framework and Poisson regression models were adopted to study the explicit link between hate speech online and hate crimes offline. Risk terrain modeling (RTM) was used to further assess the ability to identify places at higher risk of hate crimes offline.
Citation View help for Citation
Export Citation:
Funding View help for Funding
Subject Terms View help for Subject Terms
Geographic Coverage View help for Geographic Coverage
Smallest Geographic Unit View help for Smallest Geographic Unit
Census tract
Restrictions View help for Restrictions
This data collection may not be used for any purpose other than statistical reporting and analysis. Use of these data to learn the identity of any person or establishment is prohibited. To protect respondent privacy, the data files in this collection are restricted from general dissemination. To obtain these restricted files, researchers must agree to the terms and conditions of a Restricted Data Use Agreement.
Distributor(s) View help for Distributor(s)
Time Period(s) View help for Time Period(s)
Date of Collection View help for Date of Collection
Data Collection Notes View help for Data Collection Notes
- For additional information on the Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime Study, please visit the Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime website.
Study Purpose View help for Study Purpose
The overarching goals of the research were to (i) address the lack of reliable knowledge about hate crime prevalence in the U.S. by (ii) identifying and analyzing online hate speech and (iii) examining the links between the online hate speech and offline hate crimes. To achieve these goals, the project pursued the following three objectives:
- Classify online hate speech in terms of (i) which individuals and groups direct what kinds of speech (type and severity) at (ii) which groups and (iii) where the tweets are generated.
- Estimate the relationship between online hate speech classification and offline hate crime.
- Develop and test an empirical model to identify areas at increased risk of hate crimes.
Study Design View help for Study Design
The project drew on four types of data: recorded hate crime data, social media data, census data, and data on hate crime risk factors.
Recorded hate crime data served as a dependent measure in the analyses. Data were obtained on hate crimes recorded in 2017 and 2018 in L.A. County, compiled by the Los Angeles County Commission on Human Relations (LACCHR). These data represent the most comprehensive data set on hate crimes available in the county. The LACCHR receives hate crime incident reports from 46 law enforcement agencies, 5 community organizations, 36 school districts and 13 higher education institutions, as well as directly from victims. LACCHR staff review the data from all sources to determine whether each reported incident meets the definition of a hate crime as defined by applicable statutes. Staff also check for duplicate reports to ensure incidents are not double-counted. For incidents that occurred in public places, the investigators received the actual location of the incident; for those occurring in private locations, the investigators received mid-block location information. Data from LACCHR were coded into three categories for analyses: i) racially motivated hate crimes; ii) religion motivated hate crimes; iii) and sexual orientation motivated hate crimes. For the purposes of the ecological analysis, the data were then aggregated to census tracts, providing us with count data for each measure by census tract and year.
Social media data were the main independent measure of interest. Using the Twitter streaming Application Programming Interface (API) via COSMOS software (Burnap et al., 2014), all tweets posted between September 2017 and September 2018 and geotagged to L.A. County were collected. These data were used to derive a count of all geocoded tweets; 1,813,862 tweets were geolocated within L.A. County in 2017 and 2018.
Supervised machine learning classifiers were then built to identify hateful tweets targeting three characteristics: race (anti-African-American), religion (anti-Muslim, anti-Jewish) and sexual orientation (anti-lesbian, gay, and bisexual). Recorded hate crimes in L.A. County are most frequently reported to target one of these three characteristics. Three gold standard datasets of human coded annotations were generated to train the machine classifiers based on samples of tweets (see Appendix B for classifier results). The classifiers were then used to identify all hateful tweets in the dataset, including which characteristics the tweet targeted. Finally, all geolocated tweets were aggregated to census tracts, providing counts of all tweets and hateful tweets by tract. An important caveat to both social media and hate crime data is that neither represents a representative sample of the true population: only tweets from users opting to have their tweets geotagged and offline were captured, and only data on reported hate crimes was available.
Census data. The latest 5-year estimates from the American Community Survey were also collected for use as controls in analytic models. Relevant variables were selected based on literature that estimated hate crime using ecological factors (e.g. Green, 1998; Espiritu, 2004). These include age, employment status, race and educational attainment.
Hate crime risk factor data. Existing research literature was reviewed to identify particular environmental features that served as risk factors in risk-terrain models (see Table 2 for the full list of 20 variables). Data on these factors were obtained from public sources including the public L.A. County GIS portal.
Universe View help for Universe
Tweets over the course of one year from the general population of Los Angeles, California.
Unit(s) of Observation View help for Unit(s) of Observation
Data Type(s) View help for Data Type(s)
Description of Variables View help for Description of Variables
Variables include counts of the number of tweets with various types of hate speech, counts of hate crimes broken down by category, and variables on population with breakdowns by race, gender, age, and educational achievement.
HideOriginal Release Date View help for Original Release Date
2021-07-28
Version History View help for Version History
2021-07-28 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:
- Performed consistency checks.
- Checked for undocumented or out-of-range codes.
Notes
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
One or more files in this data collection have special restrictions. Restricted data files are not available for direct download from the website; click on the Restricted Data button to learn more.