This study is provided by ICPSR. ICPSR provides leadership and training in data access, curation, and methods of analysis for a diverse and expanding social science research community.

Congressional Record for 104th-110th Congresses: Text and Phrase Counts (ICPSR 33501)

Principal Investigator(s): Gentzkow, Matthew, University of Chicago, and National Bureau of Economic Research; Shapiro, Jesse, University of Chicago, and National Bureau of Economic Research


Please note that inconsistencies have been identified in some of the data accompanying this collection related to the variable "speechID." Potential data users are advised that the files in DS15-DS20 may be compromised and should be used with caution.

This qualitative data collection contains original and processed text from the United States Congressional Record for the 104th-110th Congresses. The Congressional Record includes text from both chambers, the United States House of Representatives and the United States Senate. For each Congress the archive includes the original tagged text files, parsed files that separate the text into individual speeches, speaker metadata that can be linked to the parsed files, and counts of two-word phrases (bigrams) by speaker, party, and date.

Access Notes

  • Data in this collection are available only to users at ICPSR member institutions. Please log in so we can determine if you are with a member institution and have access to these data files.


DS0:  Study-Level Files
DS1:  Original 1995 - Download All Files (111.088 MB)
DS2:  Original 1996 - Download All Files (77.308 MB)
DS3:  Original 1997 - Download All Files (73.989 MB)
DS4:  Original 1998 - Download All Files (77.8 MB)
DS5:  Original 1999 - Download All Files (85.189 MB)
DS6:  Original 2000 - Download All Files (75.264 MB)
DS7:  Original 2001 - Download All Files (74.673 MB)
DS8:  Original 2002 - Download All Files (65.293 MB)
DS9:  Original 2003 - Download All Files (87.775 MB)
DS10:  Original 2004 - Download All Files (70.721 MB)
DS11:  Original 2005 - Download All Files (83.864 MB)
DS12:  Original 2006 - Download All Files (65.232 MB)
DS13:  Original 2007 - Download All Files (96.656 MB)
DS14:  Original 2008 - Download All Files (65.678 MB)
DS15:  Speeches - Download All Files (421.492 MB) large file
DS16:  Counts by Date - Download All Files (377.123 MB) large file
DS17:  Counts by Party - Download All Files (128.305 MB)
DS18:  Counts by Speaker - Download All Files (358.637 MB) large file
DS19:  Metadata: Speaker - Download All Files (0.266 MB)
DS20:  Metadata: Speech - Download All Files (18.863 MB)

Study Description


Gentzkow, Matthew, and Jesse Shapiro. Congressional Record for 104th-110th Congresses: Text and Phrase Counts. ICPSR33501-v4. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2015-12-01. https://doi.org/10.3886/ICPSR33501.v4

Persistent URL: https://doi.org/10.3886/ICPSR33501.v4

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote XML (EndNote X4.0.1 or higher)


This study was funded by:

  • National Science Foundation (SES-0617658 and SES-0922342)

Scope of Study

Subject Terms:    government, legislative bodies, political speeches, public officials, United States Congress

Smallest Geographic Unit:    United States

Geographic Coverage:    United States

Time Period:   

  • 1995--2008

Date of Collection:   

  • 2007-06--2011-11

Unit of Observation:    parsed bigrams of congressional records, speaker and speech metadata

Universe:    Full-text of the published Congressional Record for both chambers of the 104th-110th Congresses of the United States.

Data Type(s):    administrative records data, aggregate data, text, program source code

Data Collection Notes:

This collection has not been processed by ICPSR and is being released in the original ASCII format for convenience of use; no value labels are present in the data.

Please see the ICPSR User Guide for information about what each part of the data collection contains.

Please note that the files for this data collection are extremely large. Users should exercise discretion when downloading files.


Study Design:    Please refer to the Original P.I. Documentation in the ICPSR User Guide.

Sample:    The data are not a sample, as this collection is an aggregation of data on Congressional speech.

Data Source:

Congressional Records obtained from the Government Printing Office


Original ICPSR Release:   2012-12-14

Version History:

  • 2015-12-01 This collection is being updated to comply with new ICPSR file-naming conventions. No other changes have been made to the collection.
  • 2015-10-23 This collection is being updated to include data for the 110th Congress, spanning the years 2007 and 2008.
  • 2013-07-08 The User Guide was updated.

Related Publications


Metadata Exports

If you're looking for collection-level metadata rather than an individual metadata record, please visit our Metadata Records page.

Download Statistics