The Dataset

Before analyzing these data, it is best to have some information about the dataset. First of all, it is useful to know how the data were collected and what sorts of error might be present in the data. The section on survey research on this website attempts to answer those questions by discussing survey research methods in general and the methodology behind the 2020 ANES survey in particular.

It is also important to understand the codebook, which describes the dataset. The section on the codebook describes the information in the codebook entries for each variable and provides some additional information for some variables.

Survey Research Methods

The study of voting behavior generally relies on information from sample surveys. Aggregate election statistics from states or counties, another common type of election data, are useful for examining the patterns of election results, such as differences in the presidential vote among the 50 states; however, such data are not suitable for an analysis that focuses on the individual voter. To investigate the factors that affect how people vote, we need information on individuals. Such information commonly includes data on voting behavior, attitudes and beliefs, and personal characteristics. As it is impractical to obtain this information for each member of the electorate, the common procedure is to draw a sample of people from the population and interview these individuals. Once collected, survey data are usually processed and stored in a form allowing for computer-assisted data analysis. This data analysis generally focuses on describing and explaining patterns of political opinion and electoral behavior.

The data for this instructional package are drawn from the 2020 American National Election Study (ANES), sponsored by the University of Michigan and Stanford University. Funding for the 2020 ANES came from the National Science Foundation (NSF). The study interviewed a total of 8,280 respondents before and after the election, but we include only 7,453 respondents who completed both pre-election and post-election interviews. The 2020 ANES survey used a contactless, mixed-mode design that was created in response to challenges related to the COVID-19 pandemic. The pandemic made face-to-face interviewing unfeasible in 2020, and no in-person interviewing was done. Instead, a sequential mixed-mode design was implemented that included self-administered online surveys, live video interviews conducted online, and telephone interviews. Only a portion of all the information collected by the study is contained in this dataset, and the selected data have been prepared for instructional purposes.

In order to use a dataset, a codebook is necessary. The codebook describes the dataset by providing a list of all variables, an explanation of each variable, and a description of the possible values for each variable. A codebook can be thought of as a combination of a map and an index to the dataset.

Many people ask how it is possible to make any generalizations about the American public on the basis of a survey sample of a few thousand individuals. The truth of the matter is this: it is not possible to do so unless researchers use some methodical type of sampling scheme. If we just stood on a street corner and asked questions of the first one thousand people who walked by, we could, of course, not draw any conclusions about the attitudes and opinions of the American public. If, however, we have some kind of sampling scheme, a similar size sample can yield accurate generalizations about the American public.

A full explanation of the theory of survey sampling is beyond the scope of this instructional package. However, we can introduce some basics. The goal of any social science survey is to reduce the error in generalizing about the population. Error can have two origins—systematic and random. The goal of proper sampling is to reduce both. Systematic error is much more serious than is random error, so we are especially concerned about it.

We can reduce error in a survey through proper sampling techniques.

The most basic form of sampling is the simple random sample. This involves drawing a sample from a list of all members of the population in such a way that everybody in the population has the same chance of being selected for the sample.

Simple random samples are not appropriate for many social science applications. Often, we want samples in which we are sure there are a similar number of subgroups (e.g., women, those without a college degree, Latinos) in the sample as there are in the population. A simple random sample will not guarantee this. Stratified probability sampling comes closer to this guarantee.

The actual form of sampling depends on whether the interviews will be conducted in person (as the ANES has done in years other than in 2020), by telephone, or in some other way such as on the Internet.

All errors can be of two different types—random errors and systematic errors. Random errors occur any time we seek to measure something. Often, re-measuring the same thing will get a slightly different result. As the old carpenters’ saying goes, “measure twice, cut once.” Systematic error is when the device we are using to do our measuring is badly calibrated—like a ruler with the first half-inch missing. Surveys can contain both types of errors.

We can typically deal effectively with random error by phrasing our conclusions in probabilistic terms rather than certainties. We can say, for example, that one position on an issue is favored by more people than the other, but that the difference is within the margin of error for a sample of the size taken for this survey.

We also attempt to minimize systematic error by making sure that our questions are well worded, that our interviewers are well trained, and by adhering to best practices of developing and conducting surveys, such as those developed by the American Association for Public Opinion Research (AAPOR).

Potential sources of error in national surveys include:

The sampling procedure itself. Since surveys are based on samples and not the entire population, there is a certain amount of random sampling error in any survey. For a properly constructed national sample with about 2,000 respondents, the margin of error is around +/-2 percentage points.
Certain unavoidable systematic errors in the sample. Some segments of the population are exceptionally difficult to reach. Homeless people and those in penal or mental institutions, for example, are not a part of the ANES sample.
Survey nonresponse and refusals to cooperate with the survey by potential respondents. This is the result of not being able to get a potential respondent to take the survey. As the number of surveys and polls has increased in recent years, respondents have displayed survey fatigue, and refusals have increased over time. If non-respondents differ from respondents, this can be a big problem.
Lack of candor by respondents. This involves questions that have socially acceptable answers (e.g., Did you vote in the election?) or socially unacceptable answers (e.g., Did you cheat on your income tax last year?).
Inability of respondents to remember past behaviors. For many students of politics, it is difficult to believe that some people just do not remember who they voted for, or they remember it wrongly. ANES has found, for example, that after an election more people remember voting for the winner than actually voted for him or her. Other political behaviors are more difficult for people to recall (e.g., Did you contact a public official about some matter in the past year?).
Respondents misconstruing survey questions as exams. This can result in them providing answers to questions that they really have not thought much about (i.e., what has been termed non-attitudes).
Badly trained interviewers. They may give respondents cues as to how to answer questions, or mis-record respondents’ answers, or falsify data.
Errors in the preparation, coding, and processing. These can occur when entering the survey into a computer data file.

It is important to be aware of the potential sources of error when working with survey data. Small differences in percentages may be the result of such errors.

The data for this instructional module are weighted. Weighting a dataset is a technical procedure to correct the data for several basic factors. The goal of weighting is to produce a dataset that is more demographically representative of the population. When the data are weighted, some respondents count for more than one person (they have a sample weight greater than 1.0) and some count for less than one person (they have a sample weight less than 1.0). You need not be overly concerned about weighting, as the dataset is designed to be automatically weighted when you open it, and you will only sometimes notice that are you working with weighted data. Some small anomalies that you may occasionally notice will be the result of working with weighted data.

The Dataset

Survey Research Methods

This Dataset

Codebook

Survey Sampling

Sources of Error in Surveys

Weighting the Data