The Exercises
The core of this instructional module is a set of analysis exercises, which are designed to develop the student’s ability to conduct and understand analyses of survey data. To begin, students are introduced to the basic principles of reading a simple contingency table. Then students are exposed to more sophisticated analyses, in which a simple two-variable relationship is explored by introducing a control variable into the analysis. At the same time, some substantive aspects of voting behavior are explored. After completing these exercises, students should be prepared to conduct their own research into a suitable topic with these data.
Data Analysis
Reports of polls in the media describe behaviors or opinions, reporting information like the proportion of people who favor some proposal or who feel a certain way about an issue. These reports deal with “what” questions of social science research — what does the American public think about marijuana legalization? The survey research described on this website is more concerned with “why” questions — why did some people vote for Joe Biden in 2020 while others voted for Donald Trump? To answer “why” questions we need to examine relationships between variables.
A variable is a factor that can take on different values. Individual characteristics, such as age, income, race, or education, are variables because different people have different values on these characteristics. An element that does not vary is called a constant. For instance, if we asked people born in January what month they were born in, they would all answer January and we would have a constant not a variable. Variables come in different types. We can first distinguish between categorical variables, interval-level variables, and ratio-level variables. Interval-level and ratio-level variables are continuous, meaning that each value of the variable is one equal increment larger than the previous and one smaller than the next value. Temperature, if measured in degrees Fahrenheit, is a good example of an interval-level variable—each increment is one degree, and temperature can theoretically take on any value, even negative values. Age, if measured in years, is a good example of a ratio-level variable—each increment is one year, and the increments have full numeric properties with an absolute zero. For instance, no person can be -1 or -2 years old, and someone who is 40 years old is twice as old as someone who is 20.
Typically, with interval-level or continuous variables, there are relatively large numbers of values. Categorical variables, by comparison, have a limited number of values. Gender is an example of a categorical variable. All the variables included in this instructional package dataset are categorical. Variables that could have been measured as interval-level variables, such as age or income, have instead been made into categorical variables by creating categories that define specific ranges for the variables (e.g., for age, the categories are 18-24 years, 25-34 years, and so on).
Variables can be measured on several different types of scales. Categorical variables can be measured either as nominal or ordinal. Nominal measures are the simplest—the categories are unordered. A good example is area of residence (R18 in the dataset). This variable is coded in the following manner:
- Rural area
- Small town
- Suburb
- City
This coding could, of course, be changed. For example, the city could be coded “1,” suburb as “2,” etc., without disturbing the underlying meaning of the variable. Other examples of nominal scales are presidential vote (A02) which is coded as (1) Biden; (2) Trump; (3) Jorgensen; (4) Hawkins; (5) Other candidate; (9) NA, or race/ethnicity (R10) which is coded as (1) White; (2) Black; (3) Hispanic; (4) Asian or Native Hawaiian/other Pacific Islander; (5) Native American/Alaska native; (6) Multiple races; (9) NA. You can see in all of these cases that it would not matter if we changed the ordering of the categories in the variable because they are nominal measures.
Ordinal-level scales have an underlying order—a scale that ranges from “greatly favor” to “greatly oppose” is a good example. Consider the dataset question on whether an individual favors or opposes the death penalty (K05). This is an ordinal-level variable coded in the following manner:
- Favor strongly
- Favor not strongly
- Oppose not strongly
- Oppose strongly
You can see that there is an order to the values here—they range from strong support to strong opposition. Changing the order of the values for this variable would change the underlying meaning of the variable.
Interval-level scales have an underlying order in which the intervals are the same. For example, if education were measured by the number of years of education, then a score of “15” would mean that the respondent had 15 years of formal education which is one year more than somebody who answered “14” and one year less than somebody who answered “16.” Here, not only does the order of the values matter, but we can assume that each year of education is the same as all the other years—each year adds one increment to the pre-existing education level. There are, however, no variables using interval-level scales in this SETUPS dataset.
Two variables are related to each other when certain values of one variable are likely to be associated with certain values for the other variable. Conversely, two variables are unrelated when the values of one variable are equally likely to be associated with any of the values for the other variable. For example, if we say that education and turnout (i.e., whether one votes) are related, this could mean that more educated people are more likely to vote (which is what we would expect), or that more educated people are less likely to vote. Either way, the two variables would be related, for values on one variable would be linked to values on the other variable. Naturally, it would be far more informative to state how education and turnout are related, rather than just state the simple fact that they are related, and this should be done whenever possible.
When speaking about the relationship between two variables, it is common to use the terms independent variable and dependent variable. The independent variable should be the “cause” and the dependent variable the “effect.” In other words, the independent variable affects or influences the dependent variable. A common research procedure is to start with some dependent variable and then to identify some independent variables that are strongly related to the dependent variable. In this way, we can explain the dependent variable by identifying some of the factors that influence it.
A common procedure is to use a contingency table to examine the relationship between two variables in a survey. A contingency table presents the cross-tabulation between two variables. An example of a contingency table is the cross-tabulation between party identification and presidential vote. This table omits those who did not express a party identification and those who did not vote in the 2016 presidential election.
In interpreting contingency tables:
- Look at the percentage by the independent variable. Party identification is the independent variable. When it is cross-tabulated with presidential vote, the columns under each category of party identification add to 100 percent.
- Compare the distribution of the dependent variable across the categories of the independent variable.
- Look for trends in the percentages. In the cross-tabulation of party identification and presidential vote, the percentage of respondents voting for Trump goes up dramatically as we move from the category of “Strong Democrat” to the category of “Strong Republican.”
- Don’t treat percentages as exact reflections of the population—they are approximations.
- Look for sizeable differences in percentages—small differences can be the result of random error.
- Consider the total number of respondents in each column. Be cautious in interpreting percentages if there are fewer than 50 respondents in a column.
Statisticians make a distinction between measures of significance and measures of association. A measure of significance tells whether the relationship between two variables might be the result of random chance, and thereby not a real relationship. A measure of association tells how strong the relationship is between two variables. Chi-square (χ2) is a common measure of significance. It tests the hypothesis that there is no relationship between the columns and rows in a contingency table. Chi-square is reported as both a number and a probability; if the probability is higher than 0.05, it generally means that we cannot reject the hypothesis and that the two variables are unrelated. If the probability is lower than 0.05, it means that there is less than a five percent chance that a relationship as strong or stronger would occur by chance alone, which usually allows us to reject the possibility that the variables are truly unrelated. This is a conservative approach to data interpretation in that we do not conclude that two variables are truly related unless there is a very low likelihood that the observed association could be the result of chance alone.
Measures of association share some general properties—they have a value of zero when two variables are unrelated and rise to a value of 1.0 when the two variables have a perfect relationship. The measure of association you use depends in part on the scales on which one measures the variables.
Commonly used statistics for ordinal variables are gamma and Kendall’s tau. Both gamma and tau can have positive or negative signs. A negative sign means the variables are inversely related, meaning that as one goes up the other goes down. A positive sign means the opposite. Thus, for gamma and tau, a value of -1.0 indicates just as perfect a relationship as a value of 1.0. Another statistic appropriate for ordinal data is Somer’s d.
For tables with nominal variables, either Cramer’s V or the contingency coefficient are appropriate measures of association, but SDA does not report these statistics for contingency tables. This, however, is not a major problem as only a few variables in this instructional package dataset are nominal. The vast majority of them are ordinal.
The great value of statistics is that they can summarize in a single number an entire table with many cells and percentages. By doing so, however, one loses some detailed information about the relationship between the two variables.
Remember that measures of association do not really explain any more than do the percentages in a table—they just do it more conveniently. Also, it is possible to calculate any statistic on any contingency table you construct. Whether the statistic has any meaning is the result of the thought going into the table before you construct it.
Variables in this dataset have been coded into what we think are the most logical categories. There may be times, however, when you might want to change the coding. Recoding variables may be desirable when:
- There are so many categories of one of the variables in a table that there are too few cases per cell for meaningful interpretation. With the fairly large number of cases in the 2020 ANES, this is less of a concern than it was in past years.
- There is some meaningful reason to combine categories in a variable. For example, you think the real distinction in education is whether somebody has a college degree or not instead of the five categories the variable has in the dataset.
Some important variables in this dataset have a larger number of categories—party identification (7), the ideology scale (7), presidential primary vote (10), age (7), family income (7), and religion (12).
It is important to remember that recoding variables shifts respondents into different categories, resulting in more respondents in some or all of the categories.
Some rules-of-thumb for recoding are:
- Recode variables to emphasize the important distinctions in the data.
- Re-label recoded variables immediately after creating them. It is easy to forget what you wanted to do with a new variable. It is important to remember that any tables you generate with unlabeled variables will not be labeled either.
- Check the marginals for your newly recoded variable—see if they make sense when compared to the marginals for the original, un-recoded variable.
- Remember that the recoding of any variable is only in effect for the table that you generate. The recoding does not remain after the table has been generated, so if you want to use the recoded version of a variable for another table, you will have to recode it again.
You will learn much more about recoding variables in later sections of this instructional package.