# Data Analysis

Reports of polls in the media describe behavior or opinion, reporting information like the proportion of people who favor some proposal or who feel a certain way about an issue. These reports deal with "what" questions of social science research — what does the American public think about abortion? The survey research described on this website is more concerned with "why" questions — why did some people vote for Barack Obama in 2012 while others voted for Mitt Romney? To answer "why" questions we need to examine relationships between variables.

## Variables

A variable is an element that varies, meaning that it can take different values. Individual characteristics, such as age, income, or education, are variables because different people have different values or scores on these characteristics. We can first distinguish between categorical and interval-level data. Interval-level variables are continuous, meaning that each value of the variable is one equal increment larger than the previous and one smaller than the next value. Age, if measured in years, is a good example — each increment is one year. Typically with interval or continuous variables, there is a relatively large number of values. Categorical variables have a limited number of values. Gender is an example of a categorical variable. All of the variables included in this instructional package dataset are categorical. Variables that could have been measured as interval-level variables, such as age or income, have instead been made into categorical variables by creating categories that define specific ranges for the variables (e.g., for age, the categories are 18-24 years, 25-34 years, and so on).

Variables can be measured on a number of different types of scales. Categorical variables can be measured either on nominal or ordinal scales. Nominal scales are the simplest — the categories are unordered. A good example is region (R19 in the dataset). This variable is coded in the following manner:

1. Northeast
2. Midwest
3. South
4. West

This coding of region could, of course, be changed. For example, the South could be coded "1," Northeast as "2," etc., without disturbing the underlying meaning of the variable.

Ordinal-level scales have an underlying order — a scale that ranges from "strongly favor" to "strongly oppose" is a good example. Consider the dataset question on privatization of Social Security (J13). This is an ordinal-level variable coded in the following manner:

1. Strongly favor
2. Favor
3. Neither favor nor oppose
4. Oppose
5. Strongly oppose
6. NA

You can see that there is an order to the values here — they range from strong support to strong opposition. Changing the ordering of the values for this variable would change the underlying meaning of the variable.

Interval-level scales have an underlying order in which the intervals are the same. For example, if education were measured by the number of years of education, then a score of "15" would mean that the respondent had 15 years of formal education. Here, not only does the order of the values matter, but we can assume that each year of education is the same — each year adds one increment to the pre-existing education level. There are no variables using interval-level scales in this SETUPS dataset.

## Relationships between Variables

Two variables are related to each other when certain values of one variable are likely to be associated with certain values for the other variable. Conversely, two variables are unrelated when the values of one variable are equally likely to be associated with any of the values for the other variable. For example, if we say that education and turnout (i.e., whether one votes) are related, this could mean that more educated people are more likely to vote (which is what we would expect), or that more educated people are less likely to vote. Either way, the two variables would be related, for values on one variable would be linked to values on the other variable. Naturally, it would be far more informative to state how education and turnout are related, rather than just state the simple fact that they are related, and this should be done whenever possible.

When speaking about the relationship between two variables, the terms independent variable and dependent variable are commonly used. The independent variable can be considered to be the "cause" and the dependent variable the "effect." In other words, the independent variable affects or influences the dependent variable. A common research procedure is to start with some dependent variable and then to identify some independent variables that are strongly related to the dependent variable. In this way the dependent variable is explained, at least in the sense that some of the factors that influence it are identified.

## Contingency Tables

A common procedure to examine the relationship between two variables in a survey is to use a contingency table. A contingency table presents the cross-tabulation between two variables. An example of a contingency table is the cross-tabulation between party identification and presidential vote. This table omits those who did not express a party identification and those who did not vote in the 2012 presidential election.

In interpreting contingency tables:

• Percentage by the independent variable. As party identification is the independent variable when it is cross-tabulated with presidential vote, the columns under each category of party identification add to 100 percent;

• Compare the distribution of the dependent variable across the categories of the independent variable;

• Look for trends in the percentages. In the cross-tabulation of party identification and presidential vote, the percentage of respondents voting for Romney goes up dramatically as we move from the category of "Strong Democrat" to the category of "Strong Republican";

• Don't treat percentages as exact reflections of the population — they are approximations;

• Look for sizeable differences in percentages — small differences can be the result of random error;

• Consider the total number of respondents in each column. Be cautious in interpreting percentages if there are fewer than 50 respondents in a column;

## Statistics

Statisticians make a distinction between measures of significance and measures of association. A measure of significance tells whether the relationship between two variables might be the result of chance alone. A measure of association tells how strong the relationship is between two variables. Chi Square (Χ2) is a common measure of significance. It tests the hypothesis that there is no relationship between the columns and rows in a contingency table. Chi Square is reported as both a number and a probability; if the probability is higher than about 0.05, it generally means that we cannot reject the hypothesis that the two variables are unrelated. If the probability is lower than 0.05, it means that there is less than a 5 percent chance that a relationship as strong or stronger would occur by chance alone, which usually allows us to reject the possibility that the variables are truly unrelated. This is a conservative approach to data interpretation in that we do not conclude that two variables are truly related unless there is a very low likelihood that the observed association could be the result of chance alone.

## Measures of Association

Measures of association share some general properties — they have a value of zero when two variables are unrelated and rise to a value of 1.0 when the two variables have a perfect relationship. Which measure of association you use depends in part on the scales on which the variables are measured.

Commonly used statistics for ordinal variables are gamma and Kendall's tau. Both gamma and tau can have positive or negative signs. A negative sign means the variables are inversely related, meaning that as one goes up the other goes down. A positive sign means the opposite. Thus, for gamma and tau, a value of -1.0 indicates just as perfect a relationship as a value of 1.0. Another statistic appropriate for ordinal data is Somer's d.

For tables with nominal variables, either Cramer's V or the contingency coefficient are appropriate measures of association, but SDA does not report these statistics for contingency tables. This, however, is not a major problem as only a few variables in this instructional package dataset are nominal. The vast majority of them are ordinal.

The great value of statistics is that they can summarize in a single number an entire table with many cells and percentages. By doing so, however, one loses some detailed information about the relationship between the two variables.

Remember that measures of association don't really tell you any more than do the percentages in a table — they just do it more conveniently. Also you can calculate any statistic you want on any contingency table you construct. Whether the statistic has any meaning is the result of the thought going into the table before you construct it.

## Recoding Variables

Variables in this dataset have been coded into what we think are the most logical categories. There may be times, however, when you might want to change the coding. Recoding variables may be desirable when:

• There are so many categories of one of the variables in a table that there are too few cases per cell for meaningful interpretation. With the large number of cases in the 2012 ANES, this is less of a concern than it was in past years;

• You think that there is some meaningful reason to combine categories in a variable. For example, you think the real distinction in education is whether somebody has a college degree or not instead of the six categories the variable has in the dataset.

Some important variables in this dataset have a larger number of categories — party identification (7), all the ideology scales (7), the candidate like-dislike scales (11), some other attitudinal scales (7), family income (6), and age (7).

It is important to remember that recoding variables shifts respondents into different categories, resulting in more respondents in some or all of the categories.

Some rules-of-thumb for recoding are:

• Recode variables to emphasize the important distinctions in the data.

• Relabel recoded variables immediately after creating them. It is easy to forget what you wanted to do with a new variable, and any tables you generate with unlabeled variables will not be labeled either.

• Check the marginals for your newly recoded variable — see if they make sense when compared to the marginals for the original, unrecoded variable.

• Remember that the recoding of any variable is only in effect for the table that you generate. The recoding does not remain after the table has been generated, so if you want to use the recoded version of a variable for another table, you will have to recode it again.

You will learn much more about recoding variables in later sections of this instructional package.