# Data Analysis

Reports of polls in the media describe behavior or opinion, reporting things like the proportion of people who favor some proposal or who feel a certain way about an issue. These reports deal with "what" questions of social science research – what does the American public think about abortion? The survey research described on this website is more concerned with "why" questions--why did some people vote for Barack Obama in 2008 while others voted for John McCain? To answer "why" questions we need to examine relationships between variables.

## A. Variables

A variable is something that varies, meaning that it can take on a number of values. Individual characteristics, such as age, income, or education, are variables because different people have different values or scores on these characteristics. We can first distinguish between categorical and interval-level data. Interval-level variables are continuous, meaning that each value of the variable is one equal increment larger than the previous and one smaller than the next value. Age, if measured in years, is a good example; each increment is one year. Typically with interval or continuous variables, there are a relatively large number of values. Categorical variables have a limited number of values. Gender is an example of a categorical variable. All of the variables included in this instructional package dataset are categorical. Variables that could have been measured as interval-level variables, such as age or income, have instead been made into categorical variables by creating categories that define specific ranges for the variables (e.g., for age, the categories are 18-24 years, 25-34 years, and so on).

Variables can be measured on a number of different types of scales. Categorical variables can be measured either on nominal or ordinal scales. Nominal scales are the simplest--the categories are unordered. A good example is religion (V179 in the dataset). This variable is coded in the following manner:

1. Mainline Protestant

2. Evangelical Protestant

3. Catholic

4. Other Christian

5. Jewish

6. Other

7. None

9. N/A

This coding of religion could be changed. For example, Catholics could be coded "1," Mainline Protestants "2," etc., without disturbing the underlying meaning of the variable.

Ordinal-level scales have an underlying order–a scale that ranges from 'strongly agree' to 'strongly disagree' is a good example. Consider the dataset question on privatization of social security (V086). This is an ordinal-level variable coded in the following manner:

1. Favor strongly

2. Favor

3. Neither favor nor oppose

3. Oppose

4. Oppose strongly

9. NA

You can see that there is an order to the values here--they range from strong support to strong opposition. Changing the ordering of the values for this variable would therefore certainly change the underlying meaning of the variable.

Interval level scales have an underlying order in which the intervals are the same. For example, if education were measured by the number of years of education, then a score of "15" would mean that the respondent had 15 years of formal education. Here, not only does the order of the values matter, but we can assume that each year of education is the same--each year adds one increment to the pre-existing education level.

## B. Relationships between variables

Two variables are related to each other when certain values of one variable are likely to be associated with certain values for the other variable. Conversely, two variables are unrelated when the values of one variable are equally likely to be associated with any of the values for the other variable. For example, if we say that education and turnout (i.e., whether one votes) are related, this could mean that more educated people are more likely to vote (which is what we would expect), or that more educated people are less likely to vote. Either way, the two variables would be related, for values on one variable would be linked to values on the other variable. Naturally, it would be far more informative to state how education and turnout are related, rather than just state the simple fact that they are related, and this should be done whenever possible.

When speaking about the relationship between two variables, the terms independent variable and dependent variable are commonly used. The independent variable can be considered to be the "cause" and the dependent variable the "effect." In other words, the independent variable affects or influences the dependent variable. A common research procedure is to start with some dependent variable and then to identify some independent variables that are strongly related to the dependent variable. In this way the dependent variable is explained, at least in the sense that some of the factors that influence it are identified.

## C. Contingency tables

A common procedure to examine the relationship between two variables in a survey is to use a contingency table. A contingency table presents the cross-tabulation between two variables. An example of a contingency table[D] is the cross-tabulation between party identification and presidential vote–omitting those who voted for somebody other than the two major party candidates.

In interpreting contingency tables:

- Percentage by the independent variable. Since party identification is the independent variable when it is cross-tabulated with presidential vote, the columns under each category of party identification add to 100%;
- Compare the distribution of the dependent variable across the categories of the independent variable;
- Look for trends in the percentages. In the cross-tabulation of party identification and presidential vote, the percentage of respondents voting for McCain goes up dramatically as we move from the category of "Strong Democrat" to the category of "Strong Republican;"
- Don't treat percentages as exact reflections of the population–they are approximations;
- Look for sizeable differences in percentages–small differences can the result of random error;
- Consider the total number of respondents in each column. Be cautious in interpreting percentages if there are fewer than 50 respondents in a column;
- Look for significant differences!

## D. Statistics

Statisticians make a distinction between measures of significance and measures of association. A measure of significance tells whether the relationship between two variables might be the result of chance alone. A measure of association tells how strong the relationship is between two variables. Chi Square (χ2) is a common measure of significance; it tests the hypothesis that there is no relationship between the columns and rows in a contingency table. Chi Square is reported as both a number and a probability; if the probability is higher than about .05, it generally means that we cannot reject the hypothesis that the two variables are unrelated. If the probability is lower than .05, it means that there is less than a 5 percent chance that a relationship that strong or stronger would occur by chance alone, which usually allows us to reject the possibility that the variables are truly unrelated.

## E. Measures of association

Measures of association share some general properties–they have a value of zero when two variables are unrelated and rise to a value of 1.0 when the two variables have a perfect relationship. Which measure of association you use depends in part on the scales on which the variables are measured.

Commonly used statistics for ordinal variables are gamma and Kendall's tau. Both gamma and tau can have positive or negative signs. A negative sign means the variables are inversely related, meaning that as one goes up the other goes down; a positive sign means the opposite. Thus, for gamma and tau, a value of -1.0 is just as perfect of a relationship as a value of 1.0. Another statistic appropriate for ordinal data is Somer's d.

For tables with nominal variables, either Cramer's V or the contingency coefficient are appropriate measures of association, but SDA/DAS does not report these statistics for contingency tables. This, however, is not a major problem since only a few variables in this instructional package dataset are nominal; the vast majority are ordinal.

The great value of statistics is that they can summarize in a single number an entire table with many cells and percentages. By doing so, however, one loses some detailed information about the relationship between the two variables.

Remember that measures of association don't really tell you any more than do the percentages in a table, they just do it more conveniently. Also you can calculate any statistic you want on any contingency table you construct. Whether the statistics has any meaning is the result of some thought going into the table before you construct it.

## F. Recoding Variables

Variables in this dataset have been coded into what we think are the most logical categories. There may be times, however, when you might want to change the coding. Recoding variables may be desirable when:

- there are so many categories of one of the variables in a table that there are too few cases per cell for meaningful interpretation;
- you think that there is some meaningful reason to combine categories in a variable–for example, you think the real distinction in education is whether somebody has a college degree or not instead of the six categories the variable has in the dataset.

Some important variables in this dataset have a larger number of categories–party identification (7), all the feeling thermometer scores (5), all the ideology scales (7), religion (7), age (6), and others.

It is important to remember that recoding variables shifts respondents into different categories, resulting in more respondents in some or all of the categories.

Some rules-of-thumb for recoding are:

- What do you think the important distinctions are in the data? Recode variables to emphasize these distinctions;
- Create new variables and recode those rather than recoding existing variables. If you wrongly recode an existing variable, you will need to start the analysis anew; if you wrongly recode a newly created variable, you can delete it and create another;
- Rename and re-label new variables immediately after creating them; it is easy to forget what you wanted to do with a new variable and any tables you generate with unlabeled variables will not be labeled either;
- Check the marginals for your newly created variable; you might also want to run a crosstabulation of the new variable against the unrecoded variable to see which respondents have gone into which categories.

You will learn much more about recoding variables in later sections of this instructional package.