Learn: A Tutorial on Running Frequencies for a Particular Year and Running Summary Statistics

Since the DDB dataset includes annual surveys from 1975 to 1998, when you run a frequency, you will get the results for every year's survey that included that particular variable, for a possible total of 84,989 individuals. It is important to notice that it will not always be clear which years are contained in the results because not all the variables were included every year's survey. For example, the question about using automatic teller machines was not asked in the earlier DDB surveys because automatic tellers did not exist yet.

Because of this confusion and because each year is its own separate sample, when you are doing applied statistical analysis with data collected over numerous years like the DDB and the GSS, most of the time you should analyze the variables for a particular year (except when you are doing timelines and need to look at surveys over a number of years, but you will learn more about this in Exercise 2).

Let's look at an example of how to run a frequency on club meeting attendance (clubmeet) for 1987.

Open open in new window the DDB dataset, select "Frequencies or crosstabulation (with charts)," and select the "Start" button. An analysis window like the one below will appear. Type "clubmeet" next to "Row" and next to "Selection Filter(s)" type "year(1987)." In addition, under "CHART OPTIONS" select the "Bar Chart" option next to "Type of chart."

screenshot

After you "Run the Table" you will get the following results. Note that the total number of individuals for the 1987 survey was 4,003 and that 1,889 individuals or 47.2 percent of the sample did not attend a club meeting in the last 12 months. Also note that there are a lot more "codes" in this frequency distribution than the one you did for served on a group committee in the last tutorial. Instead of a simple "yes" or "no" answer, this variable asks a person to provide more numerical detail about their activity. This type of ordered variable is an example of the ordinal level of measurement . Be sure to be a conduct a thorough investigation and consult your statistics/methods book about level of measurement.

screenshot

So far you have acquired several investigative skills:

  • Searching Codebooks

  • Running and Interpreting Frequencies

To wrap up your investigation in Exercise 1, you still need the following:

  • Running and Interpreting Summary Statistics

Let's look at one more example with club meeting attendance (clubmeet) for 1987.

Open open in new window the DDB dataset and run the frequency for "clubmeet" in 1987 like you did above, except before you select "Run the Table," be sure to check "Statistics" under the "TABLE OPTIONS." Do not forget to type "year(1987)" next to "Selection Filter(s)" and to select the "Bar Chart" option. Now you can "Run the Table" to produce the following:

screenshot

Let's interpret these results:

Your frequency distribution can be interpreted similarly to what you did above. Except in these results, you also have some additional information under "Summary Statistics." There are fourteen different statistics here. Not to worry; you do not have to know about all of them for the introductory level. First, let's focus on the mode , median , and mean or "measures of central tendency." Again, be a thorough investigator and consult your statistics/methods book.

The mean of 2.46 is slightly higher than the median of 2.00 (the mean is getting pulled up by those super active people who attend a lot of meetings). But the ordinal level of measurement is not super precise so the median, or the whole number "2," makes a lot more sense to interpret here (why ?). The median of 2.00 represents the category of "1-4 times." This is the middle or the center of the frequency distribution. What does this tell us? Basically, not very many people in 1987 are going to very many club meetings, because half of the people are below this in the category of "None." In fact the mode is "1," attending no meetings in the past 12 months. The fact that 47.2 percent, or nearly half of the people surveyed did not attend a club meeting begins to build support for Putnam's claim in Chapter 3--although the number of associations may have been on the rise and large numbers of people claimed membership in these associations, not many people are "active" participants. Keep in mind, however, that your statistics here are only for a particular year. What has happened over time? You will get to explore this with timelines in Exercise 2.

Now let's focus on three other introductory summary statistics: the range , variance , and standard deviation , or "measures of variability/dispersion." The range for this variable is 6.00. This tells us that there were a variety of codes for this variable, or 7.00 -1.00 = 6.00. Therefore, because of how the question was asked, there is more room for variability. You might understand this better if you think of the analogy of answering true/false questions as opposed to multiple choice on an exam. An instructor is going to get more variability in test scores on a multiple choice exam.

The variance and standard deviation can be interpreted together, since the standard deviation is simply the square root of the variance. For interpreting the "clubmeet" frequency distribution, the variance and standard deviation are not really all that applicable, because this variable is an ordinal level of measurement. Variance and standard deviation are used, in the strict sense, for interval level variables. However, sometimes it is okay to "bend the rules," so let's look at the standard deviation in this situation because there are 7 categories and more variability is possible, just like a multiple choice exam.

A standard deviation of 1.81 is very close to 2.00. So the best way to interpret it is to say that, on average, most people fall between 2.00 categories above and below the mean of 2.46 (round to 2.00). This means that most people are between the codes of "1" and "4" or not attending meetings and attending 9-11 times. The response of 9-11 times is 2.00 above the mean of 1-4 times, and "None" is as far below the mean as is possible. Now you can see why it is so important to have a codebook and to know what the numbers represents, especially in the DDB data where a "1" means "None." This also shows why it is so important when conducting investigation that it is important to be meticulous. A crime scene investigator has to be meticulous as well, since tainted evidence does not hold up well in the courtroom. Now return to Exercise 1 and practice your ability to be meticulous by running summary statistics on club meeting attendance for a year other than 1987.