Investigating Bias and Measurement Validity

Goal

The goal of this exercise is to use triangulation to investigate bias and measurement validity in parents/teachers evaluations of student health and learning. Crosstabulations and correlations will be used.

Concept

Validity refers to how well a question, or a piece of research more generally, reflects the reality it claims to represent. Together with reliability, it is a central concept in social science research and an important concern of researchers, because if we're not measuring what we think we're measuring, we really cannot draw any conclusion about the phenomenon we wish to understand.

Validity may be affected by biases such as response bias and instrument bias. Response bias occurs when respondents answer questions on some basis other than the question itself. For example, a respondent might choose the answer that is most socially desirable (social desirability bias) or most extreme (e.g. "strongly agree"--extremity bias). She may not have an accurate recollection of past events (recall or memory bias) or use the same answer for each question in a series (response set bias). Poorly worded, ambiguous, or unnecessarily complex questions are a form of instrument bias that may affect how respondents understand and answer questions. This, in turn, can affect the validity of the measures.

Problems with measurement validity can be uncovered, and their impact diminished, by using triangulation. Triangulation (or multi-method research design) refers to the use of more than one research method (or more than one measure within a single study) to investigate a social phenomenon. For example, researchers may ask subjects to rate their own health in self-reported questionnaires, as well as collect data using other, perhaps more "objective," measures of health, such as nutritional intake, cholesterol level or blood pressure.

Triangulation therefore has three main advantages over single-source research: it can provide a truer account of the phenomenon, minimize the inadequacies of single-source research, and reduce the impact of bias by allowing a degree of cross-checking.

In this exercise, we use triangulation to explore how different kinds of respondent and instrument biases affect measurement validity of education- and health-related items.

Examples of possible research questions about victimization include:

  • What are the goals of measurement?
  • What biases are likely to affect the validity of the measure?
  • Does the measure correlate with other measures of the same purported construct?
  • Does the measure correlate with criteria with which it is supposed to correlate?

Data for this exercise come from the Multistate Study of Pre-Kindergarten, 2001-2003, Main Child Level Public-Use Version, which is collected by the National Center for Early Development and Learning. Using a mixed-method research design, this study attempts to describe variations in the experiences of children enrolled in pre-kindergarten and kindergarten programs and examines whether these experiences are related to children's outcomes in early elementary school. The study was conducted in six states: California, Illinois, New York, Ohio, Kentucky, and Georgia. Within each state, a random sample of 40 centers/schools was selected. One classroom in each center/school was selected at random for observation, and four children in each classroom were selected for individual assessment. Starting in 2001, the children were followed from the beginning of pre-k through the end of kindergarten; they were assessed using a battery of individual instruments to measure language, literacy, mathematics, and related concept development, as well as social competence. Data were also gathered from administrators/principals, teachers, and parents regarding program services, (e.g., healthcare, meals, and transportation), program curriculum, teacher training and education, teachers' opinions of child development, and their instructional practices on subjects such as language, literacy, mathematics concepts, and social-emotional competencies. In five of the six states, families were also visited in their homes to obtain information about family life; family educational practices and beliefs about the comparative roles of school and family in educating children; the nature and quality of the home-school relationship; and parents' ratings of their children's psychological development and social competence. Demographic information collected includes race, gender, family income, and mother's education level.

By using a mixed-method research design (computer-assisted self-interviews (CASI), coded on-site observations, coded video observations, cognitive assessment tests, face-to-face interviews, self-enumerated questionnaires) and collecting data from multiple standpoints (children assessments, teachers, administrators, parents), this study offers a rich look into the experiences of children enrolled in pre-kindergarten and kindergarten programs and provides examples for studying measurement validity.

All data for this exercise come from the pre-K Fall subsample. The following variables will be used:

  • Parent report of child health (CHHLTHFP)
  • Teachers report of child health (CHHLTHTP)
  • Teacher report of child's knowledge of alphabet (CSLANGPF3)
  • Naming letters (raw score) (LETTEREPF)
  • Teacher report of child's writing skills (CSLANGPF7)
  • Percent of name written legibly (EWTNMEPF)
  • How often parent speaks with teacher (TCHTALKK)
  • Family involvement: parents called teacher (PARCALLP)

For ease of interpretation, recoding variables is sometimes necessary or helpful. (Note: the online analysis package used here requires recoding variables for the sole reason of modifying or adding labels.) The following variables were recoded:

  • We renamed CHHLTHTP ("Teacher report of child health") and CHHLTHFP ("Parent report of child health") to make the meaning of these variables more readily apparent and to exclude invalid answers. The new variables are KIDHEALTH_TEACH and KIDHEALTH_PAR.
  • CSLANGPF7 ("Teacher report of child's writing skills") and CSLANGPF3 ("Teacher report of child's knowledge of alphabet") were recoded to exclude invalid answers. The new variables are CANWRITE and KNOWSALPHA.
  • The variables LETTEREPF ("Naming letters") and EWTNMEPF ("Percent of name written legibly") each contain too many answer categories for useful analysis. We collapsed answers to LETTEREPF into five categories, coded "0" (0 letters), "1" (1-8 letters), "2" (9-17 letters), "3" (18-25 letters), "4" (26 letters). The new variable is LETTERS. Answers to EWTNMEPF were collapsed into four categories, coded "0" (0%), "1" (less than 50%), "2" (50-99.9%), and "4" (100%). The new variable is WRITING.

In this exercise, we will use triangulation to explore how different kinds of respondent and instrument biases affect measurement validity of education- and health-related items.

Social scientists' ability to make statements about a given social phenomenon is predicated on the validity of the measures they use to study it. In other words, they need to show that they are truly measuring what they intend/claim to measure. In some cases, this is rather straightforward: a thermometer, for instance, is considered a valid instrument for measuring temperature. But how would you determine whether someone is healthy? You might consider "objective" health markers such as that person's weight, cholesterol level, or blood pressure. If data on objective health measures are not available, you might use more subjective measures, such as self-reported well-being, which is difficult to measure or observe externally.

Comparing Measurements of Subjective Phenomena: Parent and Teacher Ratings of Child's Health

To illustrate the challenge of establishing valid measures of subjective phenomena, let's compare how the health of a child was rated by her teachers (KIDHEALTH_TEACH) and by her parents (KIDHEALTH_PAR). Ideally, the two measures should match--children whose health is rated "poor" by their parents should also be rated "poor" by their teachers. Would you say that this is the case here? What percentage of children rated in "excellent" health by their parents is rated in "good" health by their teachers? What percentage of children rated in "fair" health by their parents is rated in "good" or "very good" health by the teachers? Are any children rated in "poor" health by their parents? What might explain the discrepancies between the parents' and the teachers' assessments?

Comparing an Objective Measure against a Subjective Measure: Child Literacy and Early Writing

Like health, a child's ability to read and write can be measured both objectively (by assessing the number of letters a child can recall or the number of words in a paragraph that he or she can pronounce correctly) and subjectively (by rating the extent of a child's vocabulary or reading fluency). Here we compare objective and subjective measures of the same concept, literacy.

Take a look at the crosstab of the child's knowledge of the alphabet as assessed by his or her teacher (KNOWSALPHA) against the number of letters the child could recall in the fall 2001 semester (LETTERS). In this case, the teacher's assessment of the child's knowledge of the alphabet is a subjective measure, whereas the number of letters the child knows is easily, and objectively, quantifiable (0-26 letters). What percentage of the children who didn't know any letter was assessed as "not yet" knowing the alphabet by their teachers? What percentage of those who knew 9-17 letters was assessed as "beginning"? What percentage of those who knew all 26 letters was assessed as "proficient"? Would you say that children's knowledge of the alphabet was assessed correctly by the teachers? What might explain some of the discrepancies between the number of letters children could read and the teachers' assessment of their knowledge of the alphabet?

Another way to examine the relationship between these two variables is to create a correlation matrix. A correlation matrix is a useful tool to measure associations between variables because it shows at a glance how a set of variables relate to each other. A correlation coefficient (called Pearson's r) tells us whether, and how strongly, two variables are related. Correlation coefficients can range from +1.00 to -1.00. A value of zero indicates that there is no association between the two variables. A value greater than zero indicates a positive association: as the value of one variable increases (or decreases) so does the value of the other variable. Conversely, a value lesser than zero indicates a negative association: as the value of one variable increases the value of the other variable decreases. Although this is not a hard-and-fast rule, typically correlation coefficients of 0.60 or more indicate a strong degree of association; 0.30 a moderate degree of association, and 0.15 a weak association.

Examine the correlation between KNOWSALPHA and LETTEREPF (the unrecoded version of LETTERS). Note that a correlation matrix is perfectly symmetrical around the diagonal therefore one need only analyze one half of the matrix. How would you qualify the strength of the relationship between these two variables?

Consider another pair of variables, the teacher's assessment of the child's ability to write (CANWRITE) and the percentage of the child's name that he or she could write legibly (WRITING). According to the results of the crosstab of these two variables, what label were teachers most likely to give the children who could not write any of their name legibly? What label were teachers most likely to give the children who could write 100% of their name correctly? Would you say that children's ability to write was assessed correctly by the teachers? What might explain some of the discrepancies between the extent to which the children could write their name correctly and the teachers' assessment of their ability to write?

As we did earlier, let's run a correlation using CANWRITE and EWTNMEPF (the unrecoded version of WRITING) to see how closely related they are. Is the relationship between these variables stronger or weaker than that between KNOWSALPHA and LETTEREP?

Comparing Objective Measures: Communication between Parents and Teachers

Obtaining valid data on objective phenomena can also be challenging. Many questions in surveys refer to past events or conditions, and we know that it is often difficult to remember the event and to associate it with the correct time period, which leads to what are called recall errors. We also know that people tend to adjust their answer in light of the expectation that some answers are more socially acceptable than others (social desirability bias).

To illustrate these issues, let's look at parent-teacher contact. In this example, we use TCHTALKK, or the frequency with which the parent reported having spoken with his or her child's teacher during the spring 2002 term, and PARCALLP, the frequency with which the teacher reported that the parent called him or her. Parent-teacher contact is an objective phenomenon: either parents and teachers spoke, or they did not, and the number of times this happened should be easily quantifiable.

Examine the results of the crosstab with TCHTALKK and PARCALLP. What kinds of discrepancies do you observe in communication patterns? For example, examine the column of cases where the teacher indicated that the child's parents never called him or her. Among these cases, in how many instances did parents claim daily or at least weekly communication? Why might these reports differ?

Think about your answers to the application questions before you click through to the interpretation guide for help in answering them.

Comparing Measurements of Subjective Phenomena: Parent and Teacher Ratings of Child's Health

What percentage of children rated in "excellent" health by their parents is rated in "good" health by their teachers? What percentage of children rated in "fair" health by their parents is rated in "good" or "very good" health by the teachers? Are any children rated in "poor" health by their parents? What might explain the discrepancies between the parents' and the teachers' assessments?

Comparing an Objective Measure against a Subjective Measure: Child Literacy and Early Writing

What percentage of the children who didn't know any letter was assessed as "not yet" knowing the alphabet by their teachers? What percentage of those who knew 9-17 letters was assessed as "beginning"? What percentage of those who knew all 26 letters was assessed as "proficient"? Would you say that children's knowledge of the alphabet was assessed correctly by the teachers? What might explain some of the discrepancies between the number of letters children could read and the teachers' assessment of their knowledge of the alphabet?

How would you qualify the strength of the relationship between KNOWSALPHA and LETTEREPF?

What label were teachers most likely to give the children who could not write any part of their name legibly? What label were teachers most likely to give the children who could write 100% of their name legibly? Would you say that children's ability to write was assessed correctly by the teachers? What might explain some of the discrepancies between the percentage of their name the children could write correctly and the teachers' assessment of their ability to write?

Is the relationship between WRITING and EWTNMEPF stronger or weaker than that between KNOWSALPHA and LETTERS?

Comparing Objective Measures: Communication between Parents and Teachers

What kinds of discrepancies do you observe in communication patterns between parents and teachers? For example, examine the column of cases where the teacher indicated that the child's parents never called him or her. Among these cases, in how many instances did parents claim daily or at least weekly communication? Why might these reports differ?

Interpretation

Things to think about in interpreting the results:

  • The numbers in each cell of the crosstabulation tables show the percent of the people who fall into the overlapping categories, followed by the actual number of people that represents in this sample. The coloring in the tables demonstrates how the observed numbers in each cell compares to the expected number if there were no association between the two variables. Where they are used, the accompanying bar charts display the patterns visually as well.
  • Correlation matrices show the relative strength and direction of the relationships within a set of variables. The values in the cells are called coefficients (Pearson's r) and the range of values that these coefficients can take is between -1 and +1. The closer the coefficient is to either -1 or +1, the stronger the relationship between the two variables. In social science research, it is rare to see correlation coefficients much above .3 or below -.3 unless we are comparing two measures of the same/similar concept or measuring the same concept at two different times. Therefore, we will consider coefficients with an absolute value of .3 or greater to be indicators of a reasonably strong relationship between the variables.
  • Weights (mathematical formulas) are often used to adjust the sample proportions, usually by race, sex, or age, to more closely match those of the general population. The analyses used in this guide did not use any weights, which may reduce the generalizability of the findings, but the resulting tables are accurate descriptions of the relationships found between these variables among these respondents.

Reading the results:

Comparing Measurements of Subjective Phenomena: Parent and Teacher Ratings of Child's Health

A quarter (24.6%) of the children rated in "excellent" health by their parents were deemed in "good" health by the teachers. Among the children rated in "fair" health by the parents, 85.7% (50% + 35.7%) were considered in "good" or "very good" health by the teachers--a shocking discrepancy! Before jumping to conclusions however, consider the number of cases in each cell of that column: there are too few cases to be suitable for statistical treatment. We simply cannot say anything about these particular children. No parent indicated that their child was in "poor" health (and as we just saw, very few--14 out of a sample of 889--rated their child in "fair" health). Comparing parents' and teachers' assessments of the same phenomenon (child's health) demonstrates the challenge of measuring subjective phenomena. Here one would expect the parents to have the more accurate assessment of the child's health, but social desirability bias (the desire to present oneself in the best possible light) could be affecting these reports. Teachers may not be in the best position to provide accurate assessments of their students' health either. In addition, note that the category labels in the answers are ambiguous, which is likely to affect measurement validity negatively.

Comparing an Objective Measure against a Subjective Measure: Child Literacy and Early Writing

Almost two-thirds (63.1%) of the children who didn't know any letter were assessed by their teachers as "not yet" knowing the alphabet. Among those who knew 9-17 letters, 31.8% were considered "beginning," and 32.1% of those who knew all 26 letters were deemed "proficient." It appears that children were most likely to be assessed correctly if they had little or no knowledge of the alphabet, perhaps because there is little ambiguity here. As they learn more letters, the exact extent of their knowledge may become less clear and it may become more difficult for teachers to place them in a specific category ("in progress" vs. "intermediate", for example). In addition, the category labels leave room for interpretation: what is the difference between "beginning" and "in progress," or between "in progress" and "intermediate"? In the absence of answer categories whose meaning is clearly defined, teachers may substitute their own understanding of the labels. Consequently, one teacher's definition of say, "proficiency," may not be the same as another teacher's. This may explain why only a quarter of the children who knew all the letters was considered "proficient."

The correlation coefficient for LETTEREPF and KNOWSALPHA is .54, which indicates a strong--though imperfect--correlation between the children's knowledge of the alphabet and the teachers' assessments of this knowledge. If KNOWSALPHA captured the children's knowledge of the alphabet perfectly, this correlation coefficient would be 1.

56.8% of the children who could write between less than 50% of their name legibly were assessed as "not yet" able to write by their teachers. Children who could write 100% of their name legibly were most likely to be regarded as "in progress," with only 18.1% considered "proficient." As with the crosstab of LETTERS and KNOWSALPHA, children were most likely to be assessed correctly if they had not yet learned to write their name. As we saw in earlier analyses, it is likely that response and instrument biases affect the validity of these measures. In addition, note that relying on a child's ability to write his own name is a problematic way to assess his knowledge of the alphabet: a child named Paul who can write his name correctly may not be more literate than a child named Christopher who can only write half of his name correctly, and it is possible that teachers take this into consideration in their assessments.

The relationship between the teachers' assessment of the children's ability to write (EWTNMEPF) and the extent to which children were able to write their name legibly (CANWRITE) is weaker (.41) than that between KNOWSALPHA and LETTEREP (.60). While this is still considered a reasonably strong relationship, it confirms that researchers should be wary of the validity of EWTNMEPF (the teachers' assessments of the children's ability to write) and CANWRITE as measures of the children's actual ability to write.

Comparing Objective Measures: Communication between Parents and Teachers

There are important discrepancies in reported communications between parents and teachers. For example, teachers indicated that in some cases they had had no contact with the parents of their students, but 25.4% of those parents said that they had spoken with the teacher "a few times" during the year. 28.9% indicated that they had spoken with the teacher "at least once a week," and 11.4% said "daily." No parent agreed with the teacher that they had never spoken.

There are several possible explanations for these discrepancies. People usually find it difficult to remember or accurately retrieve incidents that happened in the past because memory can be imperfect and therefore unreliable (typically, the longer the interval between the event and the time of assessment, the higher the probability of incorrect recalls). This is known as recall or memory bias, and it presents an obvious threat to measurement validity. Because they have many students, teachers may not remember accurately how many times they communicated with which parents over a period of a year. Similarly, parents may not have an accurate recollection of contacts with different teachers, especially if they have more than one child.

Social desirability bias may be a second reason for the discrepancies. Social desirability bias refers to the fact that in self-reports people will often answer inaccurately in order to present themselves in the best possible light. Parents, for instance, may not want to admit that they never talked with their child's teacher, or may inflate the number of parent-teacher contacts in order to appear more involved than they really were.

Finally, it is important to note that if we wanted to minimize instrument bias, question wording and answer categories should be exactly the same in both questions (TCHTALKK and PARCALLP). The fact that they're not makes it difficult to compare answers.

Summary

The goal of this exercise was to use triangulation to investigate bias and measurement validity in parents/teachers evaluations of student health and learning. Objective and subjective measures were considered. In a comparison of subjective measures of child health, parents systematically rated their children's health higher than did teachers. For many children, the teacher's rating differed considerably from the parent's. Curiously, no parent rated his or her child's health as "poor."

Two pairs of variables were used to illustrate differences in objective and subjective measures. In one example, letter recall and teacher assessment of literacy were compared as measures of child literacy. The two measures were strongly related, although there were several cases where extent of letter recall and teacher assessment of literacy diverged considerably. In another example, the percentage of the child's name that he or she could write was compared against the teacher's assessment of the child's ability to write. The relationship between these two variables was less strong, as there were many cases where the two measures diverged.

Finally, two measures of an objective phenomenon--the frequency with which the teacher and parent communicated--were compared. It was discovered that, perhaps due to social desirability bias, parents appeared to over-report the frequency with which they communicated with their child's teacher.

Further research might explore how to construct valid measures of health- and education- items that minimize response and instrument bias.

CITATION: Inter-university Consortium for Political and Social Research. Investigating Bias and Measurement Validity: A Data-Driven Learning Guide. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2013-07-26. Doi: https://doi.org/10.3886/validity

Creative Commons License This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.