Interview: Challenging Statistical Significance
Economist Deirdre McCloskey argues overuse of method detrimental to sciences, as well as people
By Mark Thompson-Kolar
WebinarTitle: Almost Never Use Tests of Statistical Significance: Use Instead Oomph, the Scrutiny of Substantive Differences
When: 2:00-3:00 p.m. Monday, October 28, 2013
Webinar link: (https://connect.umms.med.umich.edu/mccloskey/ )
Description: Existence, arbitrary statistical significance, philosophical possibilities uncalibrated to the sizes of important effects in the world are useless for science. Yet in medical science, in population biology, in much of sociology, political science, psychology, and economics, in parts of literary study, there reigns the spirit of the Mathematics or Philosophy Departments (appropriate in their own fields of absolutes). The result has been a catastrophe for such sciences, or former sciences. The solution is simple: get back to seeking oomph. It would be wrong, of course, to abandon math or statistics. But they need every time to be put into a context of How Much, as they are in chemistry, in most biology, in history, and in engineering science.
Deirdre McCloskey, (PhD, Economics, Harvard) is a co-author of 20 books, including, with Stephen Ziliak, The Cult of Statistical Significance (2008). It describes how widespread overuse of tests of statistical "significance" has negatively affected economics, medical research, political science, psychology, and sociology. It diverts scientists from addressing the key scientific question, "How Much?"
McCloskey, distinguished professor of economics, history, English, and communication at the University of Illinois at Chicago, will be addressing this topic at a webinar October 28 broadcast from the University of Michigan and sponsored by the Michigan Institute for Teaching and Research in Economics (MITRE), the Department of Economics, and the Inter-university Consortium for Political and Social Research.
In the following interview, she describes the key problems with misuse of statistical significance testing and how replacing this technique with a common sense approach that relies on human judgment about what is significant would lead to advancements in science.
ICPSR: In your webinar, you'll be talking about problems resulting from certain sciences' use of statistical significance. What's at issue with this?
Deirdre McCloskey: It's a tiny technique. My argument is not that statistics is a bad thing, or that I hate numbers! My criticism is not what people outside statistics think is wrong with statistics — like that "It’s mechanical whereas human life is nonmechanical" or that "It doesn’t reflect the human spirit." My point is a technical one about the way statistics has been used in certain restricted fields: medicine, economics, sociology, education, psychology. What people try to do when they use a "test of statistical significance" is hand over their judgment about How Big is Big to a machine. But such a judgment is basic to science, and isn’t contained in the numbers themselves, free from human judgment. Here in Chicago the temperature is 58 degrees. Is that a cold day or a warm day? In January it would be a warm day. In July it would be a cold day. In interstellar space it would be heat wave. On the surface of the sun it would be an extreme cold spell. The context for the judgment determines the scale against which we decide Big or Small.
ICPSR: Where does the problem enter in?
McCloskey: My claim is that a "null hypothesis significance test without a loss function," to speak technically, is useless. It walks away from the question "How Big is Big?" It tries to see in the numbers themselves the human significance. The procedure is clearly mistaken. It's an old point, a minority view in statistical theory right from the beginnings in Edgeworth and Gossett ("Student" in "Student’s t"). But we in the minority have been correct all along, and the majority has been wrong. The wrongness has meant that in medicine, in economics, sociology, psychology, people are coming to conclusions that they think are scientifically justified that are not. They are making big scientific mistakes, and, worse, big clinical mistakes. They are using an erroneous scientific instrument. So it’s a big deal, even though it’s a seemingly minor technique buried deep within accepted statistical procedures. The outcome is not minor. It would be as though an astronomer used a telescope that didn’t work correctly, giving her radically mistaken views of the universe. Missing earth-intersecting orbits of asteroids, for example!
ICPSR: Your book’s subtitle is "How Standard Error Costs Us Jobs, Justice, and Lives." Could you give some examples, particularly fatal ones?
McCloskey: I can give an example that was the basis of a Supreme Court case a few years ago. My co-author Stephen Ziliak and I were asked by a law firm to let them use our 2008 book as the basis for a brief to the Court. It was a drug case, a spray that you would put in your nose to stop the coming of a cold. Alas, occasionally it destroyed your sense of smell forever. The company said, "We consulted statisticians, and they told us that the frequency, the commonness, of the bad effect was statistically insignificant." Except that it’s humanly significant. Losing your sense of smell is a serious matter. We expected the conservatives on the Court to vote with the drug company. They didn't. The decision in the spring of 2011 was 9-0 in "our" favor. Look it up: Matrixx Initiatives, Inc., et al., Petitioners, v. James Siracusano and NECA-IBEW Pension Fund, brief of November 12, 2010.
Most sciences, such as physics and chemistry and geology, do not make the "significance-test" mistake. When they want to know how big something is, they talk about its magnitude, what we call its "oomph." Measuring oomph is emphatically not what statistical significance does. It says, "That would have been rare if it were merely a random result." The question of how big and important the effect of losing your sense of smell is changed into another, mostly irrelevant question of how rare.
Or an economic example. There’s an on-going dispute about the effect of the minimum wage. Economists like me believe the minimum wage is a catastrophe for poor people. Economists like Paul Krugman think that the minimum wage helps poor people. But the statistical evidence one way or the other, on both sides, is being assessed with the defective telescope of tests of statistical significance. Economists come to conclusions about whether the minimum wage is a good thing or a bad thing on the basis of a criterion that makes no sense. You might as well use a Ouija board to decide. In fact, it would be cheaper, and no better!
Such broken instruments and arguments, though, happen in science a lot. It’s not as if economists and medical scientists are particularly stupid. For a very long time, for example, geologists — especially American geologists — denied the persuasive evidence that the continents move. It was first proposed in a serious way in 1915 by a German meteorologist. Note the date — it was in the middle of the First World War, was by a German, and by a non-geologist. So most American geologists said, "Oh, that’s stupid." I had an undergraduate course in geology which had literally no account of the formation of mountains. None. Plate tectonics, as it came to be called, gave an account, and a few years after I took the course, 50 years after the theory was first proposed, almost all geologists adopted it. Making mistakes in science and sticking with them is very common. Science is conservative — indeed, it should be: we shouldn’t be changing our minds every five minutes. So in a few fields in love with tests of statistical significance without loss functions we go on and on and on making wrong decisions. A pity.
ICPSR: Where is it most evident?
McCloskey: In medical science, as we detail in the book. The doctors and the medical researchers and epidemiologist make clinical decisions on the basis of misused statistical significance. They have a proud name for it, "evidence-based medicine," and are trying right now to force it on all researchers. It is a growing catastrophe. Medical researchers are killing people by the hundreds of thousands in search of statistical significance. We explain in the book why that’s so. If a sociologist gets something wrong, it may not be the end of the world — although it could easily be important if it’s in criminology, say, and results in massive incarcerations to no good. But in medicine the evil of tests of statistical significance is perfectly obvious. If you are following the silly rules of "evidence-based medicine" you refuse to approve a new drug because you’re seeking statistical significance. By the same procedure, you allow a drug like the one in the Supreme Court case that’s bad for people because its badness is statistically insignificant. In either case, you’re hurting people, or killing them.
ICSPR: What has the response to your argument been from the scientific and academic communities?
McCloskey: I’ve been making this point for a quarter of a century. Many other applied and theoretical statisticians since the beginning of the field over a century ago, as Steve and I show, have made exactly the same point, so we’re not alone. We’ve got a list early in the book — a long list — of distinguished statistical theorists and practitioners from the late 19th century and certainly the early 20th century who said the same thing. Steve and I are perhaps more insistent, and by now inclined to be even a little impolite! I’ve been clinging to the knees of the economics profession for a quarter of a century, saying, "Guys, this is a mistake. It’s a logical error. You are damaging our beloved economic science." They get angry: "You don’t like numbers," they say, or "You’re not at Harvard, so why should I pay any attention to you?" or "I learned all this econometrics and now you’re saying it’s worthless." A few people have changed their intellectual life because of what Stephen and I said, or what William Kruskal said, or Kenneth Rothman said, or Abraham Wald said. Those who change their minds have a chance at doing actual science instead of mumbo jumbo. But most people want to go on with the mumbo jumbo, because that’s what their beloved teachers told them, and there are hundreds of textbooks recommending it. Whether it’s logical or not, they have careers to run.
I’ve gotten all kinds of reactions. People have stormed out of seminars. What’s strange is that I make a simple, common sense argument such that even the justices at the Supreme Court can understand — and on the whole, with one exception in 25 years, no one has any answer. Steve and I have logic and mathematics and reason on our side. They have mumbo jumbo, and indignation. They don’t want to hear it. The one person who has tried to reply is Kevin Hoover, now at Duke. I admire Kevin for it, because he at least had the intellectual integrity to try to defend the practice. We had a dispute with him in print and showed that what he said was not a defense at all. But at least he tried.
ICPSR: How did you become aware of the problem, and what influences do you think contributed to you recognizing that something was wrong?
McCloskey: Around 1983 I was at the Institute for Advanced Study, the Einstein Institute at Princeton, and I was working on a paper on the Gold Standard. I was told by my co-author, Richard Zecher, that statistical significance didn’t matter. What matters is substantive significance — how big the coefficient is, how much "oomph" there is, as I came to call it. At the same time I had met the British sociologist of science visiting Princeton, Harry Collins, and he pointed out to me that 10 years before there had been a "significance test controversy" in sociology and psychology. So I went and I read that, and looked into the history of statistical theory on the point. As we explain in the book, Ronald Fisher, an English geneticist and statistician in the 1920s in London, was the dominant figure creating the modern procedures of applied statistics. Alas, the procedures were wrong. (Steve Ziliak is writing a brilliant book on the way that Fisher distorted statistics.) The correct argument doesn’t always win. The long run can be quite long!
ICPSR: What would be required to effect real change across these sciences?
McCloskey: The easiest thing would be to stop using the technique and to do what the physicists and other scientists do. The technique of deciding whether something’s big or not in physics and astronomy can be called "interocular trauma." You’ve fitted a regression, and you look at the numbers, and if hits you between the eyes, well, that’s the "interocular trauma"! But it’s no joke. It is actually how they do it in physics. If you turn to a journal by physicists you will see that they don’t use tests of statistical significance. Instead, they have a theoretical expectation of how the curve will look, and they’ve gathered their experimental data, and they make a judgment, "Well, it’s quite close. Good." Or they say, "Hmm: notice those observations up at the top end of the diagram. They’re not so good. There’s something else going on here." Real scientists make a judgment within the context of the conversation of physics or geology or whatever it is. The same needs to be done — and is done, but secretly, behind the curtain — in economics and sociology and medicine.
A more fancy way of putting the point is that when you're deciding how big some number is, how big the effect of the minimum wage, and what direction it is, and so forth, what you need is what we call in the trade a "loss function" — that is, you need to know how many people, how many lives you’ll destroy, or enrich, by going one way as against the other. There are always two mistakes you can make, excessive gullibility or excessive skepticism, Type I and Type II error, as they are called. You make a judgment about how costly, how hurtful, excessive gullibility will be as against excessive skepticism, then you can balance the two. But you can’t do it absent the loss function. Without it, you haven’t faced up to how much it matters that you overestimate or underestimate the number. South Africa has a very high minimum wage and unemployment over 25% of the workforce. The minimum wage is a catastrophe there, and ruins people’s lives. Underestimating its effect statistically has very great social costs. Yet there are South African economists defending the minimum wage with meaningless tests of statistical significance. If you ignore the loss function and say, "I can just look at the numbers and decide whether they’re 'statistically significant'" you are not consulting human hurt or intellectual embarrassment, which are the only ways of deciding what’s big and what's small, what should be taken account of, and what should not. The more elaborately trained someone is in statistical techniques, the harder it is to get across the point to her. If I try to explain it to people who’ve had three graduate courses in econometrics, they resist. Of course. People make big investments in one way of looking at the world find it emotionally hard to say "whoops."
ICPSR: What advice would you give to an economist who is listening, and who isn’t yet persuaded that you’re correct but is open to considering the idea further?
McCloskey: Read the book. Think it through on your own. Be brave. Our claim is that any open-minded person who thinks about it will come to the same conclusion we have. I don’t see how they can come to any other conclusion. If they do, I urge them to send me a wire, collect. It’s not the case that numbers themselves, without human judgment, tell you how big they are. Is 6 feet tall for an American man? In Abraham Lincoln’s day it was quite tall, speaking of probabilities. Nowadays, it’s ordinarily tall. You can collect a million cases of American men in 1860, and do all the statistical tests you want, but you still won’t know whether 6 feet is tall enough to matter — there’s nothing in the numbers that tells you whether 6 feet is for some human purpose unusually tall. It’s a human judgment. You have to say, "Only 10 percent of American men in 1860 were over 6 feet, and Abraham Lincoln was taller than that — considerably. And boy, that’s important to us." The last part is the crux. You have to make that judgment. Human judgment is not avoidable in science. The people who don’t get the point we are making keep saying, "Oh no, you don’t make a judgment. All you need is this machine, and I can stick my numbers in and it will tell me what to conclude." No, it won’t.
ICPSR: How might the sciences, say economics, be different today if scrutiny of substantive differences were widely used instead of the tests of statistical significance?
McCloskey: Well, of course I think they would come to conclusions about the economy that I favor! (Laughs). Seriously, now, I know that's unlikely — statistically insignificant. But, clearly, economic scientists would understand the economy better if they would stop using an erroneous technique. One example is raised by the Nobel Prize in Economics just given. One of the three recipients is an old acquaintance of mine, Eugene Fama. Gene once used to say that if you think you can outguess the stock market, show me your bank account. It is the only legitimate test of whether you can beat the stock market. But later, Gene, I fear, fell into the statistical significance trap, and now he says, "You think you can beat the stock market; show me your test of statistical significance." But such a test is spectacularly irrelevant to the question of whether you can in a sustainable way earn more by knowing the history of prices, as the "technical analysts" and other stock pickers claim you can.
ICPSR: So going forward, if those sciences put emphasis on oomph, what do they look like?
McCloskey: It looks like a science that advances. I’m not scornful of my colleagues. I think they are very intelligent and hard-working. Honestly. But because economics has gotten into the trap of statistical "significance" it goes in circles. No one believes anyone else’s tests of statistical significance. In his heart he knows that tests of statistical significance are phony. He says, "My result is statistically significant; therefore, you should pay attention to it." But when someone on the other side of say, the minimum-wage controversy comes up with statistical significance, he says, "Oh no, she got it all wrong." The test doesn’t end arguments in economics, as its logic says it should. (In medical science, actually, it does end arguments, unfortunately, and kills hundreds of thousands of people as a result.) To the economists I say: stop doing phony tests of significance, and do something that you think will actually change people’s minds.
Economic science would advance if it used oomph — coefficient size and substantive significance — and stopped using tests of statistical significance. The issue of the minimum wage, for example, would actually get settled. Then we could go go on to the next question, and we would be able to stand on the shoulders of giants. I’m talking common sense against a fancy lunacy. I dearly wish my economist colleagues, whom I admire, would stop the lunacy, and get back to common sense.
Mark Thompson-Kolar can be contacted at firstname.lastname@example.org or 734-615-7904.