Stripping the Dread from the Data
"Naked Statistics" by Charlie Wheelan
When newspapers publish a report about a new study linking some activity, like taking frequent short breaks at work, to a terrible disease like lung cancer, do readers understand what the results of that study mean? Most don't, says Dartmouth economics professor Charlie Wheelan, author of "Naked Statistics." Most people's eyes glazed over during their intro statistics course--Wheelan's included--but the basic tools can help decipher complex issues. For example, instead of avoiding taking 10 minute breaks, consider that those breaks are correlated (associated) with cancer, but may not be causing it. A more reasonable explanation, Wheelan says--most smokers take lots of short breaks to smoke outside, and that's behind any link to cancer.
Wheelan joins us on Chicago Tonight at 7:00 pm to give us a refresher on Stat 101.
We spoke with Wheelan about people's general statistical knowledge, and how he makes a typically dry topic "a rollicking good time," according to Austin Goolsbee, University of Chicago economics professor and former economic adviser to President Obama. Read an excerpt of his book here.
Most people are pretty skeptical of statistical analysis, you write. What do you think the statistical knowledge level is like among the general population?
The skepticism is healthy. Statistics inherently simplify--that’s what they do. If I give you a file with 100 million households, it will overwhelm you. People are correct that statistics leave something out--it’s like internet dating. The profile might be accurate, but what’s left out will jump out and bite you. But the skepticism is pointed in every direction and perhaps people should have a more targeted skepticism.
You have a conversational tone in the book--in one section describing standard deviation, if they’re paying attention, the reader realizes your example made them taller and lighter than average. You write basically, if you’ve made it this far, I’ve rewarded you by making you tall and thin. How important was your style?
It’s essential. It’s a book about statistics for lay readers, so they're in the bookstore and they could go for Stephen King. It’s got to be fun--the tone, the examples, we don’t want you to regret the purchase. And it has to be relevant, so we include real examples. It’s like we’re on this journey together, and the more fun it is, the more you’ll learn. It’s a vivid contrast to most statistics books. There are people out there that don’t think in math, and need it translated into intuition.
A common way many people see statistics day-to-day is reading about scientific studies in popular media--a recent article linked diet soda to diabetes. But as you point out, these studies are misunderstood by reporters--it could be that people with diabetes avoid drinking soda full of sugar.
This is the most important takeaway. You’re besieged by these statistics in the news. You can’t make controlled experiments for humans. You can’t have some people smoke pot, some others not smoke pot, and see how they all turn out. So instead we find a group of folks, and follow them over 20 years and ask them things--like did you smoke pot in junior high? And statistics will help us find connections. But they’re just associations. You may find pot smokers are more likely to drop out--but did the smoking cause them to drop out, or did their upbringing cause both things to be more likely? Is it correlation or a causal relationship?
How responsible is the media for people’s misunderstandings of these studies?
Responsible researchers will do a couple things. They’ll use regression analysis, which helps control for other factors--they take students who smoked pot, and they’ll control for family structure, and may find with the same family structure, there is still an association between pot smoking and drop outs. But when you read those studies, the language is always very cautious--we believe it is possible that X is true. They will call for further study. But then it comes out in a journal, the media gets a hold of it and say “People who smoke pot are more likely to drop out.” This can go both ways--up until the ‘60s, we were still debating whether smoking led to cancer. Many, many studies found the same association, and there was biological corroboration. But the tobacco companies kept saying this was just a correlation.
You also write about how important it is to make sure you’re studying the right data, regardless of how well you analyze it. Tell me about the Chicago high school rankings.
This is one of my pet peeves. Every fall, assorted media would come out with the best Chicago high schools. And the list would include places like Whitney Young and Northside College Prep, etc. My question is: have you taken into account that you are measuring the best schools by test scores, and to get into these schools, you need high test scores? Have we learned anything? It’s as ridiculous as saying a basketball team is doing a great job fostering tall people. It’s backwards. The question we should be asking are which schools are adding the most value. That’s much harder to tease out. Because the people coming into Walter Payton are entering as a part of the 99th percentile, for example, and we could hit them with a stick for the year, and they would still be in the 97th percentile.
Another chapter focuses on probability. Tell me how Schlitz used basic laws of probability to create what seemed like a dangerous Super Bowl ad.
You have to remember--most American beer was bad and tasted the same back in the ‘80s. And Schlitz’s was trying to make a big marketing play. They did a lot of clever things that seemed dangerous--they would do live taste tests during the playoffs and eventually the Super Bowl. One hundred blind testers, with Schlitz and another beer. The hundred people they asked to taste blindly all said they preferred a competing brand. This seems like the last thing you’d want, but in fact, if you agree they all taste the same, if even 35% say they like Schlitz better, that comes across as a huge victory. If you take Schlitz drinkers, and only half say they like it, that’s a huge loss. You have to assume the tasting is a coin flip, because the beers taste the same. And statistics say with 100 coin flips, the probability of getting less than 40 people choosing Schlitz is really, really low. The probability of getting less than 40 was only about 2 percent. The punch line, of course, is do you know what happened at halftime at the Super Bowl? It came out exactly 50/50.
You write, “Our ability to analyze data has grown far more sophisticated than our thinking about what we ought to do with the results.” What types of thorny issues can statistics present in the age of Big Data?
The book cannot answer those questions. It can only raise them. Target, for example, figured out how to predict if women are pregnant based on their buying habits. We can do the same thing as Target in law enforcement. One of the powerful uses of statistics is crime--CompStat in Chicago and New York. At 3:00 a.m., on this corner, we have a lot of problems, so we can send more police there. Across the country, violent crime has been falling precipitously.
But when you combine that with the Target story, police can say they stopped a crime before it happened. This happened in California--police said they had info that this was a crime hotspot, they blanketed it with police, they found some suspicious women, and they arrested them for other offenses. The supposition was they were lingering because they were about to steal a car. That gets a little sketchy. What if we have really good data and we’re right more often than wrong, but some of those predictive factors are that 35-year-old Hispanic men driving a certain car are more likely to be drug couriers? All of our data says it's true, but the vast, vast majority are not drug couriers. We can become more effective at law enforcement, but we will also make like hell for those caught in that statistical net. That’s something for democracy or philosophers to decide.
This interview has been condensed and edited.