Archive | Methodology

Is coffee a killer? The statistical significance filter strikes again

coffee Thomas Lumley writes:

The Herald has a story about hazards of coffee. The picture caption says
Men who drink more than four cups a day are 56 per cent more likely to die.

which is obviously not true: deaths, as we’ve observed before, are fixed at one per customer.  The story says
It’s not that people are dying at a rapid rate. But men who drink more than four cups a day are 56 per cent more likely to die and women have double the chance compared with moderate drinkers, according to the The University of Queensland and the University of South Carolina study.

What the study actually reported was rates of death: over an average of 17 years, men who drink more than four cups a day died at about a 21% higher rate, with little evidence of any difference in men.  After they considered only men and women under 55 (which they don’t say was something they had planned to do), and attempted to control for a whole bunch of other factors, the rate increase went to 56% for men, but with a huge amount of uncertainty. Here are their graphs showing the estimate and uncertainty for people under 55 (top panel) and over 55 (bottom panel) FPO-1 There’s no suggestion of an increase in people over 55, and a lot of uncertainty in people under 55 about how death rates differed by coffee consumption. In this sort of situation you should ask what else is already known.  This can’t have been the first study to look at death rates for different levels of coffee consumption. Looking at the PubMed research database, one of the first hits is a recent meta-analysis that puts together all the results they could find on this topic.  They report
This meta-analysis provides quantitative evidence that coffee intake is inversely related to all cause and, probably, CVD mortality.

That is, averaging across all 23 studies, death rates were lower in people who drank more coffee, both men and women. It’s just possible that there’s an adverse effect only at very high doses, but the new study isn’t very convincing, because even at lower doses it doesn’t show the decrease in risk that the accumulated data show. So. The new coffee study has lots of uncertainty. We don’t know how many other ways they tried to chop up the data before they split it at age 55 — because they don’t say. Neither their article nor the press release gave any real information about past research, which turns out to disagree fairly strongly.

I agree.  Beyond all this is the ubiquitous “Type M error” problem, also known as the statistical significance filter:  By choosing to look at statistically significant results (i.e., those that are at least 2 standard errors from zero) we’re automatically biasing upward the estimated magnitudes of any comparisons.  So, yeah, I don’t believe that number. I’d also like to pick on this quote from the linked news article:
“It could be the coffee, but it could just as easily be things that heavy coffee drinkers do,” says The University of Queensland’s Dr Carl Lavie. “We have no way of knowing the cause and effect.”

But it’s not just that.  In addition, we have no good reason to believe this correlation exists in the general population. Also this:
Senior investigator Steven Blair of the University of South Carolina says it is significant the results do not show an association between coffee consumption and people older than 55. It is also important that death from cardiovascular disease is not a factor, he says.

Drawing such conclusions based on a comparison not being statistically significant, that’s a no-no too.  On the plus side, it says “the statistics have been adjusted to remove the impact of smoking.”  I hope they did a good job with that adjustment.  Smoking is the elephant in the room.  If you don’t adjust carefully for smoking and its interactions, you can pollute all the other estimates in your study. Let me conclude by saying that I’m not trying to pick on this particular study.  These are general problems.  It’s just helpful to consider them in the context of specific examples.  There are really two things going on here.  First, due to issues of selection, confounding, etc., the observed pattern might not be real.  Second, even if it is real, the two-step process of first checking for statistical significance, then taking the unadjusted point estimate at face value, has big problems because it leads to consistent overestimation of effect sizes.

I’m posting this here (as well as on our statistics blog) because I think these points are relevant for political science research as well.

Continue Reading

PPP’s baffling discard process

B. J. Martino writes:

Earlier this summer when I went on a bit of a rant about PPP and their process of discarding interviews, rather than simply weighting data. Mark Blumenthal mentioned your response to the discusion in one of his posts, where you said you were a bit “baffled” by it.

While they claim to engage in the discard process as a kind of retroactive quota to account for more older, white women in their sample, it was the discards among the non-”older white women” that made me curious. That is, any respondent who was not meeting all criteria of being age 46+, white and female.

I downloaded the data from all their 2012 surveys for Daily Kos/SEIU, and compared the sample of non-”older white women” within the unweighted released data as well as the discarded data.

At least from the first six surveys I have looked at, there appears to be a consistent difference in the partisan composition of the released data and the discarded data for this group. In every case, the released data for this group was net Democratic in Party ID (Unw D-R), and the discarded data was net Republican (Dis D-R).

Party ID in PPP Polls for Daily Kos/SEIU- Non-”older white women”
(raw unweighted data and discarded data)


Unw Sample

Discarded Sample

Unw D-R

Dis D-R

Unw- Disc





































What this suggests to me is that the discard process is both a way to apply a retroactive quota to older white women, but also a way to fix the partisanship from another group (assuming this is primarily younger voters). My thought is that they are getting too Republican a sample in this group because they never dial cell phones.

I found it interesting that despite hanging their hat on being the most accurate of 2012, they announced today that they would be working to find a way to include cell phones in the future.

Martino continues:
As I told Mark, I’m not really interested into getting into a shouting match with PPP. It has always just kind of dumbfounded me how they work. The fact that Daily Kos/SEIU published all the raw data from PPP’s 2012 polling at least gave me some opportunity to figure it out.

I guess the troubling part is how they have repeatedly stood by the statement that they do not weight for Party ID, when this discard process would seem to indicate a de facto weight on Party ID for at least a portion of the sample. What they say is strictly true, but the effect is the same. Seems to be arguing semantics.

I also took a look at the Presidential ballot for this same group of non-“older white women.” Same effect, perhaps even a bit more pronounced.

Presidential Ballot in PPP Polls for Daily Kos/SEIU- Non-older white women
(raw unweighted data and discarded data)


Unw Sample

Discarded Sample

Unw O-R

Dis O-R

Unw- Disc






































I don’t really have anything to add here; it’s just an interesting story. I remain amazed that anyone would think it’s a good idea to throw away survey interviews that have already been conducted.

Continue Reading

Needed: peer review for scientific graphics

Under the heading, “Bad graph candidate,” Kevin Wright points to this article, writing:

Some of the figures use the same line type for two different series.

More egregious are the confidence intervals that are constant width instead of increasing in width into the future.

Indeed. What’s even more embarrassing is that these graphs appeared in an article in the magazine Significance, sponsored by the American Statistical Association and the Royal Statistical Society.

Perhaps every scientific journal could have a graphics editor whose job is to point out really horrible problems and require authors to make improvements.

The difficulty, as always, is that scientists write these articles for free and as a public service (publishing in Significance doesn’t pay, nor does it count as a publication in an academic record), so it might be difficult to get authors to fix their graphs. On the other hand, if an article is worth writing at all, it’s worth trying to convey conclusions clearly.

I’m not angry at the authors for publishing bad graphs—-scientists typically don’t get training in how to construct or evaluate graphical displays, indeed I’ve seen stuff just as bad in JASA and other top statistics journals—-but it would be good to catch this stuff before it gets out for public consumption.

Continue Reading

Post-publication peer review: How it (sometimes) really works

In an ideal world, research articles would be open to criticism and discussion in the same place where they are published, in a sort of non-corrupt version of Yelp. What is happening now is that the occasional paper or research area gets lots of press coverage, and this inspires reactions on science-focused blogs. The trouble here is that it’s easier to give off-the-cuff comments than detailed criticisms.

Here’s an example. It starts a couple years ago with this article by Ryota Kanai, Tom Feilden, Colin Firth, and Geraint Rees, on brain size and political orientation:

In a large sample of young adults, we related self-reported political attitudes to gray matter volume using structural MRI. We found that greater liberalism was associated with increased gray matter volume in the anterior cingulate cortex, whereas greater conservatism was associated with increased volume of the right amygdala. These results were replicated in an independent sample of additional participants. Our findings extend previous observations that political attitudes reflect differences in self-regulatory conflict monitoring . . .

My reaction was a vague sense of skepticism, but I didn’t have the energy to look at the paper in detail so I gave a sort of sideways reaction that did not criticize the article but did not take it seriously either:

Here’s my take on this. Conservatives are jerks, liberals are wimps. It’s that simple. So these researchers can test their hypotheses by more directly correlating the brain functions with the asshole/pussy dimension, no?

A commenter replied:

Did you read the paper? Conservatives are more likely to be cowards/pussies as you call it – more likely to jump when they see something scary, so the theory is that they support authoritarian policies to protect themselves from the boogieman.

The next month, my coblogger Erik Voeten reported on a similar paper by Darren Schreiber, Alan Simmons, Christopher Dawes, Taru Flagan, James Fowler, and Martin Paulus. Erik offered no comments at all, I assume because, like me, he did not actually read the paper in question. In our blogging, Erik and I were publicizing these papers and opening the floor for discussion, although not too much discussion actually happened.

A couple years later, the paper by Schreiber et al. came out in a journal and Voeten reblogged it, again with no reactions of his own. This time there was a pretty lively discussion with some commenters objecting to interpretations of the results, but nobody questioning the scientific claims. (The comment thread eventually became occupied by a troll, but that’s another issue.)

More recently, Dan Kahan was pointed to this same research article on “red and blue brains,” blogged it, and slammed it to the wall:

The paper reports the results of an fMRI—“functional magnetic resonance imagining”— study that the authors describe as showing that “liberals and conservatives use different regions of the brain when they think about risk.” . . .

So what do I think? . . . the paper supplies zero reason to adjust any view I have—or anyone else does, in my opinion—on any matter relating to individual differences in cognition & ideology.

Ouch. Kahan writes that Schreiber et al. used a fundamentally flawed statistical approach in which they basically went searching for statistical significance:

There are literally hundreds of thousands of potential “observations” in the brain of each study subject. Because there is constantly varying activation levels going on throughout the brain at all time, one can always find “statistically significant” correlations between stimuli and brain activation by chance. . . .

Schreiber et al. didn’t discipline their evidence-gathering . . . They did initially offer hypotheses based on four precisely defined brain ROIs in “the right amygdala, left insula, right entorhinal cortex, and anterior cingulate.” They picked these, they said, based on a 2011 paper [the one mentioned at the top of the present post] . . .

But contrary to their hypotheses, Schreiber et al. didn’t find any significant differences in the activation levels within the portions of either the amygdala or the anterior cingulate cortex singled out in the 2011 Kanai et al. paper. Nor did Schreiber et al. find any such differences in a host of other precisely defined areas (the “entorhinal cortex,” “left insula,” or “Right Entorhinal”) that Kanai et al. identified as differeing structurally among Democrats and Republicans in ways that could suggest the hypothesized differences in cognition.

In response, Schreiber et al. simply widened the lens, as it were, of their observational camera to take in a wider expanse of the brain. “The analysis of the specific spheres [from Kanai et al.] did not appear statistically significant,” they explain,” so larger ROIs based on the anatomy were used next.” . . .

Even after resorting to this device, Schreiber et al. found “no significant differences . . . in the anterior cingulate cortex,” but they did manage to find some “significant” differences among Democrats’ and Republicans’ brain activation levels in portions of the “right amygdala” and “insula.”

And it gets worse. Here’s Kahan again:

They selected observations of activating “voxels” in the amygdala of Republican subjects precisely because those voxels—as opposed to others that Schreiber et al. then ignored in “further analysis”—were “activating” in the manner that they were searching for in a large expanse of the brain. They then reported the resulting high correlation between these observed voxel activations and Republican party self-identification as a test for “predicting” subjects’ party affiliations—one that “significantly out-performs the longstanding parental model, correctly predicting 82.9% of the observed choices of party.”

This is bogus. Unless one “use[s] an independent dataset” to validate the predictive power of “the selected . . .voxels” detected in this way, Kriegeskorte et al. explain in their Nature Neuroscience paper, no valid inferences can be drawn. None.

Kahan follows up one of my favorite points, on the way in which multiple comparisons corrections exacerbate the statistical significance filter:

Pushing a button in one’s computer program to ramp up one’s “alpha” (the p-value threshold, essentially, used to avoid “type 1” errors) only means one has to search a bit harder; it still doesn’t make it any more valid to base inferences on “significant correlations” found only after deliberately searching for them within a collection of hundreds of thousands of observations.

Wow. Look what happened. Assuming Kahan is correct here, we all just accepted the claimed results. Nobody actually checked to see if they all made sense.

I thought a bit and left the following comment on Kahan’s blog:

Read between the lines. The paper originally was released in 2009 and was published in 2013 in PLOS-One, which is one step above appearing on Arxiv. PLOS-One publishes some good things (so does Arxiv) but it’s the place people place papers that can’t be placed. We can deduce that the paper was rejected by Science, Nature, various other biology journals, and maybe some political science journals as well.

I’m not saying you shouldn’t criticize the paper in question, but you can’t really demand better from a paper published in a bottom-feeder journal.

Again, just because something’s in a crap journal, doesn’t mean it’s crap; I’ve published lots of papers in unselective, low-prestige outlets. But it’s certainly no surprise if a paper published in a low-grade journal happens to be crap. They publish the things nobody else will touch.

Some of my favorite papers have been rejected many times before finally reaching publication. So I’m certainly not saying that appearance in a low-ranked journal is definitive evidence that a paper is flawed. But, if it’s really been rejected by 3 journals before getting to this point, that could be telling us something.

One of the problems with traditional pre-publication peer review is that it’s secret. What were the reasons that those 3 journals (I’m guessing) rejected the paper? Were they procedural reasons (“We don’t publish political science papers”), or irrelevant reasons (“I just don’t like this paper”), or valid criticisms (such as Kahan’s noted above)? We have no idea.

As we know so well, fatally flawed papers can appear in top journals and get fawning press; the pre-publication peer-review process is far from perfect. Post-publication peer-review seems like an excellent idea. But, as the above story indicates, it’s not so easy. You can get lots of “Andy Gelmans” and “Erik Voetens” who just post a paper without reading it, and only the occasional “Dan Kahan” who takes a detailed examination.

P.S. The above post is unfair in three ways.

1. It’s misleading to call Plos-One a “crap journal.” Yes, it publishes articles that other journals won’t publish. But that doesn’t make it crap. As various commenters have pointed out, Plos-One has a different publication model compared to traditional journals. “Different” doesn’t mean “crap.”

2. I have no particular reason to think that the paper above was rejected by others before being submitted to Plos-One.

3. Just because the methods in this paper have problems, that doesn’t mean its conclusions are wrong. The data analysis provides some support for the conclusions, even if the evidence isn’t quite as strong as claimed.

I recognize that this sort of study is difficult and costly, and I have a great respect for researchers who work in this area. If I can contribute via some statistical scrutiny, it is not to shoot all this down but rather with the goal of helping these resources be used more effectively.

Continue Reading

Seeking Director, Educational Program on Data and Computing for Journalists

Mark Hansen, statistician and professor of journalism at Columbia University, writes that they’re looking to hire a director for a new program teaching journalists about data and computing.

Columbia Journalism School is creating a new post-baccalaureate program aimed at preparing college graduates who have little or no quantitative or computational background to be successful applicants to masters and doctoral degree programs that require skills in those areas. This is being done in consultation with a consortium of faculty from across the University, many from newly computational fields such as the digital humanities and the computational social sciences, which face the same disconnect between student preparation and emerging data and computing based research practices. As far as we know this program would be the first of its kind in the country.

We came to this project because, as part of the work of the Tow Center for Digital Journalism, we recently started a dual degree program in journalism and computer science. We have found it challenging to recruit young journalists to the program because they find the prospect of immediately enrolling in graduate-level computer science programs daunting. We also know that colleagues across the university want to increase the computational competency of those pursuing graduate study in their disciplines. We are thus leading a group of colleagues in creating a new program to address these needs, which we call Year Zero (so-named to suggest the portion of a graduate degree program that occurs before its first official year).

In the first semester, students will be introduced to a core series of concepts, taught in the context of the artifacts and practices of journalism, the digital humanities and computational social science, and often with pairs of instructors, one from computer science and one from these other fields. In the second term, students who plan to apply to our dual masters degree will take computer sciences courses, while potential candidates for advanced study in other fields will choose from a variety of other computational courses.

We are now seeking a Program Director to work with the Directors of the Tow Center and the Brown Institute to create course offerings for, lead courses in, and help recruit students for Year Zero. This is a full-time, two-year position, but the position is renewable based on performance and the success of the program.

This sounds a lot like the quantitative methods in social sciences M.A. program we started up a decade and a half ago at Columbia. QMSS was immediately successful and became more so, and I have every expectation that Mark’s program will become a similar success.

Continue Reading

Yes, Forecasting Conflict Can Help Make Better Foreign Policy Decisions: A Response to Salehyan

Over at Dart Throwing Chimp, Jay Ulfelder responds to yesterday’s Monkey Cage guest post by Idean Salehyan. He notes that:

Ultimately, we all have a professional and ethical responsibility for the consequences of our work. For statistical forecasters, I think this means, among other things, a responsibility to be honest about the limitations, and to attend to the uses, of the forecasts we produce. The fact that we use mathematical equations to generate our forecasts and we can quantify our uncertainty doesn’t always mean that our forecasts are more accurate or more precise than what pundits offer, and it’s incumbent on us to convey those limitations. It’s easy to model things. It’s hard to model them well, and sometimes hard to spot the difference. We need to try to recognize which of those worlds we’re in and to communicate our conclusions about those aspects of our work along with our forecasts. (N.B. It would be nice if more pundits tried to abide by this rule as well. Alas, as Phil Tetlock points out in Expert Political Judgment, the market for this kind of information rewards other things.)

However, he takes issue with Salehyan’s claim that forecasters somehow get more attention from policy makers:

In my experience and the experience of every policy veteran with whom I’ve ever spoken about the subject, Salehyan’s conjecture that “statistical forecasts are likely to carry greater weight in the policy community” is flat wrong. In many ways, the intellectual culture within the U.S. intelligence and policy communities mirrors the intellectual culture of the larger society from which their members are drawn. If you want to know how those communities react to statistical forecasts of the things they care about, just take a look at the public discussion around Nate Silver’s election forecasts. The fact that statistical forecasts aren’t blithely and blindly accepted doesn’t absolve statistical forecasters of responsibility for their work. Ethically speaking, though, it matters that we’re nowhere close to the world Salehyan imagines in which the layers of deliberation disappear and a single statistical forecast drives a specific foreign policy decision.

He concludes that:

Look, these decisions are going to be made whether or not we produce statistical forecasts, and when they are made, they will be informed by many things, of which forecasts—statistical or otherwise—will be only one. That doesn’t relieve the forecaster of ethical responsibility for the potential consequences of his or her work. It just means that the forecaster doesn’t have a unique obligation in this regard. In fact, if anything, I would think we have an ethical obligation to help make those forecasts as accurate as we can in order to reduce as much as we can the uncertainty about this one small piece of the decision process. It’s a policymaker’s job to confront these kinds of decisions, and their choices are going to be informed by expectations about the probability of various alternative futures. Given that fact, wouldn’t we rather those expectations be as well informed as possible? I sure think so, and I’m not the only one.

The full post is available here.

Continue Reading

One Irony of Nate Silver’s Leaving the New York Times

Nate Silver’s imminent departure from the New York Times to ABC and ESPN —see details in the links within this Jack Shafer post—has elicited stories of hostility to Silver within the Times newsroom.  The Times public editor, Margaret Sullivan, writes:

His entire probability-based way of looking at politics ran against the kind of political journalism that The Times specializes in: polling, the horse race, campaign coverage, analysis based on campaign-trail observation, and opinion writing, or “punditry,” as he put it, famously describing it as “fundamentally useless.” Of course, The Times is equally known for its in-depth and investigative reporting on politics.

His approach was to work against the narrative of politics – the “story” – and that made him always interesting to read. For me, both of these approaches have value and can live together just fine.

A number of traditional and well-respected Times journalists disliked his work. The first time I wrote about him I suggested that print readers should have the same access to his writing that online readers were getting. I was surprised to quickly hear by e-mail from three high-profile Times political journalists, criticizing him and his work. They were also tough on me for seeming to endorse what he wrote, since I was suggesting that it get more visibility.

I had a similar experience the night of the Iowa caucus in 2012.  Lynn Vavreck and I were in the lobby of the Des Moines Marriott talking to a senior Times reporter when the subject of Silver came up.  The reporter went on a rant about how Silver did not know things because he hadn’t been in the field as reporters are, about Silver’s “models,” and about how Silver could talk about polls that did not meet the Times polling standards (a fact that the Times polling editor Kate Phillips also referred to as “a problem” in a Twitter exchange with political scientist Daniel Smith).

The problem with this critique of Silver is that, if you follow his work closely or read his book, he’s extremely cautious about what data and modeling can accomplish.  Moreover, if you read his book closely, he is quite clear that the kind of data reporters (or baseball scouts) often gather—which is qualitative, not quantitative—is exceedingly valuable for doing what he does. This is a point about the book that hasn’t received much emphasis in the reviews I’ve read, even though the potential value of qualitative data in quantitative forecasts is well-established (see, for example, Bruce Bueno de Mesquita’s approach).

For example, although Silver developed a quantitative system, PECOTA, for evaluating baseball players, that you might think was the natural antagonist of qualitative data-gatherers like scouts, his book chapter on this has a section called “PECOTA vs. Scouts: Scouts Win.”  Silver writes:

Although the scouts’ judgment is sometimes flawed, they were adding plenty of value: their forecasts were about 15 percent better than ones that relied on statistics alone.

He has the same view of quantitative election forecasting.  In a section of the book called “Weighing Qualitative Information,” he lauds the value in the in-depth interviewing of candidates that is done by David Wasserman and others at the Cook Political Report.  Silver uses the ratings that Cook and others have developed in his own House forecasts and finds that they also add value.

So the irony, as I see it, is that Silver faced resentment within the newsroom even though his approach explicitly values the work that reporters do.  Although I suspect that Times reporters wouldn’t like to simply be inputs in one of Silver’s models, I could easily see how the Times could have set up a system by which campaign reporters fed their impressions to Silver based on their reporting and then Silver worked to incorporate their impressions in a systematic fashion.

In short, even though it may be impossible to eliminate the tension between Silver’s approach and that of at least some reporters, I think there is an under-appreciated potential for symbiosis.  Perhaps Silver will find that at ESPN and ABC.

[Full disclosure: In 2011-2012, I was a paid contributor to the 538 blog.]

Continue Reading