The following is a guest post from my colleague NYU political scientist Jonathan Nagler. In the interest of full disclosure, he and I are both Co-Directors of NYU Social Media and Political Participation (SMaPP) laboratory.
A group of sociologists at Indiana University recently claimed to have shown that “tweets predict elections”. The research looks at the proportion of tweets during the 3 months preceding the 2010 election mentioning either the democratic or republican candidate in a house race that mentioned the Republican candidate, and uses that ratio to predict the election outcome. In an Op-Ed published in The Washington Post claiming to describe the research, Fabio Rojas, one of the authors claimed that “In the 2010 data, our Twitter data predicted the winner in 404 out of 406 competitive races.” Really?
Below is Figure 1 from their paper (available on SSRN). I don’t know where Rojas was looking, but I see a lot of points on the right half of the graph — where Republican tweet share was higher than 50% — that are BELOW 0, meaning the Republican candidate LOST the election. Similarly, there are plenty of points in the left half of the graph — where the Republican tweet share was less than 50% — where the Republican candidate won the election.
The nice thing about the authors publishing this graph is that it gives a perfectly accurate description of the relationship between tweet-share and margin of victory. They are related, but the relationship seems to be fairly weak. So if we wanted to predict election outcomes, would it make any sense to use the tweet-share? We could probably do a lot better looking at who the incumbent is, the share of the vote won in the district by the party’s last presidential candidate, or any of a host of other variables. So where does the 404 out of 406 number come from? I can only guess that Rojas was making a claim about the in-sample predictions of the full model reported in Table 1: a model that includes such important variables as whether or not there was a Republican incumbent, the proportion of votes John McCain got in the district, and the proportion of the district that is white. But do we really think the tweet-share is accounting for many of those 404 correct predictions? And without having the data in hand, it’s hard to believe that even the full model they got 404 correct predictions.
What we can see from the model they report in Table 1 is that the share of mentions in Tweets seems to have some predictive power for the Republican vote margin beyond the other variables in the model. That’s interesting. But it’s a lot different than saying that the tweets predict the outcome.
And more importantly, does the tweet share actually influence anything? It might come as no surprise that the tweet share is correlated with the winner of the election: that is pretty much what we would expect. The winner will generally have more name-recognition, spend more money and have a more active campaign. All those things should generate more twitter chatter. What we want to know is: does the chatter on twitter about a candidate affect what people think of the candidate? Does it make them more, or less, likely to vote for the candidate?
These are interesting questions that my colleagues at the SMaPP lab at NYU and I are trying to answer. The basic data and analysis presented in the Indiana paper is interesting and informative. But it doesn’t help inform anyone to then make overstated claims that are so obviously contradicted by the data.