No, You Can’t Predict US Congressional Election Outcomes with Tweet Shares: But That Doesn’t Mean You Shouldn’t Try

by Joshua Tucker on August 16, 2013 · 5 comments

in Campaigns and elections,Social Media

The following is a guest post from my colleague NYU political scientist Jonathan Nagler.  In the interest of full disclosure, he and I are both Co-Directors of NYU Social Media and Political Participation (SMaPP) laboratory.

*****

A group of sociologists at Indiana University recently claimed to have shown that “tweets predict elections”. The research looks at the proportion of tweets during the 3 months preceding the 2010 election mentioning either the democratic or republican candidate in a house race that mentioned the Republican candidate, and uses that ratio to predict the election outcome.  In an Op-Ed published in   The Washington Post claiming to describe the research, Fabio Rojas, one of the authors claimed that “In the 2010 data, our Twitter data predicted the winner in 404 out of 406 competitive races.” Really?

Below is Figure 1 from their paper (available on  SSRN).  I don’t know where Rojas was looking, but I see a lot of points on the right half of the graph—where Republican tweet share was higher than 50%—that are BELOW 0, meaning the Republican candidate LOST the election. Similarly, there are plenty of points in the left half of the graph—where the Republican tweet share was less than 50%—where the Republican candidate won the election.

tweets_figure1

The nice thing about the authors publishing this graph is that it gives a perfectly accurate description of the relationship between tweet-share and margin of victory. They are related, but the relationship seems to be fairly weak.  So if we wanted to predict election outcomes, would it make any sense to use the tweet-share? We could probably do a lot better looking at who the incumbent is, the share of the vote won in the district by the party’s last presidential candidate, or any of a host of other variables. So where does the 404 out of 406 number come from? I can only guess that Rojas was making a claim about the in-sample predictions of the full model reported in Table 1: a model that includes such important variables as whether or not there was a Republican incumbent, the proportion of votes John McCain got in the district, and the proportion of the district that is white.  But do we really think the tweet-share is accounting for many of those 404 correct predictions? And without having the data in hand, it’s hard to believe that even the full model they got 404 correct predictions.

tweets_table1

What we can see from the model they report in Table 1 is that the share of mentions in Tweets seems to have some predictive power for the Republican vote margin beyond the other variables in the model. That’s interesting. But it’s a lot different than saying that the tweets predict the outcome.

And more importantly, does the tweet share actually influence anything? It might come as no surprise that the tweet share is correlated with the winner of the election: that is pretty much what we would expect. The winner will generally have more name-recognition, spend more money and have a more active campaign. All those things should generate more twitter chatter. What we want to know is: does the chatter on twitter about a candidate affect what people think of the candidate? Does it make them more, or less, likely to vote for the candidate?

These are interesting questions that my colleagues at the  SMaPP lab at NYU and I are trying to answer.  The basic data and analysis presented in the Indiana paper is interesting and informative. But it doesn’t help inform anyone to then make overstated claims that are so obviously contradicted by the data.

{ 5 comments }

Andrew Gelman August 17, 2013 at 4:04 am

Hi–we discussed this paper on the Monkey Cage a few months ago. As I wrote at the time:

I’m not so convinced that tweets will be so useful in predicting votes—-most congressional elections are predictable, but perhaps the prediction tool could be more relevant in low-information or multicandidate elections where prediction is not so easy.

Instead, it might make sense to flip it around and predict twitter mentions given candidate popularity. That is, rotate the graph 90 degrees, and see how much variation there is in tweet shares for elections of different degrees of closeness. Also, while you’re at it, re-express vote share as vote proportion. And scale the size of each dot to the total number of tweets for the two candidates in the election.

Move away from trying to predict votes and move toward trying to understand tweets. DiGrazia et al. write, “the models show that social media matters . . .” No, not at all. They find a correlation between candidate popularity and social media mentions. No-name and fringe candidates get fewer mentions (on average) than competitive and dominant candidates. That’s fine, you can go with that.

Mihai Martoiu Ticu August 17, 2013 at 4:26 am

==The winner will generally have more name-recognition, spend more money and have a more active campaign.==

Oh. I thought that people were listening to the arguments and decided rationally which argument were better. Now I understand that the amount of money spent, active campaigning and name-recognition have influence on the outcome. Does that mean that people are irrational and that there is no democracy?

Andrew Gelman August 18, 2013 at 4:53 am

Mihai:

It does not make sense to characterize people as “rational” or “irrational.” Rationality is a way of being (or, more precisely, various ways of being). People make decisions through a mix of rational and irrational processes.

A August 17, 2013 at 9:15 am

Yes, that is precisely — exactly — what it means.

Fr. August 19, 2013 at 5:36 am

The crucial bit is “without having the data in hand”. Without having the data in hand, the whole reading experience of the preliminary findings is clouded by missing bits of information.

Comments on this entry are closed.

Previous post:

Next post: