Fabio Rojas points me to this excellently-titled working paper by Joseph DiGrazia, Karissa McKelvey, Johan Bollen, and himself:
Is social media a valid indicator of political behavior? We answer this ques- tion using a random sample of 537,231,508 tweets from August 1 to November 1, 2010 and data from 406 competitive U.S. congressional elections provided by the Federal Election Commission. Our results show that the percentage of Republican-candidate name mentions correlates with the Republican vote margin in the subsequent election. This finding persists even when controlling for incumbency, district partisanship, media coverage of the race, time, and demographic variables such as the district’s racial and gender composi- tion. With over 500 million active users in 2012, Twitter now represents a new frontier for the study of human behavior. This research provides a framework for incorporating this emerging medium into the computational social science toolkit.
One charming thing about this paper—and I know this is going to sound patronizing but I don’t mean it to be—is that the authors (or, at least, whatever subset of the authors who did the statistical work) are amateurs. They analyze the outcome in terms of total votes rather than vote proportion, even while coding the predictor as a proportion. They present regression coefficients to 7 significant figures. They report that they have data from two different election cycles but present only one in the paper (but they do have the other in their blog post).
But that’s all ok. They pulled out some interesting data. And, as I often say, the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use.
Tweets and votes
As to the result itself, I’m not quite sure what to do with it. Here’s the key graph:
More tweets, more votes, indeed.
Of course most congressional elections are predictable. But the elections that are between 40-60 and 60-40, maybe not so much. So let’s look at the data there . . . Not such a strong pattern (and for the 2012 data in the 40-60% range it looks even worse; any correlation is swamped by the noise). That’s fine, and it’s not unexpected, it’s not a criticism of the paper but it indicates that the real gain in this analysis is not for predicting votes.
I’m not so convinced that tweets will be so useful in predicting votes—most congressional elections are predictable, but perhaps the prediction tool could be more relevant in low-information or multicandidate elections where prediction is not so easy.
Instead, it might make sense to flip it around and predict twitter mentions given candidate popularity. That is, rotate the graph 90 degrees, and see how much variation there is in tweet shares for elections of different degrees of closeness. Also, while you’re at it, re-express vote share as vote proportion. And scale the size of each dot to the total number of tweets for the two candidates in the election.
Move away from trying to predict votes and move toward trying to understand tweets. DiGrazia et al. write, “the models show that social media matters . . .” No, not at all. They find a correlation between candidate popularity and social media mentions. No-name and fringe candidates get fewer mentions (on average) than competitive and dominant candidates. That’s fine, you can go with that.
Again, I fear the above sounds patronizing, but I don’t mean it to be. You gotta start somewhere, and you’re nowhwere without the data. As someone who was (originally) an outsider to the field of political science, I do think that researchers coming from other fields can offer valuable perspectives.
Sharing the data
What I want to know is, is this dataset publicly available? What would really make this go viral is if DiGrazia et al. post the data online. Then everyone will take a hack at it, and each of those people will cite them.
There’s been a lot of talk about reproducible research lately. In this case, they have a perfect incentive to make they data public: it will help them out, it will help out the research project, and it will be a great inspiration for others to follow in their footsteps. Releasing data as a publicity intensifier: that’s my new idea.
P.S. In the first version of this post I included a graph showing votes given tweet shares between 40% and 60%. I intended this to illustrate the difficulty of predicting close elections, but my graph really missed the point, because the x-axis represented close elections in tweet shares, not in votes. So I crossed that part out. If nothing else, I’ve demonstrated the difficulty of thinking about this sort of analysis!