The Tweets-Votes Curve

by Andrew Gelman on April 24, 2013 · 16 comments

in Data,Media

Fabio Rojas points me to this excellently-titled working paper by Joseph DiGrazia, Karissa McKelvey, Johan Bollen, and himself:

Is social media a valid indicator of political behavior? We answer this ques- tion using a random sample of 537,231,508 tweets from August 1 to November 1, 2010 and data from 406 competitive U.S. congressional elections provided by the Federal Election Commission. Our results show that the percentage of Republican-candidate name mentions correlates with the Republican vote margin in the subsequent election. This finding persists even when controlling for incumbency, district partisanship, media coverage of the race, time, and demographic variables such as the district’s racial and gender composi- tion. With over 500 million active users in 2012, Twitter now represents a new frontier for the study of human behavior. This research provides a framework for incorporating this emerging medium into the computational social science toolkit.

One charming thing about this paper—-and I know this is going to sound patronizing but I don’t mean it to be—-is that the authors (or, at least, whatever subset of the authors who did the statistical work) are amateurs. They analyze the outcome in terms of total votes rather than vote proportion, even while coding the predictor as a proportion. They present regression coefficients to 7 significant figures. They report that they have data from two different election cycles but present only one in the paper (but they do have the other in their blog post).

But that’s all ok. They pulled out some interesting data. And, as I often say, the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use.

Tweets and votes

As to the result itself, I’m not quite sure what to do with it. Here’s the key graph:

tweet1

More tweets, more votes, indeed.
Of course most congressional elections are predictable. But the elections that are between 40-60 and 60-40, maybe not so much. So let’s look at the data there . . . Not such a strong pattern (and for the 2012 data in the 40-60% range it looks even worse; any correlation is swamped by the noise). That’s fine, and it’s not unexpected, it’s not a criticism of the paper but it indicates that the real gain in this analysis is not for predicting votes.

I’m not so convinced that tweets will be so useful in predicting votes—-most congressional elections are predictable, but perhaps the prediction tool could be more relevant in low-information or multicandidate elections where prediction is not so easy.

Instead, it might make sense to flip it around and predict twitter mentions given candidate popularity. That is, rotate the graph 90 degrees, and see how much variation there is in tweet shares for elections of different degrees of closeness. Also, while you’re at it, re-express vote share as vote proportion. And scale the size of each dot to the total number of tweets for the two candidates in the election.

Move away from trying to predict votes and move toward trying to understand tweets. DiGrazia et al. write, “the models show that social media matters . . .” No, not at all. They find a correlation between candidate popularity and social media mentions. No-name and fringe candidates get fewer mentions (on average) than competitive and dominant candidates. That’s fine, you can go with that.

Again, I fear the above sounds patronizing, but I don’t mean it to be. You gotta start somewhere, and you’re nowhwere without the data. As someone who was (originally) an outsider to the field of political science, I do think that researchers coming from other fields can offer valuable perspectives.

Sharing the data

What I want to know is, is this dataset publicly available? What would really make this go viral is if DiGrazia et al. post the data online. Then everyone will take a hack at it, and each of those people will cite them.

There’s been a lot of talk about reproducible research lately. In this case, they have a perfect incentive to make they data public: it will help them out, it will help out the research project, and it will be a great inspiration for others to follow in their footsteps. Releasing data as a publicity intensifier: that’s my new idea.

P.S. In the first version of this post I included a graph showing votes given tweet shares between 40% and 60%. I intended this to illustrate the difficulty of predicting close elections, but my graph really missed the point, because the x-axis represented close elections in tweet shares, not in votes. So I crossed that part out. If nothing else, I’ve demonstrated the difficulty of thinking about this sort of analysis!

{ 16 comments }

Squarely Rooted April 24, 2013 at 10:57 am

Maybe this is just me being whimsical, but here’s the regression I’d like to see: the difference between the expected and actual vote-share (ie, the difference between the outcome predicted by the national result and the district’s partisan lean and the actual outcome) on the tweet-share.

Joe from NY April 24, 2013 at 11:14 am

What is the added value for readers of the blog to know that you think they are amateurs? That’s an unnecessary insult.
If the paper is as half-baked as you make it look, why put it on the blog and not just send them an email with your comments and give them a go a week later? Sorry if that sounds patronizing, it isn’t meant to be.

Andrew Gelman April 24, 2013 at 11:33 am

Joe:

1. “Amateur” is not an insult, it’s a description. I’m an amateur at many things myself. One of the points of my post is that it’s good when amateurs try things out, they can bring in data sources and insights that the pros can miss. Professionals in any field can get in a rut.

2. Please follow the links. The authors already posted the paper on their blog. I’m commenting on a paper that’s already been made public.

3. Your comment to me did not sound patronizing, it just looks like you didn’t read my post carefully. But that’s ok, I know our blog readers are busy. You can take what you want from our posts.

Fabio Rojas April 24, 2013 at 11:38 pm

Andrew: You definitely have the right to critique the paper based on argument or data analysis – and I have definitely critiqued your work in the past! But, I am curious, in what sense am I or my co-authors amateurs? As folks who have published in computer science and social science journals, the comment is odd, and reading the entire post doesn’t provide me with further understanding of your intended meaning.

Andrew Gelman April 25, 2013 at 5:07 pm

Fabio:

I meant that you are amateurs in two senses. First, I don’t think you’re political scientists. Second, many of the choices you made in your statistical analysis seem to me the sorts of choices that would be made by people without much experience analyzing election data (for example, putting tweet proportions and raw votes in the same model, presenting regression coefficients to 7 significant figures, and having two data sets but just showing the analysis of one of them).

Again, “amateur” is not a bad thing. I recently wrote a philosophy paper. I’m an amateur in philosophy, and I emphasized in that paper that I was providing a statistician’s perspective. In this case, you’re providing a sociologist’s perspective, which could well be a good thing here.

Joseph DiGrazia April 25, 2013 at 6:29 pm

Hi Andrew. You are correct that we are not political scientists (though Fabio and I are political sociologists). But, I thought I could shed light on some of the issues that seem to be concerning you and explain why they appear the way they do.

Regarding the 7 digits: We used a function in R to export the tables directly into LaTeX. Yes, it is true that we uploaded that draft to SSRN back in February without cleaning that up. At the time we uploaded it to make it easier for us to share it with colleagues and get feedback – not with the intent for it to be widely distributed. Perhaps, we should have remembered to fix that before linking to it on the blog, but oversights happen. In short we had 7 digits in the table because we hadn’t gotten around to fixing it, not because we didn’t know any better. We have obviously published quantitative work in the past with properly formatted tables.

Regarding the two datasets: the SSRN paper was written several months ago before we had finished compiling and analyzing the 2012 data set. We included the newer graph on the blog just to give people a sense of what the 2012 data looked like.

Regarding the vote margin vs share issue, I discuss that in my comment below.

Fabio Rojas April 29, 2013 at 12:06 am

Ok, Andrew, I see your drift. On the first count, not one of us is a PhD holding political scientist, but Joe and I have actually published in political science journals. At least we can fake it among political scientists! That’s gotta count for something.

Second, you are right that we are not election specialists. Joe and I are political movements scholars, while Karissa and Johan are computer scientists/psychologists. But you might actually be interested in knowing that the only reason the paper “works” (= we get an important result) is that we clearly and obviously copied a key insight from the political science literature. We used “two party tweet share,” which is a direct analog of “two party vote share.” Earlier efforts had shown that aggregate tweet volume has no significant effect on election outcomes.

That’s the reason that this result is important. Earlier efforts to link social media and election outcomes were hampered by multiple issues. We are able to make some progress by using two political science tricks – use a panel of Congressional elections with standard control variables and use two party share measures.

Viva the amateurs!

Andrew Gelman April 29, 2013 at 10:04 am

Fabio:

I agree–and I say this as a non-sociologist who’s published in sociology journals.

RobC April 24, 2013 at 3:53 pm

Rather than “amateurs,” how about “undocumented academics”? They’re doing the work American academics won’t.

Joseph DiGrazia April 24, 2013 at 4:40 pm

@RobC – Actually, we are academics. Andrew is just calling us “amateurs” because of the superficial issues he identifies in the first paragraph. The draft linked to on SSRN (which we uploaded in February) contains a slightly different analysis than the one that produced the figure on the original blog post. This explains the discrepancy between blog and the PDF.

John Sides April 24, 2013 at 5:24 pm

Joe: I’ve looked at the paper too. Have you guys done models with vote share (i.e., share of two-party vote) rather than vote margin? That’s how it’s typically done in polisci. In the 2010 model, have you tried controlling for 2008 vote share and the balance of campaign spending between the two candidates? And have you tried a sort of placebo test where you see if 2010 tweets predict 2008 vote share, conditional on covariates? Just thinking about ways that you can shore up confidence in this correlation.

Joseph DiGrazia April 24, 2013 at 5:54 pm

@John Sides – Yes, we have run the models using vote share as the DV – in fact, that is what we used, initially, in earlier versions of the analysis. Substantively, the results were no different. We made the choice to switch to vote margin because we found interpretations relating to changes in actual vote counts to be compelling. We recognized at the time that this was not standard in the political science literature.

We did control for the 2008 presidential vote share, but we did not control for the 2008 congressional vote share (though we did account for incumbency), nor did we account for spending. Your suggestion regarding using the 2008 election as a test case is interesting and we will have to look into it.

Even still, our claim is that Twitter can be a reliable indicator of the public sentiment toward candidates in the run up to the election. We make no claims that other processes (i.e. campaign spending) may be driving these attitudes. Nor, as Andrew suggested in his post, do we attempt to predict election outcomes in the paper. We are simply trying to show that it is possible to construct a metric from social media data that is reliably correlated with real-worl, offline behavior.

Andrew Gelman April 24, 2013 at 7:58 pm

Joseph:

I was speaking statistically. You fit a regression model in which the outcome is the election outcome. In statistical terminology, you are predicting outcomes. I suggested you might get some interesting results if you use candidate popularity (for example, as measured by the election outcome) to predict the tweets.

Michael Frank April 25, 2013 at 4:40 pm

A couple of thoughts on the post and the paper. First, I agree with Andrew that it would be nice to see twitter mentions as the dependent variable. As a form of political behavior, tweeting about a candidate seems somewhat equivalent to trying to influence how someone votes. Perhaps something like closeness of the election in that district would be a good reason to expect variation in twitter mentions. Second, I’d also be interested in a discussion of twitter mentions coming from fake profiles in each campaign. Does the impact of mentions depend on the proportion of tweets that are “real,” or does “campaign advertising” have a measurable effect on vote share? Just a couple of initial thoughts based on an intriguing new data set.

Lawrence Zigerell April 29, 2013 at 11:51 am

The manuscript claims that the positive correlation between tweet share and electoral outcomes is evidence that “all publicity is good publicity”. But it appears that the evidence warrants only the claim that “the sum of all publicity is good publicity”. The fact that the correlation between overall tweet share and electoral outcomes is positive cannot be used to support a claim that each and every tweet adds to electoral vote share. The claim about all tweets having a positive effect requires at least a separate analysis of negative tweets in isolation.

Andrew Gelman April 29, 2013 at 11:55 am

Lawrence:

Indeed, zero evidence is presented that tweets have any effect at all, in isolation or otherwise.

Comments on this entry are closed.

Previous post:

Next post: