Archive | Data

OpenData Latinoamerica

Miguel Paz writes:

Poderomedia Foundation and PinLatam are launching, a regional data repository to free data and use it on Hackathons and other activities by HacksHackers chapters and other organizations.

We are doing this because the road to the future of news has been littered with lost datasets. A day or so after every hackathon and meeting where a group has come together to analyze, compare and understand a particular set of data, someone tries to remember where the successful files were stored. Too often, no one is certain. Therefore with Mariano Blejman we realized that we need a central repository where you can share the data that you have proved to be reliable: OpenData Latinoamerica, which we are leading as ICFJ Knight International Journalism Fellows.

If you work in Latin America or Central America your organization can take part in To apply, go to the website and answer a simple form agreeing to meet the standard criteria for open data. Once the application is approved, you will receive an account to start running and managing open data, becoming part of the community.

Continue Reading

The Tweets-Votes Curve

Fabio Rojas points me to this excellently-titled working paper by Joseph DiGrazia, Karissa McKelvey, Johan Bollen, and himself:

Is social media a valid indicator of political behavior? We answer this ques- tion using a random sample of 537,231,508 tweets from August 1 to November 1, 2010 and data from 406 competitive U.S. congressional elections provided by the Federal Election Commission. Our results show that the percentage of Republican-candidate name mentions correlates with the Republican vote margin in the subsequent election. This finding persists even when controlling for incumbency, district partisanship, media coverage of the race, time, and demographic variables such as the district’s racial and gender composi- tion. With over 500 million active users in 2012, Twitter now represents a new frontier for the study of human behavior. This research provides a framework for incorporating this emerging medium into the computational social science toolkit.

One charming thing about this paper—-and I know this is going to sound patronizing but I don’t mean it to be—-is that the authors (or, at least, whatever subset of the authors who did the statistical work) are amateurs. They analyze the outcome in terms of total votes rather than vote proportion, even while coding the predictor as a proportion. They present regression coefficients to 7 significant figures. They report that they have data from two different election cycles but present only one in the paper (but they do have the other in their blog post).

But that’s all ok. They pulled out some interesting data. And, as I often say, the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use.

Tweets and votes

As to the result itself, I’m not quite sure what to do with it. Here’s the key graph:


More tweets, more votes, indeed.
Of course most congressional elections are predictable. But the elections that are between 40-60 and 60-40, maybe not so much. So let’s look at the data there . . . Not such a strong pattern (and for the 2012 data in the 40-60% range it looks even worse; any correlation is swamped by the noise). That’s fine, and it’s not unexpected, it’s not a criticism of the paper but it indicates that the real gain in this analysis is not for predicting votes.

I’m not so convinced that tweets will be so useful in predicting votes—-most congressional elections are predictable, but perhaps the prediction tool could be more relevant in low-information or multicandidate elections where prediction is not so easy.

Instead, it might make sense to flip it around and predict twitter mentions given candidate popularity. That is, rotate the graph 90 degrees, and see how much variation there is in tweet shares for elections of different degrees of closeness. Also, while you’re at it, re-express vote share as vote proportion. And scale the size of each dot to the total number of tweets for the two candidates in the election.

Move away from trying to predict votes and move toward trying to understand tweets. DiGrazia et al. write, “the models show that social media matters . . .” No, not at all. They find a correlation between candidate popularity and social media mentions. No-name and fringe candidates get fewer mentions (on average) than competitive and dominant candidates. That’s fine, you can go with that.

Again, I fear the above sounds patronizing, but I don’t mean it to be. You gotta start somewhere, and you’re nowhwere without the data. As someone who was (originally) an outsider to the field of political science, I do think that researchers coming from other fields can offer valuable perspectives.

Sharing the data

What I want to know is, is this dataset publicly available? What would really make this go viral is if DiGrazia et al. post the data online. Then everyone will take a hack at it, and each of those people will cite them.

There’s been a lot of talk about reproducible research lately. In this case, they have a perfect incentive to make they data public: it will help them out, it will help out the research project, and it will be a great inspiration for others to follow in their footsteps. Releasing data as a publicity intensifier: that’s my new idea.

P.S. In the first version of this post I included a graph showing votes given tweet shares between 40% and 60%. I intended this to illustrate the difficulty of predicting close elections, but my graph really missed the point, because the x-axis represented close elections in tweet shares, not in votes. So I crossed that part out. If nothing else, I’ve demonstrated the difficulty of thinking about this sort of analysis!

Continue Reading

Homicide data

Jason Kerwin writes:

You mentioned checking out the Wikipedia article on homicide rates in this recent post, which prompted me to finally put up a blog post about trying to track down the source for those rates, nominally the UN Office on Drugs and Crime. I thought you might be interested, as some of it is right up your alley:

It looks like the homicide data attributed to the WHO (which means many of the poorer countries on the list) is questionable.

I don’t know anything about this, but those of you who are interested can follow up if you’d like.

Continue Reading

A New Conflict Data Source

From an announcement from Jacob Shapiro and Joe Felter:

We are pleased to announce the launch of the Empirical Studies of Conflict Project (ESOC) website, which can be accessed at .

ESOC identifies, compiles, and analyzes micro-level conflict data and information on insurgency, civil war, and other sources of politically motivated violence worldwide. ESOC was established in 2008 by practitioners and scholars concerned by the significant barriers and upfront costs that challenge efforts to conduct careful sub-national research on conflict. The ESOC website is designed to help overcome these obstacles and to empower the quality of research needed to inform better policy and enhance security and good governance around the world.

The ESOC team includes about forty researchers (current and former) and is led by six members: Eli Berman, James D. Fearon, Joseph H. Felter, David D. Laitin, Jacob N. Shapiro, and Jeremy M. Weinstein.

The website contains data on six countries, including Iraq, Afghanistan, Pakistan, Vietnam, the Philippines, and Colombia. Take a look at the data and related publications. They are providing quite a public good.

Continue Reading

Data on Cross-Country Civilian Ownership of Small Arms

In the aftermath of yesterday’s tragedy in Newtown, Connecticut, there has been a lot of discussion about the relative prevalence of gun ownership in the United States as opposed to other countries (see here, here, and here for example).

Of course, measuring gun ownership cross-nationally is complicated. One source of data is the Small Arms Survey run by the Graduate School of International Studies in Geneva, Switzerland. Here is their argument for using survey data to measure the prevalence of civilian ownership of small arms:

Ownership laws and practices vary dramatically from country to country and region to region. The most reliable information about civilian ownership comes from official registration reports. But these are incomplete everywhere, missing unregistered weapons. Many countries, moreover, do not require registration, so they have no way of directly measuring civilian ownership. The most comprehensive information on public gun inventories comes from polling and surveys. Unlike official registration data, which only covers legally owned firearms, polling can potentially reveal the approximate total of all guns in civilian hands. Because it relies on voluntary responses to very sensitive questions, however, even polling lacks great reliability. Detailed polls on gun ownership have been conducted in only a few countries, including Canada, Lebanon, and the United States. The largest poll covers some 30 to 50 countries every few years and is undertaken periodically as part of the International Crime Victimization Survey.

With that in mind, here are some figures from the most recent report of the Small Arms Survey project.

The Small Arms Survey estimates that just under three-quarters of the small arms in the world are in the hands of civilians; the other quarter are held by armed forces (23%) and law enforcement (3%).

Of those weapons in civilian hands, here is breakdown by country as of what looks to be 2007:

Another potentially useful source of information looks to be run by the University of Sydney. This organization does not seem to produce information so much as aggregate publicly available information. That being said, it does seem to be a very quick way to find out details about gun policy and data regarding gun ownership by country; see this page in particular.

John has been referencing academic literature about the politics of gun-control in the United States (here and here). I have been unable to really find anything similar in the comparative literature; most articles seem to be using gun policy as an independent variable to predict rates of violence. The one exception is a twenty-year old article in Government and Policy comparing the politics of gun control in the United States and Canada. If anyone knows of any more relevant (or recent) literature on the comparative politics of gun-control – or even simply interesting studies of other countries – please feel free to leave in the comments.

Continue Reading

Evolution, Pundits and Pollsters

David Lazer has an interesting piece on the topic.

What is important is how well pollsters did in the face of increased obstacles to doing a good job: response rates to surveys have plummeted, and increasing numbers of individuals rely exclusively on (hard to reach) mobile phones. Despite these challenges, in aggregate surveys are more accurate than ever, almost spot on in 2012. How is this possible? This is worth far more reflection than a blog entry can offer, because not all communities face challenges like these so effectively. … Here I will simply speculate that it reflects three things. The first is that there is real world feedback as to the effectiveness of methods to address these challenges. … Here I will simply speculate that it reflects three things. The first is that there is real world feedback as to the effectiveness of methods to address these challenges. Third, there is a collective process of sifting through best practices. While there is certainly some desire to keep the secrets to success private, in fact there is a certain necessary degree of transparency in methods; and this is a small world of professional friendships where knowledge is semi-permeable, allowing a certain degree of local innovation providing short run advantage, while allowing good practices to disseminate. That is, there may be (as I have written about elsewhere) a good balance between exploration (development of new solutions) and exploitation (taking advantage of what is known to work) in this system.
…The system of pollsters might be contrasted with that of pundits. Do you expect a Darwinian culling of the right leaning pundits who missed the outcome? The answer is surely not. Nor will there be an adjustment of practices on the part of pundits who largely served up a mix of anecdotal pablum to their readers. … And how did the right get it so wrong? How could the Romney campaign of successful political professionals, in part embedded in the same epistemic community as the broader set of pollsters, not have seen an Obama victory as a plausible (put aside likely) outcome? This was not a near miss on their part. Consider: at last count, you could have subtracted 4.7 points (!) from Obama’s margin in every state and he would still have won … . Romney’s campaign, and many commentators on the right, were living in a parallel world, one with fewer minority and young voters than in ours. Again, I don’t know the answer to this question. Likely key ingredients: an authentic ambiguity in how to handle the aforementioned challenges; a strong desire to see a Romney victory; an informational ecosystem today that provides the opportunity for producing plausible sounding arguments to rationalize any wishful thoughts one might have; and the relevant subcommunity was small, centralized, and deferential enough so that a few opinion leaders could trigger a bandwagon.

As David is suggesting, this is a specific case of a more general problem – how does one build forms of collective cognition that generate useful information rather than garbage? The only thing that I would add is that it might be useful to think very slightly more explicitly about the incentives that these different communities have. As he notes, there is likely a fair degree of intellectual exchange happening among professional pollsters, producing something that roughly approximates the kinds of exploration-of-different-alternatives-by-actors-communicating-with-each-other that he has studied through simulations, and that Mason and Watts work on through experiments. There may be some tendencies towards isomorphism, but they look to be relatively mild. In contrast, professional pundits are in the business of entertaining, and producing counter-intuitive claims rather than being right. As @jimcramer rather revealingly describes the perceived incentives he faces, ” No one will recall who picks Obama by 10 electorals if it turns out to be 150 margin. Believe me.” Such pundits are indifferent between wild guesses that are wrong, and safe guesses that are right – neither is likely to be remembered. Hence, they have strong incentives to make wild guesses rather than sober ones – there’s no downside to being wrong, and much upside to being right. Finally, the problems for pollsters in a campaign don’t only have to do with wishful thinking, and the bandwagoning power of a few leaders. They also likely have to do with commmenters’ desire not to be seen as deviating from the collective consensus among their ideological community. Their problem is precisely the opposite of professional pundits – deviants and iconoclasts from the prevailing wisdom are likely to be cast out if they are wrong, whereas both those who are wrong, and those who are right, are likely to continue to be employed (and to have reasonable employment chances in other campaigns) as long as they do not stray from the herd.

In short, professional pollsters have (most of the time), good incentives to be right. Professional pundits have good incentives to guess wildly, regardless of whether they are wrong or right. Political hacks have good incentives to guess safely, regardless of whether they are wrong or right. And that, arguably, is why we are where we are.

Update: Also this, from Cosma Shalizi way back in 2005:

When political scientists, say, come up with dozens of different models for predicting elections, each backed up by their own data set, the thing to do might not be to try to find the One Right Model, but instead to find good ways to combine these partial, overlapping models. The collective picture could turn out to be highly accurate, even if the component models are bad, and their combination is too complicated for individual social scientists to grasp.
Continue Reading

Geographic Data and the 2012 Election

Via both Cosma Shalizi’s Pinboard feed and an email tip, this map of the geographical origins of racist tweets in the US shows a rather striking pattern.

For a very different representation of geographic data, Mark Newman has some very nice new cartograms of voting in 2012 (Gastner, Shalizi and Newman’s original cartograms of the 2004 election, which received widespread circulation, are still available here ).

Continue Reading

Fresh Data on Obama Voter Mobilization

This is a guest post from political scientists Ryan Enos and Eitan Hersh.


The voter mobilization efforts of the Obama campaign – and the micro-targeting databases behind them – are getting a lot of attention in post-election analysis. Like hundreds of other Democratic campaigns, the Presidential campaign implemented its voter contacting strategy with the help of a company, NGP-VAN. NGP-VAN, or as campaign field operatives refer to it – “the VAN” – is a web interface that connects campaign workers to databases of voters. Each campaign provides the VAN with its voter database, and the VAN offers tools that allow staff and volunteers to log on and interact with the data. Campaign workers look up voters and make lists of targets in the VAN when they want to canvass, run phone banks, and send mail and email.

It took us a year to get them all to agree, but we are now excited to reveal initial evidence from the Ground Campaign Project (GCP) – a research endeavor done in conjunction with Obama for America, NGP-VAN, and twenty-five state Democratic parties. The project offers the most comprehensive view to date of strategic mobilization, from behind the scenes of hundreds of political campaigns.

With the help of our partners, we embedded a survey instrument into the NGP-VAN website for all users associated with the Obama campaign across the country as well as for users associated with hundreds of Senate, House, and local races in twenty-five states. From June 11 to November 6, 2012, when workers looked up voters in their campaign database, they were randomly invited to participate in our study. Interviewing campaign activists every day for nearly six months, we have built a database consisting of approximately 4,000 respondents. We will be using this dataset to answer questions about campaign strategy, political activists, and party politics.

Here, we display a few graphs that might be of immediate interest in post-election analysis.

Where was Obama’s Ground Campaign?

Every day, for the last 149 days of the general election campaign, we interviewed staffers, interns, and volunteers working for the Obama campaign. When campaign workers looked up voter contact information in from their database, one in 100 workers was invited to take a short questionnaire.

The map on the top shows the frequency of interviews we conducted by state. The number of interviews is a useful measure of how many total field workers were engaged in each state. Darker colors suggest that OFA was dedicating more of its ground campaign to these places. Notice that there was significant campaign activity in heavily Democratic states, like California. In the map on the bottom, we adjust the numbers based on how many electoral votes are up for grabs in each state. This map reflects the Obama campaign’s attention to a state, given the size of the Electoral College prize (colors indicate the ratio of the proportion of campaign activity in the state to the proportion of the total Electoral College votes in the state). When this adjustment is made, the activity in states like California largely disappears.

Our maps will generally conform to other measures of Obama’s state-by-state strategy, such as measures of money, TV ad buys, campaign offices, and candidate visits. However, the measure here is unique in the sense that it is based on internal campaign data.  This unique data source may prove to be especially revealing because the Obama campaign has relied so heavily on targeted voter mobilization efforts.  These data will allow us to directly observe the mobilization, whereas data on television advertisements, for example, would not.

One lesson we draw from the map is that the campaign took more of an offensive strategy than a defensive one. Even through the end of the campaign season, the campaign was very active in Florida and North Carolina, and never engaged many field staff in Michigan and Wisconsin.

Below you can see an animation of the campaign’s increasing intensity and focus over the last six months.

Can Campaign Workers Predict Electoral Closeness?

We asked campaign workers to estimate how much Obama would win or lose by in the state in which they were working. Below, we see the result for the state of Florida, where we interviewed over 400 workers in the last six months of the campaign. Plotted along with the campaign workers’ predictions is the RealClearPolitics polling average, aggregated by week. The Obama campaign workers overshot the polls by about 8 points in their estimates. For this reason, we include a line in which we shift the poll trend vertically so that we can compare the relative changes across time. We learn both that the campaign workers were always more confident than the polls suggested, and that the campaign workers seemed to sense similar changes in their prospects as indicated by the polls.  This sort of analysis will be useful in understanding how the campaigns’ internal assessments of closeness affect their strategy.

Continue Reading