Archive | Data

Valuable New Dataset of Constituency Level Election Results

Anyone who has ever tried to gather constituency level election-data cross-nationally should be very pleased by the following announcement I received earlier this week:

American University’s School of Public Affairs is pleased to announce the launch of Election Passport, a new online resource providing free access to a rich dataset of constituency election results from over 80 countries around the world.

The goal of Election Passport is to enable researchers and students to engage in high-level analysis of elections on countries for which data are not easily available. From Andorra to Zambia, this site provides unusually complete data sets that include votes won by very small parties, independents, and frequently, candidate names that are difficult to locate. As an ongoing project, additional elections will be regularly added.

Election Passport was developed by David Lublin, Professor of Government in the School of Public Affairs at American University, with the support of AU’s Center for Latin American and Latino Studies and the German Marshall Fund of the U.S.

We hope that you will find this to be a valuable resource and encourage you to share this announcement with your colleagues. Please contact David Lublin at or (202) 885-2913 should you have any questions.

If you have tried the dataset already, please feel free to leave any observations in the comments below. Should be valuable for scholars and policy makers alike!

Continue Reading

New Data on Ideology and Money in Politics

Adam Bonica writes:

I am pleased to announce the public release of the Database on Ideology, Money in Politics, and Elections (DIME). The database was initially developed as part of the project on Ideology in the Political Marketplace, which is an on-going effort to conduct a comprehensive mapping of the ideology of political elites, interest groups, and donors using the common-space CFscore scaling methodology. [JMS: For details, see here.). It includes records for over 100 million political contributions made by individuals and organizations to local, state, and federal elections spanning a period from 1979 to 2012. A corresponding database of candidates and committees provides additional information on state and federal elections. In addition, the database includes common-space ideal points for a comprehensive set of candidates for state and federal office, interest groups, and individual donors.

What this translates into:

The common-space CFscores allow for direct distance comparisons of the ideal points of a wide range of political actors from state and federal politics. In total, the database includes ideal point estimates for 51,572 candidates and 6,408 political committees as recipients and 13.7 million individuals and 1.3 million organizations as donors.

Here, “ideal points” means an estimate of ideology.  In essence, Bonica has developed an innovative way to measure the relative liberalism or conservatism of millions of political actors.  It’s a treasure trove, and I hope people will make use of it.  You can see some of the ways in which he’s used these data at his blog.

Continue Reading

One Irony of Nate Silver’s Leaving the New York Times

Nate Silver’s imminent departure from the New York Times to ABC and ESPN —see details in the links within this Jack Shafer post—has elicited stories of hostility to Silver within the Times newsroom.  The Times public editor, Margaret Sullivan, writes:

His entire probability-based way of looking at politics ran against the kind of political journalism that The Times specializes in: polling, the horse race, campaign coverage, analysis based on campaign-trail observation, and opinion writing, or “punditry,” as he put it, famously describing it as “fundamentally useless.” Of course, The Times is equally known for its in-depth and investigative reporting on politics.

His approach was to work against the narrative of politics – the “story” – and that made him always interesting to read. For me, both of these approaches have value and can live together just fine.

A number of traditional and well-respected Times journalists disliked his work. The first time I wrote about him I suggested that print readers should have the same access to his writing that online readers were getting. I was surprised to quickly hear by e-mail from three high-profile Times political journalists, criticizing him and his work. They were also tough on me for seeming to endorse what he wrote, since I was suggesting that it get more visibility.

I had a similar experience the night of the Iowa caucus in 2012.  Lynn Vavreck and I were in the lobby of the Des Moines Marriott talking to a senior Times reporter when the subject of Silver came up.  The reporter went on a rant about how Silver did not know things because he hadn’t been in the field as reporters are, about Silver’s “models,” and about how Silver could talk about polls that did not meet the Times polling standards (a fact that the Times polling editor Kate Phillips also referred to as “a problem” in a Twitter exchange with political scientist Daniel Smith).

The problem with this critique of Silver is that, if you follow his work closely or read his book, he’s extremely cautious about what data and modeling can accomplish.  Moreover, if you read his book closely, he is quite clear that the kind of data reporters (or baseball scouts) often gather—which is qualitative, not quantitative—is exceedingly valuable for doing what he does. This is a point about the book that hasn’t received much emphasis in the reviews I’ve read, even though the potential value of qualitative data in quantitative forecasts is well-established (see, for example, Bruce Bueno de Mesquita’s approach).

For example, although Silver developed a quantitative system, PECOTA, for evaluating baseball players, that you might think was the natural antagonist of qualitative data-gatherers like scouts, his book chapter on this has a section called “PECOTA vs. Scouts: Scouts Win.”  Silver writes:

Although the scouts’ judgment is sometimes flawed, they were adding plenty of value: their forecasts were about 15 percent better than ones that relied on statistics alone.

He has the same view of quantitative election forecasting.  In a section of the book called “Weighing Qualitative Information,” he lauds the value in the in-depth interviewing of candidates that is done by David Wasserman and others at the Cook Political Report.  Silver uses the ratings that Cook and others have developed in his own House forecasts and finds that they also add value.

So the irony, as I see it, is that Silver faced resentment within the newsroom even though his approach explicitly values the work that reporters do.  Although I suspect that Times reporters wouldn’t like to simply be inputs in one of Silver’s models, I could easily see how the Times could have set up a system by which campaign reporters fed their impressions to Silver based on their reporting and then Silver worked to incorporate their impressions in a systematic fashion.

In short, even though it may be impossible to eliminate the tension between Silver’s approach and that of at least some reporters, I think there is an under-appreciated potential for symbiosis.  Perhaps Silver will find that at ESPN and ABC.

[Full disclosure: In 2011-2012, I was a paid contributor to the 538 blog.]

Continue Reading

The IRS and a statistics problem

John Mashey writes:

Have you or do you know if anyone has looked at this hypothesis:
“The IRS unfairly targeted conservative organizations for scrutiny in getting 501©(4) status.”

I’d think one would want to have:
N years of rates of application (because I conjecture there was a big burst after Citizens United)
Some grouping of those into “conservative,” “progressive” or other.
Rates of investigation, time taken.
Rates of rejection.

I haven’t followed this carefully enough, but it actually looks like an reasonable social science question, if only to identify the
data actually required to reach any real conclusions.

I have no idea.

Continue Reading

When Can You Trust a Data Scientist?

This is a guest post from Andrew Therriault, a political science Ph.D. who is Director of Research and Business Development for Pivotal Targeting/ Lightbox Analytics.


Pete Warden’s post “Why You Should Never Trust a Data Scientist,” which Henry Farrell linked to, illustrates one of the biggest challenges facing both consumers and practitioners of data science: the issue of accountability. And while I suspect that Warden—-a confessed data scientist himself—-was being hyperbolic when choosing the title for his post, I worry that some readers may well take it at face value. So for those who are worried that they really can’t trust a data scientist, I’d like to offer a few reassurances and suggestions.

Data science (sometimes referred to as “data mining,” “big data,” “machine learning,” or “analytics”) has long been subject to criticism from more traditional researchers. Some of these critiques are justified, others less so, but in reality data science has the same baby/bathwater issues as any other approach to research. Its tools can provide tremendous value, but we also need to accept their limitations. Those limitations are too extensive to get into here, and that’s indicative of the real problem Warden identified: as a data scientist, nobody checks your work, mostly because few of your consumers even understand it.

As a political scientist by training, this was a strange thing to accept when I left the ivory tower (or its Southern equivalent, anyway) last year to do applied research. The reason for a client to hire someone like me is because I know how to do things they don’t, but that also means that they can’t really tell if I’ve done my job correctly. It’s ultimately a leap of faith—-the work we do often looks, as one client put it, like “magic.” But that magic can offer big rewards when done properly, because it can provide insights that aren’t available any other way.

So for those who could benefit from such insights, here are a few things to look for when deciding whether to trust a data scientist:

  • Transparency: Beware the “black box” approach to analysis that’s all too common. Good practitioners will share their methodology when they can, explain why when they can’t, and never use the words “it’s proprietary” when they really mean “I don’t know.”

  • Accessibility: The best practitioners are those who help their audience understand what they did and what it means, as much as possible given the audience’s technical sophistication. Not only is it a good sign that that they understand what they’re doing, it will also help you make the most of what they provide.

  • Rigor: There are always multiple ways to analyze a “big data” problem, so a good practitioner will try different approaches in the course of a project. This is especially important when using methods that can be opaque, since it’s harder to spot problems along the way.

  • Humility: Find someone who will tell you what they don’t know, not just what they do.

These are, of course, fundamental characteristics of good research in any field, and that’s exactly my point. Data science is to data as political science is to politics, in that the approach to research matters as much as the raw material. Identifying meaningful patterns in large datasets is a science, and so my best advice is to find someone who treats it that way.

Continue Reading

Why You Should Never Trust a Data Scientist

Pete Warden

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than Juan[HF – John???] in Texan border towns, and so the country was on the verge of being swamped by Hispanics. …
I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. … If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way.

[via Cosma – Cross-posted at CT]

Continue Reading

How Computers Can Help Us Track Violent Conflicts — Including Right Now in Syria

This is a guest post by David Masad of Caerus Analytics.  An elaborated version of this analysis is here.


One of the important challenges in studying conflict is simply identifying where it happens.  For more than 40 years, researchers have sought to build systematic data about episodes of conflict. Monitoring events on the ground in hundreds of countries is quite difficult, but now, thanks to the tremendous work of political scientist Philip Schrodt and Patrick Brandt and information scientist Kalev Leetaru, there is a new dataset—the Global Database of Events, Language, and Tone (GDELT)—that facilitates this task:

The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first “realtime social sciences earth observatory.” Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates.

But does it work?  Can we remotely observe violence conflicts around the world through computer-coded media reports?  Building on previous analyses by New Scientist magazine and Jay Ulfelder, I will show that the GDELT data can indeed help us do that by examining GDELT data about the ongoing Syrian civil war.  In particular, I will show that the violent events identified in GDELT correlate with death tolls at the national level. I will also show that GDELT events are correlated with the future registration of refugees. This preliminary analysis suggests that GDELT does capture underlying dynamics in the Syrian civil war, although the analysis also suggests where the GDELT data may fall short.

Continue Reading →

Continue Reading