What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine

by Victoria Stodden on April 19, 2013 · 28 comments

in Methodology

Victoria Stodden is an assistant professor of statistics at Columbia University.

There’s been an enormous amount of buzz since a study was released this week questioning the methodology in a published paper. The paper under fire is Reinhart and Rogoff’s “Growth in a Time of Debt” and the firing is being done by Herndon, Ash, and Pollin in their article “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff.” Herndon, Ash, and Pollin claim to have found “spreadsheet errors, omission of available data, weighting, and transcription” in the original research which, when corrected, significantly reduce the magnitude of the original findings. These corrections were possible because of openness in economics, and this openness needs to be extended to make all computational publications reproducible.

How did this come about?

In 1986 a study was published in The American Economic Review (the very same journal that published Reinhart and Rogoff’s piece) by Dewald, Thursby, and Anderson called “Replication in Empirical Economics: The Journal of Money, Credit and Banking Project detailing shocking results: of 152 papers published or to be published in the Journal of Money, Credit and Banking between 1982-1984, only 4 could essentially be replicated in their entirety (the authors only received sufficient data and code for 9). The reason the JMCB was selected for study was their novel (at the time) policy of requiring all authors to relinquish data and code, to be made available to other researchers. These stark results sent shockwaves through the economics community and many journals, including the AER (where RR’s study was published), subsequently implemented data and code release requirements, typically upon request. Economists also became more aware of the issue of reproducibility and many do release the data and code associated with their published studies.

Herndon, Ash, and Pollin obtained the spreadsheet Reinhart and Rogoff used by direct request, and RR had made the raw data available for download on a website apparently set up for their collaborations http://www.reinhartandrogoff.com/data/browse-by-topic/topics/9/. The code used to transform the raw data to the published findings was not made available. Herndon et al followed the methodological description in the paper for both the raw data and the working spreadsheet RR supplied and as is well known came to different conclusions (although they were largely able to replicate the published results from the working spreadsheet).

What about peer review?

A reasonable question is how these results could have passed peer review, if as Herndon et al claim there were errors in the spreadsheet and methodological liberties taken such as selective data exclusion and unconventional weighting of summary statistics. The RR article wasn’t peer reviewed, appearing in AER’s Papers and Proceedings issue. But regardless, unlike for a mathematical result, peer review generally doesn’t verify computational results. Reviewers don’t usually check whether the findings in a computation the paper can be replicated or whether their derivation actually matches that described in the paper. As the JMCB study showed, not having access to the code and data pretty much makes it impossible to replicate findings.

Proposed Change 1: Reviewers need have a way to check that any computational results were actually generated as described.

This is typically nontrivial, since having the code and data doesn’t guarantee replication is either possible or achievable without significant effort. I have been working on a not-for-profit project called RunMyCode.org which could help reviewers by providing a certification that the code and data do regenerate the tables and figures in the paper. The site provides a web interface that permits users to regenerate the published results, and download the code and data.

I am puzzled as to why Herndon et al didn’t rely on AER’s stated policy that “[a]s soon as possible after acceptance, authors are expected to send their data, programs, and sufficient details to permit replication, in electronic form, to the AER office.” Nowhere is an exception for their Papers and Proceedings issue listed, and so AER should have both the data and programs needed to replicate the RR paper.

Reproducible Research is a Necessary Standard

I suspect the reason Reinhold and Rogoff’s work was able to be scrutinized at this level is actually because of the culture of code and data sharing in economics. Before the Herndon publication the raw data were freely available on RR’s website, and when requested they supplied their working excel spreadsheet to Herndon et al. RR have come under criticism because they never released the program that showed the steps they took from the raw data they posted to their published results, and didn’t make their spreadsheet freely available, only the raw data. It’s worth keeping in mind that without their release of the working spreadsheet, it is likely their mistakes would not have been found. Now imagine how many publications don’t have data and code available, and cannot be checked at the level RR’s was, and how many mistakes in the scholarly record just aren’t being caught.

Proposed Change 2: At the time of publication, researchers make enough material openly available (data, programs, narrative) so that other researchers in the field can replicate their work

What it takes to replicate results is a subjective judgement but the programs, and the data they start from, is a minimum. This doesn’t guarantee the results are correct, but permits others to understand what was done. RR’s article was highly cited and widely reported in the press, and it appears no one had ever bothered to check the results before the Herndon paper. Perhaps they assumed peer review had done so, but whatever the reason independent researchers must be able to validate and verify published results and researchers must facilitate this when they publish findings as a routine effort. There may be confidentiality issues or other reasons data may not be openly sharable, but the default needs to be conveniently available data and code, for every publication. I recently co-organized a workshop around this issue and the organizers released a workshop report “Setting the Default to Reproducible.”

Proposed Change 3: Only use research tools that can track the steps taken in generating results. Carry out research using R or python for example, and not Excel.

I also suspect that if Reinhold and Rogoff had been in the habit of making their data and code available so that their results were reproducible, they would have caught such errors themselves before publication. If sharing had been taken seriously perhaps they would have been motivated to use tools more conducive to scientific research such as those that capture all the steps taken, like R or python can. By using Excel, which was never designed for scientific research, they institutionalized mouse clicks and other untraceable actions into a scientific workflow, which must be avoided since it makes explaining to others (and to oneself) how to replicate the findings next to impossible and too easily introduces inadvertent mistakes.

{ 28 comments }

Raypc800 April 19, 2013 at 7:10 pm

The big question here is whether or not R&R knew about this error and when they found out about it. Only R&R know the answer to the question. But, no one can doubt that there was plenty of motivation by those wanting this article to be true to tempt a Saint.

Frank de Libero April 19, 2013 at 7:38 pm

Victoria, supportive of your proposed change 3, in 2003 or prior the American Statistical Association issued a position paper where they write, “Efficient computing tools are essential for statistical research, consulting, and teaching. Generic packages such as Excel are not sufficient even for the teaching of statistics, let alone for research and consulting.” My experience as a statistician working with biologists who would regularly use Excel confirms that unreasonable and unacceptable results were pretty common.

On the other hand, the average user finds Excel easy to return to after an absence, which is not the case with statistical software. And then there’s the initial learning curve required to become proficient in, say, Stata or R. The average non-quantitative professional isn’t enthusiastic about investing the time to initially become proficient in using a better computing tool, or in regularly having to relearn it after a few months absence.

As I see it, there’s definitely a need for better use but we don’t have the right tools yet, or at least not that I’m aware.

jonathan April 19, 2013 at 9:04 pm

Two points:

1. According to what I’ve read, R&R did not respond to requests for their work product/spreadsheets even though they were informed their results could not be replicated. In other words, people did try. They had access to the underlying data sources, but they did not get the same results. And whatever AER’s policy, it doesn’t seem they supplied AER with anything. The data sources were available through R&R’s website, not through AER.

2. We know the answer to part of the issue raised in question 3 and to the first comment about when R&R knew about the error. They have released a statement saying they did not include years of data because it was, in essence, new to them and they did not have time to check it. They said they didn’t “intentionally” omit years of data. So the answer is that 3 years ago, they knew they had data which on its face significantly contradicted their findings. In the succeeding 3 years, they have given numerous speeches and presentations about this work. They have never revealed anything about data they knew existed which contradicted them. For clarity, remember that data isn’t particularly complicated at surface level; it’s just average growth, which is a number. So since they say they had data that was new to them, we know they absolutely failed either to check that data or, perhaps, to disclose what that data said. In their presentations, they’ve never given the impression they left out years. In other words, more disclosure is great but the real problem here is dishonesty. If this work were done by an Andrew Wakefield, it would be truly scandalous.

Tracy Lightcap April 19, 2013 at 10:38 pm

I agree with pretty much all of this, but I have one comments.

There’s a lot of spreadsheet bashing in this post. I agree with all that is said about that. Oth, I always prepare my main indicators in a spreadsheet before I put the final dataset into R. I know that makes it harder to record what you did, but, let’s face it, if you have some serious work to do on a time limit, R is definitely not the application of choice for getting your data into shape to analyze on schedule. Indeed, it is as easy (for me, at least) to make that big error in R code as in a spreadsheet. R has many, many virtues. In my experience, ease in preparing datasets is not one of them.

Here endeth the lesson.

Patrick April 21, 2013 at 5:34 pm

I don’t agree with this at all. Merging, creating new variables, recoding, etc. is all significantly easier in R than in excel and reproducible. I can just have a code file that documents exactly what was changed from the raw dataset. I don’t see how there is any advantage to changing things on a spreadsheet at all. Anybody who is changing values on a spreadsheet and saving it is just asking to be accused of fraud…

Erich April 22, 2013 at 2:41 am

As far as I’m concerned spreadsheets have no place in serious scholarship (certainly papers with real world ramifications such as R&R’s paper). A “time limit” is no excuse for not taking “serious work” seriously and using, as Victoria notes, a workflow with “untraceable” elements. And I concur with Patrick, simple matrix manipulations in R or Python can be accomplished with a few lines of code, can be reproduced ad nauseum, and are auditable, as opposed to an evanescent sequence of mouse click and drag operations.

I can’t imagine submission timelines for Econ papers are such that resorting to Excel would be advantageous.

Indeed anyone can make mistakes when coding; God knows I do, but the discipline of coding means my error is reproducible, so even if I were hasty enough to let that pass, my peer could at least identify the problem rapidly without having to forensically “back in” to my error.

CD April 20, 2013 at 12:22 am

The proposed changes are all good. But you’re significantly upping the labor of review if you expect (unpaid) reviewers to check computations: that could be several days of work for some papers. A journal might hire a grad student to do this.

Andrew Gelman April 20, 2013 at 1:03 am

Victoria:

I agree with your assessment and your recommendations, but I wonder if you are being too optimistic. I say this based on some details of this particular case.

You write, “RR’s article was highly cited and widely reported in the press, and it appears no one had ever bothered to check the results before the Herndon paper.” But people have reported that they were trying for years to replicate that analysis but Reinhart and Rogoff were not sharing their information.

You also write, “These corrections were possible because of openness in economics.” Releasing data after several years is better than never releasing data at all. But I think that under a regime of true openness, Reinhart and Rogoff would have released all their information right away, to the first people who requested it.

Jim Rose April 20, 2013 at 5:16 am

when you toss in data mining and publication bias, the credibility revolution that some talked of in empirical economics is the revolution that never happened.

who get a Phd or tenure track by replicating the results of others

genauer April 20, 2013 at 6:00 am

“But people have reported that they were trying for years to replicate that analysis but Reinhart and Rogoff were not sharing their information.”

May I ask, who, where was “reporting” this ?

EC April 20, 2013 at 7:51 am

Mike Konczal, for one, whose early post about the HAP paper was widely cited in the blogosphere, wrote: “From the beginning there have been complaints that Reinhart and Rogoff weren’t releasing the data for their results (e.g. Dean Baker). I knew of several people trying to replicate the results who were bumping into walls left and right – it couldn’t be done.”

http://www.nextnewdeal.net/rortybomb/researchers-finally-replicated-reinhart-rogoff-and-there-are-serious-problems#.UW14rDQo2L4.twitter

genauer April 20, 2013 at 8:37 am

I looked your link, with only one Dean Baker specifically alleged to a cepr link.

Maybe people get a little specific what who asked when (maybe putting some email or so up), and got what as an answer about which they complain that it “declined to adhere to standard ethics ”

since it is as of now obvious, that Reinhart / Rogoff made their data freely available to everybody and their excel sheet to the Amherst Three.

When I ask people for references or data, from some, incl. Krugman, I get response, from some (also nominal Princeton) not, but that does not make the later “refuse”. Could be spam filter, some obnoxious secretary, and then I decide, whether I follow up or not.

Andrew Gelman April 20, 2013 at 8:26 am

Genauer:

A commenter on the sister blog pointed out this from Bill Mitchell:

I should note that when the paper came out in 2010, I immediately tried to replicate the results and failed. I wrote to Carmen Reinhart because I had met her a few years earlier at a function in the US. I requested the data. It appears I was in a queue of researchers asking for the data. I received no reply.

As a long-standing researcher you learn that if an author will not send you their data then something is wrong. Perhaps they were too busy. Perhaps they didn’t want anyone getting their exact dataset because they knew what might be found.

Whatever! But the upshot was that I couldn’t be sure that something was empirically at fault because there are several data sets that one could reasonably assemble in the context of their enquiry which would generate slightly different and perhaps even substantially different results. It wasn’t clear to me how they generated their results despite attempts to reverse engineer them.

And I don’t think Reinhart and Rogoff have denied this. That is, I don’t think that they are making the claim that they did share their data with people back in 2010. I can believe they were too busy and felt they didn’t have time to clean their data. This has happened to me all the time, that people ask me for the data from one of my articles or books, and either I send them something so messy as to be barely usable, or I can’t send them anything at all.

Mark April 20, 2013 at 8:21 am

The failure to make underlying data and code available and inability to replicate results once that happened sounds very familiar to the debacle that occurred with climate science and the unraveling of the hockey stick; see http://climateaudit.org/2013/04/18/the-hockey-team-and-reinhart-rogoff/

Barry April 23, 2013 at 10:36 am

With, of course, opposite results – the hockey stick was upheld (unless you mean the critics who made math errors).

Which raises an issue – climate science has been correct, despite working under similar handicaps as macroeconomics has (including the corrupting effect of money).

Why did they get it right and freshwater macro get it wrong?

genauer April 20, 2013 at 9:18 am

Andrew,

nobody is perfect. I also got more than a dozen years ago (pre spam time, at least for me) an email, where somebody was asking for the correct reference of one data point in one of 117 graphs and tables in my PhD thesis. And I couldnt find it. He called about 2 weeks later, and I searched again, for hours, armed with a strcutured database of 2000 papers, and called back a week later, deeply ashamed, that I couldnt find it.

This Bill Mitchell has sent one email , after 2010, in deep spam times, on which he got no repsonse, for whatever reason, again spamfilter trained on curse words, maybe?

And given his extraordinary rude tone, I can also easily imagine, that I would NOT answer any email from him.

Sooo, so far just very wild, unsubstantiated allegation against R/R.

Andrew Gelman April 20, 2013 at 10:55 am

Genauer:

Victoria wrote, “RR’s article was highly cited and widely reported in the press, and it appears no one had ever bothered to check the results before the Herndon paper.”

My point is that it’s not correct that “no one had ever bothered to check the results before the Herndon paper.” Various people had tried to check the results and were not able to replicate the findings. But they were not able to proceed further because the data were not available. This does not mean that Reinhart and Rogoff were bad people, it just means that, until recently, they were not making their data readily available. This supports Victoria’s larger point that it would be good to have norms of data sharing.

Sebastian April 20, 2013 at 11:59 am

I’m not really sure what your point is here. The professional norm is to reply to e-mails and to share data. I don’t know what your purpose is in making this so complicated and requiring “proof”. No one (at least no one serious) is calling for R&R to see any professional repercussions on this. (And it’s not comparable to plagairizing your dissertation).

Several people have reported now contacting R&R for the data and not getting it. One, max two e-mails should be all that is required for that – what do you expect, certified letters?
If you don’t have time to answer e-mails, put your data and code on your webpage. The spam argument is bogus – e-mails sent from .edu domains don’t land in the spam filter (and this was 2010, not 2006 – peak spam is long over).

R&R’s behavior is also fairly uncommon, by the way – I’ve co-authored a paper relying extensively on replication (ungated: http://blogs.uoregon.edu/davidsteinberg/files/2012/09/karchersteinberg-sept23-2011-1ca3hhe.pdf ) – involving similar type of data (i.e. country/year macroeconomic data) using papers from econ and polisci. We were two PhD students at the time and received data from everyone we contacted, including the economists. All political scientists we contacted either had their data and code online or supplied us with both the data _and_ the R/Stata code to replicate their results, which we were also able to do without fail. That’s how it should be and my understanding is that that’s how it commonly works (Gary King – and I assume other profs – make replication of a published paper a requirement in their grad methods class, so these requests&replications happen a fair amount).

David Landy April 20, 2013 at 10:03 am

Just a small correction: the link to http://www.runmycode.org actually links to http://www.runmycode/org. Thanks for the work you’re doing!

John Sides April 20, 2013 at 2:36 pm

This should be fixed. Thanks!

genauer April 20, 2013 at 10:43 am

And I would also like to point out,

that this Mister Dean Baker was asked,

after he made his allegation on his blog:

“written by AndrewDover, July 04, 2010 8:53
http://www.scribd.com/doc/2573…me-of-Debt

Who asked for what data? ”

but Dean Baker did not come forward with any evidence nor comment for his claims, there.

We watch this in Europe, carefully.

We do fire ministers for unethical scientific conduct, even when it was 30 years ago.

Mike Rappeport April 20, 2013 at 1:00 pm

Perhaps it is because I am a non-economist (my undergraduate degree is in Physics and my PhD is in mathematics with a specialty in statistics), that I see this problem a little differently. Each of the half dozen or so papers that I published before I left Bell Labs and science for “social science” (specifically doing survey research of all kinds) was essentially automatically replicated. It had to be because as I was taught from the first day in undergraduate school, the “scientific method is essentially cumulative”, where each new piece of work represents one step in a never-ending process of hypothesis/test/counter-example (limiting the applicability)/new hypothesis. So whoever was then working on a next step really had no choice but to start by effectively replicating my work.

That R&R’s work was not many times replicated in this sense may be best understood as demonstrating how far the current modus vivendi for social science/economics/policy sciences still is from real science. If there was a similarly important “breakthrough” in physics or chemistry or biology or earth science, by a week later there would be dozens of graduate students (and their mentors) working on such topics as under what conditions does that breakthrough apply. Think for instance of what happened when there was a claim made for cold fusion. In so doing of course those “next steppers” replicate/check the original work. Even in the sciences doing such extensions isn’t as sexy as making a breakthrough, but it does have the advantage that it is ordinarily sufficient for a PhD thesis from a “quality ” school. So to my mind, a, perhaps the, real issue is not whether anyone took as their basic task “replicating” the R&R results, but why is economics so structured that three months later there weren’t 100 students testing whether the R&R results still held if the universe/sample/definitions was limited or changed in any of a dozen different obviously applicable ways (e.g. size of country, Gini level, time period, importance of financial sector).

Sebastian April 20, 2013 at 2:05 pm

I think there are two misunderstandings here
1. While it was widely discussed in policy circles, the R&R paper was not a big deal in academic economics. One can argue whether that disconnect is good or bad, but the type of work rewarded in academic economics (i.e. the type of stuff that gets grad students a job) almost across the board involves much more sophisticated data and statistical methods. Replicating the R&R results is at best a term paper for an intro to econometrics class (which is what the UMass replication started out as).
Contrast this with a study that actually was hugely influential academically, Acemoglu, Johnson, and Robinson’s
http://www.nyu.edu/econ/user/debraj/Courses/Readings/AcemogluJohnsonRobinsonAER.pdf
The Colonial Origins of Comparative Development: An Empirical Investigation
Those results were replicated multiple times, both by papers building on them and by those critiquing them and the most thorough critique did come from a grad student who got an AER paper out of it: http://ideas.repec.org/a/aea/aecrev/v102y2012i6p3059-76.html

2. There is two types of replication: People working on the same data or people working on the same question. On the first, which is what you describe in physics, there were multiple people working on the same question as R&R – including not just graduate students, but also the research departments of organizations such as the IMF and the OECD. The fact that these types of studies did come to all types of different results certainly meant that R&R’s 90% figure wasn’t considered an established fact by academic economists by any means. What people mean by “replication” in this context, though, is to start out by getting the exact same results using the exact same data – and that’s of course only possible if you have access to the exact same data, which wasn’t the case until the UMass folks got the original data.

I don’t want to claim that economics or any other social science functions the same way as physics – and there are both good and bad reasons for that – but it does so much more than you seem to understand.

fred April 20, 2013 at 7:57 pm

Tyler Cowen directed us to Andreoli Versbach, Patrick and Mueller-Langer, Frank, Open Access to Data: An Ideal Professed but Not Practised (February 21, 2013). RatSWD Working Paper Series No. 215; Max Planck Institute for Intellectual Property & Competition Law Research Paper No. 13-07. Available at SSRN: http://ssrn.com/abstract=2224146 or http://dx.doi.org/10.2139/ssrn.2224146

“We provide evidence for the status quo in economics with respect to data sharing using a unique data set with 488 hand-collected observations randomly taken from researchers’ academic webpages. Out of the sample, 435 researchers (89.14%) neither have a data&code section nor indicate whether and where their data is available. We find that 8.81% of researchers share some of their data whereas only 2.05% fully share. We run an ordered probit regression to relate the decision of researchers to share to their observable characteristics. We find that three predictors are positive and significant across specifications: being full professor, working at a higher-ranked institution and personal attitudes towards sharing as indicated by sharing other material such as lecture slides.”

What’s really shocking about this is that many if not all leading journals in economics stipulate something like the following as a condition of publication: “Authors must also acknowledge they have read and agree to the policy on data access and estimation procedures. Authors are expected to document their data sources, data transformations, models, and estimation procedures as thoroughly as possible. Authors are also expected to make data available at cost for replication purposes for up to five years from publication.” Journals should simply refuse to publish articles based upon proprietary data. If it cannot be replicated, it’s just not credible.

Jim Rose April 20, 2013 at 10:10 pm

“I invite the reader to try and identify a single instance in which a “deep structural parameter” has been estimated in a way that has affected the profession’s beliefs about the nature of preferences or production technologies or to identify a meaningful hypothesis about economic behavior that has fallen into disrepute because of a formal statistical test.”

‘The Scientific Illusion in Empirical Macroeconomics’, Lawrence H. Summers , The Scandinavian Journal of Economics, Vol. 93, No. 2, Proceedings of a Conference on NewApproaches to Empirical Macroeconomics. (Jun., 1991), pp. 129-148

CLF09 April 25, 2013 at 6:05 pm

I just wonder why no one has addressed the fact that their model is not a model? They did not test a hypothesis and had (a flawed but still) dataset that was ripe for a panel model. Sure the type selection upon the dependent was an egregious flaw, but in what world is it acceptable to not actually test anything? When I read the paper I was taken aback that is consisted of descriptive statistics. Why was a simple linear panel model not used to test (and potentially reject) their hypothesis. In my own experience in finance one would be stupid to bring such data to a superior to make a trading decision. This in the the light that we are not known for good math or even something approaching a fully specified model.

Alan T April 27, 2013 at 2:31 am
Politics Debunked April 27, 2013 at 12:50 pm

Obviously this is focused on academic work, but I thought I’d note the same is true of government entities that are doing analyses that impact the public. For instance the Social Security Administration does long range forecasts (including for Medicare) whose results have a major impact on government finances and policy choices, which appear to be critiqued far less than even academic econ papers like R&R. All through their documentation you will see references to their use of “unpublished data” which is nowhere on their site. They publish a *subset* of their results, which appears cherry picked to avoid certain results that would show off their problems. Someone from the tech business world looked at their forecast to critique it like other “business plans” and found major problems in its work:

http://www.politicsdebunked.com/article-list/ssaestimates

Since the page was done by an outsider to the policy world, it hasn’t yet got attention. The same problem occurs in for instance Congressional Budget Office long term projections where they give some of the results.. but the models they use are nowhere to be found.

Comments on this entry are closed.

Previous post:

Next post: