Victoria Stodden is an assistant professor of statistics at Columbia University.
There’s been an enormous amount of buzz since a study was released this week questioning the methodology in a published paper. The paper under fire is Reinhart and Rogoff’s “Growth in a Time of Debt” and the firing is being done by Herndon, Ash, and Pollin in their article “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff.” Herndon, Ash, and Pollin claim to have found “spreadsheet errors, omission of available data, weighting, and transcription” in the original research which, when corrected, significantly reduce the magnitude of the original findings. These corrections were possible because of openness in economics, and this openness needs to be extended to make all computational publications reproducible.
How did this come about?
In 1986 a study was published in The American Economic Review (the very same journal that published Reinhart and Rogoff’s piece) by Dewald, Thursby, and Anderson called “Replication in Empirical Economics: The Journal of Money, Credit and Banking Project detailing shocking results: of 152 papers published or to be published in the Journal of Money, Credit and Banking between 1982-1984, only 4 could essentially be replicated in their entirety (the authors only received sufficient data and code for 9). The reason the JMCB was selected for study was their novel (at the time) policy of requiring all authors to relinquish data and code, to be made available to other researchers. These stark results sent shockwaves through the economics community and many journals, including the AER (where RR’s study was published), subsequently implemented data and code release requirements, typically upon request. Economists also became more aware of the issue of reproducibility and many do release the data and code associated with their published studies.
Herndon, Ash, and Pollin obtained the spreadsheet Reinhart and Rogoff used by direct request, and RR had made the raw data available for download on a website apparently set up for their collaborations http://www.reinhartandrogoff.com/data/browse-by-topic/topics/9/. The code used to transform the raw data to the published findings was not made available. Herndon et al followed the methodological description in the paper for both the raw data and the working spreadsheet RR supplied and as is well known came to different conclusions (although they were largely able to replicate the published results from the working spreadsheet).
What about peer review?
A reasonable question is how these results could have passed peer review, if as Herndon et al claim there were errors in the spreadsheet and methodological liberties taken such as selective data exclusion and unconventional weighting of summary statistics. The RR article wasn’t peer reviewed, appearing in AER’s Papers and Proceedings issue. But regardless, unlike for a mathematical result, peer review generally doesn’t verify computational results. Reviewers don’t usually check whether the findings in a computation the paper can be replicated or whether their derivation actually matches that described in the paper. As the JMCB study showed, not having access to the code and data pretty much makes it impossible to replicate findings.
Proposed Change 1: Reviewers need have a way to check that any computational results were actually generated as described.
This is typically nontrivial, since having the code and data doesn’t guarantee replication is either possible or achievable without significant effort. I have been working on a not-for-profit project called RunMyCode.org which could help reviewers by providing a certification that the code and data do regenerate the tables and figures in the paper. The site provides a web interface that permits users to regenerate the published results, and download the code and data.
I am puzzled as to why Herndon et al didn’t rely on AER’s stated policy that “[a]s soon as possible after acceptance, authors are expected to send their data, programs, and sufficient details to permit replication, in electronic form, to the AER office.” Nowhere is an exception for their Papers and Proceedings issue listed, and so AER should have both the data and programs needed to replicate the RR paper.
Reproducible Research is a Necessary Standard
I suspect the reason Reinhold and Rogoff’s work was able to be scrutinized at this level is actually because of the culture of code and data sharing in economics. Before the Herndon publication the raw data were freely available on RR’s website, and when requested they supplied their working excel spreadsheet to Herndon et al. RR have come under criticism because they never released the program that showed the steps they took from the raw data they posted to their published results, and didn’t make their spreadsheet freely available, only the raw data. It’s worth keeping in mind that without their release of the working spreadsheet, it is likely their mistakes would not have been found. Now imagine how many publications don’t have data and code available, and cannot be checked at the level RR’s was, and how many mistakes in the scholarly record just aren’t being caught.
Proposed Change 2: At the time of publication, researchers make enough material openly available (data, programs, narrative) so that other researchers in the field can replicate their work
What it takes to replicate results is a subjective judgement but the programs, and the data they start from, is a minimum. This doesn’t guarantee the results are correct, but permits others to understand what was done. RR’s article was highly cited and widely reported in the press, and it appears no one had ever bothered to check the results before the Herndon paper. Perhaps they assumed peer review had done so, but whatever the reason independent researchers must be able to validate and verify published results and researchers must facilitate this when they publish findings as a routine effort. There may be confidentiality issues or other reasons data may not be openly sharable, but the default needs to be conveniently available data and code, for every publication. I recently co-organized a workshop around this issue and the organizers released a workshop report “Setting the Default to Reproducible.”
Proposed Change 3: Only use research tools that can track the steps taken in generating results. Carry out research using R or python for example, and not Excel.
I also suspect that if Reinhold and Rogoff had been in the habit of making their data and code available so that their results were reproducible, they would have caught such errors themselves before publication. If sharing had been taken seriously perhaps they would have been motivated to use tools more conducive to scientific research such as those that capture all the steps taken, like R or python can. By using Excel, which was never designed for scientific research, they institutionalized mouse clicks and other untraceable actions into a scientific workflow, which must be avoided since it makes explaining to others (and to oneself) how to replicate the findings next to impossible and too easily introduces inadvertent mistakes.