Sociologist Brandy Aven writes:
I have been thinking a lot about reproducible science, particularly for the social sciences. Creating norms or policies that enforce reproducible science may not only be cheap insurance to mitigate academic fraud but also improve our field. . . . There are quite a few wonderful archives out there that provide the infrastructure, but they are frightfully empty.
So far, I’m with her. It seems like a good idea to make all research materials available, but not many people do so.
But then Aven continues:
What we have is a good old fashion social dilemma. Why should I make all my data and syntax available if you don’t?
The answer is clear to me: by making your data available, you are making it more likely that others will replicate your results, continue the directions of your research, cite you, etc. Fame and fortune await.
In that case, why not make data and code publicly accessible? If it’s so good to do, why isn’t everybody doing it? Let’s set aside the cheaters and the insecure people, those scholars who are worried that if someone else gets their hands on their data, they will come to different conclusions. And let’s set aside those researchers who are so clueless that they honestly seem to think that their particular analysis is the last word on the subject.
What about the rest of us, the vast majority (I assume) of researchers who are doing science to learn about reality and who would be thrilled if others pick up our torch and continue our research directions where we left off? Why don’t we always share our data? I know I don’t, and it’s not because I don’t want other people to take a look at what I’m doing.
I can think of two reasons we (those of us who would actually like our research to be reproduced) don’t routinely share data and code:
1. Effort. This for me is the biggie. As Aven notes, there are social benefits to making data and code available, and as I note above, there are direct personal benefits as well. But these benefits are all medium-to-long-term and they pale beside the short-term costs of getting my act together to put the data in a convenient place. In fact, when I do actually organize my data, it’s often motivated by a desire to make my life easier when handling repeated requests.
2. Rules. The default is for data and code not to be released. Often there are silly IRB rules or commercial restrictions on data. In other cases it seems like too much effort to find out. Again, though, it can be good self-interest to make data available. For example, in our wildly-popular (not yet but eventually, I hope) mrp package in R, we use CCES data, not Annenberg or Pew, for our examples. Why? Because the people at CCES were cool about it. Not only do they release their (old) data for free, they don’t mind us reposting it. Ultimately CCES benefits from this. The freer the data, the easier it will be for people to do analyses, cite CCES, suggest improvements to CCES, etc.
In short, I agree with most of what Aven wrote in her post, and I agree that it would good to change the incentives to increase data sharing. But I don’t think it’s fundamentally a coordination problem, or a prisoner’s dilemma, or a tragedy of the commons, or a first-mover problem, or whatever you want to call it. I think it’s mostly about defaults and laziness (I guess that would be something like “intertemporal preference conflicts” in decision-analytic jargon).
The big picture
The concept of a social dilemma or cooperation problem or prisoner’s dilemma is appealing, I think because it is a crisp way of understanding why something that is evidently a good idea is not actually being done. But sometimes there is a more direct explanation. I think we should be careful about reaching for the “social dilemma” model whenever we see a frustrating or mysterious outcome.
For another example, see this article on something that is not a prisoner’s dilemma but was labeled as such.