Things that aren’t prisoner’s dilemmas, part 2

Sociologist Brandy Aven writes:

I have been thinking a lot about reproducible science, particularly for the social sciences. Creating norms or policies that enforce reproducible science may not only be cheap insurance to mitigate academic fraud but also improve our field. . . . There are quite a few wonderful archives out there that provide the infrastructure, but they are frightfully empty.

So far, I’m with her. It seems like a good idea to make all research materials available, but not many people do so.

But then Aven continues:

What we have is a good old fashion social dilemma. Why should I make all my data and syntax available if you don’t?

The answer is clear to me: by making your data available, you are making it more likely that others will replicate your results, continue the directions of your research, cite you, etc. Fame and fortune await.

In that case, why not make data and code publicly accessible? If it’s so good to do, why isn’t everybody doing it? Let’s set aside the cheaters and the insecure people, those scholars who are worried that if someone else gets their hands on their data, they will come to different conclusions. And let’s set aside those researchers who are so clueless that they honestly seem to think that their particular analysis is the last word on the subject.

What about the rest of us, the vast majority (I assume) of researchers who are doing science to learn about reality and who would be thrilled if others pick up our torch and continue our research directions where we left off? Why don’t we always share our data? I know I don’t, and it’s not because I don’t want other people to take a look at what I’m doing.

I can think of two reasons we (those of us who would actually like our research to be reproduced) don’t routinely share data and code:

1. Effort. This for me is the biggie. As Aven notes, there are social benefits to making data and code available, and as I note above, there are direct personal benefits as well. But these benefits are all medium-to-long-term and they pale beside the short-term costs of getting my act together to put the data in a convenient place. In fact, when I do actually organize my data, it’s often motivated by a desire to make my life easier when handling repeated requests.

2. Rules. The default is for data and code not to be released. Often there are silly IRB rules or commercial restrictions on data. In other cases it seems like too much effort to find out. Again, though, it can be good self-interest to make data available. For example, in our wildly-popular (not yet but eventually, I hope) mrp package in R, we use CCES data, not Annenberg or Pew, for our examples. Why? Because the people at CCES were cool about it. Not only do they release their (old) data for free, they don’t mind us reposting it. Ultimately CCES benefits from this. The freer the data, the easier it will be for people to do analyses, cite CCES, suggest improvements to CCES, etc.

In short, I agree with most of what Aven wrote in her post, and I agree that it would good to change the incentives to increase data sharing. But I don’t think it’s fundamentally a coordination problem, or a prisoner’s dilemma, or a tragedy of the commons, or a first-mover problem, or whatever you want to call it. I think it’s mostly about defaults and laziness (I guess that would be something like “intertemporal preference conflicts” in decision-analytic jargon).

The big picture

The concept of a social dilemma or cooperation problem or prisoner’s dilemma is appealing, I think because it is a crisp way of understanding why something that is evidently a good idea is not actually being done. But sometimes there is a more direct explanation. I think we should be careful about reaching for the “social dilemma” model whenever we see a frustrating or mysterious outcome.

For another example, see this article on something that is not a prisoner’s dilemma but was labeled as such.

3 Responses to Things that aren’t prisoner’s dilemmas, part 2

  1. J August 3, 2012 at 9:33 am #

    There could actually be an incentive for researchers to publish their data and code. It would be a signal of the strong quality of the work. Those who hoarded their data and code would be seen as hacks themselves.

  2. Jonathan Baron August 3, 2012 at 10:01 am #

    As Uri Simonsohn points out in, the journal I edit is one that requires (with some exceptions) posting of data: As you can see from the current issue (, compliance is now very good. And, what is more, nobody complains. Some even send their data with their initial submission.

    Possibly, as you point out, the basic motivations here are somewhat conflicting, to the point where they roughly balance out. In a case like this, people are very sensitive to social norms, and social norms, in turn, are affected by rules of journals, even if the rules are not strictly enforced.

    Of course, it is possible that I am now getting a biased sample of authors. If so, perhaps that is a good thing. Those who engage in “p hacking”, for example, may not want others to see the results of analyses that they did not report.

  3. ricketson August 3, 2012 at 2:19 pm #

    A note from biology: We have some well established practices for sharing data (particularly DNA sequence data). Most journals now insist that sequence data be submitted to a online NIH-run database.

    The easy thing for us is that these data fall into pretty standard categories, and we’ve established a lot of standardized formats for these types of data. I suspect that data formats in social sciences cannot be standardized so readily, which in turn would make it hard to establish a central database (even if you had the same funding as the NIH).

    One problem that biologists have had is in defining the “completeness” of documentation for the datasets. Under-documentation can result in near worthless data submissions. Over-documentation can fill up the database with BS and waste the submitter’s time. For instance, some sequence databases insist that all sequences have functional annotation — even if the sequence was generated by a population survey and not a functional study. I suspect that this is just a legacy problem — all sequences used to arise from functional studies, and it’s only recently that they’ve been used a markers in population surveys… but the result is that a researcher wastes a lot of time adding information that is nothing more than the output of an automated analysis (and the basis of the annotation is not a required part of documentation). It would be better to omit the annotation and allow the final user to run their own automated analysis on the raw data.