We’ve known for a while that most research data eventually gets lost. To plug this slow leak (and stem the waste of of funding and effort it represents), some journals now ask that the data associated with a paper be archived at publication. There’s a whole host of different data archiving policies out there, but which are most effective?
I was lead author on a recent study that looked at how often data got archived in journals with three different flavors of archiving policy. Our paper looked at four journals with no policy, four that recommended archiving the data in a public database but stopped short of requiring it, and four that had recently adopted a mandatory archiving policy. These latter four split into two groups, with two journals that required authors to have a “data archiving statement” in every manuscript, and two that did not.
Since archiving standards vary widely depending on the kind of data, the study focused on datasets used in a particular population genetics analysis called “Structure.” The input file for this is relatively simple, and the technique widely used in evolutionary biology. This approach held the author community roughly constant across journals, and at the same time eliminated variability due to the inclusion of different types of data.
The graph more or less speaks for itself:*
When there’s no policy, almost no data sets are archived (3 of 89 studies). Recommending archiving is barely more effective, with data from only 10 of 89 studies available online. The double whammy that really gets the job done is having a mandatory policy and requiring that all authors have a “data archiving statement” in their paper.
The study didn’t include any journals that only recommended archiving but absolutely required a data archiving statement, but it may be that making authors explicitly state “I’m not going to archive my data” in their manuscript is so unappealing that they put their data online. This approach may work well for journals that span a number of fields or in disciplines where data archiving is not yet on the radar — the journal doesn’t need to persuade the editorial board to adopt a mandatory policy straight away, but peer pressure should act to ensure that a significant proportion of data gets archived.
Lastly, the study also tried to get data from two of these journals by asking for it directly from the authors. About half of the requested datasets did arrive, which is a massive improvement over previous surveys (see here, here, and here). However, the data were requested only a year after publication, so 50% is probably the upper limit — authors leaving science and erasing their hard drives will steadily erode this proportion until, years later, only a few datasets are still available from authors. By contrast, all the data online in public archives will still be there, good as new.
Not archiving data alongside the paper makes about as much sense as not publishing the figures, so let’s hope that with the right sort of prodding we can get all these datasets out into the public domain.
(* I am also the Managing Editor for Publication #12.)