We’ve known for a while that most research data eventually gets lost. To plug this slow leak (and stem the waste of of funding and effort it represents), some journals now ask that the data associated with a paper be archived at publication. There’s a whole host of different data archiving policies out there, but which are most effective?
I was lead author on a recent study that looked at how often data got archived in journals with three different flavors of archiving policy. Our paper looked at four journals with no policy, four that recommended archiving the data in a public database but stopped short of requiring it, and four that had recently adopted a mandatory archiving policy. These latter four split into two groups, with two journals that required authors to have a “data archiving statement” in every manuscript, and two that did not.
Since archiving standards vary widely depending on the kind of data, the study focused on datasets used in a particular population genetics analysis called “Structure.” The input file for this is relatively simple, and the technique widely used in evolutionary biology. This approach held the author community roughly constant across journals, and at the same time eliminated variability due to the inclusion of different types of data.
The graph more or less speaks for itself:*
When there’s no policy, almost no data sets are archived (3 of 89 studies). Recommending archiving is barely more effective, with data from only 10 of 89 studies available online. The double whammy that really gets the job done is having a mandatory policy and requiring that all authors have a “data archiving statement” in their paper.
The study didn’t include any journals that only recommended archiving but absolutely required a data archiving statement, but it may be that making authors explicitly state “I’m not going to archive my data” in their manuscript is so unappealing that they put their data online. This approach may work well for journals that span a number of fields or in disciplines where data archiving is not yet on the radar — the journal doesn’t need to persuade the editorial board to adopt a mandatory policy straight away, but peer pressure should act to ensure that a significant proportion of data gets archived.
Lastly, the study also tried to get data from two of these journals by asking for it directly from the authors. About half of the requested datasets did arrive, which is a massive improvement over previous surveys (see here, here, and here). However, the data were requested only a year after publication, so 50% is probably the upper limit — authors leaving science and erasing their hard drives will steadily erode this proportion until, years later, only a few datasets are still available from authors. By contrast, all the data online in public archives will still be there, good as new.
Not archiving data alongside the paper makes about as much sense as not publishing the figures, so let’s hope that with the right sort of prodding we can get all these datasets out into the public domain.
(* I am also the Managing Editor for Publication #12.)




Could not find treasuremytext on Apple App Store. Does it exist as a different name?
Posted by Tim Hullquist | Jan 15, 2013, 8:37 amThe image was inserted as a rhetorical device. It looks like it’s only available for Android, by the way.
Posted by Kent Anderson | Jan 15, 2013, 8:45 amThe link to the paper doesn’t work…
Posted by Tom Arrison | Jan 15, 2013, 8:37 amThanks. There was a missing “http://” in the link that had been inserted. I’ve fixed it.
Posted by Kent Anderson | Jan 15, 2013, 8:44 amThanks!
Posted by Tom Arrison | Jan 16, 2013, 8:50 amTim, any thoughts on the difference between fields and types of data where there are clear standards and fields and studies that are less structured or perhaps use innovative techniques? Is it enough for a journal to just require some form of archiving in general, or do we need to continue to work toward policies whereby the data will be in a consistent and readily useable form?
Also, given your 50% success rate in actually getting hold of these datasets, it would seem that fields where there are formal and independent data repositories are better off. Journals that require deposit of DNA sequence in something like GenBank don’t have to worry about recalcitrant authors or link rot as researchers move from job to job.
Posted by David Crotty | Jan 15, 2013, 10:25 amI think it should be up to the funders what data gets archived, not to the publishers. Archiving is laborious, hence detracts from research, and it creates an ongoing expense. Both the burden and the cost come out of the funder’s budget. By the same token are the publishers expected to bear the ever increasing cost of enforcing these mandates?
There may well be cases where the data is simple, well defined and easily archived so a universal archiving rule is workable, but surely these are the exception. In some fields data preparation and archiving costs are potentially huge, up to a third of the project cost. Only the most important data should be archived in such cases.
Even in simple cases imposing cumulative burden and cost on the research community should be done by the funders who pay for it not by the publishers. Publishers should not make research policies that have nothing to do with publishing. The publishers should not try to manage the scientific enterprise.
Posted by David Wojick | Jan 16, 2013, 8:21 amThere’s more to it than that, if you’re asking for that data to be reusable by others. I think back on the datasets I collected, and I know that I had my own idiosyncratic system for organizing it, and if I dug carefully back into my notes, I could figure out the exact details of any particular time lapse movie I’d made, and decipher the cryptic coding used to annotate it onscreen and in the file name. Translating that into an organized system with a set of instructions for the next user would have been a time-consuming task. Given the very specific nature of the experiments I was doing, looking at one particular cell type under one very specific set of conditions, it’s unlikely anyone else was ever going to reuse my data. The question then is whether it would have been worth several weeks of my time to get the data into a reusable state, as compared with doing several weeks of new experiments.
Posted by David Crotty | Jan 16, 2013, 10:11 pm