Data Dump
"Data Dump" by swanksalot via Flickr

“Enough is enough,” writes Christine Borowski, in a July 4th editorial published in the Journal of Experimental Medicine (JEM).

Borowski, the Executive Editor of JEM states that, effective immediately, only “essential” supplementary tables and figures will be accepted for publication.

Her rationale for a radical change in editorial policy was based on something many editors have experienced first-hand: online articles have quickly become “data dumps” for supplemental items.

The magnitude of these supplements burdens reviewers, costs authors precious time and resources creating them, and there is little evidence that supplements are used much by readers, writes Borowski. Without the size restrictions enforced with print publications, Borowski explains how the asymmetric power relationship between reviewer and author has led to the growth of supplemental files:

Why the increase in the prevalence of supplementary data? Reviewers frequently ask for it. Editors generally allow it. So authors are compelled to provide it, although some do so grudgingly. Of course, authors are also referees. Why the same individual would demand experiments as a referee that they might balk at as an author seems paradoxical.

In 2005, only four in 10 JEM articles were published with supplemental files. Now, all of them are, and the average number of supplemental items per paper has increased from 2.4 in 2005 to 5.9 in 2011. The first full article I looked at in the July 4th issue included 15 supplemental items (6 figures and 9 tables).

Supplemental files will now be limited to “essential” information, and limited to file formats not currently permissible for inclusion in the primary paper (e.g., videos and large datasets).

While authors are to abide by these policies, reviewers must understand them as well and curtail their demands for additional data and experiments. The intended purpose of the policy is to limit the growing demand on authors, reviewers and editors and speed up the publication of new findings.

Last summer, the Editor-in-Chief of the Journal of Neuroscience announced that it would no longer publish supplemental files, citing similar rationale. However, drawing the line on what kind of supplements were important enough for inclusion made enforcement impractical.

Enhanced by Zemanta
Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist.


26 Thoughts on "A Journal Is Not a Data Dump"

In a recent presentation for a digital curation meeting I compared the scale of selected digital resources
See e.g. slide 8. This shows the scale of data generated by one research data case study (MIT). On the same scale, some current archival services are noticeably smaller, and the aggregate of European IRs (via DRIVER) is vanishingly small.

What this suggests is that we have found ways to handle and present digital research data, in terms of structuring and metadata, etc. What we haven’t yet evaluated is the value proposition of all these data given the potential costs of exploding volumes. It looks like the Journal of Experimental Medicine, and Journal of Neuroscience before it, have reached similar conclusions. In JEM’s response, while limiting the scope of publishable research data it is also attempting to make some allowances, thereby exploring the value that might be unlocked.

It is clearly important that research data are linked to the published papers, but not necessary that they are published by the same source. That said, the curation of research data is currently an under-researched area, and whoever provides such services for research data will want to know much more about usage and value before committing to the rapidly expanding costs.

This problem was bound to arise as soon as the page limitations were lifted. There is always more that can be said, more good questions to be asked, so the journal article is tending toward the research report, which can easily be ten times longer. Information content increases exponentially with level of detail, because of its underlying tree structure. But information, as always, is difficult and expensive to produce, so new lines must be drawn. Supplemental data in the sense of just more content may be yet another i-science experiment that has failed.

So, just as funding agencies (e.g. NSF: require authors to store their supporting data, journals are signaling that they absolutely don’t see a way to help authors meet this new mandate! Granted, storage of supporting data is not a historical journal activity, but serving authors is a key journal activity (and a potential new business opportunity). Surely there is a creative way for journals to help authors with this problem without assuming the burden of reviewing all supplemental data?


First off, I think the usefulness of making data available is highly field- and experiment-dependent. If you’ve sequenced a genome, then yes, it’s of great use to let others have access to that sequence. But it’s unlikely that anyone really wants to look at, much less re-use the 600 overnight time lapse movies you took of your particular cell type under highly specific experimental conditions. And for some humanities journals, the question get even more stretched.

That said, there are many journals experimenting with this sort of activity in areas where it is appropriate. Here’s a recent one:
In some areas, there’s great value in being able to access someone’s dataset for confirmation and for re-use, and this should be encouraged.

There is a fundamental question though, of whether journals are the place where those data must live, or if they should instead be stored in repositories at the author’s institution (and linked to the journal article). There’s a great Timo Hannay quote floating around about how companies shouldn’t be relied upon for the long term storage of data as their half-lives are too short (as compared with the much longer-lived universities).

There’s also the question of what, exactly, is the purpose of a scientific article. Is it meant to be a quick, easily read, short summary of a set of experiments and the conclusions learned, or is it meant to be an exhaustive record of everything that was done? The latter is verging on the “research reports” that David Wojick mentions above. Both have value, and perhaps it’s unwise to try to force one to be the other.

I think your last para hits the nail on the head.

What is the point of a scientific article?
Too which we might add – “right now” and then “in the future”.

One is supposed to be able to replicate an experiment from a paper… Sources should be correctly utilised. Hmmm.

Could we not conceive of a world where a Published article represents a ‘view’ of the work that preceded its writing? With access points to allow the interested party to drill down deeper into the underlying data objects that enabled its creation? including those 600 timelapse movies. incidental data for one party, might well be vital data for another, and you just can’t tell at the time.

Right now, I’m looking at an infographic from Science with this stat: 96.3% of researchers have asked their colleagues for the data that underpins a paper. Another stat indicates that >50% store data “in the lab”.

And finally from the same infographic ” There are many tales of Archeologists burning wood from the ruins to make coffee”

This all smacks to me of traditional thinking – it doesn’t fit in the box so throw it out. Surely the answer here is to build a better box.

The question is whether one needs everything in one “box” at all. The scientific article is a highly evolved form that serves a particular purpose. But I really like the term “access points”, and would think that instead of trying to build one “box” that serves all needs, why not lots of boxes that can be interconnected? Lots of small pieces that do what they’re supposed to do very well that can all be connected into a larger whole?

I tend to think about university repositories as being the ideal home for data. Most institutions have librarians with great training in the organization, storage and recovery of information, why not put that expertise to use?

What do you do with data that’s not yet published? Researchers need to store and organize their own data for their own future experiments. This is becoming an increasingly difficult problem. Why not kill two birds with one stone, create a university-wide data repository (with the option of making data public or private)? Those data could then be connected to the published paper for those looking to dig deeper.

A counterpoint on sharing data, and why it’s a timesink and often without great value can be found here:
Though as I said above, the value will vary quite a bit depending on the data and the field. The other really interesting concept is the idea that soon, for things like DNA sequence, we’ll reach a technological point where it’s cheaper to recreate the data from scratch each time it’s needed than it is to store it long term.

Sadly, I can’t think of a single university repository that has proven nearly as useful for disseminating and discovering data — attached to a given publication or not — as a larger project like Dryad. In theory, they’re wonderful, because they involve talented librarians in the OA value proposition, but in practice, they haven’t quite been up to the task.

As for whether “conclusions [should be] more valued than sheer bulk of data generated,” I think this is a bit of a false dichotomy. The journal article, along with the “conclusions” that it necessitates, has always been a pretty good evaluative totem for the data and code snippets that helps to locate elsewhere. Yet the laborious pre-publication peer review process is hardly ideal for allowing researchers across the globe who may have better information available to draw conclusions about the data collected by others who may be better-equipped to produce this data. While it may seem logical to privilege academic authors’ analytic abilities over the simple “production” of data, it must be said that both are enormously important, and it does not seem to me a bad thing that this dualism is becoming more flexible.

With the advent of better data indexing — David Smith’s “better box” — it may become increasingly onerous for the few real tool-builders in the scientific community, invaluable though they are, to bother about putting together a paper to correspond to some otherwise “unpublishable” development that they deserve a great deal of credit for. This is similarly true of publishing data associated with negative results, which is badly stigmatized even at the level of individual labs thanks to years of poor editorial policies. The new FigShare ( repository is attempting to do something about this, though I will watch with the same bated breath as you to see whether they are successful in sorting out the mountain which they hope to inherit.

Alex–I agree it’s not as neat as I may have presented it–if you run a sequencing center, the quantity of sequence you produce is an important number. But the quality of that data is also important, as is the usability and its meaningfulness. You can’t assume a researcher is doing good work just because they crank out a lot of it. But yes, you’re essentially right, you need data in order to learn something new.

I do agree that there aren’t any university data repositories that can be used as models here, but we are talking about innovation, building that better box as David Smith mentions above.

I think the other great conflict here is the notion that people want less to read, not more. Are there a lot of researchers who want to spend more of their time dumpster diving through the raw data of graduate students in other people’s labs? One thing the scientific article does provide is a great time savings, a brief summary of what was done and what was learned. For any given paper, that’s all that the vast majority of readers want.

The NSF requirement is regarding projects, not publications. There may well be a business opportunity here but it is not intrinsically suited to the journals. On the other hand there is also an effort at NSF (and elsewhere) to link articles to projects, so there might be an evolving tie. But data archiving is an expensive and specialized enterprise.

The point is that the author is the same person as the researcher who has the data storage obligation to the funding agency. If you’re in the “business” of serving authors (as are journals), this is something you might do to help ensure your long relevance.

I’d modify the assertion that “serving authors is a key journal activity” to read “helping authors reach readers with quality material is a key journal activity.” Journals don’t exist just to service authors (well, most don’t). They exist to filter information for readers. JEM is saying that the journal filter eliminates raw data, in essence. I think that’s a fair position to take. I agree there is a potential new business opportunity in supporting authors who wish to deposit their data somewhere, but I’m not sure whose opportunity it is.

Authors could claim that they bring readers to the journal rather than the other way round!

At best it’s “chicken and egg” – if journals forget that they need to attract/serve authors their “supply chain” could disappear. That will be especially true if the marketplace develops other types of “quality” filters.

A “stick to our knitting” survival strategy might be viable for journals with mega brands and broad audiences (Nature, NEJM etc.), but that may not be enough for journals that focus on peer-to-peer communication within narrow research communities.

I’d be curious to see whether any of the journals who are now tightening their data belts are actively recommending any alternatives for data publication. Although David makes a good point about journals articles being co-opted by exhaustive and indigestible “research reports,” it’s very difficult to argue that these don’t have real value for the replicability of science, and supplemental data is part and parcel of that. Yet the journal article is still, narrowly, the only sustainable publication model from the perspective of academic evaluation — an argument that won’t go away, to be sure, but also one that makes it seem very much like JEM is part of the problem without contributing any obvious solution. The comments on this recent iPhylo post are a good reminder that what is true of supplemental data is true of code (in bioinformatics and elsewhere), too:

I don’t think there’s much question over the idea that making data available can be valuable, and greatly helpful for the replicability of science, but there are several issues in your post that could be separated out–making data available, publication, and academic evaluation.

The research paper serves a particular purpose. Does a data repository serve that same purpose? If not, then must the two be combined? Should making data available be seen as something that must be “sustainable” or is it instead a cost center for an institution (as opposed to a revenue generator)? Is the generation of data something that should factor into a researcher’s evaluation at their institution? Shouldn’t conclusions be more valued than sheer bulk of data generated?

It’s problematic that in the linked article, many of the suggested metrics are based more on popularity than on quality. It’s much easier to garner attention than it is to make a meaningful contribution to progress, so I worry that they’d be measuring the wrong things, and researchers would end up spending more time on publicity than on experiments.

The matter of supplemental material may be more acute for authors of books, especially in the humanities, who routinely cut information of value to scholars because of the limitations of the printed format–because books, unlike journals nowadays, remain “analog first.” I have heard many authors complain about not having a place for this material. That’s authors, not reviewers. I doubt that the availability of this material would be a burden to readers, as so many scholarly monographs are not read in their entirety in the first place.

My original vision for the Penn State Office of Digital Scholarly Publishing was to include the publication of hybrid books, where the basic monograph would be made available in print form as usual but the ODSP web site would provide a place for supplemental material like data sets and interviews to be stored and linked to the printed text (which would also be available online).

The ACRL confronted the question of data curation in a report titled “Establishing a Research Agenda for Scholarly Communication” (Nov. 2007). In that same year Christine Borgman published a book (MIT Press) titled “Scholarship in the Digital Age” that took the “data deluge” as a major focus of concern. I wrote about both publications in essays that are available here:

I can’t say that I’m upset by this decision, and I think the amount of supplemental information has gotten out of hand. I understand that there are many reasons for including supplemental data, as others have mentioned. This might be cynical, but I thought that one reason for including supplemental information, especially in the sciences, is that the authors can claim first publication rights on the information.

My experience in biology has been that research “stories” are constantly evolving and often blend into each other as parts of an overarching, multi-year project. Therefore, a group might generate experimental results that could “fit” into a number of articles. A group may only have some of the information required to publish article X, but instead of waiting to complete that story, they stick those few pieces of data into the supplemental information for article Y, which is ready to submit and consists of related research. If someone is in a competitive field, and the information is likely to be heavily cited, then adding it to a supplemental section early is advantageous. I wouldn’t be surprised if some groups use this as a deliberate strategy, and they might be especially unhappy about the restrictions on supplemental information.

I’ve been “scooped” by only peripherally related supplemental information in the past, and it’s painful. Once something has been published as a supplemental figure, it’s impossible to use that piece of information as a novel research result in a later article, even if the result was obtained independently.

Web-hosting and printing are practically free, especially in the context of the usually obscene amount of money that journals charge per issue, for what is almost always voluntary work by the reviewers, editors and authors. The journals should offset the reviewers efforts financially and spend it on services to host the data.

The complaint seems to be that looking at the data is too much work, Is’nt it the reviewer’s and editor’s job to properly vet the paper. Otherwise there is not much difference between a detailed blog post and a journal paper.

Compared to computer science and statistics, which is where I come from, medical-sciences community seems almost paranoid about sharing data. Any venue that encourages transparency should have been seen as a welcome change in my opinion. I can understand why the science community is so cagey about their data nonetheless it is not a good thing.

Apologies in advance if this comes off as rude, just writing what were my first reactions. In our area high quality but free journals and freely shared data is usually the norm.

If web hosting is practically free, why is arXiv always looking for money? As to whether printing is practically free, you obviously have never paid a print bill. Setup costs alone keep many players out of the printing game.

As for “voluntary work,” who coordinates that? Who monitors it? Who grades reviewers, manages their conflicts of interest, vets their input, and makes decisions? A lot of staff, all of whom earn salaries, have health benefits, and so forth. You’re living in a fantasy world if you think coordinating volunteers is cheap or easy. Also, these “volunteers” are happy to do the work for the many non-financial perks it provides.

The medical community is paranoid about sharing data for many reasons — regulatory, humanitarian (misinterpreted data have led to massive public health scares and overinterpretations that have been counterproductive), privacy-related, and competitive (there’s a lot at stake). Data is often shared informally by colleagues in ways we don’t see.

No need to apologize for tone. Good to get to the bottom of things. But data curation, presentation, and management is neither simple nor cheap. If we go into it believing it’s either, we’re likely to make major mistakes.

Of course I meant “free” in comparison to what the journals charge. Compared to the costs of printing 40 years ago (when journal prices entirely made sense), the marginal distribution costs are near negligible now. Journal prices have not depreciated at the same rate, in fact only increased. Yes as a publisher you have to have to maintain a server, it costs electricity, you have to run a press, maintain staff…but the cost per article is in fractions of cents. Surely you agree that the price of a volume is far higher than the cost of printing that volume or the marginal cost of hosting it on the web.

My main point was that editors, and reviewers generally do it voluntarily as a community service. Yes they do get an odd book and other perks free but they are not quite remunerated at the market rates. They do it not just for the value they get from these material perks. This is a huge cost saving for the journals and they could if the so choose pass on those benefits in the form of a curated repository at very little extra cost to them. Not that they have to make that free either.

Not everybody may want to go through that data, but it will be there if one wants to verify or compare. Science benefits, I would think. However one valid criticism that came out via the comments is that subsidiary data is often used to bypass the review process. That does not arise in our field.

It seemed to me from reading the post that turning-in subsidiary information had almost become mandatory. That seems to me to be the problem that needs fixing and not stopping the acceptance of subsidiary data altogether.

The typical cost of a hosting service is around $40 per article per year. YMMV, of course, but while bits may be almost free, the software behind the bits costs millions.

As you said it depends from case to case. Those costs can be controlled quite a bit. I think as Kent Anderson mentioned another major cost is the human resource cost of keeping things organized.

If you look at Journal of Machine Learning Research, the frontline journal in machine learning. It is hosted with open source tools, so software costs are minimal. There are bandwidth costs of course, but those are not terribly expensive. They do sell printed versions that are non-free to offset the expenses. Another good thing is that the authors get to retain the copyrights over their submission, which many established journals force you to waive. They do not host the data for experiments, but usually it is up to the authors to provide it.

Open source does not make things cheaper. It transfers the expense from the cost of software to the cost of time spent by people who build and maintain systems. In many organizations this work is done on a voluntary basis, which skews any kind of economic analysis. Such volunteerism is admirable. It is not a model, however, for the vast majority of the 24,000 peer-reviewed journals. There are multiple models for publishing, each of which has certain virtues and certain limitations. Talk of open source for the people operating a humanities journal with 600 subscribers does not move the ball down the field. Talk of open publishing in clinical medicine could result in criminal negligence. What appears crazy from a distance sometimes makes perfect sense when viewed close up.

Yes, and Colin Day long ago pointed out the high hidden cost of having faculty do work that lower paid but more skilled publishing professionals could do better.

Comments are closed.