Data Dump
DATA DUMP by swanksalot via Flickr

In science, data is good.  So more data should be better?

Not according to the Society for Neuroscience, which has decided to stop accepting and hosting supplemental data with journal articles.

Announcing their decision in the August 11th issue of the Journal of Neuroscience, Editor-in-Chief John Maunsell explains that the rationale was not about space and cost, but scientific integrity.  Supplemental materials have begun to undermine the peer review process.

Validating supplementary data adds to the already overburdened job of the reviewer, Maunsell writes.  Consequently, these materials do not receive the same degree of rigorous review, if any at all.  At the same time, the journal certifies that they have been peer-reviewed.

Since 2003, when the Journal of Neuroscience began accepting supplemental data, the average size of these files has grown exponentially and is rapidly approaching the size of the articles themselves.

With few restrictions on space, reviewers may place additional demands on authors, requiring them to perform and add new analyses and experiments to the supplemental data. Often these additions are “invariably subordinate or tangential,” Maunsell maintains, but represent significant work from the author and thus delay the publication process.  Supplemental data thus changes the expectations of both author and reviewer, leading to what he describes as an “arms race:”

Reviewer demands in turn have encouraged authors to respond in a supplemental material arms race. Many authors feel that reviewers have become so demanding they cannot afford to pass up the opportunity to insert any supplemental material that might help immunize them against reviewers’ concerns.

In the August 11th issue of the Journal of Neuroscience, 19 of the 28 original articles contained supplemental materials, suggesting that they have become normal parts of the publication process.

Yet, having two places to report methods, analyses, and results compromises the narrative of the article as a “self-contained research report,” Maunsell argues.  Instead of a clear scientific story (e.g. this is the problem, this how we approached it, this is what we found and why it is important), the article becomes a kitchen sink, a collection of related, but often unimportant details, attached to a clear story.

As Maunsell explains, there are no viable alternatives to simply ending the practice of accepting supplemental materials.  Limits on the number of additional tables and figures are simply arbitrary; stating that only “important” additions be included makes enforcement impractical. The journal sees no alternative to ending the practice entirely.  This doesn’t mean that authors cannot host their own supplementary data, with links from the article — the journal simply will not vouch for their validity or persistence. Rationalizing the decision, Maunsell writes:

We should remember that neuroscience thrived for generations without any online supplemental material. Appendices were rare, and used only in situations that seemed well justified, such as presentation of a long derivation.

The decision to end a seven-year practice of accepting supplemental scientific data is fascinating if viewed within the larger framework of science policy that has been dominated lately by frameworks of openness and transparency.

Recent developments such as Open Notebook Science, Science Commons, Open Source, Open Access, and Open Government preside under the notion that more access to data allows for greater efficiency and greater accountability.  What it ignores, however, is that access is not enough.

Readers seek filters that attest to a source’s validity (the backlash over WikiLeaks’s publication of thousands of unverified documents may signal that the established news organizations are still valued for their ability to verify the authenticity of facts and stories). The decision of a journal to cease publishing supplementary materials affirms the same position that all the facts of science should pass through the same filter.

Enhanced by Zemanta
Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist.


21 Thoughts on "Ending the Supplemental Data “Arms Race”"

At the most recent SSP meeting, a representative from Cell Press noted that they have strictly reduced the amount of supplemental data they allow authors to publish. This has apparently led to a strongly positive response from readers who appreciate the new brevity. J. Neurosci. look to be taking this to the next level. Given the poor numbers most journals see for access to supplemental data, I’m betting readers will not complain.

Interesting to read your conclusion — that “access is not enough” — in light of the NY Times piece last week about how sharing data led to advances in Alzheimer’s research. (which is basically along the lines your examples of Open Notebook Science, et al.)

If journals (and their authors and reviewers) are finding that vetting data at the publication stage is too onerous, leads to an “arms race”, does that not suggest that the move toward transparency needs to be accompanied by practices for filtering at another stage of the process?

Perhaps I am missing your point?

I think we’re talking about different, although related, ideas.

J. Neurosci’s position for ending the practice of hosting supplemental data is not based upon what data scientists can share or collaborate (the topic of the NY Times article), but about what the journal is willing to certify as credible science.

I view scientific articles as short stories with linear narratives and clear conclusions, not as data dumps that include everything a scientist has worked on in the process of discovery. While there is a case for making everything a scientist has worked on public and transparent, there are real costs incurred to this practice. I think the decision by J. Neurosci. reflects the tension between full disclosure and timely dissemination of scientific results.

At what other stage — other than peer-review — do you suggest data be vetted?

We agree on the issue of articles not being data dumps, but I’m interested in the implications of that. I suggest that access to data is in fact critical, but it needs to be managed effectively. Peer review of the data need not happen at the point of publication. Rather, we need to develop (are developing?) protocols for collecting and sharing data, so it can be used and evaluated by as many scientists and interested others as possible.

So where would the vetting of data occur? At several points in the research cycle. For example, the data that led to advances in Alzheimer’s research are/were presumably being vetted long before articles are written — scientists are able to assess far more data than they could amass on their own, develop new research protocols, and make new discoveries accordingly. They are also, I assume, able to discount data that appear questionable. When new discoveries are written up and published, their supporting data can be linked to, and questionable data can be so described and also linked to. Other scholars who wish to continue the work will be able to evaluate the data as needed.

Other possibilities: perhaps data banks will have to initiate some form of peer review or assessment when data sets are deposited. And if data are widely available, closer evaluation of the data can and will happen after publication.

All of this assumes effective support for donating, storing, and providing access to data; but publishers won’t be doing all that.

No doubt there will be serious costs associated with this level of data storage and sharing (and structuring and metadata-creating). But if scientific advances rely on data that is accessible, not siloed – shared, not owned – then we need to acknowledge this and plan accordingly.

As more funders are requiring data deposits, and more groups are developing practices to support it (providing DOIs and other metadata for data sets, enabling long-term storage – e.g., a collaboration between the National Evolutionary Synthesis Center and the Dryad repository, funded by the NSF), I hope we can move away from the notion that publishers should provide access to data, and be responsible for vetting it.

In the long run, I’m seeing this as part of the move to publish, and then filter – publication would thus not be the only signal that the conclusions (and the data on which they are based) are valid. Peer review of data, in this model, would not be the responsibility primarily of publishers and their reviewers, but of other participants in the research cycle. I recognize that this has implications for tenure and promotion and a host of other procedures in the scientific academy. But it seems to me that if scientific progress will thrive on widespread data sharing, we cannot (as J. Neurosci. seems to acknowledge) expect publishers to bear the burden of evaluating data one article at a time. We need to build procedures that acknowledge the importance of sharing data, and reward those whose data is valuable.

The value of data sharing varies greatly from field to field and even from experiment to experiment within a field. For some areas of research, making data available for re-use by others is a no-brainer (things like DNA sequence data, protein structures, computational algorithms, epidemiological data, economic data). There are obvious ways to re-use that data, to plug it into new experiments and gain new conclusions. For other types of research though, the benefits are less clear. Steven Wiley makes a good case here why it’s not helpful in his field, and even worse, it’s a massive timesink:

Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes. Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results.

The wide variability in types of data, and types of experiments makes it virtually impossible to set up one set of standards that will cover everyone. I think you’re more likely to see pockets of activity, standards for annotation, organization and redistribution of data in subfields where it is appropriate. Much of this is already being done through centralized databases like GenBank and we could certainly use more resources like that.

But DNA sequence is easy to store and annotate. For other types of data, the answers are not so clear. I know imaging labs where each student generates terabytes of data each week. Long term storage and annotation of their time lapse movies of cellular behaviors is a major problem without an obvious solution. I’d like to see institutional libraries taking a larger role in creating data archives. Librarians are experts in the organization, archiving and recall of information, so it strikes me that this might fit well with their skill-set. It would also greatly increase their interactions with researchers on campus, and hopefully their budgets as well. Much of what’s stored will likely be of little use to other researchers, but may prove to be of great value within the laboratory that created it.

As for data verification, again, that varies widely. For some mathematical data, you can simply add up the numbers and see if it’s correct. For other types of data, the only way to verify it is to repeat the experiment, which can be time-consuming and involve costly reagents and equipment. I can’t really see journals setting up their own test labs to verify data that comes in with papers. For researchers looking to re-use data, the question of trust is going to be an interesting one. Are you willing to stake your reputation on data collected by someone else? There’s a big difference between citing someone’s experiments with new experiments based on their conclusions, and drawing new conclusions based solely on someone else’s data. For some types of research, it’s not a big deal, but for others, some redundancy is going to be necessary. If I was putting my name on a paper, I’d probably want to repeat some of the basic assays to satisfy myself of their accuracy.

I take data to include additional explanation and results, not just raw numbers. This can be very useful. In fact I consider journal articles to be merely large abstracts, compared for example with the 50-100 page research reports that we publish free on-line. Since virtual space is virtually free what is the problem? If just one person finds just one out of a hundred SD’s valuable it is worthwhile. Value is not the issue here.

If “data” is going to include further text and explanations, then it’s no longer just a question of storing raw data. If you’re now telling a new story, shouldn’t that story also be peer reviewed and part of the article? Isn’t that the dichotomy Phil’s article protested, that part of the article is scrutinized and then the authors get to add on whatever they want in the supplemental material? To use your reasoning, then why bother peer reviewing anything at all? Why not just publish everything in the journal with no editorial oversight?

What is the purpose of a journal article? I don’t think most readers want to have to dig through a massive pile of unreviewed material. They want a quick story explaining what they need to know, and they want as verified a story as possible. I understand the value in what you’re proposing, I just think it’s a different thing, not a journal article.

Virtual space is not free, as Kent recently pointed out. Is it reasonable to expect a journal to archive 500 terabytes of high resolution imaging data that backs up one paper? What about the 20 other papers published that month? And the next 20 the month after that? Is it reasonable to expect the journal to pay for the bandwidth to distribute that data? Amazon charges for those sorts of services. I’m not sure most publishers want to go into the server farm business.

Will researchers accept the necessary increases in subscription prices or in author fees for such services? Are you willing to pay an ongoing monthly fee for storage and distribution of your data? What about the organization and annotation of data sets? Is that the responsibility of the author or the editor? If the author does it, then the editor still has to review it, again massively increasing time, effort and costs. Journals are businesses, and if only one person is going to find value in that data set, then it doesn’t justify the costs as our readers are unlikely to want to pay for it.

What I don’t understand – why not accept/host the supplementary material and mark it clearly as “not peer reviewed” or similar?
This way the reader knows what he/she is getting into and can choose to ignore the supplements altogether.
You would need to do a bit of finetuning re usability and webdesign, but that should be no rocket science.
Only disadvantage I see is potential increase in hosting cost for publisher.

Indeed, the simple solution here is to not review the supplemental data (SD). The article stands or falls on its own. In fact much of the problem seems to be confusion over what goes into the article and what into the SD?

I am reminded of one of the first cases of large scale organizational confusion I ever diagnosed, back in the 70’s. Congress changed the 200 year old authorization process for federal water projects (dams, harbors, irrigation systems, levees, etc.) from a one step to a two step process. This raised the deep question as to what sorts of engineering analysis went into the first stage versus the second? Sorting this out paralyzed the water resources agencies for over a decade.

I think that gets to the fundamental question of the role of journals and the role of the journal article. If it’s not reviewed and the reader can easily ignore it, why should it be included with the peer-reviewed journal article? Is the purpose of the journal article to contain everything that might pertain to the story being told, or just the essential material to tell that particular story?

Couldn’t the journal article simply provide a link to the authors’ website where they could put up as much unreviewed data as they’d like?

Including such a link, perhaps as a citation, is indeed an alternative. However then the Journal could not include the SD in its content searches, nor would it get credit for citations to it or hits on it. For that matter, why post the article? Why not just publish URL’s? You could have a one page journal!

Again, this gets to the heart of what one considers the purpose of the journal article. If you want to dig through mountains of data, and if you want to do away with editorial oversight and peer review, then each researcher could indeed simply maintain their own website and post their own data. If, however, you are looking for an efficient and reviewed means of understanding a research project, the journal article provides that function. Once you allow unfettered data dumping, you lose that efficiency and that review, and you’ve no longer got a journal article, you’ve got a lab notebook. I don’t have the time to read through most researchers’ lab notebooks. I want a quick synopsis of the experiments done and the conclusions drawn. Given the low numbers journals see for access to supplemental materials, it would seem that I’m not alone.

I am no longer clear which alternative you are talking about. My view is that the article is selected because the results are important, so that is what peer review helps to do. But SD can also be important so it too should exist but there is no need to review it. The article stands on its own. You might want to have a size limit on SD.

As for low usage, one should expect that the number of people who want details is smaller than the number who just want the article. Same for those who just read the title, or go on to the abstract, but don’t read the article. There is a scaling process here. The number of users in inversely proportional to the level of detail.

So I don’t think it does involve the purpose of journals and all that. It is just a question of providing a supplemental service. The problems pointed out in Phil’s article are not deep. They stem from letting the reviewers dig into the SD, which they need not do.

I agree that supplemental data has value and I’m not arguing for the elimination of it. I do think that the value varies greatly depending on the type of data and the field of research. I also think that journals need to have strict policies concerning SD, placing reasonable limits on what can be included. Those limits have been abused as of late, and many journals are starting to clamp down on what they’ll allow, which is a good thing.

How one sets those limits is directly relevant to the “purpose” of the journal article. Is the journal article meant to be an efficient means of communicating the story of a research project, or is it meant to be the complete archive of all information related to that research project? If the former, a limited set of directly related SD seems reasonable. It’s when you start getting into the latter that I think the purpose of the journal article is lost. There’s value in making complete data sets and lab notebooks publicly available, I’m just not convinced that it’s a necessary part of the journal article. I’d rather see those types of archives kept separately.

Yes, that would be another option. The downside is probably that not all authors are able to maintain stable websites/persistent URIs. So (as a publisher) I would see it as an extension of service to the author to allow him to publish (clearly marked as such) research data directly relevant to the peer-reviewed material somewhere on the publisher’s website.

As a reader, you wouldn’t need to “dig through mountains of data” – if you’re only interested in the article, simply ignore the “lab notebook link” below it.

I think it’s a concept that’s going to have a variable level of utility, depending on the data and the field involved. For some fields (Genomics, as an example), there’s great value in including a fairly straightforward dataset with the paper. That data can immediately be re-used by the community.

For other types of data though, you’d need enormous amounts of storage space and enormous amounts of bandwidth if anyone decided to download them. Getting things organized and standardized is a huge timesink, either for a researcher or an editor. I’m not convinced that the benefits will outweigh the costs for a lot of the data generated in most fields. Researcher time can be better spent doing new research, and editorial time is too expensive to devote to something that’s likely of such low interest.

As I suggested above, institutional repositories might be a better way to handle the problem. Run things through your institution’s library, have an information science specialist working directly with the author to get the material into shape for archiving, then provide a stable and persistent resource where it can be found. These archives are likely to be of more value for the lab that generated the data than for outside laboratories, as they’d prevent the usual loss of knowledge that occurs when a student graduates or a postdoc moves on.

“At the same time, the journal certifies that they have been peer-reviewed.”

Certifies? What does that mean? There are no standards for peer review, everyone does it differently, probably no one does it the same themselves from one article to the next. How can they state that that they are certifying anything except that they sent the article out to reveiwer and got reviews back – good, bad or indifferent as we don’t even know what the reviews were or if they even influenced the editors at all.

Certification? Bah!

“Certification” is probably the wrong word. Journals do, however, stake their reputations on the quality and accuracy of the material they publish. If editors let articles into the journal that aren’t accurate, then the reputation of the journal falls, which hurts both future submissions and subscriptions. It’s in their best interest to control for quality as much as possible.

Is it possible to verify the accuracy of a data set without actually repeating the experiments themselves? Is this a reasonable activity for a journal? Isn’t there a level of trust that has to come into the process at some point?

In the past 7 or 8 years, there has been a proliferation of supplemental material submitted to scholarly journals. With no standards or best practices, publishers have varied enormously in how they are handling it. As noted above, the Journal of Neuroscience announced that the journal would accept no more supplemental material.

Peer review or not is one question. How essential it is to understanding the science in the article is another. The question of what happens to datasets looms large for many.

A NISO-NFAIS Working Group is developing a set of Recommended Practices for Supplemental Material. Two coordinated groups–a Technical Working Group co-chaired by Dave Martinsen from ACS and Sasha Schwartzman from AGU and a Business Working Group co-chaired by Linda Beebe from APA and Marie McVeigh from Thomson-Reuters–are deeply engaged in considering the problems and possible solutions.

One of their tasks is to review and compare current guidelines and examples of materials. The Working Groups invite you to send your examples and particularly your guidelines for authors to You can track the work of the groups on

Comments are closed.