Broken type
"Broken type" by vial3tt3r via Flickr

The report, “Peer Review in Scientific Publications,” released last week by the U.K. House of Commons Science and Technology Committee, reiterates much of what we already know:

  • That peer and editorial review is important for maintaining the integrity of the scientific literature
  • That the process of peer review is not consistent across all journals
  • That pre-publication review may be supplemented — although not replaced — by post-publication review
  • That publishers need to continue experimenting with other models for review and dissemination
  • That editors and senior academics need to educate new scholars on how to provide quality reviews
  • That granting and promotion committees should not rely upon a single metric (e.g., the impact factor) in order to evaluate the merits of a paper

The report also repeats a recommendation that the United Kingdom set up an independent office dealing with cases of suspected scientific misconduct, similar to the United States’ Office of Research Integrity.

What is new in this report is a recommendation that researchers provide access to their raw data to editors and reviewers at the submission stage, and then to the public after publication:

The presumption must be that, unless there is a strong reason otherwise, data should be fully disclosed and made publicly available. In line with this principle, where possible, data associated with all publicly funded research should be made widely and freely available.

While the premise of open access to data for the purpose of verification is deeply grounded within the ethos of science, such free sharing of data is rarely followed unconditionally in practice.

A 2009 study of authors who published in PLoS journals illustrated that even an explicit policy requiring the sharing of data as a precondition of publication goes largely ignored: only 10% of authors were willing to share their dataset with the inquiring researcher.

Scientists do share data, although they do so carefully, sparingly, and on their own terms. This doesn’t mean that scientists are hiding evidence of misconduct; it means only that making all of one’s data public is onerous, comes with few rewards, and distracts from producing new science.

There are other reasons why primary data do not flow freely from researchers. For one, editors and reviewers often don’t want to see them. Last year, the Journal of Neuroscience suspended the practice of publishing supplemental data because it was burdening reviewers, overwhelming authors, and slowing the publication process. For similar reasons this year, the Journal of Experimental Medicine followed suit, permitting only “essential” supplementary data to be published with articles.

Reacting to the recommendation that all scientists provide public access to their data, Tracey Brown from the non-profit Sense About Science commented for Science:

It is not clear from the Committee’s report what the problem is that would be addressed from raw data publication nor the other costs and effects of demanding it.

While the House of Commons’ report is filled with rich details for anyone interested in the practice of peer review — its strengths and limitations — I find the recommendations that stem from these details somewhat contradictory.

On one hand, the House committee understands that the practice of peer review is highly variable, that it reflects the different values and needs of individual communities, and that experimentation should be encouraged at the grassroots level.

On the other hand, the committee is also willing to make grand claims about the system as a whole and propose system-wide changes that appear to ignore how peer review is actually practiced.

When the integrity of a process is challenged, transparency is often cited as a solution. It is not evident, however, that open data mandates would improve the integrity of the peer review process, although it may provide a public appearance of it. More likely, it will greatly increase the costs of conducting good science, slow the process of discovery, and put UK scientists at a disadvantage over their international peers.

Before public officials push to legislate such requirements, it would be useful to understand what they’ll be sacrificing in return.

Enhanced by Zemanta
Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist.


33 Thoughts on "Will Open Data Solve Peer Review Concerns?"

That having others’ (publicly-funded) research outputs available provides opportunities for further discovery is (more or less) unarguable. And credit will soon come to those who share, with a component added to acknowledge reuse. So there’s a pull and a push for sharing (also papers with shared data get cited more — see even ignoring various policy pronouncements.

Community oversight is a ‘nice to have’, but realistically the idea that anyone is going to rerun analyses as part of the review process is misguided. The best we can hope for is that major repositories curate data on the way past to check for completeness according to best practice (as is done for some microarray data already). Also, just putting it out there (with the implied threat of eventual discovery if nonsense) will have some effect — avoiding hostages to fortune, essentially.

The real problem though is that the infrastructure that should make it straightforward to collect, annotate and share data efficiently as part of a workaday workflow — software, standards, databases, etc. — is being generated in a research context, competing with real research projects for money to engineer infrastructure. That sucks. Multiple screeds out there on why it sucks and plenty of half-finished poorly-conceived stuff out there as evidence. Until the provision of infrastructure is taken seriously by funders (as a group), the classic Heath Robinson / Rube Goldberg approach to research support will continue, to all our detriment.

I think that the raw data should put into some sort of repository (maybe you load the data into the database straight from the bench). The repository is searchable so that when you embark on your next experiment you can search through and see what has been done before, even get in touch with those scientists. To be honest if you don’t want to share your major discovery surely sharing the things that didn’t work would help to increase the process of discovery? Increase efficiencies? Pie in the sky I’m sure but worth a thought.

The issue of data cries out for some quantitative analysis, because otherwise it is hopelessly vague. The very concept of “the data” involved in any given research project can range over many orders of magnitude, and so then does the potential cost of preparing, storing and sharing it. I did staff work for the US Interagency Working Group on Digital Data and we struggled with this problem for two years. The problem is that there is no hard data on data, so our report was just as inconclusive as everyone else’s.

Data preparation, storage and sharing can be a significant fraction of project cost, sometimes a large fraction, 30% or more. Moreover, these costs can continue to mount long after the project is completed. Assuming a zero sum with actual research, do we really want to make significant cuts in research just to make all this data potentially sharable? There is little evidence of big benefits, to support this goal as a general proposition. There may well be specific cases where sharing is worth paying for, but they need to be identified and funded in competition with research. No one wants to cut research just to build data repositories on principle.

I would add that there is no way that the issue of integrity can justify the huge sums potentially involved in a general program of data sharing. But there are communities who are finding these costs worthwhile on a discovery basis. Communities like astronomy, ecology and particle physics for example. On the integrity side we already have the drug development case, where the costs are painfully obvious. The debate should always begin with the fact that data sharing is difficult and expensive.

Thanks for seeing the nuance in my argument: I’m not arguing against requirements for data sharing, only that it doesn’t make sense to mandate this for all funded research, in all disciplines, in all circumstances. This is what I meant by the House report acknowledging that peer-review standards were based on community standards but then recommending global requirements.

The UK’s ESRC already has a data repository and you have to promise to deposit your data if you are funded by them. However, they don’t usually accept data – especially not from small scale experimental studies – even though you have to send them a good deal of metadata to help them to decide if they want your data.

The ESRC data repository, the Economic and Socvial Data Service takes all data it is offered, unless there is an alternative more suitable ‘home’ for it or there are overriding rights or access problems. UKDA-store holds many small scale experimental studies. See The main catalogue holds over 5000 larger studeis and survey series.

The ecology and evolution community has taken quite a big step on this recently- many of the big journals in the field implemented a ‘Joint Data Archiving Policy’ in early 2011 that mandates putting the data underlying a manuscript on a public archive. It’s been quite well received by authors, mainly because there’s an easily identifiable time when the data should be archived (publication) and a reasonably discrete dataset. Accompanying this policy is the development of a database (called Dryad) where generic research data can be stored, and there are close to 1000 data packages on there now. Whether or not the data being stored are sufficient is anyone’s guess- I’m involved with a grad student class that might take a number of papers that have archived their data to try recreating the results…

I am genuinely curious how those who point to Piwowar et al. as “proof” that open data leads to more citations explain the following excerpt from the Discussion section of their PLoS paper:

We note an important limitation of this study: the demonstrated association does not imply causation. Receiving many citations and sharing data may stem from a common cause rather than being directly causally related. For example, a large, high-quality, clinically important trial would naturally receive many citations due to its medical relevance; meanwhile, its investigators may be more inclined to share its data than they would be for a smaller trial-perhaps due greater resources or confidence in the results.

I’m going to go out on a limb and say that I think data sharing is really important–this probably stems from doing research into hybrid zones between diverged populations (e.g. fire bellied and yellow bellied toads). The key question for these is ‘is this zone stable through time?’, and whilst there is often a study from decades ago in the same place, you’re never quite sure as you can’t see the original data. For me, making data publicly available is thus about more than fulfilling a funder’s requirement or a journal policy–it’s about giving scientists ten, fifty or a hundred years from now access to a really valuable resource.

The cost of curating all the data, notes, etc., from every field study, for 100 years is simply prohibitive. Somebody has to specify what is worth saving.

The raw data for most studies in ecology and evolution are typically less than 1GB, and generally consist of text files. To preserve the data associated with 10,000 publications per year for the next 100 years could be accomplished with an one time endowment of $10 million or about 500K per year (I’ve seen a detailed study on this). Considering that these amounts are equivalent to a handful of standard research grants, I’d say that was an absolute bargain if it meant that the data for an entire field could be archived and curated for a century.

This is a good example of the problem, namely people think that the cost of storage is the primary cost, when it is still the cost of people. To begin with let’s suppose the labor to prepare and submit the data is 100 hours (just 2 weeks) per publication. That alone is 1 million hours per year. If the fully loaded labor cost is $100/hr that is $100 million/year taken away from research.

Researcher burden is by far the biggest cost in this game. And if the data is not used then the labor is wasted. Even if it is used, but not for a long time, then its present value is near zero on a discounted basis. Curation has to compete with research.

Of course a lot of this depends on what you call the data, as I mention above. Are there field notebooks, maps, photos and videos, etc.? Turning truly raw data into something others can understand is simply a huge task. It is not the number of GB that matters, but the number and complexity of the items that have to be explained.

At the curation end one can’t run a scientific repository and data center for $500k. There is a lot of labor involved, especially customer service, plus forward porting all the content to new technologies every 5 years or so. I suspect $5 million per year is more like it, maybe more.

I disagree that researcher burden is the main cost here- careful curation of one’s own research data is just good practice, and hence it should be ready for public dissemination with minimal effort. One could make an equivalent (but clearly silly) argument that analysing the data thoroughly imposes an even greater cost on the community, running to billions of dollars per year ‘wasted’ by checking your assumptions and using sophisticated techniques rather than just doing basic statistics.

I do, however, agree that identifying the kind of data that should be archived is challenging. We normally request that the data that went directly into drawing the figures and doing the stats for the paper are made available, and these can be several steps removed from the ‘raw’ data itself. As this is a fairly new movement (at least in our field) I think it will take an iterative process of data sharing and reuse to establish the most useful and practical types of data to archive.

Many years ago I helped design the “burden budget” system of the US Gov’t, which regulates the burden that Federal regulations can impose on the regulated community. Arguing that there is no burden because people should do it anyway is a common fallacy.

But here the vagueness of “data” becomes apparent. What counts as data can vary over thousands of orders of magnitude in the same study. What people have been posting as “supplemental data” in journals is actually highly refined, and probably tiny, compared to the raw data, and even that supplemental data is now being seen as too burdensome.

The movement has to come to grips with the issue of burden, or it is just wasting everyone’s time.

I am currently working on my Masters research in social sciences. I have already propositioned my department for space to archive my primary data for public access through the university web system. In fact, I intend to provide a link to the archive in the final version of my thesis and any subsequent publications that may arise from the research. With the advances in internet bandwidth, network/cloud storage, and access to powerful personal electronics there is no legitimate reason that primary data would not be made available. Budgetary concerns are the only possible legitimate roudblock. However, all future research projects “should” include data storage and distribution in the research proposals. Lack of forethought or preparation is no excuse for intelligent academics to avoid making their primary data open to the public. Transparency is a critical coordinate to peer review in validating science, and maintaining its integrity in an increasingly science skeptical world.

Not sure how many peer reviewers are going to be willing to dig through raw data in order to evaluate a paper for publication. It’s hard enough to get reviewers signed on these days, in most cases, this may be too much to ask. Also factor in that delays are one of the major flaws in the current system of publication, and this requirement seems to further slow down the spread of information. If a researcher must carefully annotate and organize his raw data before submitting a paper so a stranger can readily work with it, and if the reviewers are expected to do a thorough analysis of that data, then one would expect the publication process to be drawn out over multiple years, well beyond the current (already too slow) pace.

As many have pointed out above, there’s great potential in sharing data, but as others have noted, not all data is created equal, and there’s also an enormous cost in both infrastructure and time/effort required. For some areas, the data is obvious in its reusability, easy to organize, annotate, store and readily incorporated into new studies. DNA sequence, epidemiological data, economic data immediately come to mind.

But for other types of data, it’s not that easy. I know imaging labs that are generating terabytes of data per week, taking long time lapse movies of the behaviors of very specific cell types under very specific conditions. Are these movies worth saving for future studies? They take up an enormous amount of space, are difficult to organize and annotate in a way that a future scientist could make heads or tails of, and are designed to answer a very specific question under very specific conditions. How much of a graduate student’s time should be spent collating and organizing this data for storage on the odd chance that someone might find some use for it someday? Is that time better spent doing further experiments?

A good opinion piece on the subject can be found here:

There’s no easy answer, and each field, and really each lab and each experimentalist will need to answer these sorts of questions for themselves. I do strongly agree with Sandy Thatcher above that there’s a role to be played here by librarians. So many institutions have information professionals already on the payroll, people who are trained at the organization, storage and retrieval of information. Why not put their talents to good use?

I have never had any involvement in scientific research although I received a decent scientific education, and to me, rather naively perhaps, it beggars belief that any research can be considered legitimate without the data on which it was based.

New Scientist just ran a headline about the recently released Climate Research Unit data “OK, climate sceptics: here’s the raw data you wanted”. Leaving aside the needs of headline writers to think of something catchy, this is strikingly weird … as if only climate sceptics would want to see the data whereas the rest of us would take it on faith.

I suspect practical considerations of metadata and cost are a red herring.

Sure it would be nice to get the data all tidy and documented in standard formats. But it had to be processed and therefore processable at some point during the research so just dump it as it is, code and all.

If someone later wants to get more value out of it or validate the results, let them do the documentation and transforms.

Bandwidth and storage is pretty cheap these days so infrastructure should hardly be a problem, just as long as HMG doesn’t put it to tender with the usual suspects. Of course, there are some datasets which are so huge that special measure would be required, but why legislate for the exceptions? Hard cases make bad law.

If researchers don’t want to share their data straight away then maybe they should have an official head start period, as seems to have happened with NASA’s Kepler project. Quite understandable that the originators did not want the rest of the world beating them to the punch.

If researchers don’t want to share it at all, then so be it. Others can draw their own conclusions.

As I see it there is every reason to do this and no intractable problems. Why wait for anyone? Google might even fund it.

The fact that the way science works beggars your belief suggests that you do not understand the situation. In particular the climate data flap is a political fight, not a reason to restructure all of science.

The point of the comment was specifically to offer an educated outsider view.

People will always have reasons not to change but on the face of it this is a total no-brainer.

It’s only difficult it you think you have to enforce it, legislate, ‘restructure all of science’, cater for every case.

Just start doing it. If it dies, it dies. If it snowballs, you’ve won. Check this out:

Your article has nothing to do with the issue at hand (in fact I think their hypothetical findings are demonstrably false in the real world). As for just starting to do it, a lot of specialized communities are already doing it, as I point out in my second comment

But they can justify the expense in terms of the immediate contribution that sharing makes to local discovery, not on the basis of some mythical integrity problem, or even on the basis of possible future use.

I am a great proponent of data and research sharing, in fact that is my field. (See ) But the extensive cost and burden of sharing has to be justified, not based on ideology.

Most of this discussion has focused on natural science data, but it is important to remember that data come in a wide range of different types and for the social sciences, for instance, may take the form of interview tapes, field research notes, various kinds of artifacts, survey data, etc., each type of which may require its own special curation methods. The task confronting librarians and authors in preserving all these data is a challenging one indeed.

And i should have added that at least some of these data may be protected by confidentiality agreements as well, further complicating the process of curation.

This is exactly what Journal of Errology is trying to accomplish. The Journal is one of its kind in bringing articles with negative results, stumbles and raw data obtained in the process of a research that goes out to get published. We are beginning only with research in life science and have left the rest for a later time. We have understood that unlike published articles which are peer reviewed, negative data does not have to be reviewed as much and keep the author responsible for the content. There is a lot of valuable data out there that can be tapped.

How much time do you suppose a researcher should spend preparing articles for a journal that is not peer reviewed, nor is likely to be indexed in PubMed, both generally considered necessary for a paper to count toward career advancement? Wouldn’t there be a stigma associated with publishing in such a journal–a public admittance of failure and a lack of productivity?

Any researcher who has spend a little time in a laboratory knows the importance of failure (a harsh word indeed). In fact the only failure here is the failure to communicate these errors. Our initial survey in the field of stem cell research led to some alarming number of redundancies with experiments that have already been tried and failed at and then done some other way and worked. The problem here is that companies do not want their competitors know what they have been hopelessly trying to accomplish. But this practice can be changed if researchers step ahead. And yes it does not have any contribution to one’s career, however think of it as an open source effort, where a researcher is thinking for the benefit of other researchers and not him self.

I understand where you’re coming from ideologically, the problem though, is one of practicality. I know many, many successful scientists and every single one is desperately strapped for time. You’re now asking them to put in great time and effort toward helping their competitors. Noble to be certain, but also unlikely to be high on the priority list of most scientists. Throw in the idea that being an acknowledged failure may tarnish one’s reputation and participation becomes even more doubtful.

Science, as I note in my comment above, requires some level of redundancy. If you have a successful experiment and publish it, and my lab is going to do some new experiments that expand upon it, my lab will likely repeat your initial experiment to verify that it is correct. If your lab does an experiment and it fails, I can’t know if it was performed correctly and failed or if it didn’t work because you’re incompetent. If I believe it’s a potentially fruitful path, then I have to repeat what you’ve done, even though you’ve failed. If we immediately accept your failure as gospel, and it turns out to be due to your technician’s inability to do math, then the world loses out on the potential results. So even with the publication of failures, redundancy is still going to be required.

And given the “macho” culture of science, very few top labs (and no companies as you note) are going to loudly trumpet their flops. If you’re up for tenure, how many failures do you want to publicly admit to? Are you going to recommend tenure for the lab that can’t get anything to work? Are you going to fund that lab when they have so many problems getting results?

I had the same fears as you have now about failed results leading to other researchers shying away from repeating the same, until I realized that the above discussion is also valid to successfully published experiments. Many experiments which have been published terming them successful, have later turned out to be unsuccessful. I could give you many examples where successful experiments have not been able to be replicated.

And as far as going up for tenure is concerned. These unsuccessful experiments don’t necessarily have to be propagated to in the application form. Let them use their successful results for these.

And as far as time is concerned, researchers don’t have to spend a lot of time in editing their raw data (they are doing a huge favor already in sharing their negative or raw data). The editors and reviewers working on the raw data can make is as refined as possible.

I hope i have convinced you at least a little.

I do wish you luck, but given the general lack of uptake of “good for the community” activities that don’t benefit one directly (post-publication peer review as one example), I’m not optimistic. The current state of funding and the job market is having a brutal streamlining effect on the activities of most scientists.

I’d be in favor of requiring authors to make data available to other researchers =on request.= Two reasons: (1) many of the data I’ve gathered are in a format that makes perfect sense to me but that would require additional explanation to make them meaningful to other researchers; (2) there’s relatively little chance that any particular data set will be requested by another researcher. I’m all in favor of public access to data, but we have to recognize the costs involved and to consider the potential benefits in light of those costs.

And that’s really a key question here. How much of your time should be spent organizing and annotating your data for someone else’s use, particularly since much of that data is likely of low interest to most other researchers? Isn’t your time better spent doing the next set of experiments?

The “on request” nature of your plan makes it a lot more feasible, and is in line with the policy of many journals already, which require authors to make all reagents, strains, sequence, etc. from their experiments freely available to all who ask. Unfortunately, journals have little authority for enforcing these rules, and they are often ignored by researchers.

Comments are closed.