If you pay any attention at all to scholarly publishing, you’re likely aware of the current uproar over PLOS’ recent announcement requiring all article authors to make their data publicly available. This is a bold move, and a forward-looking policy from PLOS. It may, for many reasons, have come too early to be effective, but ultimately, that may not be the point.
Make no mistake, data availability is an important new frontier in scholarly research. Last year’s White House Office of Science and Technology Policy (OSTP) memo on public access to research results had two separate objectives: access to papers resulting from funded research and access to data resulting from funded research.
The OSTP is not just talking about the data used in published research papers. It’s talking about the entire dataset from the funded research. To quote the memo, “data” is defined as;
…the digital recorded factual material commonly accepted in the scientific community as necessary to validate research findings including data sets used to support scholarly publications…
Once this policy goes into effect, PLOS’ requirements would seem to be an afterthought for authors funded in this manner. The problem is that the OSTP policy seems nowhere near being implemented, and an enormous number of questions remain about how it will work and whether it will work at all. For example, no one is exactly sure where all of this data will be stored and how to pay for the efforts and services required. Since patient data has strict privacy requirements, it’s unclear how it will be handled. One also must wonder how any such policy can be monitored and enforced–if I don’t show you my data, how do you know it exists?
Journal publishers come into the picture because under US law (the Bayh-Dole Act), the intellectual property (IP) generated as the result of federal research funds belongs to the researcher and their institution. This creates a loophole for researchers required to comply with the OSTP memo. US funding agencies can request that researchers make their data publicly available, but it is unclear if they can require researchers do so without violating IP law. The OSTP memo specifically requires that any resulting procedures, “recognize proprietary interests, business confidential information, and intellectual property rights.” Similarly, the intellectual property policies of groups like RCUK and Wellcome leave everything in the hands of the researchers and their institutions.
Journals, however, are not under these same IP restrictions. Consider data repositories and databases that are already great successes–GenBank for example. GenBank is the NCBI’s genetic sequence database, an annotated collection of all publicly available DNA sequences. It has been an enormous success, not because funding agencies require deposit of DNA sequence data, but because it is the practice of the community which is enforced by journals requiring deposit for publication. This success is something for which journal publishers do not receive appropriate credit (Kent, perhaps this is worth adding to your list of services journals provide).
Take a look at the instructions for authors for any genetics/genomics/bioinformatics journal, or these days, even general biology journals. Most, if not all, contain language like this, from the Nature Journals:
For the following types of data set, submission to a community-endorsed, public repository is mandatory. Accession numbers must be provided in the paper. Examples of appropriate public repositories are listed below.
Another example, from Genome Research:
Genome Research requires that data from a publication be easily available to the broader community in publicly held databases when available…
If funding agencies hope to get traction for data release policies, then these types of requirements, represent the future. PLOS has roots in the genomics and computational biology communities. The practices and attitudes of these fields are clearly ingrained in much of what PLOS does. These bioinformatic influences may have made it seem straightforward to extend the policies for sequence data across all of science, but as we’ve seen when the biomedical world has tried to impose its vision of open access on humanities researchers, one size does not fit all. Assuming that the way your field works is universal usually leads to flawed approaches and unhappy researchers.
Not all research is the same, and not all research data are created equal. There are clear cases where data are easily archived and reused, and that reuse has been successful in driving new experiments. But there are also data that are not quite so easy to handle, and that were generated to ask a very specific question under very specific circumstances. Those data may not be so re-usable.
Similarly, some types of data easily lend themselves to standard practices and file forms, others not so much. Without standards, every piece of data made available will be different and the resulting chaos may make generating new data an easier process than sorting out archived data. Many other issues have been raised across the blogosphere, including whether this puts researchers in low-to-middle income nations at risk, what peer reviewers are supposed to do with raw data (when most don’t even look at supplemental material) and whether it’s fair to ask researchers to give up data that they intend to continue to exploit for further experiments, risking being “scooped” by others.
Perhaps the biggest practical problem with PLOS’ policy is that it puts an additional time and effort burden on already time-short, over-burdened researchers. I think I say this in nearly every post I write for the Scholarly Kitchen, but will repeat it again here: Time is a researcher’s most precious commodity. Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.
When depositing NIH-funded papers in PubMed Central was voluntary, only 3.8% of eligible papers were deposited, not because people didn’t want to improve access to their results, but because it wasn’t required and took time and effort away from experiments. Even now, with PubMed Central deposit mandatory, only 20% of what’s deposited comes from authors. The majority of papers come from journals depositing on behalf of authors (something else for which no one seems to give publishers any credit, Kent, one more for your list). Without publishers automating the process on the author’s behalf, compliance would likely be vastly lower. Lightening the burden of the researcher in this manner has become a competitive advantage for the journals that offer this service.
But with PLOS’ new policy, they’re doing just the opposite and putting their own journals at a disadvantage. If publishing in a PLOS journal requires you to do weeks of additional work to organize your data into a reusable (or at least recognizable) form, adds the potential expense of hosting and serving that data or requires time and effort to find a suitable repository and uploading it to that repository, then why not publish the same paper in a different journal and eliminate those costs and timesinks?
Because data requirements are not uniform across all journals, PLOS has put itself at a disadvantage as far as attracting authors because other journals offer an easier path. If strictly enforced, this new policy is likely to result in a drop in submissions to PLOS journals. While no other mega-journal has been able to shake PLOS ONE’s hold on the market, this policy may provide an opening for competitors to gain on PLOS ONE and even overtake it.
So why take that risk? Why create this policy now? Only those at PLOS know for sure, but from the outside, this can’t be seen as anything other than a not-for-profit publisher putting mission above business concerns. PLOS has never been a risk averse organization, and this policy would seem to fit well with their ethos of championing access and openness as keys to scientific progress. Even if one suspects this policy is premature and too blunt an instrument, one still has to respect PLOS for remaining true to their stated goals.
The policy seems deliberately provocative, a strategy that has worked well in the past for driving change in scholarly publishing. Remember that a key moment in the modern open access movement was a controversial boycott threat. While that threat never materialized, it did start the ball rolling and led to things like the founding of PLOS. This policy may be meant as a similar opening salvo, not necessarily as the final step in the process but one to serve notice that change is on the horizon, to drive the conversation and eventually, progress.
As noted above, the culture of computational biology and bioinformatics remains a strong influence at PLOS, and most scientific communities seem to picture their own practices as the norm. It’s also worth remembering that PLOS is based in San Francisco, close to the heart of a culture that is building entire industries around gathering and analyzing data, preferably data created by others and made freely available for reuse and economic gain. Both of these strong cultural influences may be in play here helping to drive the policy.
Time will tell if PLOS has acted prematurely with this policy or if they’re ahead of the curve (as has often been the case). This is a particularly bold risk for PLOS — in the past, their experiments have offered some new benefit to authors, broadening access to their papers or streamlining the peer review process for more rapid publication. Here they’re putting a burden on authors for the potential benefit of others. It’s unclear whether researchers will respond in the same way they have in the past when they’re the ones being asked to make sacrifices.
Regardless, PLOS’ willingness to take such bold risks and court controversy continues to make them a tremendously valuable part of the scholarly publishing landscape and points to the crucial role played by university presses and not-for-profit publishers who can put mission ahead of margin. Even if this policy falls short, it is certainly bringing a lot of attention and thought to the questions that need to be answered if data availability is to happen. This particular policy itself may be a failure, but it is likely to open the door to better policies in the future.
44 Thoughts on "PLOS' Bold Data Policy"
David, this is an interesting issue. Just some random related points.
Nature discussed it at length in a special issue in Sept. 2009. At least some of the articles are OA.
I think BMC has struggled with this issue as well. Iain Hrynaszkiewicz (don’t know if he is still with BMC) wrote some articles and seemed to be their point person looking at open data.
BTW, I believe NHLBI has had a open data policy for years but takes special care to ensure patients remain anonymous.
Lantanya Sweeney wrote a fascinating article where she showed how she could merge de-identified health payment records the Massachusetts government made available to researchers and publicly available voter registration data and merged them using birth dates, gender and county of residence and was able to get William Weld, then governor’s health history. The point is you have to be very careful about making sensitive data like health records available even in a de-identified form and NEVER include birth date.
L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty,
Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.
I believe PeerJ also as a open data policy.
Figshare offers a good way to archive such data and currently is free. I think I read somewhere one of the major publishers cut a deal for Figshare to archive all their author’s data.
Thanks, David, for this astute and generous analysis. I concur with almost everything you say here. My only quibble is with this part:
Perhaps the biggest practical problem with PLOS’ policy is that it puts an additional time and effort burden on already time-short, over-burdened researchers. […] Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.
This is true as far as it goes; but data curation is research. I’d argue that a researcher who doesn’t make available the data necessary to reproduce his conclusions isn’t getting his job done. Complaining about having to spend time on preparing the data for others to use is like complaining about having to spend time writing the paper, or indeed running experiments.
I realise that not everyone will agree with this — just as in the old days of Royal Society, not everyone would have agreed that time spent writing up results for publication was time well spent. But I think the world is increasingly moving that direction, and I applaud PLOS’s move to hasten that shift.
Thanks Mike I agree. Data sharing is a good thing but as Dave C. points out it creates a whole range of challenges that vary considerably among disciplines. Human subject issues in biomedicine, perhaps size or data complexity in others. the Nature series did a good job of highlighting some of them.
Another is adequate and appropriate documentation. Without it data are not very useful and it can actually be harmful if misinterpreted. But what are the standards for documentation? I haven’t seen any. I think these are needed and would vary across disciplines.
What about credit? If I spend a year developing a very useful data set, write an article and publish the data and someone else publishes an article off it, what credit do I get? Am I an author? Do I get a footnote? If I am in the US how does the promotion and tenure committee look at it? If you look at the NIH criteria for authorship, I think you could almost argue I should be an author but does that make sense?
I am just bringing up some of the issues this raises.
What if someone reuses your data and comes to a conclusion with which you disagree? How does that affect authorship decisions?
Exactly!! I think that is one of the reasons authors hesitate to publish their data. Do you want someone else looking over your shoulder? At the same time this is a very valuable check on errors and a way to avoid fraud.
“What if someone reuses your data and comes to a conclusion with which you disagree?”
We call that “science”.
I didn’t ask what you called it. I asked how that would affect the debate over authorship, and how a researcher might react to having one’s name and data associated with a conclusion one thinks is specious. Let’s say someone used your data and name in a paper denying climate change or evolution. If the default custom is to provide authorship or as you suggest, some sort of “data provision”, do we also need a means of opting out?
Well, since I don’t advocate courtesy authorships for the person whose data is re-used, that’s not a problem for me. They should cite the data source (naturally) but that hardly constitutes an assertion of agreement over its interpretation. The same would apply for an explicit data-provision credit.
And, really, if someone analyses my data and comes to a different conclusion from me, then either (A) I made a mistake, and I’m grateful to have had it pointed out, or (B) that other person made a mistake which I or someone else can point out. This stuff doesn’t bother me. No, I’ll go further than that: it’s not a bug in academic procedure, it’s a feature.
Yes, I think that is good. My point is many researchers would prefer not to have that level of review.
It just makes it hard to implement this policy and opens up a whole range of issues.
I am glad PLoS, PeerJ and other journals are starting to grapple with open data. It is just going to take time and effort to sort everything out. It raises a lot of issues.
Also many researcher are so busy with their research and set in their ways that I think it will be really hard to get the ball rolling.
Here come the data trolls.
Yes, there are all sorts of edge-cases and complexities which ill need to be dealt with — as others have said, different communities will likely arrive at different norms, and that may be OK. At this stage, the important point is to shift the default to open.
As for credit: citation seems appropriate to me. Authorship on a paper is for doing work on that paper. If you write a new paper using data that I published, then I’ve done no new work on your paper and I shouldn’t get an authorship. But I’m not familiar with the NIH criteria.
Down the line, perhaps bean-counters will recognise data provision as a new kind of contribution, more significant than a citation.
Agree with David’s point re courage of PLOS to explore the boundaries. One previous interesting example was the ” If a paper’s major conclusions are shown to be wrong we will retract the paper” policy of 2012, which was quickly dropped following an outcry. One wonders if their resolve will be stronger this time.
I didn’t know that (obviously wrong-headed) policy had been dropped. That’s good news. Do you have a link?
You posted in the blog comments.
[edited version of a reply to http://rxnm.wordpress.com/2014/03/03/plos-clarification-confuses-me-more/#comment-937 and http://drugmonkey.wordpress.com/2014/02/25/plos-is-letting-the-inmates-run-the-asylum-and-this-will-kill-them/%5D
If PLoS’ form of words fails it is because they are addressing a big chunk of science in one go. That makes it easy to pick holes. A more charitable unpacking might go as follows:
1. Sharing is the new black, but PLoS almost never specifies what should be reported; it characterises it generically as the pyramid of data and work below the paper (up to some sensible cut-off), but what that means in each community is going to be different and can only be agreed through peer consensus. This means another round of ‘MI’ projects, but carried out piecemeal through referee conversations like case law.
2. What PLoS does dine out on is _where_ to put stuff. This is brilliant and does a great job of promoting Dryad and FigShare — Godsends both — the important point being that if you want to share you now can, whatever your domain. Size might be an issue at the upper extreme, but you can’t argue with this stuff. Basically, if you want to, or need to, then you can. Ace.
Everyone seems to think that those writing policy are Stalinist-Sadist lizards on a mission. In fact those ‘wielding’ policy are almost never willing to enforce it at present, largely because all those policies in some sense refer to community consensuses that have never been established. All it needs is the word ‘enough’. Enough according to whom? According to those that know. Which is the peer group of the author. But because everyone kicks off without any attempt to really unpack things and understand each other (blech, yeah, whatever) all this goes by the wayside.
You quite rightly speak of the differences and inherent difficulties with clinical trial data and patient data.
The Institutes of Medicine of the United States has issued a draft report on data sharing of clinical trial data and are soliciting comments until March 24th.
The report suggests a variety of possible models of data sharing (including WHICH data could/should be shared). I’d urge people to send in comments. The IOM will issue a final report at the end of the year.
I’d just like to point out that F1000Research has had a mandatory open data policy since launch (with the obvious exception for areas where there is a genuine privacy issue such as patient data). We have found surprisingly little push-back from most authors, and many others that initially had concerns changed their minds and submitted their data once they realised they get priority on that data by publishing it. Some of our authors have used it (successfully) as a way to find collaborators. All our data is citable (usually using DOIs), and for data that has no subject-specific repository, we were the first publisher to team up with figshare to host that data. They provide embed-widgets so that the data is previewable within the article itself. The widgets also include stats on views, shares, downloads etc. There is a huge amount of work being done by the Research Data Alliance (RDA), World Data System (WDS), FORCE11, CO-DATA and other major international efforts that we are involved in, to tackle the many issues around data such as incentives for data sharing, data citation, data metrics, data archiving, linking between journal articles and datasets, and It is great to see other publishers as significant as PLOS now moving in this direction too.
One datum; two or more data. Please, can we use plural verbs when data are the subject? The same request applies to agendum and agenda; addendum and addenda; maximum and maxima; minimum and minima; extremum and extrema.
Thanks Etan, I kept struggling with this as I wrote the piece and catching instances where I needed to correct things, but probably didn’t catch them all. Any errors above are purely my own, and if this requires that I return my membership card in the Grammar Pedants Society, I will regretfully do so.
Data is a mass noun. Like “water” or “family”.
I also believe in splitting infinitives and ending sentences in prepositions, since we’re writing in English and not Latin.
For what it’s worth, a lot of people have abandoned the archaic origin of “data” as the plural of “datum” — for example, The Guardian‘s style-guide and professional editor Anna Sharman. I find it pretty hard to get excited about the idea that “data is” is wrong.
As with any rule system, and as David C. notes, the central issues are cost and burden versus benefits. The most striking feature of the PLoS regulatory design may be what we can call the “repository requirement.” That is, simply requiring that the researchers share their data on request is explicitly ruled out as a compliance option, even though it is standard practice. This is great news for the repository industry (which publishers might well venture into) but one wonders about the rationale?
The problem with the repository requirement is that if most data is never wanted, which seems likely, then the cost and burden may be much higher than it need be. All that unwanted data has to be prepared, submitted, curated, etc. Mind you it is much easier to enforce compliance under a repository requirement, just check the repository, and one wonders if this is the rationale? Perhaps researchers are simply not trusted to meet a sharing requirement on their own.
In any case once the required Data Availability Statements start appearing we will have interesting new data to analyze, data on sharing data.
“Simply requiring that the researchers share their data on request is explicitly ruled out as a compliance option […] Perhaps researchers are simply not trusted to meet a sharing requirement on their own.”
Of course they’re not. With the best will in the world, researchers move on. They change jobs. They drop out of the field. They suffer disk-crashes and lose data that they’ve not backed up. The lose interest. They retire. They die.
Leaving the provision of data to individuals is a sure-fire recipe for failure, even if we charitably assume that all the individuals intend to do their best.
So some supposed high value in perpetual access justifies the cost and burden of requiring repository deposit? That assumes a very long half life of data value, which I doubt, so high that it justifies saving vast amounts of useless data. I did a lot of staff work for the US Inter-agency Working Group on Digital Data (IWGDD) and their conclusion was that only the most valuable data should be institutionalized. I agree with their assessment.
Perhaps if PLoS had to pay the repository charges their policy would be different. There is a common problem with regulators that if something is free (to them) and beneficial they naturally want all they can get. That seems to be what is going on here.
Mike, this also raises the issue of how long data should be retained for? The PLoS repository requirement seems to be perpetual, which is absurd. Here is what the US IWGDD concluded: “not all digital scientific data need to be preserved and not all preserved data need to be
preserved indefinitely.” (Executive Summary, page 1) I agree.
In that regard I notice that PLoS’s default repository — http://datadryad.org/ — seems to offer perpetual custody for a flat fee. That is not economically possible.
The same can be said for published papers. A fairly substantial fraction of published papers are never cited (or cited very infrequently). Are these “unwanted papers”? I do not think so. The data (like the published papers) are part of scientific ecosystem. It is hard to predict what will be useful in the short or long term. I am pretty sure few who made museum collections over the past few centuries had any idea how valuable it would be to research on the effects of anthropogenic induced changes (climate, habitat loss etc..).
David, sometimes when I read SK entries I feel like I am reading comments from people living in a bubble. I don’t mean to single you out, this is actually a very interesting and informative article, but you write as if the US is the only country on the planet. You talk about NIH-funded research and NIH policies as if the NIH were central to global medical research. A quick look at the human clinical trials indexed on PubMed in 2012 shows that the US accounted for only 26.7% of all indexed human clinical trials in that year and that percentage has been slowly declining for nearly two decades. This number also includes both government and privately funded research. NIH-funded research would therefore account for a very small percentage (10% – 12%?) of total indexed global medical research in any given year.
East Asia (Japan, South Korea, Mainland China, Taiwan and Hong Kong) accounted for 14% of all human clinical trials indexed on PubMed in 2012 (that is up from 8% in 2000). I have had many discussions with a number of researchers in East Asia over the years. This policy presents serious obstacles for East Asians. For one thing, all their data would not be in English and therefore would require huge expense in time and money for translation. Asian researchers would also have great concern about sharing their data. In my opinion, the research community is very competitive in East Asia and I believe that Asian researchers would indeed react very negatively to someone else using their data to write a new paper. I am not sure that researchers would even have the legal right to share this information in some countries. In the case of China, the data would be considered the property of the state institution (and ultimately the State); would the researcher have the legal authority to provide the material, even if they wished to do so? I suspect the answer to that question would be; no. I would love for a legal expert to comment on that point.
What of the Helsinki Declaration? The Helsinki Declaration commits researchers to protecting the interests of the patients. As David Solomon pointed out; how do you ensure the anonymity of the patient when the datasets come from a variety of sources with a variety of standards? What standard do you choose? How do you enforce compliance? Does PLoS even have the capability of looking at a Chinese dataset and determining that the patient’s anonymity is preserved? What if it is possible, using the PLoS dataset and another publicly available dataset, to determine patient identity (as in the Massachusetts case listed above)? How would PLoS know? Are there experts at PLoS who have a thorough understanding of Asian names, places and address formats to determine that it would be impossible to determine the identity of the patient in Asian datasets? If they require the public publication of the data; wouldn’t they then be as obligated as the researcher, under the Helsinki Declaration, to protect the anonymity of the patient?
The reason PubMed does not include all the papers is that they made it particularly difficult to submit the paper. You are not allowed to just take the final PDF and put it in. It has to be the “final accepted version” from before proofreading. If you want your papers online to match, then you have to do a significant amount of extra work. The problem with the PLoS manifesto is that they make no judgement of how to make this “data uploading” easy. There seems to be no appreciation of how different labs store and process data, nor any appreciation that the reason other groups have not created central databases is not because they don’t want to, but because it’s hard and expensive, and processing that data often takes a lot of human effort that is not easily written to disk.
> What of the Helsinki Declaration
What of the Bermuda Principles? Chinese centres contributed to HUGO. Chinese researchers will increase their citations by sharing. Negotiation with a central authority could actually be simpler.
This whole discussion is founded on assumptions that cannot be made. People are (I’m sure with the purest of hearts) imagining some Stalinist regime handing out insane dictats. To interpret things in that way is unhelpful and polarises discussion of something that is simply not a binary issue.
Bottom line: communities set their own best practice, for reporting as for everything else. If you can show PLoS or whoever that your community (however defined) regards the sharing of any data beyond that in the paper as expensive and/or worthless then you will _not_ be asked to do it.
And on language, why does it have to be translated there and then? Are there no reviewers that could be asked to review in the native language (answer carefully)? Could it not be translated at literally any point in the future (and wrt info loss staying native is better anyway)? Would not the tagging of the data on submission suffice (and in fact enable simple translations into many more languages through internationalisation)? I tell you most database types would probably agree that any metadata is better than none 🙂
I just wish people would stop imagining the worst then running with it. It’s all about the detail, which is unavoidably absent (where do I find the consensus view of a community?). Even so PLoS has done a good thing (as, to be fair, have others before them, though less controversially it seems). Being harsh, rather than working to fill the gaps, is unfair.
Ichthyostega wasn’t up to much either, but what an advance.
any _valid_ metadata is better than none…
Only in those (possibly rare) cases where the data is valuable. Where it is not the cost and burden of creating the metadata is a dead loss.
So the people you represent (by definition) share your view that certain kinds of data aren’t worth much.
Your peers are your judges. Editors are no more than facilitators. Because the PLoS policy is written as it is, there is sufficient flexibility to ensure that you will neither be asked to share worthless data, nor to annotate it. An editor that deems otherwise would be swiftly and roundly defeated by your pointing to the evidence that your peers back you up. These things will likely only need to be settled once.
I am an independent analyst so I do not represent anyone. Nor did I say the data is not worth much, rather that it is not going to be used. Look at it this way. I collect and analyze a bunch of data to get my results. Others can build on these results and that is how science progresses. But they do not need my data precisely because I have already done the work of analyzing it.
This is a perfectly reasonable attitude …
… so long as you believe that you have done all there is to be done with the data, that there are no other analyses to be done with it, that your analysis is so certainly correct that no-one would even need to check the working, and that it will never be possible to combine your data with other sets.
In other words, it’s not even remotely reasonable. I could think of a few other adjectives that would describe it better, but since this is as public blog I will exercise discretion on this occasion.
You have missed the point Mike. You are citing the two standard arguments for data archiving, namely reuse and verification. My point is that these are both highly exceptional cases so it must be demonstrated that one or both will occur with sufficient frequency and value that they justify the cost and burden of a universal archiving mandate. Doing staff work for the IWGDD I studied these issues for almost two years and concluded that no such justification exists. The IWGDD came to a similar conclusion.
I don’t know anything about Stalinism but I know a good bit about regulations and how to assess them. The PLoS mandate is a burdensome regulation that appears to be ill conceived. How to analyze regulations like this is well known, but PLoS has chosen to act first rather than going through the usual proposal stages.
For what should we wait to act? As we’ve seen here and elsewhere people excel at finding problems with this sort of thing; reasons to wait. Well I’ve seen far too many such intiatives die on a trolley and so I’m glad PLoS has forced people to take this seriously. How long have funder mandates exactly like this (except they often DO call for ‘all’ [incl. crap] data) been in place yet roundly ignored — in most cases three to five years. No effect (except amongst the eager). Why? Because noone gets their arse kicked when they moan and dissemble. If you’ve nothing worth sharing (1) say why, (2) get some mates to agree, (3) publish that somewhere visible (just in case others differ). Job done.