Editor’s Note: Today’s post is by Rebecca Grant, Iain Hrynaszkiewicz and Amy Bourke-Waite, and is based on a preprint of trial results and the subsequent peer-reviewed article published in the International Journal of Digital Curation. Rebecca is Research Data Manager at Springer Nature; Iain is Head of Data Publishing at Springer Nature; Amy is Director of Communications, Open Research at Springer Nature.

Data sharing is like maths at school.*

Bear with us.

It might seem harder than the other subjects. You might feel your teachers are not very good at explaining it. But if you do not pay attention, you will very quickly find that many real-world skills rely on maths; and you would have benefited from learning the basics as it provides a solid foundation for the rest of your adult life (whether your ambitions are to become an astronaut, a Grandmaster of chess, or simply to balance your personal expenses).

Likewise, data sharing and data management form the foundation of global academic collaboration, discovery and scientific advancement. Sadly, surveys show that academics rarely get formal training in good data management (let alone best practice), and data management is rarely incentivized by institutions. All too often even the basics are ignored, with data ending up languishing on a USB stick or on a paper notepad.

If we want research to be discovered, shared, and reused, then the same must be said of the underlying data. 

folder search concept

This is increasingly relevant as there is growing attention on mandatory Data Availability Statements (DASs) from publishers, institutions and funding agencies; this is particularly true in the UK where DASs are a requirement of UK Research and Innovation’s (UKRI) Common Principles on Data Policy. While a DAS does not always equal data sharing, these statements are a means to determine if and how research data are available – which can also assist funding agencies and research communities in assessing compliance with their policies. Increased prevalence of DASs will also enable further research, using machine-driven approaches (such as with natural language processing, text and data mining), across multiple journals and publishers, to analyze the types of DASs provided – and types of data sharing practiced – by researchers in different disciplines and journals.

Researchers report that they do not have enough time to share data, such that additional checks on submitted papers that push for data sharing will be costly for editors and publishers.

Springer Nature is nonetheless keen to encourage data sharing. More than two years ago we announced that all original research papers accepted for publication in Nature and the other Nature titles would be required to include information on whether and how others could access the underlying data by including a DAS. While there is some evidence of the benefits of data sharing in research, we also wanted to understand the costs of introducing DASs. We therefore examined the impact that introducing DASs had on authors and editors, and how the availability of datasets was reported.

Introducing Data Availability Statements across the Nature journals

Our specific aims were to 1) assess the ways by which researchers chose to make their data available, and 2) to measure the additional time required by editors and production staff to ensure a data availability statement is a) included in the manuscript, b) accurate, and c) correctly copy-edited. (All of the staff involved in the study were in-house, giving a reasonable basis on which to indicate potential cost to the publisher).

Nature Data Availability Statements require authors to provide information on where the data supporting the results reported in their article can be found, if and how they can be obtained. An example might be:

  • The datasets generated during and/or analyzed during the current study are available in the [NAME] repository, [PERSISTENT WEB LINK TO DATASETS].

As a pilot, editors of five participating Nature journals were asked to self-report the number of additional minutes it took to ensure an appropriate DAS was provided for each manuscript they processed, for 2 months. Copyeditors and production staff were also asked to provide an estimated average additional manuscript processing time for all papers they handled in this initial period. We also invited comments on the process of incorporating the DAS into the journal workflow.

To ensure all published papers included a DAS after the policy was implemented, a DAS was requested by the journal Editors or an Editorial Assistant for all papers at the “accept in principle” stage. The request was included in the decision letter. Editors were required to update their correspondence templates and checklists, including a link to new author guidance. Production needed to update copyediting and style guides to familiarize themselves with the new section as well as having additional content – the DAS – to check and process in each manuscript. Now that the policy is established, the requirement to provide a DAS has moved to earlier stages of the peer-review process.

Once the papers were accepted for publication, the text of each DAS was read and categorized into one of four different types:

  • Type 1 states that the data are available from the author on request.
  • Type 2 states that the data are included in the manuscript or its supplementary material.
  • Type 3 states that some or all of the data are publicly available, for example in a repository.
  • Type 4 states that figure source data are included with the manuscript (this is a method of data sharing used by some authors in a subset of Nature journals that publish life sciences research.)

The second phase of this project gathered data using the same process from an additional 20 journals. These were from the biological and physical sciences, which introduced the same policy, and provided the same information, as the previous journals. Data were gathered by each journal for two months after implementation of the policy, then analyzed.

In total, once the first and second phases of the project were added together, we analyzed 557 manuscripts. The journals which contributed to the project all fall under a Type 3 data policy which requires the inclusion of a DAS, meaning that every manuscript submitted to these journals was subject to the checks, self-reporting and coding.

Reporting data sharing takes time

We found that adding mandatory DASs to all accepted articles in journals operated by professional editors did increase manuscript processing time. For the first phase of the pilot, the addition of the DAS had an impact of approximately 15-20 minutes editorial and production (copyediting) time per accepted paper across all five journals.

  • Once the authors had responded to our initial request for a DAS, it took ten minutes extra editorial time on average, or a median time of eight minutes per paper, to add the DAS to the manuscript.
  • Five minutes extra copyediting time was required to ensure that the DAS matched journal style guidelines.
  • For Nature Communications, which used a slightly different methodology, 90% of editors reported 15 minutes or less to ensure the DAS was present for most manuscripts.
  • The Type 1 statement, where data are available on request, took least time (5.9 minutes on average) to add to a paper, likely because these are a single formulaic sentence and there are no links to check.
  • The Type 3 statement, where some or all data are publicly available, took the longest (18.2 minutes) as the editor needed to undertake additional checks.

The second, larger, group of journals that introduced mandatory DASs reported that fewer additional minutes were needed to incorporate a DAS into a manuscript. Possible reasons for this include greater editor and author awareness of the policy and supporting documents; improved internal communication and editor training after the first phase of the pilot; and/or greater attention being needed on the pilot journals, which informed, and made more rapid, editor training on handling future DAS for manuscripts in their discipline. (Further analysis, for example on discipline differences, is available in the original article, and the dataset supporting our analyses is on figshare.)

Investing in data sharing for the future of research

Submission-to-publication-time is an important metric, and anything that slows down publication could be seen as a negative for authors and readers – and the publisher. However, given the importance of data sharing and the value added by DASs, we believe the extra editorial time is well-invested, even (or especially) in the more complex case where the data are already publicly available. We also anticipate efficiency of incorporating DASs will improve as they become a more common editorial requirement. As editors and authors are more familiar with including them, and publishers continue to improve their guidance and procedures on providing them, we should benefit from increased experience and economies of scale.

We have already used information from this pilot to inform the implementation of data policies by other Springer Nature journals. For example, we have developed in-house administrative support for academic editors, so that journals without professional editors can also introduce DASs consistently. Simple, practical information, such as the additional time needed to process manuscripts, is valuable for editors and support staff in understanding the impacts of editorial policy changes.

In the two years since we started this work, the landscape has changed, and no doubt it will continue to evolve. Since Springer Nature began introducing standardized data policies, similar initiatives have been introduced by other large publishers such as Elsevier, Wiley, Taylor & Francis, Hindawi, and BMJ, and the standardization of research data policies across the industry is underway. As well as understanding the benefits of increasing accessibility to research data to advance discovery, it will be increasingly important to understand costs – particularly for publishers, funding agencies and policy makers.

We strongly recommend that other journal publishers looking to introduce DASs prepare by ensuring that necessary support and training is available for researchers, editors and production staff, building in extra time, or tools, and enabling them to share and cite data wherever possible. We encourage other publishers to be similarly data-driven and transparent in how they implement research data policies, and collaborate in our industry via groups such as the Data policy standardization and implementation Interest Group of the Research Data Alliance (RDA). We also welcome further research in this area, particularly on associations, if any, between the provision of particular types of data availability statement and research visibility and impact as studies have tended to be limited to specific disciplines and journals.

* ‘Math’ for American readers.

Discussion

7 Thoughts on "Guest Post: Encouraging Data Sharing: A Small Investment for Large Potential Gain"

I choked on my coffee reading this post.  Sure, data availability statements are a simple matter to write, taking only minutes of author and publishing staff time. However, the curating, metadata preparing, checking, reviewing, and publishing datasets can take weeks of dedicated effort. Then there’s the dance trying to unveil data enough for peer reviewers, but not to lock it, then getting the repository links to work in the article, and then getting the article URL linking into the data repository.  It is non-trivial and clunky in my recent papers. Yes, I would argue data publishing is a good thing, both for science in general and for the publishing author. But it’s disingenuous or at least naive to imply that it’s just a few minutes of compliance statement typing. Unless, of course the author just makes some feeble, throwaway “Type 1” data availability lip serve statement “to contact the author regarding data availability” which is passable at most journals.

Have any journals in the Springer Nature stable have ever enforced a Type 1 “contact the author” statement?  Or even just said they would?  I have yet to see a statement in the author guidelines along this line:  “While not best practice, authors are allowed to publish with a Type 1 ‘contact the author for data statement.’  However, if after the article is published, should the authors be unable or unwilling to provide promised supporting data, depending on the scope of unavailable data authors should expect an adverse editorial note to be added to their article or retraction.

With exceptions (e.g. here), publishers’ data availability statement policies mostly seem like empty exhortations.

The Journal of the Medical Library Association has developed a Type 3 data sharing policy that goes into effect on October 1 of this year. I was a member of the working group that drafted the policy and we found the excellent work of the RDA group mentioned to be extremely helpful (Iain H. is one of the co-chairs of that group). As a small society journal with very limited resources, we did not take this on lightly, recognizing that it imposes an additional burden on the volunteers who keep the JMLA running as well as on the authors (who are no better at researchers in any other discipline at maintaining their data in easily shareable formats). It’ll be interesting to see how closely our experience tracks the results reported here. In any case, we strongly agree that the potential benefits are substantial and well worth the burdens. We hope the JMLA example will encourage other LIS journals to implement similarly robust policies.
JMLA Data Sharing Policy: http://jmla.mlanet.org/ojs/jmla/about/editorialPolicies#custom-0
JMLA editorial on developing the policy: http://jmla.mlanet.org/ojs/jmla/article/view/431/623

This was an interesting read; it’s great that Springer Nature have implemented DASs across the Nature journals in spite of the costs associated. For those who may not know, since March 2014 at PLOS we have included DASs and implemented a data access policy (the equivalent to a Springer Nature “Type 1” policy) for all articles published in all seven of our journals [1]. In short, we do expect relevant data needed to replicate a study to be available via a stated mechanism at the time of publication (with some understood exceptions). Although tweaked a little bit since then, this policy has been implemented and we’ve largely been very pleased with the community support and uptake:

– Since March 2014, we’ve published ~110,000 papers with Data Availability Statements across the seven PLOS journals.
– We’ve seen a year on year increase in deposition in appropriate repositories (from 17% in 2014 to 25% of articles published in 2018).
– There were 26 million views and downloads of PLOS supplementary data files in Figshare in 2018. In fact, as all of our Supporting Information files, including some data files, are automatically deposited to Figshare, the percent of manuscripts with data in a repository is much higher than 25%.

In May 2017, my colleague Meg Byrne wrote a great blog providing reflections on data sharing at PLOS ONE [2], which is well worth a read. And we’ve had authors who’ve been very grateful at catching errors in their data before the paper is published simply because they’ve gone through one extra check step in making the data available at the time of publication.

It’s true that overseeing the policy does not come without cost to the publisher, but this is something that we’ve always seen the value of at PLOS and consider it to be a very important aspect of the value add services provided by the publisher. We don’t get 100% compliance, and overseeing this policy at such high volume is also not without challenges, but we’ve seen a huge increase in data being willingly shared and in reviewers and Editors who understand the importance of also checking the data with the manuscript. We’ll happily join the call for other publishers to also implement data sharing policies, and would urge them to adopt the ‘Type 1’ level requiring that relevant data be made available alongside the publication.

References:
[1] Data Access for the Open Access Literature: PLOS’s Data Policy
Bloom T, Ganley E, Winker M (2014) Data Access for the Open Access Literature: PLOS’s Data Policy. PLOS Biology 12(2): e1001797. https://doi.org/10.1371/journal.pbio.1001797

[2] https://blogs.plos.org/everyone/2017/05/08/making-progress-toward-open-data/

Here we go again. Publishers developing different policies, statements, and workflows to the detriment of authors. Different and discordant conflicts-of-interest policies and platforms are just one example. Publishing is littered with others (formatting, style) that collectively waste millions of author, funder, and publisher time and money.

Instead of calling on publishers to develop their own DASs, we (publishers) should be working with stakeholders – data repositories, funders, researchers, NISO(?) – to develop ONE platform for researchers to enter relevant data sharing information and metadata. Submission systems, via an API, would pull the information from this ONE platform to fill each publisher’s unique DAS (really, the DAS should be standardized across journals).

The author’s recommendations, while laudable, miss the mark in that they are looking too downstream. Sometimes it is better to swim together in the same lane, beginning further upstream.

To be fair, there are some widely accepted principles around data publication and citation that are consistently being adopted by publishers:
Joint Declaration of Data Citation Principles:
https://www.force11.org/datacitationprinciples
Transparency and Openness Guidelines:
https://cos.io/our-services/top-guidelines/
I’m not sure about the idea of creating a new platform to carry data sharing information and metadata. Data publication is, by design, a distributed thing, with a wide variety of subject-specific and general data repositories, each offering different advantages and each working with new models for long term sustainability (some commercial, some not-for-profit). It is already burdensome for researchers to prepare and archive their data in one of these repositories, so adding an additional step of then going to a different location and entering the information about the data that’s been deposited seems unnecessary. Why not just follow the standards, get a DOI for your dataset, and cite it in your paper appropriately? Why would it be easier to enter that information on a third party site and then pull it into a submission system then just entering it on the submission system directly?

Because I am skeptical that publishers will be able to agree on one harmonized data sharing policy and statement for all journals. So what you propose forces authors to answer questions about their data for Journal A and then again for Journal B, C, or D if their paper is rejected multiple times. If you take a look at the long list of questions some of the ICMJE journals ask for clinical trial data sharing you may appreciate my suggestion. We publishers need to work with our authors to develop the solution instead of assuming we know what is best. Again, our authors are suffering with COI disclosures because stakeholders went their own way leading to multiple databases, policies, statements, etc. Publishers have an opportunity to not make the same mess for DASs.

Leave a Comment