Editor’s Note: With Peer Review Week on the horizon, today we turn to the question of preprints, and how they can best be integrated into the permanent research record. Today’s post is by Sylvia Izzo Hunter (Marketing Manager), Igor Kleshchevich (Senior Software Engineer), and Bruce Rosenblum (VP of Content and Workflow Solutions, and the 2020 NISO fellow), all with Inera, an Atypon company. For more on preprints, the Society for Scholarly Publishing is hosting a webinar, The Future of Preprints: Coronavirus as a Case Study on September 22.
The COVID-19 pandemic has produced an explosion of postings on preprint servers to meet the critical need for rapid dissemination of new biomedical and clinical research findings. Citations to these preprints, both in other preprints and in peer-reviewed articles, have also exploded, as the research–cite–publish cycle shortens to weeks or even days, and citing the most up-to-date information becomes vital.
The preprint explosion created an urgent need for development work at Inera –the ability to process preprint citations went quickly from a nice-to-have feature to an imperative. This post was born out of our discoveries about (and frustrations with) the current preprint citation landscape as we set out to update our software solutions to include support for citations to preprints. As COVID-19 increased the prevalence of preprint citations during the spring of 2020, and we worked to adapt our software, we uncovered one technical challenge after another, illustrated below with real-world examples.
As we were working on our software development and this post, a working group assembled by ASAPbio in collaboration with EMBL-EBI and Ithaka S+R was working on a set of recommendations for “building trust in preprints” (posted by Beck et al., as a preprint to OSF Preprints on July 21, 2020). These separate but parallel activities show that there is a growing awareness of the issues; however, this workshop report has, to our knowledge, not previously been mentioned in The Scholarly Kitchen. Many of our recommendations overlap with those of Beck et al.; we advise that their report be read and considered alongside the data and recommendations we present here.
We will not discuss the pros or cons of preprints, because we believe that irrespective of anyone’s opinions of them, preprints are now an integral part of the scholarly publishing landscape. What we will discuss are the challenges of recognizing, linking, and retrieving information that has not yet been peer-reviewed, and that may be cited in ways that make it difficult for readers to recognize when a citation has not been peer reviewed. Preprint citations, unless managed well, may weaken the refinement of scholarship, and we make the case that current management needs to be improved.
Some preliminary notes
Preprint servers, early publication articles, and metadata are constantly in flux. All of the examples below were accessed on September 8, 2020. We cannot promise that future readers will find them in exactly the same state.
We use the term “preprint” to mean an item that has been posted on an open site for purposes of viewing and commenting, either prior to or in parallel with the peer-review process, but is not formally undergoing peer review on the site where it has been posted. “Preprint server” means a site that hosts preprints. “Article” or “journal article” means an item that has been accepted by and published in a peer-reviewed journal.
Our data set was collected by culling sample citations from randomly selected bibliographies in preprints on medRxiv, bioRxiv, OSF-hosted preprint servers, and the WHO COVID-19 preprint server from April to September 2020. It was supplemented by example citations to preprints provided by our customers.
Preprint servers do not always identify their content as not peer-reviewed
Inconsistencies and ambiguities exist across preprint servers with respect to how (and, indeed, whether) they identify their content as posted without peer review. For example, this note appeared at the top of most pages we viewed on bioRxiv and medRxiv:
bioRxiv is receiving many new papers on coronavirus SARS-CoV-2. A reminder: these are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information.
The original preprint server, arXiv, is similarly clear about the status of manuscripts. Other sites are less clear. MindRxiv, for example, has no indication on its homepage that the content it hosts has not been peer reviewed, only a note stating: “MindRxiv is a service provided by the Mind & Life Institute, and is not affiliated with other preprint servers.” Neither do individual preprint pages on this server (see, for example, this preprint).
It is imperative that all preprints be clearly labeled as such: at a time when many people consider any information they find online to be true, being anything less than explicit is misleading.
Recommended citations on preprint servers do not indicate that the citation is a preprint
Many preprint servers provide recommended citations in a variety of editorial styles, many of which don’t make it clear that the work in question is a preprint. For the Beck et al., manuscript, OSF Preprints conveniently provides these suggestions, none of which explicitly states “Preprint”:
Beck, J., Ferguson, C. A., Funk, K., Hanson, B., Harrison, M., Ide-Smith, M. B., … Swaminathan, S. (2020, July 21). Building trust in preprints: recommendations for servers and other stakeholders. https://doi.org/10.31219/osf.io/8dn4w
Beck, Jeffrey, et al. “Building Trust in Preprints: Recommendations for Servers and Other Stakeholders.” OSF Preprints, 21 July 2020. Web.
Beck, Jeffrey, Christine A. Ferguson, Kathryn Funk, Brooks Hanson, Melissa Harrison, Michele B. Ide-Smith, Rachael Lammey, et al. 2020. “Building Trust in Preprints: Recommendations for Servers and Other Stakeholders.” OSF Preprints. July 21. doi:10.31219/osf.io/8dn4w.
The APA citation even lacks the name of the preprint server. When we searched for this preprint at https://search.crossref.org/, the Vancouver format citation provided by Crossref was:
Beck J, Ferguson CA, Funk K, Hanson B, Harrison M, Ide-Smith M, et al. Building trust in preprints: recommendations for servers and other stakeholders. Center for Open Science; 2020 Jul 21; Available from: http://dx.doi.org/10.31219/osf.io/8dn4w
This citation gives the Center for Open Science as the preprint server with no indication the item is a preprint. So, when an author copies one of these citations into a new article, future readers of that article may have no idea that the cited content has not been peer reviewed.
Recommended citations on preprint servers may fail to include a DOI
The MLA example citation, above, does not include a DOI, which can make it difficult or impossible for a reader to follow the citation back to the authoritative copy. Here’s a suggested MLA citation from MindRxiv:
Weng, Helen, et al. “Focus on the Breath: Brain Decoding Reveals Internal States of Attention During Meditation.” MindRxiv, 7 Nov. 2018. Web.
When we Googled the title of this paper to find it from the DOI-less citation, MindRxiv did not even appear in our first page of results. Interestingly, the first two matches link to a preprint of this paper posted on November 4, 2018, to bioRxiv at https://doi.org/10.1101/461590. Only the third link points to the published article.
bioRxiv’s version of this preprint directs us to “View current version of this article” in a big red banner at the top of the page, with a link to the final publication in Frontiers in Human Neuroscience. The MindRxiv page gives no indication that the paper has been published in a peer-reviewed journal.
Preprint citations provided by authors often do not include a DOI
Today, the majority of journal articles are assigned DOIs that are deposited with Crossref, and the majority of preprint archives also assign DOIs to their content (arXiv, which assigns its own persistent identifiers, is the most notable exception). But assigning a DOI does not guarantee that it will always be used in subsequent citations.
In a random (though unscientific) sampling of preprint citations we found in preprints posted to a variety of servers over the past six months (i.e., new preprints citing older preprints), fewer than half included a DOI.
Sometimes the DOI is missing because the recommended citation on the preprint server did not include one. In other cases, reference management software doesn’t distinguish between a journal article and a preprint, and may not include the DOI when formatting a reference to a specific editorial style. Sometimes authors don’t understand why it’s important to include a DOI when citing content that is not part of a traditional issue-based journal.
In fact, it’s even more essential to include a DOI when citing content that is not part of a traditional paginated publication to enable readers to locate the authoritative copy of the preprint.
Preprint servers are not always updated to indicate publication in a peer-reviewed journal
In the preprint servers we reviewed, failure to indicate that a preprint has since been published in a peer-reviewed journal is not uncommon.
Publishers have told us that it’s essential to know when a preprint has been peer-reviewed and published. Peer reviewers and editors need to understand whether or not a preprint citation has subsequently been published when evaluating new research. Often, publishers will ask the author of an accepted article to review preprint citations and update them to cite the final article, if the content is substantially the same in both versions. It’s also important for future readers to know if a preprint was peer reviewed and published in a journal, and, if so, which one.
Whose responsibility is it to notify the preprint server that an article has been published? Authors may not see it as their responsibility and, in any case, are notoriously overburdened with administrative tasks. So notification and updating must be handled via scholarly publishing workflows and infrastructure.
Consider the paper “Serology characteristics of SARS-CoV-2 infection since the exposure and post symptoms onset.” It was posted to medRxiv on March 27, 2020, and published online in the European Respiratory Journal on May 19, 2020. As of September 8, medRxiv had not been updated, nor does Crossref metadata indicate the relationship between these items.
The lines of responsibility for notification, website updates, and Crossref metadata updates are not entirely clear, and this is the cause of many metadata disconnects. As Beck et al., note in the ASAPbio workshop report, the problem can be mitigated if publishers, server hosts, and Crossref work together to identify a clear set of workflows that minimize the burden on authors and, to the greatest extent possible, automate metadata and site updates.
Automated link formation between preprints articles may not always work
Given the importance of including a DOI in a preprint citation, editors should locate and add any that are missing when editing a peer-reviewed article. Ideally this should be relatively automatic by looking up the citation on Crossref.
But the cross-references between items are not always updated. In theory, Crossref could watch for items with similar first author surnames and titles and then automatically create linkages based on a set of matching criteria. If this is beyond Crossref’s purview then they could, at a minimum, automatically send out notifications to preprint servers based on matches, prompting the servers to update their site and redeposit their metadata.
But such matching logic is not infallible. In particular, it may not work if an article’s title changes significantly between the preprint and the final publication. For example, a preprint posted on the World Health Organization COVID-19 preprint server entitled “A simple model to assess Wuhan lock-down effect and region efforts during COVID-19 epidemic in China Mainland” was published in the Bulletin of the World Health Organization as “Modeling the effects of Wuhan’s lockdown during COVID-19, China.”
Crossref returns inconsistent results for the same query
Part of our software development process includes automatic nightly regression testing. We have found that Crossref structured queries return inconsistent results for preprints. Here are two references to preprints that are part of our regression test set:
E. Hassanien, L. N. Mahdy, K. A. Ezzat, H. H. Elmousalami, H. A. Ella, Automatic X-ray COVID-19 Lung Image Classification System based on Multi-Level Thresholding and Support Vector Machine, medRxiv.
[citation copied from bibliography of https://doi.org/10.1101/2020.05.01.20087254]
Lachmann, A., Jagodnik, K. M., Giorgi, F. M., and Ray, F. (2020). Correcting under-reported covid-19 case numbers: estimating the true scale of the pandemic. medRxiv
[citation copied from bibliography of https://doi.org/10.1101/2020.07.01.20144279]
Although our test system makes identical queries every night, our daily data review has found that Crossref doesn’t consistently return a DOI for each of these references, and it’s unclear why.
Authors may post preprints to two or more preprint servers
Sometimes a preprint appears on multiple servers.(Note: our discussion here focuses on the technical challenges associated with such postings; we leave the principle of multiple postings for commenters on this post to discuss.)
We found the following paper on three different preprint servers. The first two citations are the recommended ones from those servers:
Rodriguez Llanes, Jose Manuel and Castro Delgado, Rafael and Pedersen, Morten Gram and Arcos Gonzalez, Pedro and Meneghini, Matteo, Confronting COVID-19: Surging Critical Care Capacity in Italy (3/25/2020). Available at SSRN: https://ssrn.com/abstract=3564386 or http://dx.doi.org/10.2139/ssrn.3564386
Rodriguez-Llanes JM, Castro Delgado R, Pedersen MG, Arcos González P & Meneghini M. Confronting COVID-19: Surging critical care capacity in Italy. [Preprint]. Bull World Health Organ. E-pub: 6 April 2020. doi: http://dx.doi.org/10.2471/BLT.20.257766
Rodriguez-Llanes JM, Castro Delgado R, Pedersen MG, Arcos González P & Meneghini M. Confronting COVID-19: Surging Critical Care Capacity in Italy. medRxiv. Posted 6 April 2020: 2020.04.01.20050237; doi: https://doi.org/10.1101/2020.04.01.20050237
According to the statement that accompanies the SSRN version, “Authors have either opted in at submission to The Lancet family of journals to post their preprints on Preprints with The Lancet [the Lancet-branded preprint service on SSRN], or submitted directly via SSRN.” We can therefore presume that the authors either submitted this paper to The Lancet or posted it directly on SSRN on March 25, although there is nothing in the recommended citation to indicate that it was submitted to The Lancet. Two weeks later, the authors submitted the same manuscript to the Bulletin of the World Health Organization, which posted it on their COVID-19 preprint site. The authors also posted the preprint to medRxiv.
This situation creates several problems. First, if and when this paper is published, multiple preprint servers will need to update their site and their metadata. Second, despite the (laudable) inclusion of DOIs in the recommended citations, if — as is frequently the case — future citations to any of these versions do not include a DOI, looking up the DOI at Crossref presents challenges.
As a test, we took the three citations above, removed the DOIs, and then tried to find their DOIs via Crossref services. With structured queries a DOI was only returned for the first citation, which had been incorrectly deposited at Crossref as a journal article. The others were correctly deposited as “posted-content”. Querying the same three references using the Crossref Simple Text Query service returned the medRxiv DOI for all three citations, incorrectly for two and correctly only for the third.
Preprint sites replace preprints with the published journal article
We have seen multiple cases in which preprint servers replace the preprint with the final journal article after publication. For example, if we look at the article “Habitat risk assessment for regional ocean planning in the U.S. Northeast and Mid-Atlantic” on marXiv, we find that the original preprint has vanished without a trace. In its place is a PDF of the final publication in PLOS One, and the following recommended citation:
Wyatt, K., Griffin, R., Guerry, A., Ruckelshaus, M., Fogarty, M., & Arkema, K. K. (2018, January 26). Habitat risk assessment for regional ocean planning in the U.S. Northeast and Mid-Atlantic. https://doi.org/10.1371/journal.pone.0188776
This means that, if any other paper has cited the preprint, and if the final published journal article differs from the preprint in any significant way, future readers will be unable to view the research as it was expressed in the preprint.
A preprint later published in a peer-reviewed journal is like an article for which a correction has been published — you don’t make the original article go away; instead, you publish a correction and leave the original as it was.
Through the examples above, we’ve illustrated a number of challenges for the current preprint environment. We’ve come to think of preprints as much like an unruly teenager: we see tremendous promise, but in need of more adult supervision to achieve their potential.
In addition to endorsing the recommendations of Beck et al., we recommend the following steps to make preprints more trusted and sustainable.
- NISO, in conjunction with ASAPbio:
- Consider expanding the recommendations of Beck et al., and developing recommended practices for preprints, similar to PIE-J and NFAIS Best Practices for Publishing Journal Articles.
- Preprint servers:
- Clearly identify when content has not been peer-reviewed.
- Update recommended citation formats, when provided, to always include a DOI, the preprint server name, and a “preprint” indicator.
- Work with vendors of reference management software to improve integration between preprint metadata and software that consumes the metadata, so that preprints are handled as a unique citation type and not shoehorned into data structures built for journal articles.
- Define and implement workflows to ensure that preprint webpages are promptly updated with final publication information.
- Refrain from replacing preprints with published articles.
- Journal publishers:
- Update editorial style guides/instructions to authors to indicate how preprints should be cited (we recommend, at a minimum, first author, preprint title, preprint server name, date of posting, “preprint” indicator, version, and DOI).
- Educate authors about the correct use of DOIs in preprint citations.
- Take steps to ensure that preprint metadata are not deposited to Crossref as journal article metadata.
- Reference management software vendors:
- Design and implement structures for new reference types for preprints, so that they will be formatted correctly according to journal styles and always indicate that the item is a preprint.
- Update query logic (structured query, Simple Text Query) when multiple items may have the same authors, title, and year of publication, to better distinguish between a journal article and an earlier preprint, or between preprints of the same manuscript on multiple preprint servers.
- Consider implementing systems to automatically notify preprint services when journal articles have been published with the same authors and title.
- Consider creating automatic Crossref metadata links between preprints and journal articles that have identical authors, title, and year of publication.
Many of these recommendations cannot be implemented in isolation. We hope that the interested parties listed above will work effectively together on solutions.
Sir Isaac Newton famously commented that if, in his work, he saw further than others, it was “by standing on the shoulders of giants.” The knowledge we are collectively building is only stable if the citations that underpin new research are sound. When our citations do not clearly indicate whether a source has been peer-reviewed, and when preprint metadata is handled incorrectly, we risk undermining the stability of future research chains. We have an opportunity to bring consistency and stability to the preprint environment; if we choose not to take it, we risk harming the integrity of research.