Reproducibility and research integrity are two strongly related, and increasingly important, concepts. On the one hand, corner cutting and sloppy research practice lead to results that aren’t robustly repeatable or convert poorly in clinical trials. On the other hand, more serious individual malpractice and industrial-scale fraud threaten the fundamental trustworthiness of research itself. What unites the two is the need to ensure that research has been conducted to an adequate standard, with transparency, traceability, and accountability.
The fight for research integrity
Last week, I attended the first in-person STM Week in London since the beginning of the pandemic. In a break from the old format, the new three-day program consisted of a startup fair for early stage companies, and a series of lightning talks; a research integrity masterclass, comprising a seminar and workshop; and an early career event. Two themes relating to fraud and research integrity ran very strongly through the first two days: image manipulation and fake papers generated by paper mills (large-scale producers of fake research papers, often employing AI to generate text and images).
Of the 22 startups that were represented during the fair, four of them; FigCheck, ImaCheck, Proofig, and ImageTwin (which also won the Karger Vesalius Innovation award presented as part of the fair program) are attempts to address image duplication and manipulation.The STM STEC Working Group on Image Alterations and Duplications has developed a draft list of requirements for image manipulation tools. The idea is to include tools that meet all those requirements in STM’s new Research Integrity Hub, which was discussed during the research integrity masterclass. Fellow Chef Lisa Janicke Hinchliffe interviewed Joris van Rossum about the hub earlier in the year.
Paper mills and the resulting mass retractions were also an important topic for the masterclass. STM has recently collaborated with the Committee on Publishing Ethics (COPE) to investigate the prevalence of paper mills, and some of their findings are sobering. As part of the investigation, data aggregated anonymously by publishers showed that most journals found around 2% of their papers are likely generated by paper mills, rising to over 40% for some journals. The mass retraction discussion was held under the Chatham House rule, so I can’t go into details, but what I can say is that research into the scale of the problem is still ongoing. As worrying as coverage of the problem currently appears, for instance in this article in the Guardian, and this piece in Nature, there may be further uncomfortable news to come. The STM Integrity Hub also has a module aimed at helping publishers identify AI-generated text using a series of techniques that STM Solutions are understandably tight-lipped about.
What unites these two themes, and some of the other topics discussed like identity assurance at submission of publication and the problem of multiple submissions of the same content to different journals, is that they concentrate the responsibility for research integrity at one stage in the research knowledge lifecycle: publication.
Where research integrity meets reproducibility and research infrastructure
The last part of the masterclass was a brainstorming session where attendees were split into groups and asked to come up with ideas on how to tackle challenges around research integrity. Understandably, many of the groups looked to industries that face similar challenges around authentication, like the financial and banking industry. Ideas included identity checks, like requiring traditional IDs, passports, drivers licenses, etc. when authors submit, or making institutional research offices responsible for verifying the identities of authors.
Building on the idea of identity assurance naturally led to considering ORCID, and how research infrastructure can provide a network of trusted assertions. Earlier in the day, Matt Hodgkinson, speaking on behalf of COPE, pointed out that ORCID does not necessarily assure identity. People can create ORCID iDs at will and simply claim to be affiliated with an institution and make up fake publications. While this is true, it is possible for institutions to integrate ORCID so that the affiliation is asserted by the institution, rather than the individual owner of the ORCID iD. According to ORCID product director Tom Demeranville, hundreds of institutions have already adopted this integration. While this is still only a small proportion of institutions, when publishers enable or even require researchers to sign in to submission systems using their ORCID iD, the combination of these two authentication steps makes it possible to securely assert that the article comes from a specific author and assure that they are affiliated with the institution. In other words, we don’t need an authenticated version of ORCID, as it would already exist if both institutions and publishers were to adopt and fully implement it. As ORCID Executive Director Chris Shillum pointed out on Twitter, ORCID has a whole series of trust markers to validate and verify assertions on ORCID records and are looking to partner with publishers to pilot their use in editorial workflows.
This type of assertion network approach can reduce the burden on publishers to investigate the individual authenticity of submitted works. After all, a network of authenticated activity, including funding awards, affiliations, datasets, collaborations, projects, and publications is far harder to fake than a single piece of text, a figure, or a dataset.
Blue sky thinking
In my own breakout group, we took inspiration from an industry that isn’t just concerned about identity, but also research integrity: health and pharmaceuticals. I apologize for not remembering who first suggested the idea of enabling researchers to upload their data, so that it can be analyzed and graphs plotted within a publisher-controlled environment. This is similar to the Trusted Research Environment (TRE) concept described in this green paper from the UK Health Data Alliance. Environments like TREs and digital lab notebooks are used by pharmaceutical companies and, increasingly, by universities to establish an electronic paper trail that links experiments to data, to analyses, and ultimately to the researchers responsible.
Imagine if the integrity of the publishing process didn’t rely purely on publishers’ ability to detect fraud, malpractice, or mistakes based on the limited information available in a submitted manuscript. Instead, what if this responsibility were spread throughout the ecosystem, from funder grant management system, to data management plan, to data center, to lab notebook, to preprint, to published version of record, making use of trusted assertions to build an open, verifiable research environment that also leverages transparency so that publishers, funders, institutions, and other researchers could all trace findings and claims back through the whole research process?
A whole-sector approach
The vision I laid out above may sound utopian, but much of the technology and tools required already exist. As well as the TREs, which can be seen as a model for traceability, and ORCID trust markers, which illustrate how the same thing can be done securely in the open, initiatives like Center for Open Science, and Octopus show how a range of outputs and activities can be used to document the entire research process.
The problem is not technology, it’s a wicked mix of perverse incentives, network effects, business model inertia, and sustainability challenges that lock us all into the same restrictive ideas about what constitutes a research publication, and what counts for prestige and career advancement. To address the range of challenges from poor research practice to industrial-scale fraud by paper mills, we need a whole-sector approach that involves funders, institutional research management and libraries, researchers, and publishers. As fellow Chef Alice Meadows and I wrote in a previous post, it really does take a village, and cross-sector collaboration is vital to building the interoperable research information infrastructure needed to connect the people, places, and things of the scholarly ecosystem in a way that is verifiable and trusted.