How can we tell when an article presents high quality research? Peer review tries hard to determine whether the authors have conducted robust and relevant research, but the pressures of time and limited resources (including limited human expertise) mean the determination can be off the mark.
Assessment clearly takes place after publication – everyone who reads the article will form some sort of opinion – but capturing these opinions and making them broadly visible has been a decades-long challenge. We have tried encouraging Commentary articles, allowing comments on the article home page or on sites like PubPeer, capturing activity on Twitter (R.I.P.), and online journal clubs.
All of these initiatives run into the same problem: attention gloms onto articles with serious integrity issues, controversial results, or AI generated images of anatomically implausible rats, leaving the vast majority of published work with no assessment signal. Post-publication comments essentially capture ‘excitingness’ rather than more pedestrian features like the robustness of the methods or the validity of the results and conclusions.

Articles citations are academic publishing’s go-to measure of post-publication interest in an article. Since journal reputation is built on article to article citations, publishers have built extensive infrastructure around capturing and quantifying these citations. While articles presenting robust results do get lots of citations, the signal-to-noise ratio is low. As scite.ai’s work on classifying the sentiment around citations has shown, about 80% of citations are just ‘mentioning’ – a way for authors to name-check relevant previous work and a way for authors to demonstrate that they’re somewhat knowledgeable about their field of research.
Including a citation to a retracted article is bad form, but citing an article that’s relevant but not robust is common practice, and counts as +1 citation just as much as citing a brilliant and compelling piece of research. These mentions cloud the citation signal for robust research.
The major underlying flaw of article to article citations is that they are not ‘trust dense’: they require almost no effort from authors to include, and there are negligible consequences for citing a weak article. Find a relevant paper, add it into your reference manager, and in a few clicks the article to article citation is created. If a cited article is later found to be flawed, then nothing happens – the article to article citation persists and opinion around the citing article is usually unaffected.
The good news is that there are other forms of citation that are much more ‘trust dense’. As presaged by the title, the most ‘trust dense’ citation is a data citation*. Or, more fully, an ‘article to dataset to article’ citation. This specific kind of citation arises when an article generates new data and a subsequent article bases new analyses on that dataset. NB I’m using ‘citation’ quite broadly here – the reanalysis implies a connection between the subsequent article and the earlier article/dataset, even when a formal citation for the reanalyzed data does not appear in the subsequent paper.
First, for a data citation to happen, the original article must have somehow made their newly generated data available, which by itself is a powerful signal that the original authors were sufficiently confident in their findings to give others access to the underlying evidence. Data sharing is rare among articles from paper mills, and (perhaps obviously) only articles that share their data can be judged ‘reproducible’, which is perhaps the highest badge of analytic robustness.
Second, unlike skimming an abstract and casually throwing in an article to article citation, a data citation implies a much deeper relationship between the original article and the subsequent research. To make a data citation, the author has to read and re-read the original article to understand how and why the dataset was generated, and then inspect the dataset itself before any re-analysis or reuse can begin. If at any point the original article or the dataset seems fishy, the data are discarded and the data citation never comes into being.
Finding an ‘article to dataset to article’ citation thus implies that other researchers have extensively validated the original work and have built sufficient trust that they’re willing to base their own work (and hence reputation) on the reused data.
In past years, this think-piece would come to a screeching halt right here. Even though researchers describe reusing data in about 50% of their research articles, most methods for identifying data citations have hopelessly high false negative rates: they only find evidence for data reuse in <2% of articles. So, even if data reuse is powerful evidence for high quality research, we can’t reliably detect when data citations happen, and therefore the idea is dead in the water.
Fortunately, times have changed. Authors have always been clear when they reuse data, so that sentences like the below can be found in the Methods section whenever data is reused:
Our analysis was conducted on the Yoruba (YRI), European (CEU), and Han Chinese (CHB) reference sample individuals from the high-coverage 1000 Human Genomes data aligned to GRCh38/hg38 [88]. (sentence taken from here)
Authors have also always been terrible at providing identifiers for the datasets they reuse.
The above sentence does contain an accession number, and with the help of the surrounding text we can infer that it is from the 1000 Human Genomes database. But GRCh38/hg38 isn’t the reused data – it’s actually the three populations of reference samples. The URL for the Yoruba sample is https://www.internationalgenome.org/data-portal/sample/NA18500, which the authors should have included, along with the other two.
The above sentence also illustrates another classic – but profoundly unhelpful – form of a data citation. The bibliographic citation [88, Byrska-Bishop et al.] points to a published article, but only from the surrounding context is it clear that the authors re-analyzed the datasets generated by Byrska-Bishop et al. This is +1 ‘article to article’ citation, but given the context it should clearly also be +1 ‘article to dataset to article’ citation and +100 to the credibility of the work conducted by Byrska-Bishop et al.
The next ‘fortunately’ is that the new generation of reasoning LLMs can help us reconstruct usable data citations from the dog’s dinner of how authors describe the source of their reused data. An example is the LLM version of the PLOS Open Science Indicators, as described in this preprint.
Of course, one can reasonably object that both data sharing and data reuse are unevenly distributed across research, and hence the trust signals that can be derived from reconstructing data citations will be uneven too. Aside from making their datasets available and discoverable, there’s also little that researchers can do to induce others to reanalyze their data**. At least they’ll get the trust bump from having enabled reuse by making their data public.
Moreover, is this unevenness between fields a feature or a bug? Fields that generate shareable datasets but have no culture of data sharing are – by this measure – producing research that’s less robust and less reliable than fields that share lots of data. It’s possible that increasing the visibility of data reuse, and finally giving credit to authors that generate robust, reusable data will motivate all fields into doing a better job of sharing data.
*Other ‘trust dense’ citations include ‘article to reagent to article’ and ‘article to protocol to article’, as both imply that the subsequent authors extensively validate the outputs they reuse in their experiments.
**I predict that if data reuse becomes a trust metric, papermills will inevitably branch out into data re-analysis rings.
Discussion
6 Thoughts on "Data Reuse is the Sincerest Form of Flattery"
Great post – thanks Tim. The challenge for publishers now (in an AI world) is to make these trust signals/markers machine readable. Data citations do that when incorporated into the metadata of the published article. To realise your vision, that’s what we need to concentrate on. Many thanks.
Thanks Catriona! There’s not much hope that authors will start including actual data citations in their articles, and adding data citations to the text is of course impossible for the millions of articles that have already been published. But you’re right that AI generated metadata about data reuse could and should be added to the article metadata through something like COMET (https://www.cometadata.org/)
Would the Research element journal help with this? Encouraging researchers publish the robust data and open-source scientific equipment used in the research may add a bit value.
One of the big challenges with tracking output reuse is that while authors can control where they put their outputs and hence how they can be referred to (e.g. via a DOI vs an accession), they’ve no control over how people reusing those outputs actually refer to them. That reuse takes place across all sorts of journals, and there’s little concerted effort to prod authors to provide formal identifiers for the outputs they reuse.
“First, for a data citation to happen, the original article must have somehow made their newly generated data available, which by itself is a powerful signal that the original authors were sufficiently confident in their findings to give others access to the underlying evidence.”
I feel like this ties in with the concept of “checkability” in Alison Mudditt’s May 28th piece, which has serious legs for research integrity efforts.
Absolutely – just sharing the data (and preferably the code) is a key component of checkability. This quote from Alison’s article hit home too: “researchers engage far more readily with non-article outputs when they are presented in relation to a familiar article structure.”
In some ways the article is the ultimate ‘ReadMe’ for navigating the dataset, so it makes sense that it’s only when both are presented together that reuse for the data is maximized.