The scholarly ecosystem depends on Crossref to curate the linkages created when one published article cites another. Without a stable record of these linkages, we would be blind to the structure of the literature. For example, which past studies underpin current research? Which new fields draw on many other areas to progress, and which are more isolated? Citations are also both an important usage metric and the standard credit system for authors; access to these data likely explains why almost every publisher is a Crossref member.
For all its glorious complexity, this citation network is superficial – published articles are no more than an account of the research, constrained by word counts and the need to be succinct to keep the reader engaged. The real meat of the research is the raw and processed data and any accompanying code. A deeper, more complete network would connect articles to their datasets, and then connect those datasets to other articles that re-use them.
More and more data are being shared alongside published articles, so these relationships are out there and ready to be recorded. But they’re not making it to Crossref, and hence researchers a) don’t have a public record of the connection between their article and their data, and b) don’t get credit for others re-using their data.
For example, the Dryad data repository alone received 4538 data packages in 2017. Because Dryad only hosts datasets associated with published articles, this should have led to 4538 data citations being passed to Crossref. In particular, these should have had the ‘isSupplementedBy’ relationship, which indicates that a dataset was generated by the citing article. Instead, there are a lifetime total of 4752 data citations* in Crossref (not just 2017!), of which PeerJ accounts for 3804 and eLife another 678. PeerJ has only 69 articles with data in Dryad, and eLife has 210, so there’s another 4473 linkages between articles and datasets that didn’t make it.
What’s the obstacle here? There’s certainly goodwill on the publisher side, as evidenced by the endorsement of the Joint Declaration of Data Citation Principles by both Wiley and Elsevier (among many others).
One problem is semantics. Because reference metadata are always passed to Crossref, researchers citing their own data is the simplest way for Crossref to link articles and their associated datasets. However authors (and journals) are confused by being asked to cite their own data in their references. You don’t cite your figures or tables, so why would you cite your data? Unless publishers and journals can re-educate the research community into always citing their own datasets, this approach seems unlikely to succeed.
Aside from re-educating authors, publishers could ask their typesetters to ensure that data citation metadata are always passed to Crossref. Production systems are becoming increasingly sophisticated, so automated identification and curation of links between articles and datasets does seem eminently feasible.
Of course, extra resources are needed to support this extra typesetting work, which is why it isn’t being done already. Publishers are well aware of data citation protocols (c.f., the endorsements above), so ‘lack of resources’ is really just vernacular for ‘this is not a priority’.
Why aren’t data citations a priority? Citations to a publisher’s journals boost Impact Factors, and hence eventual revenue, so having typesetters carefully curate article citations has a commercial incentive. As noted previously, no such incentive exists for open data – having excellent connections between datasets and articles doesn’t have a clear path to future revenue. Devoting extra resources at the typesetting phase to getting the data citations right is therefore a hard sell.
Neglecting data citations is probably short-sighted. Momentum towards open science is building, particularly in response to powerful funder initiatives. Someday soon re-using published data will become commonplace. The extra citations will accrue to journals or publishers with a) lots of datasets to reuse, and b) well established linkages between their articles and data. Moreover, journal performance metrics may one day include data citations (as papers with open data are more robust and more useful to the community), and publishers with weaker data standards will lose out.
The success of Crossref is a testament to the scholarly publishing community’s ability to put aside commercial differences and create something that benefits all. The next step everyone needs to take is extending the citation network to datasets. That begins with:
- Publishers pushing for the inclusion of data citations in the references, and tagging them appropriately at typesetting stage.
- In-text and data availability statements references to dataset DOIs being tagged as well, so that linkages between articles and their datasets are visible to Crossref, and authors can receive credit for the deposition of their data
Both steps just involve changes to production protocols – one small step for publishers, one giant leap for open science.
*these arrived via the REST API, when a publisher sends Crossref a marked-up version of the references section. There are another 6196 citations with the ‘isSupplementedBy’ relationship, but only a tiny fraction of these are datasets, and most come from a single publisher. There’s another type of citation too – event data – which also has some data citations. See here and here for more information.