The scholarly ecosystem depends on Crossref to curate the linkages created when one published article cites another. Without a stable record of these linkages, we would be blind to the structure of the literature. For example, which past studies underpin current research? Which new fields draw on many other areas to progress, and which are more isolated? Citations are also both an important usage metric and the standard credit system for authors; access to these data likely explains why almost every publisher is a Crossref member.

dropped baton
CC BY licensed image courtesy of tableatny.

For all its glorious complexity, this citation network is superficial – published articles are no more than an account of the research, constrained by word counts and the need to be succinct to keep the reader engaged. The real meat of the research is the raw and processed data and any accompanying code. A deeper, more complete network would connect articles to their datasets, and then connect those datasets to other articles that re-use them.

More and more data are being shared alongside published articles, so these relationships are out there and ready to be recorded. But they’re not making it to Crossref, and hence researchers a) don’t have a public record of the connection between their article and their data, and b) don’t get credit for others re-using their data.

For example, the Dryad data repository alone received 4538 data packages in 2017. Because Dryad only hosts datasets associated with published articles, this should have led to 4538 data citations being passed to Crossref. In particular, these should have had the ‘isSupplementedBy’ relationship, which indicates that a dataset was generated by the citing article. Instead, there are a lifetime total of 4752 data citations* in Crossref (not just 2017!), of which PeerJ accounts for 3804 and eLife another 678. PeerJ has only 69 articles with data in Dryad, and eLife has 210, so there’s another 4473 linkages between articles and datasets that didn’t make it.

What’s the obstacle here? There’s certainly goodwill on the publisher side, as evidenced by the endorsement of the Joint Declaration of Data Citation Principles by both Wiley and Elsevier (among many others).

One problem is semantics. Because reference metadata are always passed to Crossref, researchers citing their own data is the simplest way for Crossref to link articles and their associated datasets. However authors (and journals) are confused by being asked to cite their own data in their references. You don’t cite your figures or tables, so why would you cite your data? Unless publishers and journals can re-educate the research community into always citing their own datasets, this approach seems unlikely to succeed.

Aside from re-educating authors, publishers could ask their typesetters to ensure that data citation metadata are always passed to Crossref. Production systems are becoming increasingly sophisticated, so automated identification and curation of links between articles and datasets does seem eminently feasible.

Of course, extra resources are needed to support this extra typesetting work, which is why it isn’t being done already. Publishers are well aware of data citation protocols (c.f., the endorsements above), so ‘lack of resources’ is really just vernacular for ‘this is not a priority’.

Why aren’t data citations a priority? Citations to a publisher’s journals boost Impact Factors, and hence eventual revenue, so having typesetters carefully curate article citations has a commercial incentive. As noted previously, no such incentive exists for open data – having excellent connections between datasets and articles doesn’t have a clear path to future revenue. Devoting extra resources at the typesetting phase to getting the data citations right is therefore a hard sell.

Neglecting data citations is probably short-sighted. Momentum towards open science is building, particularly in response to powerful funder initiatives. Someday soon re-using published data will become commonplace. The extra citations will accrue to journals or publishers with a) lots of datasets to reuse, and b) well established linkages between their articles and data. Moreover, journal performance metrics may one day include data citations (as papers with open data are more robust and more useful to the community), and publishers with weaker data standards will lose out.

The success of Crossref is a testament to the scholarly publishing community’s ability to put aside commercial differences and create something that benefits all. The next step everyone needs to take is extending the citation network to datasets. That begins with:

  • Publishers pushing for the inclusion of data citations in the references, and tagging them appropriately at typesetting stage.
  • In-text and data availability statements references to dataset DOIs being tagged as well, so that linkages between articles and their datasets are visible to Crossref, and authors can receive credit for the deposition of their data

Both steps just involve changes to production protocols – one small step for publishers, one giant leap for open science.

*these arrived via the REST API, when a publisher sends Crossref a marked-up version of the references section. There are another 6196 citations with the ‘isSupplementedBy’ relationship, but only a tiny fraction of these are datasets, and most come from a single publisher. There’s another type of citation too – event data – which also has some data citations. See here and here for more information.

 

Tim Vines

Tim Vines

Tim Vines is the Founder and Project Lead on DataSeer, an AI-based tool that helps authors, journals and other stakeholders with sharing research data. He's also a consultant with Origin Editorial, where he advises journals and publishers on peer review. Prior to that he founded Axios Review, an independent peer review company that helped authors find journals that wanted their paper. He was the Managing Editor for the journal Molecular Ecology for eight years, where he led their adoption of data sharing and numerous other initiatives. He has also published research papers on peer review, data sharing, and reproducibility (including one that was covered by Vanity Fair). He has a PhD in evolutionary ecology from the University of Edinburgh and now lives in Vancouver, Canada.

Discussion

20 Thoughts on "What’s Up with Data Citations?"

The challenge of having datasets cited properly is an issue that’s been a frustration of mine for a long time, so thank-you for raising the issue.

At OECD, we publish books and papers (which we refer to as analytical content) and datasets. We give everything we publish a DOI and provide a ‘cite as’ tool in our platform for all our content, whether analytical or data in an attempt to make it easy to cite stuff. (And, because we get all our DOIs from CrossRef, they have a complete set of all our dataset metadata.) Yet, while authors happily go about their day citing our analytical content, they absolutely won’t cite the datasets. We can’t even get our own authors to cite their own datasets! When we suggest they might start doing so, they look at us as if we were something horrid the cat had just dragged in – changing a lifetime’s habit on how to cite stuff is clearly beyond the pale. As you suggest, we are now considering adding data citations to reference lists during our production process for books and papers in an attempt to ‘prime the pump’ but can we add yet another time-consuming task to the ever-growing list of don’t-worry,-author,-we’ll-do-it-for-you jobs we ask of our production team? Automation might be a way, but correctly identifying which of our 350 datasets was used in a chapter or paper is going to be the challenge. For me the central challenge is how to change author behaviour? How can we get them to realise that citing all the items they use in a work is going to help their readers by building better pathways through the ‘literature’?

Your piece raises another favourite niggle of mine: why are data sets considered ‘supplements’ to analytical content? Which came first, the data or the article? Our 350 datasets are openly available and are drawn on to provide evidence for many articles, chapters, papers – written by our authors and those around the world – they stand in their own right and need to be cited in their own right.

Tim, I’m having a hard time understanding what a data citation would look like, and specifically how authors would cite their own data from within their own paper. Can you provide an example?

Secondly, I’m confused about your argument regarding citation counts. If an author publishes a paper (A) and cites paper (B), then we can count a total of one citation from A to B. But if this author is encouraged to cite paper B and dataset B, then do we count a total of 2 cites? And what if author B splits his/her dataset into 100 separate tables, all of which are registered with Crossref. Could there be a total of 101 possible citations to that paper?

If your argument that “Citations to a publisher’s journals boost Impact Factors, and hence eventual revenue” is sound, then we have a possible problem on our hands.

Phil,
This is what an OECD data citation looks like:
OECD/FAO (2018), “OECD-FAO Agricultural Outlook (Edition 2017)”, OECD Agriculture Statistics (database), http://dx.doi.org/10.1787/d9e81f72-en (accessed on 28 May 2018).

You’ll note that we include an automatically generated “(accessed on )” element, which is important when citing a dataset that is constantly updated (and can be revised back in time, too!).

If you’re really bored, we published our ideas on data citations in a 2009 white paper (thanks to UNECE for posting it!) https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.40/2010/wp.8.e.pdf

Hi Phil – you’re not alone in struggling to picture data citations, which is one of the problems I raise above. Here’s an in-text citation for a dataset collected by someone else:

These datasets were collected using ScholarOne Manuscripts reporting system and are available on Dryad (http://datadryad.org/resource/doi:10.5061/dryad.36r69, Petchey et al. 2014b).

Here’s the same citation in the references:

17. Petchey OL, Fox JW, Haddon L. Data from: Imbalance in individual researcher’s peer review activities quantified for four British Ecological Society Journals, 2003-2010. Dryad Digital Repository. 2014b; http://dx.doi.org/10.5061/dryad.36r69. Accessed 13th April 2016.

(It would be interesting to see whether this reference made it to Crossref, but I ‘m unsure how to query that)

When it comes to authors citing their own data, it doesn’t seem realistic to ask them to have a formal citation in the references. This may be heretical, but I was thinking that the DOI in the data availability section should be sufficient to link the article with its own dataset:

Availability of data and materials
The datasets analyzed during the current study are available at Dryad (DataDryad.org), doi:10.5061/dryad.2jr8p.

I think the initial idea is that researchers just gather citations for their datasets, and their highly cited datasets will contribute towards e.g. a higher h-index. Including data citations into journal metrics would be a real motivator for everyone to take data citations seriously, although there will obviously have to be efforts to catch citation manipulation (as there are with regular citations).

Hi Toby

There seems to be several issues which your citation for Phil opens up discussion:
a) OECD “data” comes from materials that are not published in academic journals. This information, I suspect, might be consider in the “grey” literature as would many data sets published as part of the independent research in economics or social sciences and in the STEM area that are in reports from government agencies, consultants and many others.

This, of course, is increasingly problematic as research outside of what once was the scholarly work. And it falls outside of the traditional “peer” review but is accepted, often, in the same manner.

b) Citation of the above material is often found in the secondary literature that is often aimed, not at the academic journal community but for consumption by government agencies, foundations, non-academic policy institutes and others. It often appears as reports, policy papers, and many other forms.

This flows back into the other thread on the use of preprints and other vehicles to publish research by academics or scholarly researchers who want this information accessible to others than their peer scholars because increasingly research funding and recognition comes from many other sources that use open access materials. Shaping Tomorrow is one such intelligent search site, https://shapingtomorrow.com which searches over 100 organizations and publications across multiple subjects and whose citations are used internationally by decision makers in governments, academia, corporations, etc. In these cases the “impact factor” of academic journals looses much of its cache. As I and others have said in the parallel thread, authors and end users want access to the articles and not the journals. As has also mentioned for, particularly, the biomed area much of this data or reports never gets published in the academic literature for many reasons that we can all list. Academic journals are but one gateway to get research material to those who need such data sets and research analysis.

As several articles have pointed out, there are major library systems that are questioning and even rejecting publishers’ “big bundles”, impact factors not withstanding. This is happening at a time when many developing country universities, often with weak scholarly research, are fighting to be published and increase publishing of critical data sets and research are seeking audiences that don’t necessarily limit their needs to the scholarly literature. I believe, if one looks at Elsevier’s parent Relx, it is clear that they are responding to that need both in access the materials and seeking those who can bring relevance that is acceptable outside of the “academic” or “scholarly” journals.

I am often asked if OECD’s review process equates to the peer review process. This is what we do:

OECD data is collected from quality-assured sources (e.g. government agencies) or from fieldwork and surveys using processes and techniques common to academics and goes through a stringent quality assurance process that is overseen by a Committee comprising >100 of the world’s most significant statisticians. As for our analytical publications, our review process uses anything from two reviewers to as many as 35 (one from each of our member countries).

Having worked in scholarly publishing before coming to the OECD, I can say that I can see no qualitative difference between what is published by the OECD and what is published by, say, authors at LSE or Harvard or Paris Dauphine. Therefore I see no reason why OECD content should not be as accepted as from any academic institution (and, judging by the demand for our content from academia, academe happily accepts our content on a daily basis.).

I would also say that, for institutions like OECD, maintaining the quality of what we publish is vital because our reputation and future funding depends on it. I’m sure this is true for many institutions like ours, both ‘official’ (i.e. government owned) and not (such as World Economic Forum or Brookings or Chatham House).

As with any human endeavor, no review process is going to be perfect, whether scholarly peer review or systems like ours – but I would not agree that the scholarly peer review process is somehow a guarantor of work that is more ‘scholarly’ nor that work from non-academic sources is somehow ‘secondary’ or ‘grey’ (see below for more on that).

Yes, content from places like the OECD is aimed at a broader audience, not just scholars. But why isn’t content from academia aimed also at audiences outside academe? Many non-academics would love to access scholarly literature, if only they could (and when they can, thanks to OA, they do). After all, in OECD countries, roughly 50% of the population has been educated to first degree level – and therefore are more than able to understand much of what is published by scholars. The challenge is to ensure that scholarly content – from wherever it comes – is discoverable and accessible by anyone who needs it.

I think it’s time to retire the notion of ‘grey’ literature – a term that, as I understand it, arose to describe content that was informally published in the pre-internet era and was therefore hard to source. Informally published content, such as pre-prints, are now easy to source and its volume is growing rapidly – see https://osf.io/preprints/bitss/796tu/ for a recent survey. According to Google Scholar, the most-highly cited source in Economics is not an academic journal but the NBER* Working Papers Series (https://scholar.google.com/citations?view_op=top_venues&hl=en&vq=bus_economics) and four out of five top cited physics and maths sources are hosted by arXiv – proof that if the content is good, relevant and worth citing, it will be, regardless of whether is has been peer-reviewed or its host has an impact factor.

*NBER (National Bureau of Economic Research) is a private, non-profit, organization dedicated to conducting economic research and to disseminating research findings among academics, public policy makers, and business professionals.

Hi Toby

Thanks for the detailed review process for OECD. Both OECD and NBER are well recognized. The important point in your comment is the “wish” that more of the “data” and even the research itself that is locked behind pay walls needs to be out in the open whether by the current concession to the idea of Open Access or an emergent or yet to appear model that is not available through a pirate such as Sci Hub or oblique ways via posting on various version of “arXiv” or similar.

I am just one small voice noting that whether within the academic community such as the R1’s and their clones, or, as you note, sites such as OECD or NBER, those that produce and use the knowledge want to be able exchange without having to burrow through or around pay walls that in many ways frustrate not just academics but those that support their work and those that are now choosing to find such alternative paths.

It would be interesting to know the extent over time of the growth of those paths and the size of that body. I am suspecting that like drugs able to come across the US borders, that the equivalence of the cartel’s ingenuity will soon be paralleled. And like the drive to legitimize to deal with the drug situation, some clever entrepreneur will transcend the world created by Maxwell.

Tom,
I agree that the current model is under pressure and that Sci Hub and other recent developments will accelerate change, however, I don’t think it’s right to attribute the current situation as a world created by Maxwell. Whilst he certainly rode the post-WW2 wave that created new universities, expanded the number of researchers and thus the market for scholarly publications, the roots of today’s publishing model pre-dates his entrance into the market. Maxwell learned the trade from his encounter with Springer in the late 1940s, creating Pergamon in 1951 out of a failing Butterworths-Springer joint-venture that had been founded in 1948. He he was forced to sell Pergamon in 1991, just before the Internet came along. So he was gone before the market went digital, before the Big Deal was invented by Academic Press, then an independent company, in 1996, and all the consolidation and changes that have followed since. He was a player, but he didn’t write the rules of the game.

(Disclosure: I worked for Pergamon Press from 1987 to 1993, when the brand was discontinued by Elsevier, leaving Elsevier in 1997 before joining OECD in 1998).

Hi Toby

Thanks for the clear description of the evolution pre/post Maxwell. The issue at hand seems to me to understand not just the conflicts being argued about both pay walls and their monetary costs but the perceived intellectual costs to the larger community, beyond academics, wanting to obtain access and the present/future size of that literature (such as your citing OECD and NBER) and that is accepted by both academics and researchers outside of the academy.

Angela’s comment below yours is telling,”The point of sharing data is so people can use it”, which, here, I believe, argues for inclusion into scholarly literature. But the reciprocal also needs increased consideration so that all sectors that depend on sharing and scholarly exchange can readily communicate without having to pass through a toll booth or find a bypass or alternative to a toll road.
OA existed prior to the digital world via journal page charges. The various versions of the digital copyright is another path. As I have suggested, researchers and end users seem more concerned about the ability to exchange knowledge. In many ways, the argument of journal publishers rings hollow.
As you note, the literature outside of the “toll booth”, is a rutted but increasingly traveled path In fact, Angela’s comment points to the fact that this literature should be cited in journals. That, of course, is a precarious path since it sends readers onto that alternative route which may increase such traffic.

Data citation is quickly becoming the standard. We just adopted it for our journals as well. Some authors were demanding we allow them to cite data, but more importantly, they want others to cite their data. The point of sharing data is so people can use it. Why would we not incorporate a way to cite it?

Having a data citation policy also encourages authors to post data with a DOI or URL and make it discoverable. Many journals allow citations to non-peer reviewed content—conference papers, industry reports, government reports, software manuals, etc.

There is another possibility for the reason so few data citations. Dryad takes in data sets associated with published articles. If the article was published FIRST (!) then the data set deposited, no citation in the paper! The author doesn’t have a DOI before the paper is published.

Yes in theory an author should coordinate, but at the current time getting the paper published is more important then getting the data deposited. We can discuss (argue?) for days how to change the ecosystem.

Arguing about how best to connect articles with their datasets is probably an argument worth having – it’s one of the big issues with data citation.

My thought is that Data Availability sections should move out of the article and have a more independent life, so that they can be updated to link data that’s shared after the article is published back to the article.

The challenge of connecting articles and datasets is something the Scholix Framework sets out to address. Scholix is a product of WDS/RDA Working Groups that have been focused on data-article linking. See http://www.scholix.org/ for details.

For a service that implements the Scholix Framework, see ScholeXplorer from OpenAIRE at https://dliservice.research-infrastructures.eu/#/. This advertises 18 million bidirectional data-article links from publishers, data centres, DataCite, CrossRef and OpenAIRE.

For an example of an article that connects to supplementary datasets by taking advantage of ScholeXplorer (and thus indirectly the Scholix Framework), see e.g. https://doi.org/10.1016/j.poly.2018.03.024 and specifically the Research Data section at https://www.sciencedirect.com/science/article/pii/S0277538718301499#ec-research-data. The links to crystallographic data here arise because an author deposits datasets with a data repository who ultimately include the article DOI in metadata associated with a DataCite DOI; the dataset-article link is harvested from the DataCite metadata store by ScholeXplorer which the publisher can then query to expose links from the article to the datasets.

What Scholix and similar initiatives involving publishers and data repositories can enable is bi-directional referencing (or citation?) between datasets and articles with minimal imposition on authors and no special typesetting required.

Hi Ian – many thanks for this additional information. I did agonize for a while over whether to include other parts of the data citation infrastructure (e.g. DataCite and Scholix), but ultimately decided that short and simple worked best for this post.

I started looking at data citation location in 2014 (https://doi.org/10.6084/m9.figshare.923523.v1). The problem I was trying to solve is to give credit for data in the form of citation counts. The majority of authors do not cite data in the reference list, it’s often in the methods or even a ‘Data Availability’ section. We solved this problem using Dimensions and now have citation counts on figshare as we near 20,000.

Hi Mark – thanks for this comment. Do you mean that there are 20K datasets now on Figshare, or are there 20K of something else? Also, could you elaborate on how Dimensions addresses the problem of data citations not reaching Crossref?

Hi Tim. Citations to files with figshare DOIs: https://scholar.google.com/scholar?as_ylo=2018&q=10.6084/m9.figshare&hl=en&as_sdt=1,5&as_vis=1 – We also provide branded DataCite DOIs for Institutional clients too, so theres citations to their content. eg in this abstract: https://www.osapublishing.org/oe/abstract.cfm?uri=oe-25-3-2818 – Now we have the links, we can help mark them up in the metadata we send DataCite, which should in theory end up at Crossref due to their collaborations. If we’re just asking Crossref to look in the reference lists for citations to data, they’ll find very little. I’m at SSP this week if anyone wants to chat more on the subject 🙂

Hi Tim — Thanks for raising the issue of data citation. I would add that there is another dependency in the scholarly ecosystem and that is DataCite DOIs. DataCite, a non-profit organization made up of academic and research institutions, national libraries and more, hold nearly 12 million scholarly DOIs — including DOIs for data, images, software, workflows and more. All of Dryad’s nearly 100K DOIs come from DataCite.

As is pointed out in the post, the trick is to create linkages between the content. DataCite and Crossref work closely together on a service called EventData. EventData collects and exposes events that happen on the web around DOIs. Scholix, an initiative is to establish an interoperability framework for exchanging information about the links between scholarly literature and data, uses EventData to aggregate data-literature link information. In the case of DataCite resources, these events regularly are links to/from DataCite to/from other scholarly resources. These links can be provided by DataCite or Crossref.

— DataCite: Links from DataCite DOIs to Crossref DOIs. These are recorded in the metadata for DataCite’s Registered Content. The data is ultimately supplied by DataCite members who are the publishers and ‘owners’ of the Registered Content.
— Crossref: Links from Crossref DOIs to DataCite DOIs. These are recorded in the metadata for Crossref’s Registered Content. The data is ultimately supplied by Crossref members who are the publishers and ‘owners’ of the Registered Content.

These activities together can move data citation forward.

Tim – an addendum to my earlier message — DataCite just published a blog post (https://doi.org/10.5438/h16y-3d72) analyzing the links between Crossref DOIs and DataCite DOIs to obtain a snapshot of the current state of data citation. We referenced your post and touched on some of your earlier work on journal data policies. Would be great to get community feedback on this.
thanks,
trisha

Hi Trisha – that’s a really great blog post, and does a much better job of explaining the issue than I could. One question that sprang to mind is over the 850K links between CrossRef and DataCite that originate with a DataCite DOI. Do these not also link the dataset to a published article?

Comments are closed.