Editor’s note: Today’s guest post is co-authored by Brooks Hanson (American Geophysical Union), Daniella Lowenberg (California Digital Library), Patricia Cruse and Helena Cousijn (both DataCite)
Not so long ago, publishers put up with incomplete and erroneous references to the literature, to the detriment of readers and scholars. Even after papers were online, tracking citations forward or backward involved significant hand work. Even after online publishing started in earnest, references were entered by authors, re-keyed by typesetters, and checked manually (as much as time allowed) by reviewers, editors, copy editors, and proofreaders. This often involved pouring over indices or, in some cases, searching for the exact paper in volume after volume in the library or basement. The fallout was that priority was usually given to your own journal citations. Errors in authors, volumes, year, and pages were rampant and persisted as authors copied erroneous references in later publications.
It took a concerted effort to achieve accuracy, reduce manual effort required, and improve efficiency for authors and publishers alike. In turn, linked and tagged references and other identifiers related to authors (ORCID IDs for researchers) and funders (Crossref’s Funder Registry) emerged. They helped secure additional links and generated new studies about how science is done, collaboration networks, and research outcomes. The formation of Crossref for registering article DOIs, and DataCite for non-journal content, were critical steps. So was the adoption of standards around DOIs and citation download formats (e.g., RIS [Research Information Systems] file format), which allowed authors to ingest citations into reference managers (no more typing errors and 404 errors). Open APIs to Crossref, DataCite, and Pubmed, to name a few, as well as digitized backfiles, allowed publishers to quickly check the accuracy of references and, better still, rebuild them from scratch accurately, to a particular style, and to add links to the references, as well as to forward and backward citation information, and more. Standard ways of tagging references were developed and adopted (thank you, JATS [Journal Article Tag Suite]), allowing easy machine recognition of reference sources (journal vs. book chapter vs. some other type).
Recent efforts to open references (i4oc) aim to take this research even further, and artificial intelligence and semantic tools may soon allow understanding not just of whether a reference is cited, but of how often, and in which contexts throughout a paper, discipline, and over time.
The community now has an opportunity to continue this forward progress by extending this process to link datasets functionally between repositories and journals, thereby aligning scholarship with FAIR (Findable, Accessible, Interoperable, and Reusable) principles. This would improve data discoverability and research transparency, provide much more accurate research on reuse and outcomes, and allow for data to be valued like articles and to become part of the academic reward system. Indeed, DataCite was created in part to enable research data to follow the same process as Crossref enables for the scholarly literature. This process is more challenging, however, as research data are distributed across numerous international repositories with even less standardization than existed for journal references; in fact, data have not been regularly cited in references even when they are in a repository.
The good news is that all the pieces are in place now to enable the community to move forward. Many publishers have taken the bold step of signing onto the Data Citation Principles, aimed at ensuring citations to data are included in both papers and reference lists. Repositories, too, are taking up the mantle, housing data and providing permanent identifiers and landing pages for datasets, in compliance with the FAIR principles. A new level of certification of repositories, CoreTrustSeal, supports these efforts. Many repositories are now registering DOIs through DataCite, which then enables connections between Crossref and DataCite DOIs and metadata. The Enabling FAIR Data Initiative in the Earth, space, and environmental sciences is a good example of implementing these connections across a discipline. These activities, together with citation links, are the focus of an interoperability initiative, Scholix.
A key missing piece in implementing these connections is that, while many publishers have signed onto the principle of citing data, for the most part the community is not yet benefiting from this. To help with this, a group of publishers, under the auspices of the FORCE11 Data Citation Implementation Pilot, have just published a complete “roadmap for scholarly publishers” in Scientific Data (Open Access).
This article walks through all stages in the life cycle of a paper to demonstrate which actions publishers should take at each stage. We highlight here three of the most important steps for making data count — steps that publishers can take immediately. The first is in their guidelines, the second is in their production workflow, and the third is in the metadata deposit to Crossref.
The first step, as in the Enabling FAIR data effort above, is to direct research data into appropriate repositories and include data citations, together with their persistent identifier, in publications. This can be a DOI assigned by a repository through DataCite, or another recognized identifier (such as a GenBank ID). There are now many repositories, domain-specific or general, including institutional repositories, that can house most datasets. The FAQs that publishers and repositories developed for the FAIR data effort include additional information on this. It is a step that many publishers have already taken.
However, even when datasets are cited, they are often not tagged in a way that can be identified as a machine-readable data citation. This means that any effort that researchers are making in following best practices, or that publishers are putting into advocating or requiring that researchers cite data in references, are moot. For these citations to have value, they need to be tagged as what they are — a data citation — so that they can be appropriately recognized and linked. The requirements for taking this second step are described in the publishers roadmap. As a third and final step, tagged data citations need to be included in the publishers’ feed to Crossref.
The paper describes the two options for tagging and deposit to Crossref — see the “Production” and “Downstream delivery to Crossref” sections. For more detailed information, you can also take a look at this recent Crossref blog post on the topic.
When data citations are included, tagged correctly, and provided to Crossref, links become available to the community, thereby providing a complete view of — and increased access to — research. These data citations are included in the Crossref/DataCite Event Data Service, and can then be picked by other services via an open API made available by Crossref, as well as the DataCite API. In addition, the ScholExplorer service collects data citations for datasets using identifiers other than a DOI. This allows all organizations to discover and display the relations between articles and datasets. The connection is complete!
Researchers, repositories, scholars and others benefit by being able to follow links, whether they start with the data or literature. Properly tagged and deposited data references can also be displayed (and counted) at the repository level (courtesy of Make Data Count) as shown here:
(Photos courtesy of DataONE)
Adopting these standards and practices will go a long way to empowering data publishing, and to providing researchers with access to the data they need to validate conclusions and answer new research questions. If we believe data should be valued like other research outputs, we must take action to achieve this. Supporting the open data movement means providing proper support for data citations.
Help us make the magic of scholarship happen — prioritize implementing data citations for your journals!
Note, this post was updated on December 18, 2018 to correct the title of the journal Scientific Data