Citation Needed Stickers
Image via Tfinc

Making the data behind research papers publicly available remains something of a new frontier, both for publishers and for authors. As the research culture shifts more toward transparency, and as more journals and funding bodies require release of data, it is vital that the data be discoverable, to facilitate reuse, and citable, to provide credit where it is due. A recent study looking at data citation practices from 2011 to 2014 does indeed show progress, but also that we have a long way to go.

Recently Crossref and the Digital Curation Center (DCC) issued guidelines for best practices for data citation, namely that citations to datasets appear in the References section of any paper that uses them. This is in contrast to the ways that journals usually cite data, “intratextually” (e.g. including a GenBank Accession Number in the text of an article) or in a separate dedicated “data availability” section of the paper. Neither of these satisfies the new standards which are aimed at better fulfilling the Joint Declaration of Data Citation Principles (JDDCP), which states, “data citation should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.”

The goal of the JDDCP is to help drive data availability by raising its importance and measurability as a means of understanding researcher productivity. If you create a dataset that drives significant research forward, then you should be acknowledged for that contribution. Proper citation is key toward the reward, not to mention the discoverability offered by data citation metrics such as Thomson Reuters’ Data Citation Index.

The recent study, described here in a blog post by author Elizabeth Hull from Dryad, showed that we are far from the intended goal. Only 6% of total articles cited the data’s DOI in the articles’ reference list, 75% just listed the DOI somewhere in the body of the article, and 20% had no citation of the data DOI anywhere in the article. On a positive note, things are improving, with works cited properly in the references rising from 5% to 8% over the four years studied, and articles with no data citation at all declining from 31% to 15%. These findings indicate progress, albeit very slow progress. I suspect that as data availability becomes more common, things will improve.

Several of the journals I work with are just implementing data policies and means of making data available, and to be honest, data citation was not something that was on the radar of their editorial offices — writing clear policies and instructions to authors, arranging for partnerships with data repositories and then working through the technologies required to make this happen took priority. But having been involved with a recent Alan Turing Institute Symposium on Reproducibility for Data Intensive Research, the need for best citation practices was made evidently clear. We’ve now implemented a policy to meet best practices, and authors will be required to cite data as they would any other citation, and particularly to include it in their article’s References.

I thought that sharing this learning experience might be helpful for others who are in the same position, just beginning to wade into the waters of connecting data to research papers. Citing all one’s sources is a useful practice, and given that data deposits receive DOIs, should present no major hurdle for journals to follow. If we’re going to the bother of making data available, we want it to be found, and we want authors to be rewarded for their efforts. Good citation practices can help make this happen.

 

David Crotty

David Crotty

David Crotty is the Editorial Director, Journals Policy for Oxford University Press. He serves on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.

View All Posts by David Crotty

Discussion

5 Thoughts on "Data Citation Standards: Progress, But Slow Progress"

Elizabeth Hull’s study was based on journal articles with associated data in the Dryad data repository. Had it been drawn from the science literature at large, I suspect the results would have been paltry.

I probably had an article among the “correct” 6% of total articles cited the data’s DOI in the articles’ reference list, but that was just because that’s what the journal’s instructions said I was supposed to do. I recall thinking, “well that’s different.” It was different because even a few years later, many journals still call for supplemental materials to be cited in the text, and what authors actually do is a hodgepodge.

I doubt publishers, institutions, and researchers appreciate this issue. Certainly digital preservation and data discoverability are important, but it comes at a cost to researchers who have to figure out how to do it, and to institutions which have to serve the data indefinitely, not to mention publishers who have to facilitate it all. And then comes institutional policies and policing of data management plans for recipients of funding that comes with these strings, dual institutional approval requirements of datasets and manuscripts. All noble and necessary, but the future beneficiaries of these data preservation and discoverability steps are not the ones who are paying to generate the data in the first place. In the applied sciences, every dataset is generated for a reason, and paying upfront for some intangible future benefit by someone else may be a hard pill to swallow.

I expect there will be blowback and subsequent refinement to sweeping institutional data decrees. To me, the decrees seem to envision large datasets such as clinical trials, epidemiology, or climate/earth/ocean/atmospheric sciences that may start in the millions or more entries and truly are not interpretable by humans. In contrast, I question whether the small, heterogeneous datasets often generated in biology or ecology, for example, would be of much value in isolation from their published articles. In that case, how important are the separate citations and data curation? Maybe the simple, supplemental information tacked on the back of a discoverable manuscript is perfectly functional.

Good post. I hope to see SK venture into the data side more often.

Agreed, Chris. Ten years ago I did staff work for the US Interagency Working Group on Digital Data (http://itlaw.wikia.com/wiki/Interagency_Working_Group_on_Digital_Data), to begin to formulate Federal data policy. We developed the minimalist Data Management Plan concept, which was then implemented by NSF and now by most US funding agencies under the Public Access Program. There was an ongoing struggle with what I called the Utopians, who wanted data availability on a massive, expensive and burdensome scale. The Utopians are still at it.

I understand the the train is moving forward to recognize through citation the fact that data are being reused. I wonder if this is fair to all the authors of the original study. I am thinking in particular of the individuals whose primary role was the design of the data gathering instruments (for example, in the recent multi-author gravitational waves paper). Their contribution was just as strong to the paper reusing their data as to the original paper. Citation may be a fair approach for the PI of the original paper but is it fair to all? Perhaps this was closely considered in formulating the joint declaration. Any insight you or others can share would be appreciated.

Data citation (and in fact all things data publishing) remains a challenge. At OECD, where we publish datasets by the score, we have long (since 2006) attached DOIs to datasets and even provide ready-to-download citations that are compatible with the usual reference manager tools – yet I have hardly ever seen anyone actually cite a dataset in a reference listing – not even our own authors! Thanks to the efforts of DataCite, among others, the tools are in place to cite datasets in many areas so the effort now must be to change the culture among authors so they cite a dataset as they would a book or a journal article. Since our authors won’t cite data ‘properly’ themselves, we’re aiming to prime the pump by asking our copyeditors to add in the correct data citation to a reference listing (at least to our own data) in the hope it may encourage authors to do it themselves the next time.

Data is slippery stuff to publish (when is a dataset a dataset and not a piece of supplementary information?). Where it is a supplement to a piece (as opposed to a source on which is piece is written – not quite the same thing) we do not propose to cite it. Rather, we will continue to do as we do now, stick a link to the data file under the table/chart/graphic so readers can download the data and play with it themselves. I have to say, the ability to download an Excel file of the data that created a table/chart/graphic in a publication is about the most popular feature we’ve ever created for our readers. Of course, the excel file itself carries a link to the mother dataset (if there is one) so that readers can click through and explore more if they need to.

I’d be interested to hear from others their experience or efforts to get authors to cite datasets as I think this is the biggest challenge now.

Yes; I agree with you. Data citation is a process and most of the time it take lots of time for full complete. And I also agree with your information which is data citation practice from 2011 to 2014; it’s a solid information because earlier I heard this information. Overall, your article is very much important.

Comments are closed.