Citations are foundational elements of scholarly publishing. They provide evidence of the reliance of current work on existing literature, background on how research strategies were developed, indication of the thoroughness of the work, and a summary of significant prior related art, as well as facilitating plagiarism detection. As an ancillary benefit, they also have created one metric against which past research is judged. The investment in ensuring references are accurate, complete, and the link (if one exists) to the referenced object is functioning is certainly a core publishing function.
However, all of these things have very little to do with how references are presented in citations, nor the importance of one style over another. Much of the focus on citation style is driven by domain tradition and adherence to one of the many dozens of style guides that exist in the scholarly publishing world. I certainly don’t intend to disparage any of the different manuals, nor their adherents in any field. What concerns me is the fact that there is so much time wasted in the production process of editing references to adhere to the house style, whatever that style may be.
Anyone who has written a scholarly article, a book chapter, an entire book, or a bibliography understands the challenges of references. Recently, I submitted a book manuscript for a project that I am co-editing. One of the most tedious and painful aspects of pulling together the book was going through and editing the references. While I appreciate and acknowledge the importance of the references, the entirely manual process of checking and formatting the references was an incredible waste of time.
A scholarly reference at its most basic element is a string that provides readers with sufficient information to track back to the original source of the content being referenced. That information, of course, varies by the type of publication, the source document, and the complexity of the resource being referenced. There will always be a variety of citation styles based on the type of resource, but why do we need so many different styles for the same type of resource?
The consistent formatting of names, the placement of commas and periods, and the representation of all manner of data are all style issues that should be handled by automated style sheets drawing on linked metadata. Because we are focusing on a string of text, we are inefficiently producing citation content. and wasting effort on post-distribution processing. We are also not taking advantage of all the potential from machine-linking this information, a critical requirement as we move further into a linked data world.
There have been many studies over the years of citations, their accuracy (just a few more and more), and the types of errors researchers are making. These errors can be serious or minor, with minor errors being the most common. Significant errors can be so bad that discovery of the cited work is impossible. Although the percentage (in the 5-20% range across several studies) is modest, it is still troubling..
Some of the research on citation errors showed inappropriate use or misuse of a referent’s conclusion. If the other identification and metadata errors in a citation are eliminated or reduced significantly in ways discussed below, editors or reviewers might redirect their time to confirming the appropriateness of the reference. This might be a better investment of editorial resources.
Many publishers and vendors who provide copyediting or publishing services have for some time done production transformations of references and validation. Unfortunately, the match rate of these post-production references against CrossRef and PubMed metadata is far short of perfect. This editorial process has been the focus of a lot of process development and programming over the years. There has also been research in how to use pattern matching for names and topics (and another), as well as how to algorithmically parse citations (and just a few more and more).
Rather than using text to identify digital resources, we should be taking advantage of the system of identifiers and metadata that exist to streamline the process of reference creation and validation. This would solve the problem at the root rather than after the fact. Instead of identifying people with just their names, we should be using ORCIDs or ISNIs to ensure disambiguation and machine-processability. Identifiers can also be used for institutions and publications. Nearly every article has (or at least should have) a persistent identifier such as a DOI. Publications have ISBNs or ISSNs. Sound and A/V formats use ISRC, ISAN, and ISMN identifiers. Special collections and archive materials have collection identifiers and accession numbers. Data sets are being tagged with DataCite DOIs, or ARKs. Even concepts and other digital materials can be identified with URIs. All of these identifiers should have associated metadata that describe what that object is, be it name, title, or description.
When authors are submitting references, why doesn’t the community simply send in a reference that is submitted like this:
<Author ID>, <Publication ID>, <Object Identifier>, <Publisher identifier>, <Date (of publication/access)>, and specific location (such as page number, etc), if necessary.
Each of these elements, aside from date and specific item location should be represented by unique persistent identifiers. References should not be strings of text that describe the referenced content. In our increasingly digital and machine-intermediated world, the production departments can pull in related metadata using an automated process to retrieve the data and format it in any way the publication’s editors would prefer. Any outliers can be dealt with in traditional ways.
While such a change requires buy-in from the community for broad adoption, providing citation data in these structured ways would actually be easier for authors to do than to correctly gather and format all of the information in references the way they are doing so now. A large number of authors are already doing something similar by using reference managers like Zotero, EndNote, Mendeley, RefWorks or some similar service. All of these services use structured metadata to transform easily from one style to another. Rather than taking that structured data, converting it to text streams, then having editorial services return it back to structured data, publishers should ask for that data and plug it into an automated formatting process.
Making all (or at least most) of the elements of references machine-readable through persistent identifiers will remove through unambiguous reference the errors and confusion about the identity of the author, or publication or whatever is being identified. A greater range of metrics and analytics are possible through machine-processable references. Some of this is already being done by organizations that process references for analytics, such as Thomson-Reuters. Who can say what metrics or services might be possible if all the material were generally available? Additionally, citations can be scanned for appropriate or correct quote attribution. Presently, CrossRef partners to provide a service for plagiarism detection, mediated through the CrossCheck metadata, but there are others. A more robust system might follow with new and different types of providers.
One might respond that the scholarly community has successfully implemented the DOI system and DOIs are regularly included in references, so isn’t that sufficient? In some ways yes, the inclusion of DOIs does address part of the challenge. DOIs do create the functional tie to the metadata regarding the referenced item. However, they do not form the basis for how most references are constructed or formatted or allow the parsing of the reference. Also, DOIs provide an indirect link to information rather than a direct link to, for example, the author’s ORCID profile. This lack of direct connections creates a barrier to simple analytics and information mapping. This is certainly not an insurmountable issue, based on the robust infrastructure that CrossRef provides, but traversing this linked data web directly rather than via the CrossRef metadata also simplifies the collection of metrics for assessment purposes. Relying on DOI metadata also lacks some of the real-time validation services and potential linked data extraction from a master identity record that would be associated with a persistent identifier for that entity, such as a person using ORCID, a book using ISBN, or an institution using an ISNI.
PLOS hosted a hack day at their offices earlier this month in San Francisco on trying to find ways to improve the machine interoperability of citations. One idea is what PLOS is calling rich citations, essentially using linked metadata to provide richer citation information. PLOS Labs has developed an open source bot that can automatically collect rich citation information. While these might be interesting services, would they be necessary, if the power of identifiers instead of text were already built into the references?
To implement these suggestions would require a cultural shift in the current publication process. Authors will need to view much of their content not as narrative, but rather as structured information and data. Citations are a good place to start since much of this content has existing identifiers that can be assigned and used to replace the textual format with a structured one. Let’s leverage the existing systems and tools to redesign how we manage citations to stop wasting our time on reformatting and add the increased functionality that machine-processable structures can provide.