Guest Post - JATS4R: Optimizing the Reusability of Scholarly Content

Editor’s Note: Today’s post comes from Mary Seligy, who is a Business Analyst with Canadian Science Publishing, an independent, not-for-profit scholarly publisher.

Last year at the 2016 SSP conference, I attended a pre-conference day intriguingly titled “Machines as the new readers”. As someone with an unabashed fondness for the Terminator series of movies, my mind instantly conjured Arnold Schwarzenegger’s mercury-red eye lasering through reams of scholarly articles with cold and calculated efficiency. And, while the pre-conference didn’t turn out to be quite as dramatic (or, thankfully, as terminal), it turned out to be one of the more interesting sessions, highlighting themes that reverberated throughout the following days of the conference.

terminator — I’ll be back…for your properly coded XML tags. Image courtesy of Garry Knight.

There was a lot of talk about standards, specifically the idea that a standard is a kind of shared trust. Standards require confidence in a common understanding about the meaning and implementation of collectively defined norms. When you can’t trust something, you must stop and check it, sometimes many times over. For us humans, this lack of trust sucks up your efficiency and wears on your patience (and nerves).

While humans have some capacity for grey areas, machines — at least those employed in the shared systems that make up the scholarly publishing infrastructure today — have little tolerance for the ambiguous; they require consistent and predictable inputs to handle content efficiently, never mind effectively. For example, it is impossible for machines to consistently and accurately retrieve an author’s name, when they have published under different name variations (initials or none, all given names or not, etc.). This is precisely the scenario that ORCID is addressing with standard author identifiers. If the author has verified his or her ORCID iD, then everyone (and everything) else can trust in that number to represent that author. We can trust it and can move on.

Consider also that publishers are not in the content-creation business, but are instead in the business of providing trustworthy assertions about the content in their care, assertions of provenance, authority, ownership, funding, and so on. These are not only appreciated and expected by humans, but are increasingly required by the computerized gatekeepers throughout the scholarly publishing infrastructure. Therefore, standards — and their less-formal sibling, best practices — can be considered to underwrite the assertions that we, as publishers, hope will instill trust in our users and enable content to pass seamlessly and effectively from author to audience.

While I listened intently to all of these interesting and worthy ideas about machines and standards and trust, I wasn’t only listening as an employee of a scholarly publisher (which I am), but also as an active participant in JATS4R. The principal mission of JATS4R (JATS for Reuse) is to optimize the reusability of scholarly content tagged in JATS XML, by developing XML tagging best practices. JATS4R is an actively growing international and inclusive working group comprising publishers, typesetters/compositors, delivery-platform folk, archivists, persistent-identifier people, and other assorted stakeholders — all united in JATS4R’s vision of a world in which scholarly content travels seamlessly to its users, facilitated by its reusability.

Machines — at least those employed in the shared systems that make up the scholarly publishing infrastructure today — have little tolerance for the ambiguous; they require consistent and predictable inputs to handle content efficiently, never mind effectively

Now, “reusability” is a term that sounds like something good, like something solid and serviceable, but perhaps a bit abstract. In the context of JATS4R’s work, we are concerned with the ability of machines to “reuse” content for exchange, storage, retrieval, and sharing throughout the scholarly publishing infrastructure after it has first been “used” in the act of publishing, typically on a website.

Dear reader: If you have started to think about clicking away from this page at the mention of ‘XML’ and ‘JATS’, technical-sounding terms which are the province of a vendor or IT folk, I urge you to bear with me — unless you do not care about cost savings, improved efficiency, enhanced discovery, better tools, and many other needs that our businesses share.

Plainly put, XML (eXtensible Markup Language) is basically a structured system of labeled containers for content. Virtually all machines can be made to understand XML and, in essence, our content. The people who know the content best in most organizations typically aren’t technical people or the vendors, but the editorial folk who shepherd that content and make decisions about it every day. That’s why the secondary mission of JATS4R is to develop resources to help everyone in scholarly publishing have a better understanding of what XML is, so that we can all make good business decisions where XML is concerned. (You can read more about what XML is and how it works in scholarly publishing here)

As for JATS, it is an actual standard published by the National Information Standards Organization (NISO). Formerly known as the NLM (National Library of Medicine) DTD and maintained by the NCBI (aka the group behind PubMed), JATS was renamed to Journal Article Tag Suite (JATS) in 2012, when it officially became NISO standard z39.96-2012. Version 1.1. was re-released in 2015 as NISO z39.96-2015.

Alert readers may be wondering at this point why JATS4R, an organization devoted to developing best-practices for tagging content with JATS XML, even needs to exist, given that JATS is already a standard. Good question.

JATS started life back in the early 2000s, a heady time when scholarly publishing organizations everywhere were excited about XML, a new specification that promised to blow the lid off dissemination and discovery. Among other things, XML promised to enable full-text journal content to be described, indexed, and searched online. Back then, everyone was getting into the game by making up their own flavor of XML, which is easy to do, as the standard allows you to create custom containers and rules for structuring the containers. The problem, of course, is myriad kinds of XML that were all too different to allow any practical or cost-effective way to exchange that content among systems outside a publishing house. In turn, every provider publishing unique XML made archiving and other key processes in the scholarly journal lifecycle expensive or difficult, if not completely inaccessible.

Enter JATS, which was effectively the XML-equivalent of the Tower of Babel. Through its flexible design, JATS made it relatively easy for most organizations to adopt and make XML that was designed for journal article content, with the result that we now had a solid foundation for content exchange.

A lot has happened since the early 2000s; the internet has exploded in usage and the scholarly publishing universe has been expanding in kind. Whereas hosting digitized content on a publisher’s own website was once a significant victory for its content’s dissemination and discovery, today this is merely table stakes. Today, a journal article’s machine-mediated path to its usership is multi-various, and much of its potential for discovery and accessibility hinges upon the XML in which it is encoded.

And, if the number of places an article might travel weren’t enough to contend with, consider the forms it might take! We publishers often myopically think only of the PDF or web page instances of an article. But an article might be made available in JSON format for text and data mining, RDF for semantic applications, in EPUB for e-readers, and many other possible formats and applications. No matter the embodied format, an article typically starts in journey with the full-text XML, often in JATS.

What all of this means in practical terms is that while JATS’ flexibility has been extremely successful in terms of enabling widespread adoption, it is often too flexible for the huge variety of systems and applications available for journal content. There are many completely valid ways to capture the same article object (such as authors or keywords) in JATS, making it hard to trust consistent XML inputs, therefore harder to program computers to handle our content.

For example, the Open Access Media Importer (OAMI) was the original use case for JATS4R. The OAMI is an automated bot developed to harvest OA content and upload it to Wikimedia Commons, the media repository used by Wikipedia. But even though all of the articles came from the PubMed Central Open Access subset (and therefore all tagged in JATS XML), inconsistencies in the article XML threw a wrench in the works. Variations in tagging for keywords and other article parts not only complicated large-scale reuse by the OAMI, but the unclear licensing statements required the developers to implement text-mining-like algorithms to accurately determine whether specific content was compatible with reuse on Wikimedia Commons.

Now, JATS itself remains purposefully non-prescriptive when it comes to how to tag things in XML, and for good reason. Its mandate is to maintain a specification that is flexible enough for continued adoption. Nevertheless, every year, participants in the annual JATS-Con users’ conference ask the JATS authors for tagging guidance, because it can be time-consuming, frustrating, and sometimes expensive to make XML tagging decisions in JATS XML — especially if they turn out to be impractical for some reason and must be reconsidered.

So, shortly after a paper was delivered by the OAMI developers at JATSCon in 2013, JATS4R answered a call-to-action by the community. Since content reusability clearly affects all content, a collaborative effort was needed to develop best practices for tagging XML in JATS. JATS4R was formed and began recruiting those who shared the vision of a world where everyone gains efficiency when judicious, common tagging practices have been determined and cost-effective, efficient systems and tools can be built around expected inputs.

I myself found my way to JATS4R on a subway train. I had read about JATS4R on the internet, after searching for the best way to capture a certain structure in the XML for our new journal FACETS. But it wasn’t until I was walking to catch the subway on the first day of JATSCon back in 2015 that I bumped into Melissa Harrison, Head of Production Operations at eLife and the Chair of the newly formed JATS4R working group. Through that discussion, I began to truly appreciate the larger vision of work that could benefit everyone in the scholarly publishing ecosystem by removing barriers to scholarly content interoperability through JATS XML tagging best practices. By the time we’d arrived at our stop for JATS-Con, I was hooked.

Since that time, JATS4R has developed best-practice XML tagging recommendations for various article objects, beyond the original use case of licensing and other aspects of permissions. The rebuilt website houses the recommendations and also the fledgling XML Learning Center, which is intended to be a repository of resources on various XML topics so that everyone can learn more, regardless of their level of technical understanding. The site also contains information on how organizations can participate (and why they should), because best and common practices only work if everyone (or at least a majority) follows them.

While developing recommendations for using the current JATS specification is the principal job of JATS4R, increasingly, we find ourselves in an advocacy-like position, presenting use cases to the JATS standing committee for article objects for which there is currently no specific or clear way to tag in JATS without committing “tag abuse” (a term that applies to using XML tags in a way that is counter to the intended meaning, such as using a tag meant for an author to capture a copyright statement). These are things that have arisen largely out of the possibilities that the digital age affords, such as a publisher’s ability to continually upload new versions of an article, giving rise for the need to capture an article’s version in the XML. In such situations, the discussions around recommendations can be nothing short of Talmudic, as established concepts, such as the version of record, become challenged by new concepts, such as the Best Available Version.

JATS4R’s call to the scholarly community at large is to participate however people can, because the work it undertakes affects everyone. For that reason, it recently reorganized its recommendations, making structure to allow for several subgroups to work on recommendations simultaneously, and thereby include as many people as possible who are interested in a given article part, such as authors and affiliations. If you would like to participate in JATS4R, please do; JATS4R is entirely run by volunteers, and we welcome all comers interested in content reusability.

Scholarly Kitchen

The Scholarly Kitchen account is used for anonymous posts, housekeeping posts at the blog, posts from the Society for Scholarly Publishing, and a few other purposes.

The Scholarly Kitchen

Guest Post — JATS4R: Optimizing the Reusability of Scholarly Content

Scholarly Kitchen

Announcing Our 2026 New Directions Seminar: “What Is a Journal in 2030?”

Scholarly Kitchen

Related Articles:

Next Article: