The title of this post may seem like a farfetched claim, however, no one can deny that we are currently faced with increasingly critical challenges — climate crisis, shrinking biodiversity, hunger, poverty, disease, and more. I think most of us would agree this means it’s essential for the research findings that could help address these challenges to be shared as quickly and widely as possible — and for the data behind those findings to be FAIR (findable, accessible, interoperable, and reusable). And that means…metadata!
As a community, we have a collective responsibility for sharing research outputs, including their metadata. That’s why Metadata 2020 is so timely and important (disclaimer: I am co-chair of their Researcher Communications project group). This community-led initiative aims to improve metadata in order to enhance discoverability, encourage new services, create efficiencies, and — ultimately — accelerate scholarly research. Lofty goals, to be sure! Which means that to succeed in achieving them we need the support of everyone who is involved in creating, curating, and consuming metadata.
Per the FAIR principles, “Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services.” Building on this, the Metadata 2020 project group on Best Practices and Principles has developed a set of draft principles, which were recently released for community comment. They state that for metadata to support the community, they should be:
- COMPATIBLE. Provide a guide to content for machines and people. So, metadata must be as open, interoperable, parsable, machine actionable, human readable as possible
- COMPLETE. Reflect the content, components and relationships as published. So, metadata must be as complete and comprehensive as possible
- CREDIBLE: Enable content discoverability and longevity. So, metadata must be of clear provenance, trustworthy and accurate
- CURATED. Reflect updates and new element. So, metadata must be maintained over time
Those who provide descriptive information (metadata) about research and scholarly outputs
Some of the best creators of metadata are, of course, the people who created the content to which it refers, i.e., the researchers. They know their work better than anyone else, so are best placed to decide on keywords, something which many (perhaps most?) journals now routinely request or require (although these are often limited in number and provide no context for the article). They also know most about how, when, and where their work was carried out; who else was involved; what resources were used (equipment, artifacts, etc); and more — all information that helps ensure research is reproducible. However, as Daniel Hook pointed out in his presentation at Academic Publishing in Europe 2019, this information is much less likely to be collected — if at all. And yet, as he also noted, if Google, Apple, and other services can tell us when and where we used our phones, surely it should be relatively easy to collect the same data for research work, for example, via electronic lab notebooks.
Researchers aren’t the only stakeholders who create metadata. Publishers also contribute, by adding persistent identifiers (PIDs) for authors and reviewers, funders, and other organizations. In future, grant identifiers could also be included as universally searchable fields (currently available in Pubmed), as well as identifiers for research resources, such as laboratory equipment or special collections. In addition, publishers typically provide some structural metadata (page numbers, type of work, etc), and they work with third party services such as Crossref and DataCite to register DOIs (Digital Object Identifiers) for their content.
Those who classify, normalize, and standardize this descriptive information to increase its value as a resource
Collecting the data that makes up metadata is only part of the solution. Making it meaningful is just as important, and that includes making it consistent. PIDs play a role again here, as do taxonomies and standards, though neither are easy or straightforward. Take taxonomies, for example. The CRediT taxonomy, developed by CASRAI to help assign credit for different researcher roles, has been around for several years, but unfortunately has not yet been widely adopted. This may be at least in part because it isn’t applicable to all disciplines, but also because the role definitions themselves vary by discipline, making them hard to apply consistently.
There are similar challenges around keywords. While some fields have well-established taxonomies, others do not; even where those taxonomies exist, they may not be embedded in research workflow processes. And if the keywords list contains ‘mouse’ and ‘human’, how is one to know the subject of study?
Standards are equally challenging. As the presenters in a recent SSP panel entitled Achieving Standards Nirvana: Why Industry Interoperability Isn’t Easy, but Is Worth the Effort showed, “ It’s not all rainbows and unicorns!” Getting all stakeholders to agree on even the most basic elements of a standard can take years; getting that standard widely adopted can be an even longer process. And yet doing so could significantly improve metadata quality — and maybe even speed up the dissemination of knowledge. For example, NISO has recently expanded the JATS data model to include a Non-Monetary Support section, which will enable PIDs for research resources to be included in the metadata for the resulting publications.
Those who store and maintain this descriptive information and make it available for consumers
Creation and curation of metadata are two key steps in the scholarly research pipeline. Once metadata has its home in databases, repositories, and catalogs, it comes under the watch of custodians — libraries, archives, repositories, and library service providers, and others. They are tasked with keeping various pieces of information current, accessible, and discoverable, along with metadata collectors (e.g., Crossref, ORCID, CHORUS), who also maintain descriptive information for users. Custodians and curators must, therefore, work in tandem.
My Researcher Communications group co-chair, Michelle Urberg, Metadata Librarian at ProQuest, shared this example:
The 360 KnowledgeBase (360 KB), which is maintained by metadata librarians at ProQuest/ExLibris, is the work of more than a decade of cleaning and ingesting publisher metadata, collecting updates from publishers, and re-ingesting the newly cleaned metadata. This time intensive work involves checking that pieces of information such as title, author, edition, date ranges, and URLS are correct every time the information gets reloaded into the 360 KB. Without the curation of the 360 KB team, access points of this metadata deprecate quickly, which in turn affects the user experience of librarians and library users using this knowledge base to facilitate scholarly research.
The bottom line for any metadata custodian is that their systems must be set up to ingest metadata correctly, distribute it efficiently, and ensure that any changes they make don’t affect its quality or inhibit its use by consumers.
Those who knowingly or unknowingly use the descriptive information to find, discover, connect, and cite research and scholarly outputs
Whether we are using metadata as individuals or consuming it via artificial intelligence, we have a collective responsibility to ensure that it’s the best it can be. This requires us to develop better ways of adding missing information and correcting incorrect metadata. For example, at my organization, ORCID, between 250 and 300 users contact us each year because an article they didn’t write has been connected to their ORCID record in error. This happens for one or both of the following reasons — first, the journal or publisher in question didn’t require their authors to authenticate their iD during the publishing process, and second, they don’t have an automated process for moving publication metadata from their submission system to the production system and through to indexing and abstracting services. Resolving these misattributions takes substantial effort on the part of the author, ORCID, Crossref, and the journal. This is a time-consuming and often frustrating process for the author, and a strain on our and Crossref’s resources. Using APIs to collect ORCID iDs would significantly improve the situation, and this is something which we are currently working on simplifying.
And of course ORCID iDs are just one small component of a publication’s metadata! We need to find ways to encourage our community to improve all metadata, both by making it easier to do so and by increasing their understanding of why this is important.
Metadata 2020’s work includes getting a better understanding of what librarians, publishers, researchers, and repository managers already know about metadata.The Researcher Communications project group is exploring how to increase the impact and consistency of communications with researchers about metadata. Other Metadata 2020 groups are addressing this from the perspective of librarians, publishers, and repository managers. We have therefore recently launched a survey to help test our assumptions and gain a better understanding of metadata knowledge, attitudes, and usage in scholarly communications. If you’re a member of any of those groups, I strongly encourage you to take the survey — it shouldn’t take more than 15 minutes — and to share the link with others in your community. We will publish the results later this year in a peer-reviewed journal (and, most likely, an update here as well!)
Thanks to my Metadata 2020 colleagues, Kathryn Kaiser, Rachael Lammey, Laura Paglione, and Michelle Urberg for their help with this post