When Roger Schonfeld codified the “stumbling blocks” experienced by users of online academic resources, those of us managing the channels of scholarly content search and retrieval pinned blame squarely on the underlying metadata — but which data points exactly? Was it something in the DOI record? An error in the full-text markup? A mismatch between either of those metadata assets and the link resolver? It is nearly impossible to determine which metadata was at fault for undermining research experiences, or who owned its remediation, especially without a complete view of the ecosystem.
When we peel back the layers of metadata construction and transmission to examine points of friction in the research workflow, we get a glimpse into the messy supply chain of publishers, libraries, and service providers. Together, this network of stakeholders and experts manage high-volume pipelines of bibliographic data, persistent identifiers, controlled vocabularies, and terabytes of XML, KBART, and MARC files that surround digital scholarly outputs. When you closely examine instances of poor user experience, like a broken link in a paper’s cited references, multiple pieces of information could be to blame. A previously correct URL can be rendered useless if content switches domains, such as during a platform migration or rebranding. In aggregator platforms, like JSTOR, SCOPUS, or ProQuest, if the publication date or page numbers are incorrect for a journal article, an open URL link can fail and send users to the platform homepage and not the article — or produce a 404-error message.
Cases like this have led to uneven progress toward meaningful changes in metadata that produce measurable benefits to researchers, students, faculty, and librarians. While proof of metadata’s impact on end-users may lurk in proprietary data, such as the 90% discoverability increase found by investment in semantic technology, we lack a shared framework to measure our collective returns on metadata maintenance and enrichment.
In an effort to unearth opportunities to measure metadata ROI, we recently spent time looking for what evidence can be found in available studies and datasets. What can the literature tell us about links between good metadata and good information experiences in scholarly research?
Making the business case
For as long as we can remember, our industry has accepted the fact that good metadata does good things for scholarship — or, at very least, that metadata management is an accepted cost of doing business in today’s digital publishing ecosystem. A majority of NISO initiatives are focused on metadata. Metadata 2020 was a clever program dedicated to the improvement of scholarly metadata’s quality and impact, believing that enriched, interconnected, and reusable metadata fuels discoverability and innovation in scholarly research.
For books publishers, product and content metadata has long been established as key to success in online discovery and sales. In part, metadata-driven search performance has become a key performance indicator for most marketing departments. Academic librarians have established the ways publisher metadata fuels the engines of resource discovery and access. Users’ ability to find and make good use of content are key metrics of success in both collection development and publisher sales.
Therefore, many scholarly publishers have subsidized both internal and collective measures to improve the metadata pipelines. Metadata strategy is a component of some publisher sales strategies and can be a key step on the road to digital transformation. Discoverability specialists employed by most major publishers (e.g., the content discovery representatives listed by NISO’s Open Discovery Initiative) reflect these investments in content metadata and content providers’ focus on metadata’s impact on research workflows.
But, how do we define concrete ROI for these metadata investments? Where exactly is the proof that metadata makes a measurable difference to the lives of researchers and the life cycles of their work?
Metadata impacts on research workflows
The literature reveals a strong connection between metadata and content findability, or the degree to which content is retrievable from a database or search engine. Specifically, studies show a positive correlation between search success and accurate as well as open metadata. When it comes to search, both the architecture of the information and its display factor into positive researcher experiences.
The value of precise, usable metadata goes beyond search engine performance, however. Analysts have demonstrated how accurate and accessible metadata is critical to addressing research data sharing and reuse in some fields of study. The importance of metadata accuracy is underscored by case studies, such as how Covid-19 metadata errors undermined research analysis and efficiency.
Where scholars have endeavored to develop a framework for judging metadata quality, such as a survey of public health and epidemiology researchers, metadata accuracy and accessibility rank higher than other indicators. Open and accurate metadata fuels robust literature analysis and systematic reviews that drive scholarly research. Accessible metadata is also key to serving open science initiatives, for example, where semantically rich data is called for to address today’s pressing research needs. The priority for accurate metadata becomes clear when considering a need for reliable, consistent altmetrics, which does not currently have an industry-standard formula.
Ok, we know that metadata must be accurate, accessible, and relevant to be of value to the research workflow. What else can we glean from existing research that links specific metadata elements with tangible benefits to those working with scholarly communications?
Linking metadata elements and user experience
Studies have shown that content discovery and delivery are made possible by industry-standard metadata, in particular industry-standard metadata elements, such as records associated with DOIs and other persistent identifiers. Within those records, the highest value attributes are those that enable disambiguation and help contextualize a resource.
By analyzing search keywords and the metadata tags containing those terms, studies have demonstrated how the title and description attributes were the most influential in successfully driving users to full-text articles. This resonates with the evidence linking discoverability with semantic enrichment, such as the use of ontologies in scholarly metadata, where epidemiological research once again offers a valuable use case. Semantic tools also offer humanities scholars efficiency, as well as opening up new lines of inquiry and opportunities to scale expert crowdsourcing of enriched metadata. The folks at Scite see citation metadata as a key to improving reading experiences, in particular, to help researchers draw connections between concepts or studies.
Several studies have connected users’ ability to retrieve the full-text of open-access articles to publishers’ use of accurate licensing metadata (see for example a study on hybrid journal metadata). This suggests a promising future for the new ALI recommendations. The team at More Brains has done some fine work measuring the financial impacts of data reuse enabled by persistent identifiers. They found that when ORCIDs and DOIs are used throughout the research lifecycle, starting with grant funding, the administrative time saved for researchers equates to cost savings for their institutions.
When we pair these evidentiary threads together, metadata elements, standards, and qualities link directly with measurable impacts on the scholarly communications lifecycle. For instance, we can connect these metadata/impact pairings:
- High-quality title & description metadata → improved full-text search and retrieval
- Accurate and accessible semantic metadata → enables programmatic data analysis
- Persistent identifiers (ORCIDs and DOIs) → saves administrative time & costs
There are likely many other metadata/impact pairings of value to our industry that would further our collective efforts toward developing standard benchmarks for metadata successes.
Metadata: it takes a village
The fact is, we cannot resolve every broken link in the ecosystem, as much as we would like to do everything possible to reduce user friction and increase research productivity. In our experience, the highest return on metadata investments comes where publishers, libraries, and technologists work together to scale the production and maintenance of quality metadata. This is where information standards and terms of engagement come into play, to establish the trust necessary in a value chain like scholarly communications. Identifiers like the DOI or protocols like KBART provide the basic infrastructure upon which scholarship operates.
Those of us within this network of research information have a collective responsibility to ensure the metadata surrounding scholarly assets are accurate, interoperable, standards-compliant, and widely distributed. We encourage you, dear reader, to reach out to organizations like NISO, Jisc, and others to lend a hand and do your part to improve the positive information-user experiences generated by good metadata.
The authors would like to thank Jennifer Kemp at Crossref for the inspiration to take this dive into the metadata literature and reflect on its impact on research information experiences. Special thanks to Michelle’s former colleagues, who supported the 360 Knowledgebase and Summon, for assistance with the discussion about linking failures.
7 Thoughts on "The Experience of Good Metadata: Linking Metadata to Research Impacts"
At one point in my career I had to audit a mailing list. That should be a simple thing to do. You have a name and an address. Let us take a name: Bill A. Smith, U of Iowa, Or is it BSmith U of I, or BASmith U of Iowa or Smith B or Smith BA or …….
Metadata takes work on behalf of the user. It can take more work than the project!!!!!
I’ve been deeply involved in one library-side aspect of this topic since the very early days of OpenURL 0.1. I was depressed then and now to find that a substantial portion of our broken links were happening because the standards for metadata for journals had ambiguity in them that technology couldn’t solve. An easy example is journals whose enumeration includes two issue numbers for a single actual issue, e.g. Vol 10, Issue 5/6. I have no idea why anyone ever thought this was acceptable, but the linking standards, which require a single number in the issue field, left it entirely up to both the outbound and inbound openurl servers to decide whether to use 5 or 6. One would choose one, and the other would choose the other, and the link would break, and both companies would claim, correctly, that they were compliant with the standard. And don’t get me started on supplemental issues’ metadata! We all hoped that DOI would just make this problem go away, and it has helped a lot, but we need more universal adoption of DOIs throughout the metadata ecosystem. In general and speaking as someone who has been a “serials librarian” (among other hats) for over 2 decades, almost all of the most intractable problems are caused by the journal publishers doing stupid, metadata-unfriendly things with their enumeration, ISSNs, pagination, etc.
Hi Melissa–Thanks for your comment! This issue of funky pagination or issue numbering (Fall 2020 has similar issues to 5/6 for link resolvers) is definitely a long-standing one. It would be great to have an automated intervention to fix these issues and in the meantime, continue to (re)invest in human eyeballs on metadata clean up.
” It would be great to have an automated intervention to fix these issues and in the meantime, continue to (re)invest in human eyeballs on metadata clean up”
Perhaps an earlier step would be “automated detection” as it seem that the onus is on the authors and librarians to find problems with publications that the publisher is unable or unwilling to test.
Would PMC be the best place to pull data for this type of analysis?
Automated detection or some sort of large-scale audit would be a good step for sure.
PMC seems like a great place to start!
The problem is not technical, but conventional in the sense of standards. There is no “problem” as far as the publishers are concerned – they are following the standards and technically there is nothing that authors or librarians can do. The problem is the standards themselves, that allow for unresolvable ambiguity. I suppose that librarians, who are the biggest invoice payers for the publications in question, could attempt to complain to the publishers that they need to stop using enumeration etc. that causes the problem, but I can’t imagine they’d listen to us. Whatever business reason they have for doing that nonsense in the first place would still exist as we’re certainly not going to cancel subscriptions over the matter and they know that.