Editor’s Note: Today’s post is by Peter Webster, Librarian Emeritus and former Information Technology Services Librarian at Saint Mary’s University in Halifax, Canada.
Artificial intelligence tools for searching scholarly literature are being heavily promoted, and they are rapidly gaining popularity. However, like many other researchers, I am finding that AI search tools routinely miss important articles and can misinterpret concepts and subjects. So far, AI search tools are not able to reliably shortcut or replace conventional search methods. While hallucination is known to be part of the problem, it seems that limited, inconsistent, and missing metadata are also key issues.

AI tools for scholarly search are heavily reliant on a few openly accessible scholarly metadata sources, particularly OpenAlex, CrossRef, Semantic Scholar, PubMed, and arXiv. The metadata that is available is often limited to article titles and abstracts, plus some subject or keyword information. The metadata quality and detail vary, and are drawn from many different sources, both open access and paywalled.
The major AI search developer, Semantic Scholar, is augmenting metadata using full text under agreements with publishers as well as open access content. However, much of their metadata is made up of title and abstract information and is not based on full text.
Clarivate also derives metadata from the full-text sources they license for its very large Central Discovery Index. This index is used for the Primo and Summon 360 Library catalogue discovery tools and is also used for new AI research assistant tools. But the Central Discovery Index still relies heavily on inconsistent title and abstract metadata from many different sources. Many scholarly publishers are limiting or withholding information about paywalled content and not allowing AI access to full text.
Large scholarly article indexes, like Elsevier’s Scopus and Clarivate’s Web of Science, are now featuring artificial intelligence assistant products as well. These indexes also rely primarily on title and abstract metadata, along with internal subject enhancements.
The Limitations of Inconsistent Title and Abstract Metadata
A good deal has been written about the shortcomings of title and abstract-based searching, and the need for more complete and consistent metadata. de Vries, van Smeden & Groenwold found that “literature searches based on title, abstract, and keywords alone may not be sufficiently sensitive” and noted the potential value of full-text literature searches. Guowei Li et al found that article abstracts often miss important research features and concepts and can misstate research findings. Ong, Wagner & Keane make the case that “the AI tools’ inability to access paywalled literature introduces a potential bias toward open-access sources, which could affect the comprehensiveness and balance of the AI-generated output.”
At present, we seem to be in a situation where limited and inconsistent metadata and limited access to paywalled full-text content are lessening or negating the expected advantages of AI smart searching.
AI Search Improvements with Metadata Derived from Full Text
What is sorely needed are AI search tools that have more detailed and consistent metadata to work with, ideally derived from full text, but collected in a way that meets paywalled content providers’ need to protect valuable, human-readable content.
There is also a good amount of information available about the search improvements that can be achieved when AI searching is based on metadata derived from full text. For example, Jimmy Lin, working with articles in Medline, found that “retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search.”
In a recent LinkedIn post, Anton Yuryev, Consulting Director with Elsevier, commented on the value of AI searching based on metadata derived from full text. He notes that full-text analysis of scholarly articles yields some 4 times more information that AI search tools can use for detailed concept determination.
Better Search Metadata While Protecting Full-Text Access
There are several ways this might be done, even with the current messy mix of open access and paywalled, restricted content. In overly simple terms, AI subject determination uses metadata in a couple of ways. Elements of Information about each article are analyzed and collected into a knowledge graph, a structured and interconnected collection of information about articles. A knowledge graph allows attributes, common concepts, words and phrases, or connected citations to be identified and linked together. AI methods can be used to collect and develop detailed metadata to be included in a knowledge graph. AI search tools then rely on the knowledge graph for identifying related articles.
Another metadata approach that is central to Artificial Intelligence searching is vector embedding. This is a process of converting scholarly article concept and relationship information into machine-readable numeric code (vectors). Embedded vectors derived from full text are then used to identify and group articles that share similar concepts and other attributes and relationships.
Knowledge graphs and vector embedding methods are commonly used by open metadata collections. They are used internally by most publishers and scholarly content providers. Springer Nature has developed its own Springer Nature SciGraph. Elsevier uses knowledge graph and vector embedding applications from the Neo4j Company. They also use other methods in specialized subject areas.
Semantic Scholar has been at the forefront of developing knowledge graphs and vector-embedding methods. They have developed the Semantic Scholar Academic Graph (S2AG) and the SPECTER and SPECTER2 (Scientific Paper Embeddings using Citation Informed Transformers) vector embedding models. OpenAlex, one of the metadata sources commonly used by AI search tools, is built on a knowledge graph.
To a limited extent, AI search providers like Clarivate and Semantic Scholar are using AI-aided metadata collection and enhancement based on full-text articles. A number of publishers are working in partnership with AI applications on initiatives such as the Model Context Protocol (MCP) standard to better connect AI searches to publisher content. So, there are potential models to build on. Knowledge graphs and vector embedding provide a means for the sharing of AI-assisted full-text metadata enrichment while still keeping human-readable articles protected behind paywalls when that is needed.
Recently, Todd Carpenter, Executive Director of the US National Information Standards Organization (NISO), has posted letters discussing the work that NISO is doing to “affect positive advances for the interoperability of AI systems in scholarly communications.” He outlines a number of AI standardization issues that need addressing, including search and discovery metadata standards.
Existing knowledge graph and vector embedding methods provide an opportunity to develop standardized cross-publisher methods for collecting and sharing more consistent and detailed metadata.
Open Issues for AI Scholarly Search to Reach Its Potential
However, there is a troubling opposing trend. In a blog post last September, Aaron Tay reported that major publishers Elsevier and Springer Nature are substantially reducing the amount of abstract information they provide to open metadata repositories. Rather than providing more detailed metadata, based on full text, some publishers are moving to further restrict the information available to AI search tools.
There are many possible reasons why this is happening. While some AI developers are working closely with content providers, many AI tools are aggressively collecting metadata and accessing content with little control. Subscription publishers obviously seek to maintain revenue, and some are developing their own paid AI search tools. But there are also substantial concerns about how AI access to scholarly content is managed and regulated. Academic publishers, as well as other scholarly content creators, like the university I work for, must be concerned with protecting their content, ensuring that it is properly attributed, that it is uniformly accessible, that author and publisher rights are upheld. There are important issues to be worked out in the relationship between AI search tools and scholarly content providers. A more consistent approach to providing detailed metadata for AI search might well help to address some of these concerns, in addition to improving AI search functionality.
Scholarly researchers and their academic libraries need to be more aware of the metadata the AI search tools they use are working with and how that metadata can be improved. We need to be aware that search results are being negatively affected by limited and missing information, particularly metadata about paywalled scholarly articles.
The greater availability, standardization, and enrichment of metadata for AI are important topics for discussion. The sharing of consistent article metadata and greater detail derived from full text is critical for AI-based scholarly literature searching to reach its full potential.