The scholarly publishing community talks a LOT about metadata and the need for high-quality, interoperable, and machine-readable descriptors of the content we disseminate. However, as we’ve reflected on previously in the Kitchen, despite well-established information standards (e.g., persistent identifiers), our industry lacks a shared framework to measure the value and impact of the metadata we produce.
In 2021, we embarked on a Crossref-sponsored study designed to measure how metadata impacts end-user experiences and contributes to the successful discovery of academic and research literature via the mainstream web. Specifically, we set out to learn if scholarly books with DOIs (and associated metadata) were more easily found in Google Scholar than those without DOIs.
Initial results indicated that DOIs have an indirect influence on the discoverability of scholarly books in Google Scholar — however, we found no direct linkage between book DOIs and the quality of Google Scholar indexing or users’ ability to access the full text via search-result links. Although Google Scholar claims to not use DOI metadata in its search index, the results of our mixed-methods study of 100+ books (from 20 publishers) demonstrate that books with DOIs are generally more discoverable than those without DOIs.
As we finalize our analysis, we are sharing some initial results and inviting input from our community. What relevant lessons can we glean from this exercise? What changes might book publishers consider based on the outcomes of this study?
Background on the study
This study was designed to evaluate metadata impacts & benefits to users. Given its popularity with a range of stakeholders in our industry, we set out to measure metadata impacts on discoverability in the mainstream web – namely, Google Scholar.
Our test method and analysis rubric was developed based on our own information-user research, in particular how readers search and retrieve scholarly ebooks, as well as published studies about academic information experiences and research practices. We rated the search performance of more than 100 scholarly books using preset test queries (two for each title). The books tested in this study came from publishers of all sorts and sizes, and represent both monographs and edited volumes from a range of fields; some were open access and others were published under traditional licensing models.
We developed and executed known-item test searches that were designed to simulate common researcher practices. Heuristic analysis of the search results was used to rate the search performance on a 5-point scoring rubric, which was designed to measure the degree of friction in locating the book in question. This method allowed us to assess specific book and metadata attributes by their search performance scores to assess the impact of book metadata on content discoverability in Google Scholar.
Results and findings
In this study, we learned that high-value fields include the primary title paired with subtitles, author/editor surnames and/or field of study. Queries using full book titles performed the best across the board. Those using publication dates and/or author/editor surnames and/or publisher names, but without the book title, were the lowest performers.
Surprisingly, our discoverability scores show no significant variation in performance by the type of book, whether edited or authored. Open-access titles performed somewhat better than traditional ones. Books covering humanities and social science fields performed a bit better than STM books, but only by a slim difference (that is not statistically significant).
We primarily tested the discoverability of book titles, from equal numbers of books with and without chapter-level DOIs. We ran similar tests for chapter-title discoverability but found the majority of test queries for chapters lead users to the full book itself. While books without title-level DOIs were found to be less discoverable, we did not find a measurable difference between books with or without chapter-level DOIs. (Note: All books in this study with chapter-level DOIs assigned also carried a title-level DOI, which was found to be fairly common.)
Based on these results, we are developing a theory that books with DOIs perform better in Google Scholar because they benefit from the structured, open metadata associated with those DOIs – which are used by hundreds of platforms and services, and therefore are “seeded” throughout the mainstream web, which Scholar may draw on for indexing, linking, etc. That said, however, these results also suggest that publishers are best served by a metadata strategy that is well attuned to the protocols expected of each channel for book search and discovery. In a recent conversation about our findings, Anurag Acharya himself noted that these results underscore the need for publishers to invest in the robust construction and broad distribution of book metadata.
In this study, we have observed that the metadata protocols surrounding Google Scholar are not fully integrated into our industry’s established scholarly information standards bodies, like NISO, or infrastructure organizations, like Crossref. While some mainstream data standards prevail in the Scholar index, like the use of schema.org and HTTP, some key metadata attributes seem to be lacking. For example, an indicator of the type of scholarly book (monograph, handbook, etc.) would improve Google Scholar’s search index and could be used to filter search results, thereby improving users’ experiences discovering scholarly books. One clear challenge for book publishers today is the fact that Google Scholar operates outside of our community-governed scholarly information infrastructure.
What comes next
While this study focused on Google Scholar, the results and lessons learned are applicable to other mainstream channels of information seeking/discovery. Our report, due out spring 2023, will contribute to the literature intended to support user-centric information systems design and content architecture by scholarly publishers and service providers.
As we write up our findings, we intend to develop a framework that can help publishers and others measure the impact of their work to enrich and distribute scholarly metadata. We hope this first systematic review of the impacts of metadata on the discoverability of books in Google Scholar will provide valuable insights for this community. In the meantime, please share your thoughts and questions in the comments below — or reach out to us directly.
The authors would like to thank Jennifer Kemp at Crossref for the inspiration to take this dive into the metadata literature and reflect on its impact on research information experiences. Special thanks to Anurag Acharya at Google Scholar for his consultation during this study.
9 Thoughts on "Measuring Metadata Impacts: Books Discoverability in Google Scholar"
Looking forward to the full report! I worked a lot on book metadata a couple of years ago while coining and working on Bookmetrix. We found then that many publishers, smaller and larger, didn’t assign DOIs to books to start with, let alone chapters. Not only is that by the looks of it bad for the discoverability of books in Google Scholar but also within the entire ecosystem. We also found that many references to books could not be properly parsed by Crossref due to the lack of DOIs. As a result, books and chapters tend to show an underestimation of the number of citations and authors wouldn’t know their book was being cited. Books have always been treated with lower priority by many publishers so I hope your study will invigorate to treat them like for like with journals.
Your hypothesis is “DOIs have an indirect influence on the discoverability of scholarly books in Google Scholar”. Perhaps an alternative hypothesis is that those scholarly book publishers who are thoughtful enough to apply DOIs to books and book chapters generally pay more attention to the quality of their metadata, and that’s why books with DOIs happen to be more discoverable
That’s an interesting perspective, Bruce, and could very well be true. As far as what we can reliably discern from our data is that DOIs facilitate distribution of structured data to myriad channels, which Google Scholar likely incorporates into its ranking. Whatever the reason, I hope publishers will see this as additional ROI evidence for metadata investments!
Very interesting research, and I’m really looking forward to the report!
In a past life when I sat in on meetings with Google Scholar about making content more discoverable I remember hearing a lot about how GS was only willing to index a certain kind of books–those with more standalone chapters rather than a continuous non-fiction narrative since standalone chapters are more comparable to articles. I’m not sure if this is still the case, just my experience from a few years back, but it might be a factor when investigating discoverability of ebooks in GS.
Thank you, Abigail, for your comment and thoughtful read of this post (also, it’s so lovely to hear from you)! You’re absolutely right, GS prefers to index edited volumes over monographs, as chapters can more easily replicate the journal model. Regardless of book type, publishers can influence the discoverability of books in GS by including their titles in Google Books. And, since we found that GS most often points users to the full-book record, this should help monographs be more discoverable than in the past — and we hope that’s encouraging to books publishers!
Interesting to hear that Google Books is still kicking around and influences GS indexing! 🙂
Thank you for a valuable study. Is there any indication, from yours or any other research you’re aware of, that related metadata such as chapter abstracts has a positive impact on scholarly book discovery?
Hi Rich, thanks for your question! I’m not aware of evidence that specifically support chapter-level abstracts in Google Scholar discovery — however, in so far as abstracts include key terms, we do have evidence of their value. When we first designed the study, we were hoping to be able to measure the impact of book and chapter abstracts (this was eventually tabled due to the structure of DOI metadata in the Crossref database). What we found, however, is that key terms from the book and chapter titles ranked among the highest value attributes in our research — which aligns with keyword analysis in the studies, e.g., https://doi.org/10.5860/crl.77.1.7. I hope that helps!