With the recent surge in library e-book sales, serials aggregators are racing to add e-books to their platforms. ProQuest’s recent acquisition of ebrary and JSTOR’s expansion into current journals and e-books signal a shift from standalone e-book and e-journal aggregator platforms to mixed content gateways, with e-books and e-journals living cheek by jowl in the same aggregation.
Meanwhile, researchers have become accustomed to the big search engines, and have shifted from reading to skimming. As the authors of an article in the January issue of Learned Publishing, “E-journals, researchers – and the new librarians,” summarize:
Gateway services are the new librarians. . . . Reading should not be associated with the consumption of a full-text article. In fact, almost 40% of researchers said they had not read in full the last important article they consulted. . . . ‘Power browsing’ is in fact the consumption method of choice.
These changes in behavior mean that gateway vendors have to develop more sophisticated tools for organizing and surfacing content. ProQuest, OCLC, EBSCO, and others have responded by creating new tools and systems. But is it enough?
Publishers often discuss distinctions between e-book and e-journal business and access models, but the truly complex differences in e-books and e-journals reside beneath the surface, in the metadata layer. Understanding and compensating for these differences is essential for interoperable content discovery and navigation when mixed e-book and e-journal content is delivered in large-scale databases, which is increasingly the norm.
Until the evolution of semantic technologies reduces our reliance on catalog and bibliographic records for information discovery and contextualization, nothing supports research discovery better than pristine, consistent, and granular metadata.
As discussed in a recent opinion post in The Atlantic Wire, consumers are recognizing the drawbacks of Google-style search. For research, Google searching is especially inadequate for researchers given its:
- Susceptibility to search engine optimization gaming
- Reliance on linear ordering of result sets
- Lack of transparency about resources that are not included in the searched information and/or not prioritized by the search algorithm
- Inability to provide contextualized information — the “shape of the elephant” of what is being sought, based on piecemeal queries*
Networked metadata layers offer new ways of navigating and linking content, ways that avoid these pitfalls. But they have their own challenges.
Lack of consistent record quality — When dealing with such larger volumes of information, there’s little incentive for individual publishers to invest in manually overhauling their metadata. Most publishers don’t create their own records. They rely on clearinghouses — OCLC predominantly — to manage record creation for them. Inconsistencies and errors in records create enduring problems, and locating and repairing these can be a daunting task.
E-book records are arguably more problematic than their more consistent e-journal brethren. Few publishers have a detailed appreciation of the challenges involved in creating durable links across e-book subsections like chapters or entries. Even at the book level, I have heard from ARL librarians that they will refrain from populating Open URL resolvers with e-book MaRC in order to avoid the broken links and version confusion that results from dirty data.
Need for more granularity — As researchers increasingly seek smaller and more specific content units, quality metadata assignment at the entry or chunk level becomes even more important. Ideally, metadata will support durable linking like a digital object identifier (DOI) and provide hierarchically structured subject information, such as is ideally contained in an e-book MaRC record or e-journal bibliographic record.
Networked data nodes can effectively drive dynamic discovery and access for e-books, e-journals, and other content formats, including multimedia. Is there enough clean data at scale to support this? No.
Overhauling old records, investing in more granular record creation, and cross-matrixing MaRC, DOI, and bibliographic records is a massive endeavor requiring significant investments of money and time. When done right, the process can significantly improve interoperability and navigation in discrete publisher platforms. But right now, it appears this will be a competitive advantage for some and not a universal benefit.
Data layers must be populated with comprehensive, discipline-specific taxonomies and clean metadata. For purposes of discovery and navigation, data nodes should contain MaRC and e-journal bibliographic record match points as well as durable linking locations. While some individual publishers may undertake this, as Springer has, it’s untenable in the mid-term for most non-STM publishers.
One anticipates that dynamic, automated processes for metadata creation will become the norm. If Narrative Science can generate text from data, surely the reverse is within our grasp.
We can look forward to semantically generating reliable, structured metadata on the fly from text and image content chunks. STM e-content companies are the most likely candidates to blaze trails with metadata creation “engines” that learn to read and interpret content chunks in order to produce descriptive data in real-time with increasing accuracy.
However, until technology learns to learn more accurately, there will continue to be problems with scale, interoperability, and consistency of metadata in large content gateways. Discovery of mixed- type research content will be suboptimal and incomplete.
Semantic research gateways are the next wave, but they will work best when informed by the traditions of librarianship. There is critical value to be gleaned from time-tested practices for describing and organizing multidisciplinary subject matter. Metadata librarians have long wrangled with consistency issues. Reference and subject librarians — particularly those working with libguides — grasp context better than most.
The most promising solutions will make the most of innovative technologies while also mining specialized institutional knowledge.
* A seminal paper on this topic was prepared by Thomas Mann in 2007 for the Library of Congress Professional Guild.
6 Thoughts on "Smarter Metadata — Aiding Discovery in Next Generation E-book and E-journal Gateways"
One possible way of fixing this is publishing books in XML. If the structure is chosen wisely, it is relative easy to extract a granular description.
At AUP we are working on an export to TEI XML. This defines (sub)chapters. Combined with chapter titles, this should enable a ‘data mining engine’ to do some interesting things. We hope to show the first XML title in the OAPEN Library at the end of this month.
In theory, this is correct. In practice we’ve found it incredibly hard to come up with even a small number of XML structures to cover all the different types of books we publish at OECD so automatic extraction is hard. We’re working hard to crack this problem because metadata creation is probably the single biggest challenge we’re facing at the moment.
Meanwhile, if anyone wants to see a site that has granular metadata for e-books, e-journals, working papers and datasets in a seamless service – try searching for ‘part-time work’ on OECD iLibrary. In the search results you’ll find chapters, graphs within chapters, datasets, working papers. User feedback is great – but don’t underestimate the work needed to generate the metadata.
It didn’t help that the BISAC codes were constructed to reflect the practices of trade publishing. See my article “Why I Hate the BISAC Codes” available here: http://www.psupress.org/news/SandyThatchersWritings.html.
P.S. Mann’s article is indeed seminal–and superb.
There is a fascinating competition between metadata and full text search. As a semantic search person I used to say that metadata was obsolete. But people are doing some really cool stuff with metadata, especially mapping. Here is the latest example, with lots of mapping of NIH and NSF R&D: http://rd-dashboard.nitrd.gov/