Finding Stuff: Discovery and Data Quality

About a month ago, you may have seen an article published by Nature entitled “How To Tame the Flood of Literature“. The article focused on new tools for aiding researchers in keeping current with the literature in their fields. These are personalized recommendation engines, which may be based on a researcher’s publication history (Google Scholar), on users with similar interests (PubChase), or trained by an individual’s approval or rejection of suggestions (Sparrho). The article made some useful points about the time needed to train such systems as well as the ongoing necessity of human evaluative skills, whether one’s own or those of colleagues. They serve a purpose for researchers, but they are not yet fully mature as tools for discovery.

A week or two following that, Ithaka S&R released a challenging brief, “Does Discovery Still Happen in The Library? Roles and Strategies for a New Reality“, authored by Roger C. Schonfeld. The brief notes that discovery currently happens in a variety of settings. It discusses findings from a recent Ithaka survey that indicate that skilled use of discovery tools in the context of library resources differs dramatically between first-year undergraduates and experienced graduate students, between researchers in different communities, and even between institutions. Connecting users with the right content — a long-standing role for academic libraries — is currently a time- and labor-intensive activity.

Whether we blame budget constraints, technological disruptions, or some other aspect of modern life, discovery remains a significant challenge. By the end of that brief, noting the difficulties of delivering robust discovery services in a period of transition, Schonfeld is asking the hard question of “Can libraries step forward to play a greater role in current awareness? Should they do so?”

Should libraries abandon investment in formal discovery services (of whatever ilk) and leave the job to the somewhat mysterious algorithmic mercies of Google, Amazon, or the start-ups referenced above? There was feedback from the community which Roger subsequently shared publicly here. One of the most important points made was that discovery did not have to occur solely within the library, but that the library should be considered a key channel of access in supporting discovery. I’m in full agreement with that idea, but there’s some practical work that needs to be done if content and service providers are going to support discovery to the extent needed in the library.

The nature of that practical work is revealed in a recently released, cross-industry white paper, “Success Strategies for Electronic Content Discovery and Access“. To save you some time, let me quickly summarize the nature of the problems identified. They have to do with the speedy supply of accurate and complete metadata in connecting users with the content to which the library is offering access. Data suppliers, service providers and libraries are not giving sufficient attention to the gaps that occur for the user when:

Data are incomplete or inaccurate;
Bibliographic metadata and holdings data are not synchronized;
Libraries receive data in multiple formats.

We live in a garbage-in-garbage-out world of information services. These are data quality issues that, if not resolved, impede the work of professional scholars as well as students. High-quality, rich bibliographic metadata must be supplied and properly fed into the various discovery and access systems in place in libraries if users are to be supported in efficiently searching the available literature. Identifiers have to match up. Holdings information must be kept current. Most importantly, stakeholders need to work together to ensure consistency and synchronization.

Before libraries can answer the question posed in Schonfeld’s brief about whether it’s worth supporting discovery in the library environment, the directors of those libraries must have an accurate sense of user need, preference, and adoption. Until those systems are properly fueled by accurate data which has been appropriately and effectively integrated, the role of the library in support of discovery will remain an open question mark. No one will be sure whether the profession took the best road or simply the most expedient one. Is it a question of defeat or of best practice? In an age when we are claiming that decisions must draw from real data, surely that is a faulty approach.

The working group behind this paper spans the broadest range of disciplines (HSS as well as STEM) and includes content providers Elsevier, Springer, Wiley, Project Muse, and Ithaka/JSTOR, as well as representation from the library community including OCLC, the University of Maryland and the University of Delaware. Those individuals participating have already taken the first step required by articulating the issues, none of which are insurmountable. They are, however, issues in need of resolution. If the information community acts collaboratively, we can continue to support the library’s role in discovery, even as we recognize that the library need not be the sole channel available to scholars or students.

This cross-industry paper will be in the spotlight over the coming weeks, via both an NFAIS webinar discussion (Oct 23) as well as during the Charleston Conference (specifically on Thursday, Nov 6 ). Take these opportunities (and the time) to familiarize your staff and your organization with the issues that have been identified as well as the recommendations offered as a solution. If your role as a content provider is to support awareness and access of high-quality content, then it’s incumbent upon the information community as a whole to ensure that the hallmarks of quality — that is, full and accurate metadata consistently applied and delivered in readily ingestible formats — are present in the feeds exchanged between stakeholders. To wait until we see which way discovery is headed before making these improvements has implications for how rapidly research and education move forward.

*Post updated 10/15/14 to note the efforts of OCLC in the working group.

Jill O'Neill

Jill O'Neill is the Educational Programs Manager for NISO, the National Information Standards Organization. Over the past twenty-five years, she has held positions with commercial publishing firms Elsevier, ThomsonReuters and John Wiley & Sons followed by more than a decade of serving as Director of Planning & Communication for the National Federation of Advanced Information Services (NFAIS). Outside of working hours, she manages one spouse and two book discussions groups for her local library.

Discussion

20 Thoughts on "Finding Stuff: Discovery and Data Quality"

I chose to post more than 80 of my articles at the Penn State library’s institutional respository ScholarSphere in part because it structures metadata in such a way as to maximize discoverability. Among other things, in the various categories it prompts authors to supply metadata, there are helpful prompts that lead to standardization. That this system works well was brought home to me recently when a long essay I posted very quickly got indexed by Google and began appearing on the first page of Google results for relevant searches.

By Sandy Thatcher
Oct 14, 2014, 8:28 AM

Search engines like Google and Google Scholar depend heavily on full text search, rather than on metadata. No doubt metadata is important but the full text aspect does not seem to be recognized in the white paper.

By David Wojick
Oct 14, 2014, 9:36 AM

Google and Google Scholar do search the full text of documents, but metadata quality and design have an enormous impact on whether or not Google goes to the trouble of crawling your text. A colleague of mine co-authored a very useful book on this topic; see in particular chapters 6 and 7.

By Rick Anderson
Oct 14, 2014, 1:38 PM

As I understand it, Google changed their indexing policy for scholarly articles a year or so ago. Where they used to index full text of journal articles on both Google and Google Scholar, they now only do full text on Google Scholar, and index only abstracts on regular Google. The reasoning for this was seeing so many bouncebacks where a typical Google user would get to a scholarly article and immediately head back to search results for something more comprehensible, and an assumption that anyone using Google Scholar was actually looking for scholarly articles.

By David Crotty
Oct 14, 2014, 1:42 PM

Interesting, David. Google often provides a small sample of GS search results as part of the hits from a Google search. I wonder how they get them?

By David Wojick
Oct 14, 2014, 2:45 PM

In our experience with implementing SEO, Google changes its indexing practices on a more or less ongoing basis, with no warning and no notice. This poses something of a challenge for those who want to get indexed.

By Rick Anderson
Oct 15, 2014, 9:09 AM

True enough, Rick, my wife is a web designer and she harps on this. But isn’t that the publisher’s metadata, because that is where the full text resides? The white paper seems to be about the library getting good metadata via the supply chain, especially from service providers as opposed to the publishers.

By David Wojick
Oct 14, 2014, 2:49 PM

Where the metadata comes from depends on what the websites/documents are. This is an issue for formal and commercial publishers, of course, but also for institutional repositories. Those of us who manage or oversee IRs tell faculty that one consequence of contributing their papers is increased exposure, so we’re always worrying about how to ensure that our IRs’ content will be crawled by Google and thereby made discoverable.

By Rick Anderson
Oct 15, 2014, 9:11 AM

I credit the good metadata that Penn State’s IR encourages for seeing my new history of Princeton swimming indexed by Google and placed on its first search page under that phrase within a week of its posting.

By Sandy Thatcher
Oct 14, 2014, 6:57 PM

As one of the authors of the white paper, I am delighted to see this conversation. I want to add to the comments made about metadata and its role in the e-content eco system.
One major use of bib metadata is to ‘discover’ resources. But the bibliographic metadata supports more than just discovery. It contains important identifiers that connect the scholar to the e-content itself. These identifiers are used in a ‘relay network’ that connects to the full text. So, when a user clicks on the “view” button, a message containing the identifiers is relayed to the knowledge base where the request is resolved and forwarded on to the publisher or vendor data base. A successful relay will result in the full text display.
The metadata we are talking about in this paper, then, is not only descriptive information of the title, author, subject, etc. that helps the user find the content. It also contains identifiers such as ISXN, DOI, etc., that are used by the technology (or network?) to execute the display commands.
I hope this brief explanation helps to describe the role of metadata in the networked environment.

By Carlen Ruschoff
Oct 22, 2014, 9:47 AM

It was not mentioned but are those providing search (discoverability) services asking the researchers what they need and how they want to go about finding it. If not, I would start from there. If they are, are the services doing follow up.

By Harvey Kane
Oct 14, 2014, 9:43 AM

Harvey, it’s always a question as to whether the users are currently satisfied with the available tools. Based on the Nature article included here, it would appear that for some segment of the user base, those needs for serious scholars may still be somewhat unsatisfied. Google Scholar is fine as far as it goes, but no one knows how comprehensive it is or what the rate of update might be. That’s true of some of those other tools as well.

Libraries, serving a broader user base of undergrads, grad students, etc, with differing skill sets really want to ensure that the discovery services and tools that they provide are as up-to-date as possible, are as reliably comprehensive as possible, and are processing materials as rapidly as possible. The white paper from the working group suggests that there are areas that need to be improved and offers recommendations as to how best to make those improvements. What it comes down to in many respects is the requirement that metadata associated with formal publications needs to be supplied more completely, more accurately and more speedily than is currently happening. If content providers are worried about users finding their latest content, this would be a meaningful shift to improve the chances of that happening!

By Jill O’Neill
Oct 15, 2014, 11:55 AM

Keep in mind that Google, etc. do not provide access to most subscription services, including review journals, that are taken by research libraries. In the esoteric reaches of science and scholarship, such advertising companies provide mere shiny surfaces that fail to display riches below. Even Google Books, which has PDF’d many public domain, out-of-print tomes, often fails to catalog them properly and cannot tell you anything about the quality of print, paper, binding.

By Albert Henderson
Oct 14, 2014, 10:21 AM

The white paper makes the important distinction between discovery per se and access, by which they mean determining whether the item in question is actually available via the library collection. Google Scholar searches a great many subscription journals, on the discovery side. (I would be surprised if that does not include review journals.) But GS discovery is not access, except in limited cases where GS links to a repository copy or OA article or some such. Then too there are the local search engines. This is not simple. Access sounds like the modern equivalent of keeping the card catalog accurate and current, which I am sure is not easy in the digital age. This seems to be the focus of the white paper.

By David Wojick
Oct 14, 2014, 10:34 AM

The distinction is academic. Citations without access are birds that can’t fly.

By Albert Henderson
Oct 14, 2014, 11:22 AM

Not really, Albert. Access here means being in a particular library, or library access. There are lots of other ways to get access. In fact I use GS a lot and never use a library. (The last time I used a library was to get a demo of a new fangled thing called the world wide web.) The diversity of overall access, beyond library access, is something of a problem for libraries.

By David Wojick
Oct 14, 2014, 11:42 AM

Now, David, you’re making me weep and gnash my teeth in frustration. The fact that the pipes are leaking between the reservoirs of content and the library’s discovery tools and services is not an insurmountable problem. We just need to increase the pressure on suppliers to move the necessary information through the system more rapidly!

By Jill O’Neill
Oct 15, 2014, 12:04 PM

I agree, Jill, so I do not understand your weeping and gnashing. I was merely making the technical point that discovery per se and library access are two very different things in the new e-world. The white paper is about library systems, not the broader topics of discovery or non-library access. For example, my background is with US Federal access, especially DOE OSTI and Science.gov.

However, if the goal is to “move the necessary information through the system more rapidly!” then this probably comes with a significant cost. I would have to see the numbers before I endorsed the goal. The burden of information mandates has been a study of mine for many years. In fact I helped design the “burden budget” system that governs US Federal regulatory activity. Information mandates can be very expensive.

By David Wojick
Oct 15, 2014, 4:29 PM

Not necessarily, Albert. You’re dismissing an awful lot of valuable bibliographies consisting of evaluative/indicative citations to books that are still useful to scholars and students for purposes of discovery. I agree that the current emphasis is on enabling online access as quickly and as completely as possible, but let’s not pretend that will be happening overnight. At the same time, we can avoid the issue of libraries falling behind if content providers would begin moving high-quality metadata for their materials into the pipeline more rapidly going forward.

By Jill O’Neill
Oct 15, 2014, 12:01 PM

I am currently taking a look at how our library’s ebooks get indexed in our discovery service. Some vendors provide ebooks whose chapters are indexed. This provides about 15 times more records to discover. Other vendors don’t provide this. For example, Springers reference works get chapters indexed, but e-monographs don’t, even though each chapter apparently has its own DOI. I think indexing at the chapter level has the potential to drive ebook use further, but not being aware of what is going on behind the scenes leaves a lot of question marks.

By Zach
Oct 15, 2014, 11:18 PM

The Scholarly Kitchen

Finding Stuff: Discovery and Data Quality

Jill O'Neill

Discussion

Scholarly Publishing Gets Its Awards Season Moment

Bring Your Creativity to Chula Vista: The 3rd Annual SSP Originals Auction

Annual Meeting Early Registration is Open—Download the Preliminary Program now!

Jill O'Neill

Related Articles:

Next Article: