Revisiting: Who Has All The Content?

For human readers, the scholarly communication user experience has been characterized by fragmentation. In recent years, substantial efforts have been devoted to ameliorating this fragmentation, whether by aggregation, syndication, or infrastructure for entitlement-based linking. I have long dreamed of a single site filled with all scholarship for ready discovery and access. This fragmentation, for human readers, is mostly a massive inconvenience. For the machines, though, whether for training or analysis, fragmentation is a real impediment. The machines never, ever, want to read the publications just of a single publisher. The machines want to read everything, or at least everything in a given field. And this in turn raises questions about the work of assembling everything together — which is no small task — and making it available for machine training and analysis. More than six years ago, I looked at the state of play when it came to assembling together substantially all published content. While some of the players have shifted around, the question today is more pressing than ever: Who Has All the Content?

Our contemporary media landscape is characterized by fragmentation. Every publisher seemingly has its own platform, and users must learn to navigate the idiosyncrasies of each. If you know how to read The New York Times in print, you won’t have very much trouble reading The Washington Post either, but if you try to use their mobile apps you’ll see that it’s as if they are in different industries. The tv guide and movie listings in your local newspaper once told what you could watch during the upcoming week, but to rely on streaming services today you’ll find it difficult to determine what is available where across the different main platforms, Amazon, Hulu, and Netflix. In our sector, the digital transformation has been no less powerful, but fragmentation is no less problematic. Or is it?

Painting by Giovanni di Paolo of "The Creation of the World and the Expulsion from Paradise" — Giovanni di Paolo (Giovanni di Paolo di Grazia) (Italian, Siena 1398–1482 Siena), The Creation of the World and the Expulsion from Paradise, 1445, Robert Lehman Collection, The Metropolitan Museum of Art.

Looking across the scholarly publishing sector, there are many delivery platforms, representing a diversity of models, such as ACS Publications, PLOS One, Project Muse, and ScienceDirect. In conducting their research, scholars and students find that the voyage from discovery to access can frequently be tortuous. The need to use multiple content platforms adds to their cognitive burden.

But in addition to these publisher platforms, fragmented though they are for content delivery to institutional customers, there are a number of services that gather up “all” of the content from across publishers. I will label them “comprehensive fulltext” services in this post. Comprehensive, because each of them is characterized by this as a serious objective, regardless of current degree of comprehensiveness. Fulltext, because each of them contains substantial actual content, not just metadata describing the content, regardless of whether they display or deliver this fulltext.

In conducting their research, scholars and students find that the voyage from discovery to access can frequently be tortuous. The need to use multiple content platforms adds to their cognitive burden.

Most of this post focuses on services that are designed not to interfere with institutional sales but rather to be complementary with site-licensing models. The range of businesses that can be built when one has available “all” publisher content is striking, and it is notable that so many publishers have been willing to re-license their content for a variety of secondary or complementary uses.

In what follows, I seek to offer a basic overview of the landscape of comprehensive fulltext services. If any categories or services are missing, please add them in the comments.

Discovery Services

A decade ago, librarians expressed substantial concern about Google capturing the discovery starting point role from their catalogs. Several vendors responded with services designed to search the library-licensed content beyond just the books in the catalog, providing a “Google-like” search box on the library website. To create these services, the vendors had to index the metadata and, wherever possible, the fulltext of an array of digital resources. As a result, these indices now have the full content of an array of publishers and content providers, including a great deal of the scientific literature, as well as ebooks, digitized books, newspapers, primary sources, images, audio, video, and so forth.

There are four services that fall into this category, owned by three companies. The not-for-profit OCLC provides WorldCat Discovery. Two other companies, each of which has its own content platform, provide the other three services. EBSCO provides the EBSCO Discovery Service. ProQuest provides both Summon and Primo Central, which are sold as separate products but have been moving towards a common underlying index, leaving it a little unclear whether there are three or four indices at this moment across the four products.

It is not clear why these discovery services have to date focused exclusively on search rather than expanding to the panoply of other types of discovery that are growing in importance, such as feed-based personalization and alerts. They required substantial up-front investment but are now beginning to generate a healthy return given ongoing library payments.

There is an argument for placing the citation databases in this category but ultimately they do not belong. Scopus and Web of Science alike receive fulltext content from a wide range of publishers in print and electronic form, using this for the sole purpose of generating their metadata-based indices of the scholarly literature. Their indices do not contain fulltext and so no fulltext searching is possible. They have made the choice to stay in the citation database category rather than expand outward into the discovery service terrain. Although they may acquire fulltext as an intermediate step for the purpose of creating their metadata, it is discarded rather than retained, and as such they cannot be considered comprehensive fulltext services.

Consumer Technology Services

Several major consumer internet services, such as Bing and Google in the US, index the fulltext of most publisher websites. There have been a variety of mechanisms for enabling these services to create these indices, but today the “bots” that gather these indices are typically allowed to crawl publisher platforms via a recognized IP address. While they meet the definition of comprehensive fulltext services, their aspirations go far beyond academia and scientific publishing.

Two consumer internet services have also created dedicated scholarly discovery services. Of these services, Google Scholar drives by far the largest amount of traffic, while Microsoft Academic has experimented with a more sophisticated array of features. Like the library discovery services, these scholarly discovery services ingest metadata and full-text from a variety of publishers and other providers, although without quite the same range of format coverage. In addition to search, they also provide personalization including alerts, citation estimates, and certain other types of analytics.

One service stands out for its emphasis on the use of artificial intelligence and is placed in this category somewhat provocatively. Meta gathers fulltext from many publishers and has used this for a number of purposes, including feed-based personalized discovery, predictive modeling to support editorial decision-making, and term extraction regarding experimental methods and substances. Meta’s recent acquisition by the Chan Zuckerberg Initiative, the philanthropic vehicle of the Facebook founder, has been rightly celebrated for its potential contributions to open science. Given Meta’s strong efforts towards personalization, it will be interesting to follow whether any connections are developed with Facebook’s identity graph.

The growth of user-contributed and pirate sites is noteworthy. Against this backdrop, the publisher-licensed comprehensive fulltext services may have a variety of strategic dilemmas.

Preservation Services

Libraries have long been responsible for the preservation of print materials. In the transition to digital formats, they worked with publishers to develop shared third-party custody of publications. These services operate with different models, but the objective is to provide substantial coverage of the scholarly literature for preservation purposes. Following a trigger event, access is enabled.

Two services aspire towards comprehensiveness and work to provide a global solution. CLOCKSS is operated by a not-for-profit organization with a board comprised of librarians and publishers, while Portico is a service of parent not-for-profit organization ITHAKA (which employs me through its Ithaka S+R service). Both ingest scholarly publications from numerous publishers, are supported by a combination of libraries and publishers, and use different technical models to provide geographically distributed preservation meriting third-party certification. Their preservation databases include the complete fulltext of tens of thousands of journals and books from hundreds of publishers. Like discovery services, these preservation services have focused on addressing a key strategic issue and have steadily built out their coverage for a single service model.

Another model of a preservation service is that offered by OCUL’s Scholars Portal, a platform for preservation of and access to a variety of scholarly content. Based on the collection development decisions of the OCUL consortium, Scholars Portal loads publisher materials into its own infrastructure, which has received third-party preservation certification. Researchers from OCUL member institutions can seamlessly discover and access publisher and other content on a common platform with a standard user interface. Ultimately, Scholars Portal is curated based on the collection development choices of the consortium, which some might argue reduces somewhat its aspirations towards comprehensiveness, but given the consortium in question it certainly comes close.

Plagiarism Detection

Networked information services have increased concerns about plagiarism among students and academics alike. One prominent solution has been to build a massive database of everything publicly available on the internet and as much published content as possible, against which manuscripts and student research papers can be assessed for possible plagiarism. Marketed for publishers as iThenticate, and TurnItIn for student papers, this service is also offered as CrossCheck for scholarly publishers that regularly contribute their content to the database in exchange for a discount on checking their own submissions.

Article Delivery

Several article delivery services have built substantial coverage of the scientific literature without directly competing with publishers’ institutional licensing approaches. There are several models that fall under this heading. Each of these is probably more fragmented in its coverage than the models discussed in the foregoing categories.

Institutionally oriented services provide document delivery, sometimes without any lag, as part of a workflow for articles that are not part of licensed collections. These models involve high per-article fees and are especially widely adopted at smaller academic libraries and corporations that cannot afford to build substantial collections on a site-licensing basis due to relatively small usage. In this category, ReprintsDesk offers Article Galaxy and CCC offers a similar service as part of RightFind. In some cases, these services are being integrated directly into researcher workflows using a link resolver.

Another model is DeepDyve, which is pitched to researchers more so than to libraries. For a modest subscription, researchers or research groups have unlimited on-screen reading, with the ability to print or download fulltext for additional fees. It is interesting that printing and downloading remain such powerful differentiators.

Of course, there is also the model of the open-access repository. Although most repositories cover only a small portion of the literature, PubMed Central probably contains the highest share of the medical literature.

Populated By Users (Or Pirates)

All of the foregoing comprehensive fulltext services are populated by publishers and content providers under contractual agreements. There is also a category of services that are assembled using other means. As a result, even if they contain enormous amounts of fulltext, they probably cannot be comprehensive in as systematic a way as some of the categories described above. Some of the models in this section are widely recognized as legitimate, others widely recognized as illegal, and some are in a gray zone of multiple perspectives.

Citation management tools have transitioned from desktop applications to cloud-based services where users can store not only reference information but also notes and PDFs. Elsevier’s Mendeley is probably the most widely used such service, at least among scientific researchers, and at one point it claimed to possess the largest database of scholarly content as a result. ProQuest’s RefWorks and the open-source Zotero are two other significant offerings in this space.

As a result of other features beyond its basic citation management functions, Mendeley has also joined a class of services sometimes called scholarly collaboration networks. Some of these services encourage researchers to upload their own research and publications and are building substantial databases of the scholarly literature as a result of doing so. ResearchGate and Academia.Edu are most prominent for having done so with respect to the published scholarly literature.

There are also a set of unambiguously pirate efforts. Of these, Sci-Hub and its LibGen database, which contains the vast majority of scholarly publications, have received the greatest attention over the past year.

Reflections

Looking across the categories mentioned above, there are a solid 20 comprehensive fulltext services. Of course, none of these aggregations is completely comprehensive. Some lack selected publisher participation. Some exclude certain content types. Some have only been able to attract metadata rather than full text from selected publishers. Others lack user submissions or sufficient pirated credentials.

Of the publisher-licensed services, all have limited rights in how they can use the fulltext content. Still, each has successfully earned the trust of publishers. There is a remarkable array of businesses being built atop comprehensive fulltext.

Populating these comprehensive fulltext services requires meaningful ongoing work by publishers to export their content and by the services to normalize content into a common index. Looking across the group here, it is easy to imagine that there could be ways to drive efficiencies and other improvements.

The growth of user-contributed and pirate sites is noteworthy. Against this backdrop, the publisher-licensed comprehensive fulltext services may have a variety of strategic dilemmas.

Finally, it may be worth considering the purposes for which there might be alternatives to the comprehensive fulltext service. CHORUS and SHARE are both pursuing alternative visions that may serve some, but perhaps not all, of the functions outlined above.

I thank Lisa Hinchliffe, Marie McVeigh, Bill Parks, and Tom Reller for their discussion and direct assistance that contributed to this piece.

Roger C. Schonfeld

@rschon

Roger C. Schonfeld is the vice president of organizational strategy for ITHAKA and of Ithaka S+R’s libraries, scholarly communication, and museums program. Roger leads a team of subject matter and methodological experts and analysts who conduct research and provide advisory services to drive evidence-based innovation and leadership among libraries, publishers, and museums to foster research, learning, and preservation. He serves as a Board Member for the Center for Research Libraries. Previously, Roger was a research associate at The Andrew W. Mellon Foundation.

Discussion

2 Thoughts on "Revisiting: Who Has All The Content?"

I deeply appreciate this thorough and nuanced exploration of the scientific literature availability and indexation landscape. This invaluable breakdown, which thoughtfully considers all key players, helps illuminate the mechanisms underlying the flow of research information.

However, I think there might be some conflation of concepts here:

1) Open Access vs. Full-Text Exports: Could the issue of comprehensive access to full text articles from a central source like PubMed Central be more about the mixture of open access content and proprietary full-text exports provided to certain repositories or indexing entities, rather than inconsistencies in published content?

2) Normalization of Indexed Data: Is the challenge really about the normalization of different data formats? With the widespread use of the JATS XML standard in publishing, consistency seems achievable. Or, is the real issue that many downstream indexers process and organize this content in their preferred ways, leading to a perceived lack of normalization? This could be less of a technical hurdle and more of a procedural disparity due to varied indexing methodologies.

By David Levy
Jul 13, 2023, 11:21 AM

Thank you, Roger, for raising this post for a revisit — improving human experiences with scholarly information is always a favorite topic! But, I am confused and a little concerned about your personification of “the machines” as having monolithic needs, wants, and preferences. I would think that there are many machines (e.g., search engines, recommendation tools, composition platforms) that are indeed trained and designed to only “read” content from a single publisher. So, I wonder if would you be able to be more specific about which machines you see as designed to aggregate all scholarly content? and why these are the prevailing priorities when it comes to content search and discovery for today’s researchers?

By Lettie Y. Conrad
Jul 13, 2023, 3:35 PM

The Scholarly Kitchen

Revisiting: Who Has All The Content?

Innovation Showcase Highlights Cutting-Edge Publishing Solutions

View photos from the 46th Annual Meeting!

Discovery Services

Consumer Technology Services

Preservation Services

Plagiarism Detection

Article Delivery

Populated By Users (Or Pirates)

Reflections

Roger C. Schonfeld

Related Articles:

Next Article: