Our contemporary media landscape is characterized by fragmentation. Every publisher seemingly has its own platform, and users must learn to navigate the idiosyncrasies of each. If you know how to read The New York Times in print, you won’t have very much trouble reading The Washington Post either, but if you try to use their mobile apps you’ll see that it’s as if they are in different industries. The tv guide and movie listings in your local newspaper once told what you could watch during the upcoming week, but to rely on streaming services today you’ll find it difficult to determine what is available where across the different main platforms, Amazon, Hulu, and Netflix. In our sector, the digital transformation has been no less powerful, but fragmentation is no less problematic. Or is it?

Painting by Giovanni di Paolo of "The Creation of the World and the Expulsion from Paradise"
Giovanni di Paolo (Giovanni di Paolo di Grazia) (Italian, Siena 1398–1482 Siena), The Creation of the World and the Expulsion from Paradise, 1445, Robert Lehman Collection, The Metropolitan Museum of Art.

Looking across the scholarly publishing sector, there are many delivery platforms, representing a diversity of models, such as ACS Publications, PLOS One, Project Muse, and ScienceDirect. In conducting their research, scholars and students find that the voyage from discovery to access can frequently be tortuous. The need to use multiple content platforms adds to their cognitive burden.

But in addition to these publisher platforms, fragmented though they are for content delivery to institutional customers, there are a number of services that gather up “all” of the content from across publishers. I will label them “comprehensive fulltext” services in this post. Comprehensive, because each of them is characterized by this as a serious objective, regardless of current degree of comprehensiveness. Fulltext, because each of them contains substantial actual content, not just metadata describing the content, regardless of whether they display or deliver this fulltext.

In conducting their research, scholars and students find that the voyage from discovery to access can frequently be tortuous. The need to use multiple content platforms adds to their cognitive burden.

Most of this post focuses on services that are designed not to interfere with institutional sales but rather to be complementary with site-licensing models. The range of businesses that can be built when one has available “all” publisher content is striking, and it is notable that so many publishers have been willing to re-license their content for a variety of secondary or complementary uses.

In what follows, I seek to offer a basic overview of the landscape of comprehensive fulltext services. If any categories or services are missing, please add them in the comments.

Discovery Services

A decade ago, librarians expressed substantial concern about Google capturing the discovery starting point role from their catalogs. Several vendors responded with services designed to search the library-licensed content beyond just the books in the catalog, providing a “Google-like” search box on the library website. To create these services, the vendors had to index the metadata and, wherever possible, the fulltext of an array of digital resources. As a result, these indices now have the full content of an array of publishers and content providers, including a great deal of the scientific literature, as well as ebooks, digitized books, newspapers, primary sources, images, audio, video, and so forth.  

There are four services that fall into this category, owned by three companies. The not-for-profit OCLC provides WorldCat Discovery. Two other companies, each of which has its own content platform, provide the other three services. EBSCO provides the EBSCO Discovery Service. ProQuest provides both Summon and Primo Central, which are sold as separate products but have been moving towards a common underlying index, leaving it a little unclear whether there are three or four indices at this moment across the four products.

It is not clear why these discovery services have to date focused exclusively on search rather than expanding to the panoply of other types of discovery that are growing in importance, such as feed-based personalization and alerts. They required substantial up-front investment but are now beginning to generate a healthy return given ongoing library payments.

There is an argument for placing the citation databases in this category but ultimately they do not belong. Scopus and Web of Science alike receive fulltext content from a wide range of publishers in print and electronic form, using this for the sole purpose of generating their metadata-based indices of the scholarly literature. Their indices do not contain fulltext and so no fulltext searching is possible. They have made the choice to stay in the citation database category rather than expand outward into the discovery service terrain. Although they may acquire fulltext as an intermediate step for the purpose of creating their metadata, it is discarded rather than retained, and as such they cannot be considered comprehensive fulltext services.

Consumer Technology Services

Several major consumer internet services, such as Bing and Google in the US, index the fulltext of most publisher websites. There have been a variety of mechanisms for enabling these services to create these indices, but today the “bots” that gather these indices are typically allowed to crawl publisher platforms via a recognized IP address. While they meet the definition of comprehensive fulltext services, their aspirations go far beyond academia and scientific publishing.

Two consumer internet services have also created dedicated scholarly discovery services. Of these services, Google Scholar drives by far the largest amount of traffic, while Microsoft Academic has experimented with a more sophisticated array of features. Like the library discovery services, these scholarly discovery services ingest metadata and full-text from a variety of publishers and other providers, although without quite the same range of format coverage. In addition to search, they also provide personalization including alerts, citation estimates, and certain other types of analytics.

One service stands out for its emphasis on the use of artificial intelligence and is placed in this category somewhat provocatively. Meta gathers fulltext from many publishers and has used this for a number of purposes, including feed-based personalized discovery, predictive modeling to support editorial decision-making, and term extraction regarding experimental methods and substances. Meta’s recent acquisition by the Chan Zuckerberg Initiative, the philanthropic vehicle of the Facebook founder, has been rightly celebrated for its potential contributions to open science. Given Meta’s strong efforts towards personalization, it will be interesting to follow whether any connections are developed with Facebook’s identity graph.

The growth of user-contributed and pirate sites is noteworthy. Against this backdrop, the publisher-licensed comprehensive fulltext services may have a variety of strategic dilemmas.

Preservation Services

Libraries have long been responsible for the preservation of print materials. In the transition to digital formats, they worked with publishers to develop shared third-party custody of publications. These services operate with different models, but the objective is to provide substantial coverage of the scholarly literature for preservation purposes. Following a trigger event, access is enabled.

Two services aspire towards comprehensiveness and work to provide a global solution. CLOCKSS is operated by a not-for-profit organization with a board comprised of librarians and publishers, while Portico is a service of parent not-for-profit organization ITHAKA (which employs me through its Ithaka S+R service). Both ingest scholarly publications from numerous publishers, are supported by a combination of libraries and publishers, and use different technical models to provide geographically distributed preservation meriting third-party certification. Their preservation databases include the complete fulltext of tens of thousands of journals and books from hundreds of publishers. Like discovery services, these preservation services have focused on addressing a key strategic issue and have steadily built out their coverage for a single service model.

Another model of a preservation service is that offered by OCUL’s Scholars Portal, a platform for preservation of and access to a variety of scholarly content. Based on the collection development decisions of the OCUL consortium, Scholars Portal loads publisher materials into its own infrastructure, which has received third-party preservation certification. Researchers from OCUL member institutions can seamlessly discover and access publisher and other content on a common platform with a standard user interface. Ultimately, Scholars Portal is curated based on the collection development choices of the consortium, which some might argue reduces somewhat its aspirations towards comprehensiveness, but given the consortium in question it certainly comes close.

Plagiarism Detection

Networked information services have increased concerns about plagiarism among students and academics alike. One prominent solution has been to build a massive database of everything publicly available on the internet and as much published content as possible, against which manuscripts and student research papers can be assessed for possible plagiarism. Marketed for publishers as iThenticate, and TurnItIn for student papers, this service is also offered as CrossCheck for scholarly publishers that regularly contribute their content to the database in exchange for a discount on checking their own submissions.

Article Delivery

Several article delivery services have built substantial coverage of the scientific literature without directly competing with publishers’ institutional licensing approaches. There are several models that fall under this heading. Each of these is probably more fragmented in its coverage than the models discussed in the foregoing categories.

Institutionally oriented services provide document delivery, sometimes without any lag, as part of a workflow for articles that are not part of licensed collections. These models involve high per-article fees and are especially widely adopted at smaller academic libraries and corporations that cannot afford to build substantial collections on a site-licensing basis due to relatively small usage. In this category, ReprintsDesk offers Article Galaxy and CCC offers a similar service as part of RightFind. In some cases, these services are being integrated directly into researcher workflows using a link resolver.

Another model is DeepDyve, which is pitched to researchers more so than to libraries. For a modest subscription, researchers or research groups have unlimited on-screen reading, with the ability to print or download fulltext for additional fees. It is interesting that printing and downloading remain such powerful differentiators.

Of course, there is also the model of the open-access repository. Although most repositories cover only a small portion of the literature, PubMed Central probably contains the highest share of the medical literature.

Populated By Users (Or Pirates)

All of the foregoing comprehensive fulltext services are populated by publishers and content providers under contractual agreements. There is also a category of services that are assembled using other means. As a result, even if they contain enormous amounts of fulltext, they probably cannot be comprehensive in as systematic a way as some of the categories described above. Some of the models in this section are widely recognized as legitimate, others widely recognized as illegal, and some are in a gray zone of multiple perspectives.

Citation management tools have transitioned from desktop applications to cloud-based services where users can store not only reference information but also notes and PDFs. Elsevier’s Mendeley is probably the most widely used such service, at least among scientific researchers, and at one point it claimed to possess the largest database of scholarly content as a result. ProQuest’s RefWorks and the open-source Zotero are two other significant offerings in this space.

As a result of other features beyond its basic citation management functions, Mendeley has also joined a class of services sometimes called  scholarly collaboration networks. Some of these services encourage researchers to upload their own research and publications and are building substantial databases of the scholarly literature as a result of doing so. ResearchGate and Academia.Edu are most prominent for having done so with respect to the published scholarly literature.

There are also a set of unambiguously pirate efforts. Of these, Sci-Hub and its LibGen database, which contains the vast majority of scholarly publications, have received the greatest attention over the past year.

Reflections

Looking across the categories mentioned above, there are a solid 20 comprehensive fulltext services. Of course, none of these aggregations is completely comprehensive. Some lack selected publisher participation. Some exclude certain content types. Some have only been able to attract metadata rather than full text from selected publishers. Others lack user submissions or sufficient pirated credentials.

Of the publisher-licensed services, all have limited rights in how they can use the fulltext content. Still, each has successfully earned the trust of publishers. There is a remarkable array of businesses being built atop comprehensive fulltext.

Populating these comprehensive fulltext services requires meaningful ongoing work by publishers to export their content and by the services to normalize content into a common index. Looking across the group here, it is easy to imagine that there could be ways to drive efficiencies and other improvements.

The growth of user-contributed and pirate sites is noteworthy. Against this backdrop, the publisher-licensed comprehensive fulltext services may have a variety of strategic dilemmas.

Finally, it may be worth considering the purposes for which there might be alternatives to the comprehensive fulltext service. CHORUS and SHARE are both pursuing alternative visions that may serve some, but perhaps not all, of the functions outlined above.

I thank Lisa Hinchliffe, Marie McVeigh, Bill Parks, and Tom Reller for their discussion and direct assistance that contributed to this piece.

Roger C. Schonfeld

Roger C. Schonfeld

Roger C. Schonfeld is director of Ithaka S+R’s Library and Scholarly Communication program. He leads a team of methodological experts and analysts that provides strategic consulting, surveys, and other research projects, for academic libraries, scholarly publishers and intermediaries, museums, and learned societies. Previously, Roger was a research associate at The Andrew W. Mellon Foundation.

View All Posts by Roger C. Schonfeld

Discussion

18 Thoughts on "Who Has All the Content?"

I would like to point out ScienceOpen in this context. We currently track more than 28 million journal article records (with over 3 million full text open access articles) across all publishers, disciplines and license types. We offer article-level metrics, context building and curation within a search and discovery environment with strong filtering tools such as citation, Altmetric score, and usage. For our publisher customers the “read” button links back to the version of record on their website. Overlayed over the platform is a suite of social networking tools that allow users to interact with the content (recommend, comment, peer-review). We are freely accessible from anywhere, because the main currency of our current system is usage and we can support individual authors, editors, journals and publishers to get more visibility for their work in the digital space.

thanks Roger for this work.

What is not clear is where the “grey” literature appears in these systems. This would include:
-government agencies such as the US Department of Agriculture and others
-think tanks such as consulting firms and research organizations
-policy organizations and associations that do not publish in traditional journals
-industrial publications such as those listed in Standard Rates and Data series
-international agencies such as World Bank, United Nations,
-others

I understood Roger to mean scholarly articles as his listing also doesn’t address the major “comprehensive” fulltext book databases either (Amazon and Google Books come to mind). But, to your question, I believe that iThenticate is the system that has the greatest coverage of the items you mention. Whether iThenticate could ever use the content it has to disrupt other types of fulltext deployment in this helpful taxonomy from Roger is something I’ve wondered about for years but not had any luck soliciting perspectives via Twitter (so I welcome anyone who wants to share their thoughts!). I presume all of these providers have the content they have under some sort of license-not-purchase agreement, just as libraries do, and that making use of it for a new purpose may or may not be so easy.

thanks lisa

the issue is of concern since much research is being done by parties other than academics and it is published in literature that is outside of the standard scholarly journals. The recent revelation that there are many parties who mine this and also such information as personal data on sites such as Facebook to carry out what would normally be classified as academic research, but published or presented elsewhere as we have seen during the recent US elections, this literature, grey by label, may be of more than scholarly interest.

Additionally, there are a number of proprietary and public database search engines that work across all literature and publish this, often as meta data linked to the full texts. These are able to also search text and not just titles or abstracts when accessible.

Thus, there are ramification for the scholarly journal publishers, if the value to academics may lead to other venues in time.

Yewno is certainly one to keep an eye on for a novel method of discovering works, ScienceOpen is growing, too, and Sparrho has a novel Pinboard-like method for dissemination of works, drawing from it’s large index. I’m wondering where Crossref and SHARE fit in this scheme?

I would also like to point out, by the way ;-), that Mendeley presents some interesting discovery services via their recommender tools: https://www.mendeley.com/suggest/

I think I may have caused some misunderstanding, as reflected within some comments, about the scope that I intended to cover in this piece. CrossRef does not contain any fulltext at all. SHARE links together repositories (as I mention briefly at the end). ScienceOpen contains metadata only for non-open materials. Reach of these services is a valuable addition to the landscape, but none is mounting a realistic effort to build a database containing the **fulltext** of **all** scholarly publications. I think Yewno may be trying to do so and probably should appear with Meta (although it remains venture funded). I can also see the argument for including Google Books, although I’m not sure its regular updates include very much of the non-monograph literature. In this piece I was trying to compare underlying databases, roughly, in terms of size and scope. Strategic issues may flow from that, regardless of how content is currently licensed or used.

Thanks for the further clarification. Please note that when you starred the words in your post you did not star “scholarly”. Yes, this is the scholarly kitchen but some of us work across the construct called “disciplines” and also across the grey literature which feeds into the scholarly literature and is growing as many scholars are increasingly seeking to address a wider audience then pub/perish peers. Given the suggested additions to the list, which, agreed, don’t all fit in your narrow cast list, and looking at what is happening within that increasingly shifting community, globally, except for the top ranked academic journals, the venues or numbers of academic journals is increasing but the criteria for acceptance starts to become problematic and, for some who uses these articles, the cost/bit of such information starts to look increasingly inflated. Hence the need for big data search engines that can strip out the potential value from the self-serving persiflage.

The contemporary lists of such compilations is important for today, but what these lists say about the direction of the industry coupled with other elements may be significant

Tom, thanks for this. I didn’t intentionally exclude any services that contain a superset of the group in my definition. In fact, you’ll note that I included Google and Bing (not just Scholar and Academic) in acknowledgment that scholarly services operate within a larger context of services and, as you point out, of needs. So, if you’d like to identify any other “superset” services that meet my definition, I’d love to be informed about them. I’m not sure what that says about “the industry” other than an effort to discuss and learn and grow together.

CrossRef is interesting in this regard though, because they do represent a different approach to accessing full-text across publishers with text- and datamining services based on the CrossRef Metadata API (http://tdmsupport.crossref.org).

And another one: BASE provides (mostly) full-text search across the content of many repositories.

I think that the library discovery tools show some signs of moving beyond the search. In particular, I’m thinking of some Ex Libris tools, like bX, which provides recommendations. The new UI for Primo also allows for saved searches to generate alerts, and there is a reading list functionality called Leganto that I think has the potential to either inform bX in a big way or otherwise provide information about stuff that should be clustered together by topic.

Many libraries that have taken hard stances on privacy are also seeing increased administrative pressure regarding the tracking of usage data at the individual level. If institutional willingness to track individual data increases, it could give rise to more personalization.

Thanks for the overview, Roger. You do touch on this but it’s worth reiterating that delivery and access challenges remain a most important topic beyond who has indexed how much content. The ability to provide accurate rights management and navigable linking (fair, unbiased, legal access) to content users should have access to is a critical component and value of the discovery service. Indexing content is just the starting point. Then this can lead to meaningful data-driven services such as personalization and recommendations. Yes, more to come…

Leave a Comment