There are tremendous opportunities for using data to bring greater efficiency and more effective discovery to the research process. Publishing platforms, discovery services, and academic libraries are all in a position to make innovative uses of data that are the by-product of everyday research practices of scholars and students.
The benefits of personalizing discovery are already playing themselves out in the consumer space, from anticipatory discovery services like Google Now to integrations of shared and personal collections through browser plugins. As in the consumer space, the implications of changing discovery processes will cascade through the scholarly ecosystem. At Ithaka S+R, we have been exploring how control of content discovery is shifting, and I served on NISO’s working group to bring more transparency to the discovery black box. I am interested in analyzing the opportunities before us in the scholarly ecosystem to embrace the use of data to personalize the research process.
Some academic librarians express real skepticism about metrics and data. There is understandable concern for protecting privacy, while the benefits have seemed dubious since many of the best examples of exploiting user data involve selling advertising (as Google and Facebook have so effectively demonstrated). In my view, the most promising opportunity to protect private data is for the scholarly community to build strong systems for data usage and its limitation under the ultimate control of academia.
All digital platforms have extensive usage data, left by the trail of user actions on a site. Without care, these data about individual actions are difficult to parse, and even creating the basic COUNTER-mandated usage data requires thoughtful processing. These anonymous usage data can also be processed in other ways, allowing for the creation of various types of recommender systems, such as HighWire Press’s popular and related articles features. And, publisher platforms are not the only, and perhaps not the best, source of usage data for these purposes. Ex Libris’s bX service has illustrated how recommender systems can be built at a cross-platform level, using data from the library proxy servers that make off-campus access possible at many campuses for licensed electronic resources.
These generic recommendations have real value to researchers. But what if they could be personalized to the interests and needs of individuals? If this were to allow us to improve the efficiency of the discovery process without sacrificing serendipity (the latter of which I will examine in a subsequent post), it would have real value for scholarship.
Many content platforms allow or encourage users to create an account, almost always including keyword-based content-alerting and article sales features. ScienceDirect offers a direct linkage with the RefWorks citation management tool. Taylor and Francis offers a pairing feature between the user account and a mobile device. JSTOR offers non-institutional access through metered reading and subscription passes. But, these accounts are not generally used to personalize the platform’s search tools, let alone to provide forms of discovery that anticipate a scholar expressing research needs. This is to some degree understandable. The scale of the data required to deliver high-quality discovery personalization is absent at all but the largest content platforms. Additionally, even some of the largest publisher platforms may not have sufficient starting point activity to merit the investment.
But looking beyond publisher platforms, other services are beginning to position themselves with the data necessary to drive anticipatory discovery. Google Scholar’s MyUpdates feature anticipates publications that might be of interest to a given individual, based on one’s publications history. Mendeley has expressed an interest in developing recommender systems using its extensive data and offering pull (anticipatory) methods of discovery. These products, much loved though they have already become, are at the very beginning of a development pathway for discovery. Eventually, some will incorporate a variety of additional types of signals to deliver better results. They will also learn to deliver results when they are most useful, and not just when they are new. Because these products are targeted to researchers rather than licensed through libraries, they are sidestepping many of the concerns expressed about data security and privacy.
It is interesting to think about who gains and who is threatened by where data are being exploited in the service of users. Academic libraries are relatively weakened if Google (through Scholar) and Elsevier (through Mendeley) know far more about the research habits and needs of their scholars than they do. And smaller publishers and platforms, unable to offer advanced discovery services, will also find that they have only limited data to customize their user’s experience in a variety of other ways.
This might play itself out in several ways, which may not be mutually exclusive. First, the benefits of scale in providing data-driven services could result in another set of pressures for consolidation towards the largest publisher platforms. It is therefore notable that the recent merger of Springer and MacMillan units excluded the Digital Science data-driven services that are the best positioned to take advantage of this scale. This may be just a financial move for the short-term, given the investments still being made in Digital Science, but perhaps this split is also strategic.
Perhaps instead of being tied to publisher scale and publisher strategy, the data services portfolios developed by Elsevier, MacMillan, and others, will be spun off as cross-publisher services. In that type of scenario, it might make more sense for an academic service already replete with user accounts to repurpose its accounts to allow login and data exchange with other publisher services. Think about the social login buttons allowing the use of Facebook, Google, and Twitter account credentials to sign on to other services. Could a scholar’s credentials from Mendeley or ReadCube, for example, be used to simplify authentication and authorization to a variety of other scholarly services, while also allowing academic users to transport selected data about themselves in a controlled fashion to enable a variety of advanced services? As in the case with the consumer web platforms, a winner-take-all dynamic may be the consequence for those that can establish themselves in this sense as a data platform.
EBSCO, Ex Libris, OCLC, and ProQuest, provide discovery services that have been widely adopted by academic libraries as search starting points, alongside curated content platforms and/or sister library services. These entities develop extensive usage data, some of it linked to one or more user accounts. If these services could develop a single sign-on across their offerings, the personalization they could provide for discovery and in other services might be quite impressive.
The least likely, but most intriguing, possibility, is that academic libraries and scholarly publishers find a way to build a common community standard or service that allows users to control their own data and carry it with them from platform to platform. Such a vision faces real challenges in terms of privacy, security, and finding an alignment of interests between academic libraries, university IT departments, scholarly publishers, and discovery services. But there would be real benefit in exploring this possibility before it is foreclosed on.