1599 world map
An early discovery tool (pre-Google Maps). 1599 map of Arctic exploration by Willem Barentsz in his third voyage, image via Wikipedia.

There are tremendous opportunities for using data to bring greater efficiency and more effective discovery to the research process. Publishing platforms, discovery services, and academic libraries are all in a position to make innovative uses of data that are the by-product of everyday research practices of scholars and students.

The benefits of personalizing discovery are already playing themselves out in the consumer space, from anticipatory discovery services like Google Now to integrations of shared and personal collections through browser plugins. As in the consumer space, the implications of changing discovery processes will cascade through the scholarly ecosystem. At Ithaka S+R, we have been exploring how control of content discovery is shifting, and I served on NISO’s working group to bring more transparency to the discovery black box. I am interested in analyzing the opportunities before us in the scholarly ecosystem to embrace the use of data to personalize the research process.

Some academic librarians express real skepticism about metrics and data. There is understandable concern for protecting privacy, while the benefits have seemed dubious since many of the best examples of exploiting user data involve selling advertising (as Google and Facebook have so effectively demonstrated). In my view, the most promising opportunity to protect private data is for the scholarly community to build strong systems for data usage and its limitation under the ultimate control of academia.

All digital platforms have extensive usage data, left by the trail of user actions on a site. Without care, these data about individual actions are difficult to parse, and even creating the basic COUNTER-mandated usage data requires thoughtful processing. These anonymous usage data can also be processed in other ways, allowing for the creation of various types of recommender systems, such as HighWire Press’s popular and related articles features. And, publisher platforms are not the only, and perhaps not the best, source of usage data for these purposes. Ex Libris’s bX service has illustrated how recommender systems can be built at a cross-platform level, using data from the library proxy servers that make off-campus access possible at many campuses for licensed electronic resources.

These generic recommendations have real value to researchers. But what if they could be personalized to the interests and needs of individuals? If this were to allow us to improve the efficiency of the discovery process without sacrificing serendipity (the latter of which I will examine in a subsequent post), it would have real value for scholarship.

Many content platforms allow or encourage users to create an account, almost always including keyword-based content-alerting and article sales features. ScienceDirect offers a direct linkage with the RefWorks citation management tool. Taylor and Francis offers a pairing feature between the user account and a mobile device. JSTOR offers non-institutional access through metered reading and subscription passes. But, these accounts are not generally used to personalize the platform’s search tools, let alone to provide forms of discovery that anticipate a scholar expressing research needs. This is to some degree understandable. The scale of the data required to deliver high-quality discovery personalization is absent at all but the largest content platforms. Additionally, even some of the largest publisher platforms may not have sufficient starting point activity to merit the investment.

But looking beyond publisher platforms, other services are beginning to position themselves with the data necessary to drive anticipatory discovery. Google Scholar’s MyUpdates feature anticipates publications that might be of interest to a given individual, based on one’s publications history. Mendeley has expressed an interest in developing recommender systems using its extensive data and offering pull (anticipatory) methods of discovery. These products, much loved though they have already become, are at the very beginning of a development pathway for discovery. Eventually, some will incorporate a variety of additional types of signals to deliver better results. They will also learn to deliver results when they are most useful, and not just when they are new. Because these products are targeted to researchers rather than licensed through libraries, they are sidestepping many of the concerns expressed about data security and privacy.

It is interesting to think about who gains and who is threatened by where data are being exploited in the service of users. Academic libraries are relatively weakened if Google (through Scholar) and Elsevier (through Mendeley) know far more about the research habits and needs of their scholars than they do. And smaller publishers and platforms, unable to offer advanced discovery services, will also find that they have only limited data to customize their user’s experience in a variety of other ways.

This might play itself out in several ways, which may not be mutually exclusive. First, the benefits of scale in providing data-driven services could result in another set of pressures for consolidation towards the largest publisher platforms. It is therefore notable that the recent merger of Springer and MacMillan units excluded the Digital Science data-driven services that are the best positioned to take advantage of this scale. This may be just a financial move for the short-term, given the investments still being made in Digital Science, but perhaps this split is also strategic.

Perhaps instead of being tied to publisher scale and publisher strategy, the data services portfolios developed by Elsevier, MacMillan, and others, will be spun off as cross-publisher services. In that type of scenario, it might make more sense for an academic service already replete with user accounts to repurpose its accounts to allow login and data exchange with other publisher services. Think about the social login buttons allowing the use of Facebook, Google, and Twitter account credentials to sign on to other services. Could a scholar’s credentials from Mendeley or ReadCube, for example, be used to simplify authentication and authorization to a variety of other scholarly services, while also allowing academic users to transport selected data about themselves in a controlled fashion to enable a variety of advanced services? As in the case with the consumer web platforms, a winner-take-all dynamic may be the consequence for those that can establish themselves in this sense as a data platform.

EBSCO, Ex Libris, OCLC, and ProQuest, provide discovery services that have been widely adopted by academic libraries as search starting points, alongside curated content platforms and/or sister library services. These entities develop extensive usage data, some of it linked to one or more user accounts. If these services could develop a single sign-on across their offerings, the personalization they could provide for discovery and in other services might be quite impressive.

The least likely, but most intriguing, possibility, is that academic libraries and scholarly publishers find a way to build a common community standard or service that allows users to control their own data and carry it with them from platform to platform. Such a vision faces real challenges in terms of privacy, security, and finding an alignment of interests between academic libraries, university IT departments, scholarly publishers, and discovery services. But there would be real benefit in exploring this possibility before it is foreclosed on.

Roger C. Schonfeld

Roger C. Schonfeld

Roger C. Schonfeld is the vice president of organizational strategy for ITHAKA and of Ithaka S+R’s libraries, scholarly communication, and museums program. Roger leads a team of subject matter and methodological experts and analysts who conduct research and provide advisory services to drive evidence-based innovation and leadership among libraries, publishers, and museums to foster research, learning, and preservation. He serves as a Board Member for the Center for Research Libraries. Previously, Roger was a research associate at The Andrew W. Mellon Foundation.


25 Thoughts on "Data for Discovery"

Academicracy or those in academia who have ultimate control over data. It seems to me that data can tell us what someone has done but not necessarily what someone will do in the future.

This Is indeed an interesting problem in discovery services. What one has done is often out of date compared to what one is doing, because scholarship is nomadic, constantly pursuing new questions. But lots of people use alerts and we are basically talking about intelligent alerts of various sorts, among other things.

To my mind the important thing is to realize that there are very different needs when it comes to discovery and these require different services. Keeping up with the field requires new content, but the long half life data shows that problem solving often means looking way back in time. And there are several other task types as well, such as finding a thesis topic for one’s student, or a proposal topic that fits an RFP, etc. As always, understanding the human logic of discovery comes before analyzing the data. Personalization means knowing what I am doing and that is not easy. It is, however, fun to try.

Welcome aboard Roger. I’m looking forward to your promised post on serendipity. It’s a really interesting question–have we lost a certain amount of stumbling upon something important by moving to an article level economy driven by directed search?

I also think it’s worth examining the value of of things like predictive search in creative processes such as doing original research. So much of what Google is doing seems to reaffirm your beliefs, rather than to challenge them (the “filter bubble”). But in doing research, one is seeking something new and different from what’s already known.

And if the research world is dominated by one service with one algorithm (as the online world trends toward, like with Google, Facebook, etc), then does this homogenize research in some way? If everyone in an area of research is being fed the same information, does this reduce the diversity of ideas and approaches?

I also wonder how well accepted this sort of tailored information will be by researchers. Their expertise lies in digging out information and making new connections. How many will be willing to turn this over to someone else (or someone else’s algorithm)?

The last paragraph of this post should be printed and taped onto the wall of everybody working in scholarly communications. Yes, what we need is a common community standard.

Agreed on the importance of the question posed in the last paragraph. Given the historically zero-sum approach taken by discovery service providers towards collaboration, and the lack of transparency for publishers on the receiving end of discovery usage, I don’t rate the chances of the industry solving this any time soon. The question is whether a standards-based approach can offer sufficient value to all parties to get out of the mud – or offer a compelling route around it.

There’s a trade-off possible between the supply of useful data from institutions/libraries/academics and the receipt back of more valuable services that benefit learning and research. Rather than being seen as weakening academic libraries, this trade-off could give them a valuable role in the arbitrage of data for valuable services.

I’m interested in the upside you see to this tradeoff for libraries. How can they even make effective decisions about what services to procure while understanding less and less (relatively at least) about their own user community?

The potential upside I see is for libraries to layer identity information over usage information and obtain a much better, and more granular, understanding of the needs of their users. For example, whether a faculty member with a critical academic need is generating the usage that is driving subscription renewals or PDA purchases. This should in turn translate into more effective decision making when investing institutional budgets to meet learning and research needs.

Librarians have been collecting information on user needs for as long as libraries have been around, but it’s been based on years of interaction with patrons rather than through processing big data. Developing and applying standards for the carriage of identity information could lead to the development of tools to turn this data into meaningful insights. And, while the whole idea is riven with security and privacy concerns, I don’t think these are insurmountable if the end result is recognized by patrons as delivering a more effective service to them.

Tim, what zero-sum approaches and mud are you referring to? If it is the discovery algorithms it may be necessary to keep them proprietary lest they be stolen. If not that,then what?

I’m referring to content neutrality. Ideally (at least for users/purchasers), discovery services would be independent of content and would compete on the basis of service e.g. ease of use, effectiveness of their algorithms etc. That’s obviously not the way it’s played out, with some discovery service providers viewing content concessions as a gain to a competitor and a loss to themselves. It feels like mud because everyone’s stuck – no single provider can go ‘neutral’ without the others agreeing to it, so either they all move or no-one moves.

Regarding the paragraph about the need for cross-platform authentication and data exchange, this is part of what becomes possible through use of ORCID, researchers can login to different systems using ORCID authentication and then choose what pieces of information to transfer between systems.

I agree that ORCID might be one basis for what I was proposing. But it would involve transferring and selectively controlling data that goes well beyond the information about research objects currently managed via ORCID. It would be fascinating to see ORCID step forward into such a role, taking on more characteristics of a fully realized cross-platform researcher account.

Roger, the more I think about this statement the less sensible it seems — “the most promising opportunity to protect private data is for the scholarly community to build strong systems for data usage and its limitation under the ultimate control of academia.” My problem is that neither the “scholarly community” nor “academia” are regulatory bodies. In fact neither is even definable.

How then might these vaguely defined groups create limits and controls on data flows? Can you give an example of what such a control mechanism might look like and how it would be enforced on the discovery industry? As a policy analyst I am mystified and as a discovery system builder I am concerned. Are there precedents for this kind of global information control?

David, by that statement, I meant in the first place to make a basic distinction: between those services that gather user data without any direct accountability to academia – here Google is a fine but by no means the only example – and those that do so with at least some accountability, such as that provided through a vendor/customer relationship.

I understand the distinction, although it is one of degree. But how do you propose to create and enforce the control? Academia per se is neither a customer, nor a vendor. My concern is that it is one thing to have standards the facilitate information flow, like Dublin Core. It is quite another to propose standards that restrict information flows. How would that work?

Here is a possible example for discussion, a personalized discovery service that I have been thinking about. A researcher subscribes and thereafter gets an alert every time someone downloads an article of theirs (to the extent that the service has that information). The alert will be as specific as possible about who did the download. If it is an identifiable individual then contact information will be included in the alert. For institutions or organizations there would at least be geographical information, perhaps more. In this way people can know where their ideas are going, as they go.

Would the standards you envision cover something like this? Might they prohibit it, as an invasion of the privacy of the downloaders? Or would there be some sort of approval process?

Back when we were doing online versions of laboratory manuals, we regularly heard from pharma companies that in order to use it, they would need a private version that they could host on their own intranet. The idea was that they didn’t want anyone collecting any information on what techniques they were using, as that information might tip off their competitors. I suspect this level of paranoia has not decreased since then. Enforcement then, would take place at the level of the purchase decision. If you can’t ensure privacy, then you lose a sale.

Academia is a different beast, though there still remains a competitive advantage in secrecy (http://scholarlykitchen.sspnet.org/2010/10/25/openness-and-secrecy-in-science-a-careful-balance/). If I know what my competitor is reading and my competitor does not know what I’m reading, I may gain some advantage. At the same time, many investigators are making their reading lists public through services like Mendeley (and if you want the data you’ve mentioned above, that’s the place to get it). So likely there’s a mix of approaches in the academy and needs may vary.

I was thinking of buying the download data from the publishers and libraries, hence the privacy issue.

Exactly, and having that as a practice would result in some lost sales. One then must project whether the loss of subscriptions and any hit on reputation would be outweighed by the profits made from selling off customer data to third parties.

How would my knowing that, say, people at Yale were reading my articles lead to lost sales? It might do the opposite. My assumption is that researchers would value having this information. Plus seeing how ideas were spreading would be valuable in its own right. So it is also not clear that there would be a reputation hit if it were properly framed. For example the service could be non profit.

The Mendeley idea of searching reading lists by author or article is good but very simple. I am surprised they do not offer that feature.

Think about the reverse. If I’m at Yale, do I want my direct competitor at Harvard knowing exactly what I’m reading, knowing the articles about which techniques and relevant work are feeding into my experiments so that they can potentially duplicate what I’m doing and beat me to the punch? I might not subscribe to your journal if you’re going to sell me out.

Then one can get into governments tracking social science or political science researchers, potentially censoring people who read the “wrong” things.

Yes, I am sure it would be controversial, which is why I used it as an example, as I wanted to hear Roger’s thoughts on control. I am reminded that when I was on a University faculty it seemed that the prevalent emotion among my fellows was jealousy. After all, most researchers are really small businesses. Perhaps you are suggesting that public opinion itself is a control mechanism.

See Sayre’s Law:

Sayre’s law states, in a formulation quoted by Charles Philip Issawi: “In any dispute the intensity of feeling is inversely proportional to the value of the issues at stake.” By way of corollary, it adds: “That is why academic politics are so bitter.”

Comments are closed.