Last week, our newest Kitchen contributor Lisa Janicke Hinchliffe raised questions about the RA21 initiative, analyzing how its effort to improve the security of content providers in the face of rampant piracy may have consequences for, among other things, privacy and the future of IP authentication. I strongly agree with her perspective about RA21, and I want to take some column inches today to explain why. It is not so much that I am specifically concerned about the privacy issues of RA21, although I think they deserve scrutiny. My concern is this: In establishing a mission “to align and simplify pathways to subscribed content across participating scientific platforms,” RA21 has scoped its problem the wrong way. Simply put: It’s not about security. It’s about identity. Every individual should be in control of their own identity.

Five men gathered around a table in a private club, in discussion while smoking and drinking, perhaps about issues of authentication and identity
F.C. Yohn, I don’t see why fair play isn’t the thing after he leaves college as much as before, 1907, Library of Congress Prints and Photographs Division.

No one can deny that there are major problems with our current mechanisms for authorizing access to licensed collections of scholarly content. While on-campus access can be made entirely seamless via IP-based authentication (for the declining case of a user who stays entirely within the campus network), off-campus access is a mess. I first called attention to these issues in an issue brief and a talk at STM about my hapless efforts to access an article citing my work. As I said emphatically, the proxy server, which is in wide use across US higher education and beyond, is not the answer. Proxies fail to support the actual patterns of discovery and access that are natural to researchers, who aim to move seamlessly among resources, regardless of starting point, without having to get authorized for every platform every time. Proxy servers for off-site access to licensed e-resources are a sorely outdated technology that sets out a stumbling block in the researcher’s path, driving users away from publishers and libraries alike towards the open web.

But something else happened on the way to improving the user experience. Sci-Hub exploded on scene as a massive driver for piracy. And, when the publishing community started to look closely at Sci-Hub and how it works, it became apparent that proxy servers were a fundamentally weak link in the security chain, for at least two reasons. First, they aren’t uniformly well maintained by libraries, allowing vulnerabilities to fester. And, second, when unauthorized use is detected, they often force content providers to choose between permitting downloads anonymously or taking the “nuclear option” of turning off access for an entire university. And so it was no longer the case that the priority was improving the options that could coexist alongside proxy servers to improve the user experience. Now, from a content provider perspective, proxy servers have got to go.

As I have been reflecting on RA21, the umbrella initiative for alternative authorization mechanisms beyond the proxy server, I fear that the framing around security or user stumbling blocks are both wrong. The underlying question for modern authorization is about authentication of individual users and so authentication is increasingly about identity. As a result, RA21 is necessarily mucking around with issues of identity. And if poor choices are made about managing identity, the academic community risks making a major mistake with wide-ranging implications.

Identity online is a hot topic today. There is widespread recognition of the substantial value of user data. Many readers will be familiar with my work to chronicle the impressive and valuable businesses being built in our sector that rely on identity as their foundation. These include the array of bibliometric and research evaluation systems that were disrupted last week with the launch of Digital Science’s new Dimensions product. Elsevier’s and Digital Science’s integrated research workflow suites reaching back into the science laboratory are fundamentally about data and analytics, as Elsevier itself shouts loudly to anyone who will listen. Some might conclude that identity is everything.

Major publishers, discovery services, and research providers are pivoting to take on more of the researcher workflow, including most of the leading organizational participants in the RA21 effort. These publishers and vendors are already using the advantages of their positions today (just as they should be!) to build impressive new offerings in discovery, library systems, research workflow, and analytics. The way this game is played, the more data a corporation has, tied to an individual identity, the more value it can generate. The most important consideration for managing identity is that the user data must be controlled by, or at least fully accessible to, the platform provider. It has been several years since David Smith on these pages called for “a set of principles about how such data is to be used.

There is no observer more enthusiastic than me about many of the opportunities presented by data-enabled services, from personalized scholarly discovery to researcher workflow tools, but data empires are not the only way to build most of these services. Indeed, there are alternative frameworks that would leave the user in control of their own identity and their own data. Brigham Young University developed the notion of a “personal API” that would empower the individual to control their own data and choose where it was used. I proposed, nearly three years ago, an approach that would put individuals at the center, rather than universities, publishers, or other vendors, in terms of both authentication/authorization and their user data. Just last month, Herbert Van de Sompel gave a compelling overview of how a decentralized model along these lines could be conceived. On a more centralized basis, it is possible to imagine ORCID growing into such a service, or others doing so, even if there are few indications of such development. Such alternative approaches operationalize privacy by giving the user control of their identity and data. And, as a happy byproduct, they force providers to compete based solely on the services that they can provide on top of the data — rather than competing based on control of the data, with all the disadvantages that we have watched take hold in the consumer sector.

But, RA21 is not pursuing broader solutions. It is not centered around individuals and their own control of their identity and data. Users are essentially pawns. Instead, RA21 is scoped narrowly, which just happens to avoid disrupting providers or interfering with the dominance of leading players. It gives major advantages to those market incumbents that already have access to large amounts of usage and user data. Solutions that would create a level playing field around user data are certainly not in the interests of market incumbents. It is unsurprising that RA21 seems to be taking a set of approaches that reifies the interests of the companies that led its initial development. While there are efforts being made to add a “library perspective” to the RA21 table today, it is around policy considerations such as privacy rather than fundamental architecture.

Indeed, we are starting to hear rumors, as Lisa reported last week, that leading RA21 players are ultimately interested in killing off IP authentication, not just the proxy server. IP authentication has offered a truly seamless user experience, enabling researchers to choose their own pathways from discovery to access, to the extent that they are working entirely on the campus network. And, it has been seen to offer more anonymity for users than individually authenticated alternatives. But, given that users on handheld devices are likely to be moving regularly across networks, IP authentication is not the panacea of seamlessness it once was. And, to be sure, any network-based authentication does permit some content security vulnerabilities. But, notably, requiring that researchers individually authenticate for every session strengthens the data and analytics businesses discussed above. All in all, it makes perfect sense that leading providers would see removing IP authentication as a strategic objective, positioning them to build further on their data empires.

We are being asked to trust that in a second potential stage of work RA21 will take on broader questions yet to be determined. This may yet happen, and I hope that it does. But, once the limited shared security and off-site access interests of the current group of industry leaders is addressed, will the sector representatives participating in RA21 take this next step? Why should observers trust that RA21 will support efforts that will put into place user-centric identity and personalization choices that could reduce the strength of incumbent advantages in developing new businesses? All indications point in the opposite direction.

I challenge RA21 to stop asking for trust and rather to earn it. Doing so would involve a commitment to develop a user-centric level playing field empowering users to manage their own identity and data. I hope that in doing so RA21 can realize its potential to serve the broader interests of scientists and academia, not just the understandable objectives of publishers and platforms.

 

I had the great benefit of comments about drafts of this post, on an extremely short timeline, from: Cody Hanson, Bruce Heterick, Lisa Janicke Hinchliffe, David Smith, Aaron Tay, and one anonymous contributor. They helped to improve this post tremendously but do not uniformly share my perspective nor should they be held responsible for its shortcomings.

Roger C. Schonfeld

Roger C. Schonfeld

Roger C. Schonfeld is the vice president of organizational strategy for ITHAKA and of Ithaka S+R’s libraries, scholarly communication, and museums program. Roger leads a team of subject matter and methodological experts and analysts who conduct research and provide advisory services to drive evidence-based innovation and leadership among libraries, publishers, and museums to foster research, learning, and preservation. He serves as a Board Member for the Center for Research Libraries. Previously, Roger was a research associate at The Andrew W. Mellon Foundation.

Discussion

30 Thoughts on "Identity Is Everything"

Meanwhile, Facebook, Google, and Twitter are getting data on nearly every user visiting this site via the “Like” and “Share” widgets installed at the top and bottom of our posts, including IP address and so forth, and adding that to their vast storehouse of data. Facebook has been doing this since 2010. So, if you have a Google, Facebook, or Twitter cookie in your browser and you’re reading this, the companies just gathered more data from your visit to this site. This is why I was told that an advertisement on the Kitchen followed a user to the New York Times mobile site recently. Both sites have cross-talk from these large online advertising purveyors, and they track users moving among sites. Just so we’re aware of the current baseline.

GDPR is something I would have thought you’d have mentioned around this, as it seems poised to do a lot of what you think needs to be done. It will make data more portable, give users more control, provide the right to be forgotten, etc. Can you comment on this?

Also, the loose talk about “the end of IP addresses” is just that — loose talk. There is very little chance that a core part of the Internet’s infrastructure will be sidelined by machine-level access privileging. IPs will remain part of the mix. These upgrades of the authorization space are more additive, as I see it.

Kent, Just a quick response as I am getting on a flight. There is no loose talk about the end of IP addresses. Rather, some content providers are interested in abandoning the use of IP address ranges as the basis for authorization to access content. Are you denying that is taking place?

I’m disputing that it’s significant talk. My exposure to this kind of talk is that it’s more razzle-dazzle than realistic. That’s my point, and why I characterized it as “loose talk.”

Kent, we must be talking to very different people. I saw a demo of a in-development ID system from a major platform that made it very clear to me this isn’t some sort of pie-in-the-sky “razzle-dazzle” (as you say) idea but something that realistically could be deployed within a year or two at the latest.

I’m saying the talk about IPs being abandoned is razzle-dazzle. The technology for WAYF or CASA or similar systems are realistic, but they still utilize IPs.

Let me echo Lisa in saying there is no razzle dazzle here. For me, I spoke with a number of individuals from content providers and associated with RA21 in the process of putting together this post. What I find most disturbing is that some of our colleagues actually fear speaking up in public on these issues, even if they are troubled by some of the directions this initiative and its leaders are taking.

My apologies Kent. I left it too ambiguous when I wrote: “I saw a demo of a in-development ID system from a major platform that made it very clear to me this isn’t some sort of pie-in-the-sky “razzle-dazzle” (as you say) idea but something that realistically could be deployed within a year or two at the latest.”

Let me be more clear: “I saw a demo of a in-development ID system from a major platform that INCLUDED THE STATEMENT BY THE LEAD DEVELOPER THAT THE PLATFORM WANTS TO ELIMINATE IP ADDRESS AUTHENTICATION AND THAT made it very clear to me this isn’t some sort of pie-in-the-sky “razzle-dazzle” (as you say) idea but something that realistically could be deployed within a year or two at the latest.

Yes, people choose to give up data regularly. We should not confuse that choice and take it as evidence that they want it taken from them by publishers or the institutions at which they work/study.

IP addresses will continue. No one is claiming that RA21 seeks to eliminate IP addresses. It’s IP based authentication to library subscribed resources that is going to be eliminated, not immediately of course but eventually. RA21 repeated states the goal is to “move beyond” IP authentication. We should believe the platforms/publishers when they tell us what they are doing.

Yes, people choose to give up data regularly.

“Choose” being the operative word. I’m not sure most users (or publishers or librarians for that matter) are clearly aware of how much data they’re giving up already. Are you signed in to Google/Facebook? Is the journal site running Google/Facebook widgets, or Google Ads or Google Analytics? If so, then how much tracking is going on? I honestly don’t know and I doubt many users of journals do either. There’s no transparency, so an informed choice cannot be made.

Which leads me to the question — if we’re already giving up all of this information to the hugest of corporations, does it matter if we also give up the same information to a set of smaller corporation (e.g., publishers)? Is one okay but not the other and if so, why? Should librarians be arguing against the integration of social media into journal platforms as well?

David, I fully acknowledge that there are bigger concerns here. I scoped my discussion to the massive consolidation in the scholarly publishing and research platforms industries. Just because I might be willing to allow Facebook to track me — to some extent, a choice I make personally by electing to maintain an FB account — doesn’t mean that I want a scholarly publisher to be able to track me professionally and use that tracking to lock my university further into its ecosystem. The two cases can be distinguishable for any individual, which is why individuals should be in control of their own identity. I’m surprised, Kent, if you wouldn’t agree with that.

I agree, which is why I stopped using Facebook more than a year ago. That, and other reasons. I just wanted to note where the baseline is, and that everyone reading this is regularly giving up data in ways that they may not realize.

Still, I don’t think you accepted my invitation to talk about how GDPR intersects with this. So, I’ll invite again.

I’m perfectly fine with users choosing to give their identity to a publisher site. User control. User agency. User choice. Not sure how many different ways I can articulate that.

The key then is transparency, which I’m pretty certain is not being provided by those currently tracking and collecting user data. Would you be okay with a publisher offering the same level of transparency around privacy as Google and Facebook?

David, I fear our conversation straying into a different area than addressed by RA21. So, let me be clear that I see a difference between publisher/end user privacy/transparency/user control practices (e.g., when I create a Mendeley account that is also my ScienceDirect account) and how these issues are being framed in RA21, in which the RA21 position is that the university is the owner of the data not the individual.

And speaking of which, should we be extra dubious about Google’s CASA (Campus Activated Subscriber Access) for Scholar, which has been pushed as part of RA21?

A question here: Who is the “user”? Everyone seems to assume that it is the user is the individual researcher, but is it really? As an academic researcher, it is quite clearly my university who is the publisher’s customer, so I would suggest that a more logical way to view things is to also assign the label of “user” to the institution. Then IP authentication makes perfect sense: the institution is also the “user” and the publishers already has access to the data of said users. To my eye, this way of seeing things grants an appropriate amount of both privacy for individuals and access to user (=institution) access data.

But I wonder if there may not be legal stumbling blocks here, at least for European institutions. The form of access to content is presumably part of the deal between institution and publisher. If access to content involves individual employees having to give up data considered personal (everyone on this thread seem to think this to be the case, irrespective of other opinions, so this seems clear), then this is a form of payment from the institution to the publisher. The European data protection directive is pretty tough stuff… academic institutions effectively using individual data on their employees as something to be traded with publishers… I have to report and keep track of every spreadsheet that I use to keep with my students’ grades, so really don’t know about that one.

David, I agree a major point to long term success being transparency which can therefore deliver balance between data use/collection and the utility of the service offerings. Last week, in response to Lisa’s post I wrote a response post on the need to build trust through transparency and choice. It can be found here: https://danielayala.com/2018/01/16/it-all-comes-down-to-trust/

To the GDPR query, there is an intersection, but recall that the EU laws don’t dictate the boundaries of use, except for the most sensitive of personal information. The underlying premises are that of explicit consent, breach notification in a timely manner, right to know what a company/entity knows about you, right to be forgotten, portability of data and demonstration that privacy principles have become part of the development lifecycle of products and services offered. Therefore, the use cases of concern are not scoped down by GDPR, but the transparency about what is being done may be clearer to users to help them make informed decisions.

David, CASA (which is something HighWire has worked on closely with Google) is not “part of” RA21 in any formal (technical, architectural, process) sense. What we have agreed with RA21 folks on is that we are keeping in touch on measurement and such. The two approaches are pretty different, yet they address somewhat the same problem: it is fragile to get off-campus access to subscribed resources (sort of like expecting the wifi on an international United flight to work: you can’t count on it). CASA and RA21 (and to some extent Unpaywall) all address this problem in different ways by picking a different part of the researcher workflow to act on. (I even have a slide on this from a presentation last year. 🙂 . The nice thing is these are complementary, and not in competition. With CASA, RA21 and Unpaywall, the user effectively has three different places where their access can be enabled, any one of which will open the door. The expectation is that in combination these will give a lot more access right when people need it, without them having to go “out of band” (to Sci-Hub) to get what they need. And because they all are inside the users workflow, the user doesn’t have to learn something new or go somewhere else. That’s the theory.

CASA is up and running at HighWire and in (I think) three other platforms (and Unpaywall has been out for a while also — though I know not every publisher is sanguine about Unpaywall). So we’re seeing how well theory fits practice.

John, I’m surprised to see Unpaywall listed here. What’s the mechanism Unpaywall uses log users in? I know Kopernio works off the library’s proxy server. I wasn’t aware Unpaywall had added functionality to get through a paywall. I thought it went to alternative sources only and I’m not seeing anything on their site about this.

Lisa, Unpaywall does not log people in. It finds legal open copies (based on a couple of databases, as I understand it — I think I’ve seen it documented, but can’t recall where if you didn’t see it on their site) and will point people to those copies by putting a symbol on the abstract page. You click on the symbol and get to the legal open copy. In many cases (e.g., with a lot of HighWire-hosted journals) that legal open copy is ON the journal’s website, because the journal makes the content free by policy after the “embargo” period, or the article is hybrid or gold OA. Unpaywall seems to get that right most all the time (I’ve seen some exceptions in my own use, but not too many).

My point about Unpaywall and CASA and RA21 is that they are all “in the workflow”: the user doesn’t have to go somewhere else (sci-hub, research gate) to get their work done. But each is at a different part of the workflow: CASA is at the search/discover part; RA21 is at the login-page part; Unpaywall is at the abstract-page part.

Hope that makes sense. (The Unpaywall folks are in my experience open about what they do and how; fortunately they don’t make me read the code to figure that out :)).

Oh, ok – if what is common to the three is being in the workflow, I get what you are saying. Of course, the proxy server could be put into the workflow just like the RA21 solution of pointing to a SAML instance if platforms wanted to do so. Some platforms indeed do that now, which is why I’m not convinced that RA21 is a better workflow solution than the proxy server (though SAML obviously offers greater security).

John can you tell us more about Google’s involvement in CASA? Are they just supplying the infrastructure, or do they play an active role in authentication, and are they collecting data and tracking users in any way?

FWIW, I found the explanation of CASA in https://insights.uksg.org/articles/10.1629/uksg.360/ helpful (though it doesn’t answer all questions e.g. tracking) – key paragraph: “Under CASA, when on-campus users arrive on a Google Scholar search results page, as well as their entitlements being highlighted, a cookie will also be placed on their device which records their institutional identifier. When that user returns to Google Scholar off campus at a later date, their institutional entitlements will again be highlighted in the results list. On clicking through to read the article, Google Scholar will pass over an encrypted CASA token to the journal website, which the journal website will then decrypt in order to passively authenticate and authorize the user.”

However, to what extent the creation of the token is attached to the Google ID of the user is still something I have not been able to get firm clarity on, nor what Google is doing with the data they collect from such retrieves. I also do not have a clear understanding if the data is shared back to the regular Google (non-Scholar) and used in ad sales there.

Dan, these seem like key issues that every publisher would want to clarify before enabling such a solution, and points to a set of issues I would expect libraries to monitor as part of their license agreements. I have to hope none are not just trusting the most important data and advertising network in the world on these matters.

Why do you guys and academics all want to establish simple off-site access? Go home to your families and hobbies and forget about the rigours of work for a few hours. You are all very privileged intellectuals — so reap some of the benefits of such positions and establish a life outside of work and achieve a better work/life balance! Use the fact that off-site working is difficult to your benefit and stop creating unneccessary additional problems and work that will only benefit a small minority of researchers. If Open Access really takes hold (i appreciate that there is slow but increasing momentum) then all these security systems will be defunct as like unpaywall there will be no requirement to log-in to access free and open content/data.

In an age where so many willingly give up so much personal data for membership/apps and social media it is inevitable we can be tracked and targetted at any time by those sufficiently determined to do this. However, I am with Kent, the tide is changing and GDPR is a step in the right direction, together wth the right to be forgotten. We now just need a way of removing the ability of advertisers to track our activity and hound us across all searches, or better still have the ability/right to remove all advertising from our web activity if we so choose.

One aspect of this discussion that intrigues me is how much of the publishing landscape is now Open Access and how much of that change has pushed publishers and content providers into the analytics and other researcher-level metrics arena? I’m sure there is a high correlation.

The other phenomena I see are people who have two or more personas within a social media app or between them. Younger patrons use different apps than older populations, and with places like Reddit you can have throwaway accounts. I agree with you – I’m not sure the paradigms we are familiar with as academics are as true for others, some of whom will be in universities soon!

Elizabeth, I agree with you on this, and I think it is not just correlation but causation. It is **precisely because of open access** that Elsevier, for example, is working to transform into a data and analytics company. And in order to do this, it is essential for these companies to gather up heaps of individual-level data. To the extent that libraries take the approach of “not my problem” because it is not about authorization to licensed e-resources, more the shame.

If Open Access really takes hold (i appreciate that there is slow but increasing momentum) then all these security systems will be defunct as like unpaywall there will be no requirement to log-in to access free and open content/data.

This came up in the comments on another article, and it’s probably worth noting again that open access doesn’t automatically solve all the problems of privacy and user tracking. As we know from all the free services offered by Google, Facebook and the like, just because something is given to you for free doesn’t meant that you’re not being spied upon. Say what you will about the subscription business model, but at least it offers librarians a chance to have some oversight on the practices of the businesses where researchers are gathering their information.

Comments are closed.