Improving Privacy by Rethinking Architecture

A distributed or decentralized technical architecture could impact privacy and data security more than any policy. Image via opensource.com.

More than 15 years ago, Lawrence Lessig famously argued that the internet’s technical architecture has policy implications just as powerful as the regulatory apparatus that a government may impose. We have seen in copyright the consequences both of building law and policy atop a set of technological assumptions that have now shifted and of developing policy workarounds through technical efforts such as DRM. In today’s post-Snowden environment, many of us are grappling with privacy issues as consumers, as citizens, and as voters. Recent weeks have brought renewed discussion of “backdoors” as an engineered solution to enable government access to our private communications. Just as technical considerations are vital for protecting privacy in the broadest societal context, so too are they essential to address the professional privacy of researchers in the digital library and scholarly communications space. Architectural alternatives will arise which not only have policy consequences but also strategic implications for publishers, libraries, and intermediaries alike. While competitive concerns cannot be avoided entirely, as a community we should be thinking about how to draw not only on policy but also on technical architecture to balance privacy and innovative features for researchers.

In the current research information architecture, a researcher will typically navigate to a content platform, whose content has been licensed by the library, to gain access to an article once it is discovered. There, the researcher’s activities are recorded centrally with great granularity: what is downloaded, what search terms are used, what source of authorization is utilized, and more. In aggregated form, these usage data are sent back to the library (as COUNTER-compliant reports indicating the high value of the resource in question) and used for other purposes (such as calculating royalties for content licensed from third parties). The content platform typically retains the raw usage data, allowing for the detailed analysis of specific content usage and downloading behaviors. These specific activities can be associated with an individual, especially where that individual has signed up for an account on the content platform, for example to receive new issue alerts. Content platforms are developing increasingly sophisticated uses for these data, for example for the creation of recommendation systems and for individual personalization.

Notwithstanding the value of these usage data in understanding and serving user needs, librarians have generally avoided granular usage data. Recognizing the vital importance of being able to read anonymously, libraries have assiduously deleted individually identifiable general circulation data, to avoid any possible privacy violations (see for example this toolkit). Although at the same time many libraries retain a variety of personally identifiable information associated with other reading and research behaviors, these data have been so fragmented that it is unusual to see personalized services developed around them.

By having comparatively little user activity data on their own hands, and fragmenting those data they do keep, libraries may feel they have sidestepped an ethical dilemma. But they surely have missed the opportunity to build the personalized discovery and research information management tools that scientists and other academics need. In doing so, libraries have underserved researchers and other users, putting themselves at a competitive disadvantage relative to other providers. But at the same time, they have not placed limits on the data gathering and usage of content providers and other vendors nearly as strictly as they often claim for themselves.

But today, librarians and other policy-makers are starting to awake to the problematic aspects of this dynamic. And the problems are twofold. First, the already-mentioned failure to serve users. And second, the perception that many licensed e-resources, including content platforms, do not meet the privacy standards that libraries have established for themselves.

NISO’s recent released Consensus Principles on Users’ Digital Privacy, led by my fellow Scholarly Kitchen Chef Todd Carpenter, is an effort to grapple with this challenging space. I spoke at one of the group’s virtual meetings and provided feedback on a draft of these principles, which are designed to strike a balance of enabling great user transparency and in some cases informed consent without reducing the quality of services that can be offered. Seen as a compromise between privacy hawks and service innovation hawks, these principles represent something of a way forward. To the extent that they are incorporated into license agreements between libraries, publishers, and other providers, they will represent a substantial improvement relative to the status quo.

Still, the NISO principles seem to assume the current architecture of content platforms, each of which centrally controls various user and usage data. Other technical architectures with direct bearing on the control of user data already exist today, such as the consortial or institutional loading of licensed content resources, as practiced by OCUL and others. And new models are emerging. One example that recently caught my eye is the “personal API” initiative that is growing up at Brigham Young University (BYU). Let me explain.

First, a little background on one of the project partners. Jim Groom is an instructional technologist who has worked valiantly at the University of Mary Washington and through Reclaim Hosting with a variety of partners to help students find ways to assert and manage their own digital identities. His work on the politics and practicalities of managing one’s identity through individual tools like WordPress, and not just through centralized systems like Facebook, deserves attention on its own.

The context is that BYU has been reengineering its systems — which will include such things as admissions and financial aid, human resources and financial systems, course scheduling and grant management, alumni relations and development — to interact in a more modern way using APIs. This is an increasingly common architectural choice that is driven by a need to improve interoperability while maintaining security.

The novel development is this: In partnership with BYU, Groom and his colleagues are working to build what is being called a “personal API.” The idea is for each member of the BYU community to be able to control the data they contribute into the university systems. They are starting with a pilot focused on allowing students to publish materials into the learning management system or a variety of other places, but the vision is far broader. Ultimately, they envision giving each member of the university community the ability to retain and control their own data, only making it available into university systems on a time-limited, as-needed basis. This architecture might have substantial benefits for privacy and security, but it might also complicate the efforts that some universities are making to construct large data warehouses of student activity data to allow them to target interventions that will improve student progression and graduation.

If successful, it is possible to imagine such a decentralized architecture being extended to serve the needs of a researcher account in terms of library resources and content platforms. Indeed, I have argued that a cross-platform user account controlled by the researchers themselves would bring vast improvements in the processes of authorizing access to an appropriate copy of an information resource, while dramatically improving one’s control of one’s own data. Such an architecture represents a sea-change compared with today’s systems and it is no small thing to imagine the costs and logistical complexities of a transition.

Even so, the work that Reclaim Hosting and BYU are undertaking is more than just experimental. It suggests that, when the political and organizational stars align, an entirely different architecture for the control of user data is possible. As Ithaka S+R prepares to launch a program of work in the area of privacy in 2016, I wonder: What would such a shift mean for scholarly publishing and academic libraries?

Roger C. Schonfeld

@rschon

Roger C. Schonfeld is the vice president of organizational strategy for ITHAKA and of Ithaka S+R’s libraries, scholarly communication, and museums program. Roger leads a team of subject matter and methodological experts and analysts who conduct research and provide advisory services to drive evidence-based innovation and leadership among libraries, publishers, and museums to foster research, learning, and preservation. He serves as a Board Member for the Center for Research Libraries. Previously, Roger was a research associate at The Andrew W. Mellon Foundation.

Discussion

4 Thoughts on "Improving Privacy by Rethinking Architecture"

The concept of controlling one’s data works well in the abstract but requires a case-level approach when looking at exactly the kind of data one can (or should) control. For example, should a student be able to control which class grades are sent to the registrar? Not if s/he wants to graduate. Should a student be able to control records of the books s/he checks out from the library? Not if s/he wants to have library privileges; and the purging of individual circulation records when books are returned are not controlled by the student, but the library itself. Secondly, many of those personal data APIs are already built into online services, like Facebook, which permit students the options of restricting levels of access to friends and the public. A personal API would somehow have to communicate and interact with all of the online sites and services we use–and there is no precedent that requires a private service, say, Facebook, to interact with a personal API. A user is required to click and accept the terms of each service, not to negotiate with each of them. In sum, the idea of personal control of data works well in the abstract, but may fail upon implementation.

By Phil Davis
Jan 4, 2016, 8:29 AM

I’m not clear that these examples mean “failure” unless privacy is defined to = no data sharing or record-keeping. But, if privacy means making a choice about whether to share one’s data under what circumstances (including what one might be trading access to one’s for – which is what we do with Facebook), then the examples here all meet the criteria of transparency, user control, etc. Why not let a library user opt-in to record-keeping of what they had checked-out? As a doctoral student, that would have huge value for me. And, while Facebook isn’t required to interact with a personal API – it might find it valuable to do so if personal APIs became common?

By Lisa Janicke Hinchliffe
Jan 4, 2016, 5:42 PM

Where would SCNs (Scholarly Collaboration Networks, essentially social networks such as ResearchGate, Academia.edu, Mendeley) fit into this picture? These networks are built to surveil researchers, and their business models are based around capturing the activity of researchers and selling that data to the highest bidder, or in the case of Mendeley, using that data to serve Elsevier’s needs. These sites often induce researchers to post copies of their papers online, frequently doing so illegally, and thus route around any privacy controls a university may have in place to safeguard a researchers’ identity and activity. If these sites continue to grow, does that mean that the privacy you’re proposing here is doomed, or perhaps that, like we see for Facebook and Google, most people fail to value it?