More than 15 years ago, Lawrence Lessig famously argued that the internet’s technical architecture has policy implications just as powerful as the regulatory apparatus that a government may impose. We have seen in copyright the consequences both of building law and policy atop a set of technological assumptions that have now shifted and of developing policy workarounds through technical efforts such as DRM. In today’s post-Snowden environment, many of us are grappling with privacy issues as consumers, as citizens, and as voters. Recent weeks have brought renewed discussion of “backdoors” as an engineered solution to enable government access to our private communications. Just as technical considerations are vital for protecting privacy in the broadest societal context, so too are they essential to address the professional privacy of researchers in the digital library and scholarly communications space. Architectural alternatives will arise which not only have policy consequences but also strategic implications for publishers, libraries, and intermediaries alike. While competitive concerns cannot be avoided entirely, as a community we should be thinking about how to draw not only on policy but also on technical architecture to balance privacy and innovative features for researchers.
In the current research information architecture, a researcher will typically navigate to a content platform, whose content has been licensed by the library, to gain access to an article once it is discovered. There, the researcher’s activities are recorded centrally with great granularity: what is downloaded, what search terms are used, what source of authorization is utilized, and more. In aggregated form, these usage data are sent back to the library (as COUNTER-compliant reports indicating the high value of the resource in question) and used for other purposes (such as calculating royalties for content licensed from third parties). The content platform typically retains the raw usage data, allowing for the detailed analysis of specific content usage and downloading behaviors. These specific activities can be associated with an individual, especially where that individual has signed up for an account on the content platform, for example to receive new issue alerts. Content platforms are developing increasingly sophisticated uses for these data, for example for the creation of recommendation systems and for individual personalization.
Notwithstanding the value of these usage data in understanding and serving user needs, librarians have generally avoided granular usage data. Recognizing the vital importance of being able to read anonymously, libraries have assiduously deleted individually identifiable general circulation data, to avoid any possible privacy violations (see for example this toolkit). Although at the same time many libraries retain a variety of personally identifiable information associated with other reading and research behaviors, these data have been so fragmented that it is unusual to see personalized services developed around them.
By having comparatively little user activity data on their own hands, and fragmenting those data they do keep, libraries may feel they have sidestepped an ethical dilemma. But they surely have missed the opportunity to build the personalized discovery and research information management tools that scientists and other academics need. In doing so, libraries have underserved researchers and other users, putting themselves at a competitive disadvantage relative to other providers. But at the same time, they have not placed limits on the data gathering and usage of content providers and other vendors nearly as strictly as they often claim for themselves.
But today, librarians and other policy-makers are starting to awake to the problematic aspects of this dynamic. And the problems are twofold. First, the already-mentioned failure to serve users. And second, the perception that many licensed e-resources, including content platforms, do not meet the privacy standards that libraries have established for themselves.
NISO’s recent released Consensus Principles on Users’ Digital Privacy, led by my fellow Scholarly Kitchen Chef Todd Carpenter, is an effort to grapple with this challenging space. I spoke at one of the group’s virtual meetings and provided feedback on a draft of these principles, which are designed to strike a balance of enabling great user transparency and in some cases informed consent without reducing the quality of services that can be offered. Seen as a compromise between privacy hawks and service innovation hawks, these principles represent something of a way forward. To the extent that they are incorporated into license agreements between libraries, publishers, and other providers, they will represent a substantial improvement relative to the status quo.
Still, the NISO principles seem to assume the current architecture of content platforms, each of which centrally controls various user and usage data. Other technical architectures with direct bearing on the control of user data already exist today, such as the consortial or institutional loading of licensed content resources, as practiced by OCUL and others. And new models are emerging. One example that recently caught my eye is the “personal API” initiative that is growing up at Brigham Young University (BYU). Let me explain.
First, a little background on one of the project partners. Jim Groom is an instructional technologist who has worked valiantly at the University of Mary Washington and through Reclaim Hosting with a variety of partners to help students find ways to assert and manage their own digital identities. His work on the politics and practicalities of managing one’s identity through individual tools like WordPress, and not just through centralized systems like Facebook, deserves attention on its own.
The context is that BYU has been reengineering its systems — which will include such things as admissions and financial aid, human resources and financial systems, course scheduling and grant management, alumni relations and development — to interact in a more modern way using APIs. This is an increasingly common architectural choice that is driven by a need to improve interoperability while maintaining security.
The novel development is this: In partnership with BYU, Groom and his colleagues are working to build what is being called a “personal API.” The idea is for each member of the BYU community to be able to control the data they contribute into the university systems. They are starting with a pilot focused on allowing students to publish materials into the learning management system or a variety of other places, but the vision is far broader. Ultimately, they envision giving each member of the university community the ability to retain and control their own data, only making it available into university systems on a time-limited, as-needed basis. This architecture might have substantial benefits for privacy and security, but it might also complicate the efforts that some universities are making to construct large data warehouses of student activity data to allow them to target interventions that will improve student progression and graduation.
If successful, it is possible to imagine such a decentralized architecture being extended to serve the needs of a researcher account in terms of library resources and content platforms. Indeed, I have argued that a cross-platform user account controlled by the researchers themselves would bring vast improvements in the processes of authorizing access to an appropriate copy of an information resource, while dramatically improving one’s control of one’s own data. Such an architecture represents a sea-change compared with today’s systems and it is no small thing to imagine the costs and logistical complexities of a transition.
Even so, the work that Reclaim Hosting and BYU are undertaking is more than just experimental. It suggests that, when the political and organizational stars align, an entirely different architecture for the control of user data is possible. As Ithaka S+R prepares to launch a program of work in the area of privacy in 2016, I wonder: What would such a shift mean for scholarly publishing and academic libraries?