Most (hopefully all) Scholarly Kitchen readers will be familiar with HathiTrust, a huge and carefully-curated collection of digitized print books that is managed by a cooperative of research libraries and allied institutions, and that makes those digitized texts available both for reading (where possible) and for research interrogation. Recently they announced an expansion of the services available to researchers, and I reached out to HathiTrust’s executive director, Mike Furlough, with some questions. His responses, along with those from some other members of his team, are below.
To provide some context for our readers, give us a quick summary of the history of the HathiTrust Research Center (HTRC).
MIKE FURLOUGH (Executive Director, HathiTrust)
The Research Center originally grew out of discussions among Google and its library partners over a decade ago about ways that we could make the collection of texts as broadly useful as possible. The proposed three-way settlement agreement among Google, the Authors Guild, and the American Association of Publishers included a provision for the development of one or more research centers, which would have been set up to provide researchers with the ability to perform “non-consumptive” research — essentially computationally based investigations on large amounts of the corpus, rather than “reading the books” one-by-one. Examples of this type of work might be somewhat simple: search, word frequencies, the building of concordances, or more complex, such as image analysis to identify specific types of features or graphics, or semantic analysis to identify the subjects within the collection of texts.
HathiTrust developed an RFP for one or more of its member libraries to host a research center, which was met with a great joint response from Indiana University and the University of Illinois at Champaign-Urbana. These two institutions had complementary expertise and resources in digital humanities, computer science, and what we used to call “cyberinfrastructure” development. Ultimately, that settlement was rejected in federal court, but we continued on with our plans and the Research Center launched in 2011. The core operations of the Research Center are funded by co-investments from Indiana University, the University of Illinois, and HathiTrust.
The Research Center has over time been the recipient of research grants from the National Endowment for the Humanities, the Andrew W. Mellon Foundation, the Institute for Museum and Library Services, and the the Alfred P. Sloan Foundation. We have been gradually developing the offerings of the Center, with a greater pace in the last year or two. For the first six years we offered these services only on the portion of the collection that we knew to be in the public domain, or for which rightsholders had given us permission to make available for reading.
There are different ways that researchers can undertake data mining using the HathiTrust collection. All of these are available at the HathiTrust Research Center’s main site.
First, users of that site have the ability to define a “workset” to perform analysis upon — basically this is smaller set of text that are important to your questions. There are a handful of tools there that you can select to run on that workset, and they allow you do some basic, but still powerful, types of analysis, such as extracting the names of people, places, and organizations, find the words that occur most frequently and create a tag cloud visualization, and identify prevalent themes in the text and producing visualizations based on these.
Second, we have created a huge secondary dataset that was derived from the full corpus, which is known as the Extracted Features Dataset. This provides researchers with a “bag of words” from the collection. You don’t get sentences, and you don’t get enough information to reconstruct the sentences, but you do get some powerful metadata. There is the standard descriptive information for each volume, but it includes metadata about each page in the volume: how many lines, what parts of speech appear, what languages are found there, the first letters of lines, etc. For a lot of researchers, this is sufficient for them to do the work they need to do. We also use this dataset to drive some keen visualization features through the Bookworm interface.
Third, if you are a really advanced user and you need to have greater control over what tools you use and how, then you can work within a Data Capsule. This is a secure virtualized computing environment that gives you command line access to a Linux environment. You can import your own software and additional data to use in your work, but once you switch to the “secure” mode, you are walled off from the rest of the web. This helps us guard against malicious code.
What has changed recently about HTRC’s service offerings?
We recently announced some upgrades and extensions of our analytics services. Some of these improve performance, others improve functionality and data availability. The biggest change is that we’re now expanding the ways in which we make data from the in- copyright collection available for “non-consumptive” use through the Research Center.
The Extracted Features, which I mentioned earlier, already derived data from the full collection — copyrighted and public domain. Now we also offer users the ability to create their own worksets of in- copyright data and use the standard tools we provide. That’s new. And we are really excited about now providing secure access to the in- copyright collection in the Data Capsule service. Not everyone will want this, but for those who do, it more than doubles the amount of data available for analysis, and it introduces the ability to analyze materials all the way into the late 20th century. We’ve been worrying for a long time about “the lost 20th century” for research, brought about by continued extensions of copyright terms. This doesn’t solve that problem, but it opens the door for some types of work that was not previously possible.
Does this expansion of HTRC’s data-mining services reflect some kind of new agreement with copyright holders and publishers, or simply new thinking and risk analysis with regard to fair use?
It’s a case of catching up with what the law and courts already allowed. I mentioned earlier the rejected Book Settlement among publishers, the Authors Guild, and Google. When that failed, the Authors Guild continued in their suit against Google, and they sued HathiTrust too. But the Authors Guild lost both of these cases in the circuit court and then in the federal appeals court. In the rulings for both cases the court made very clear that large scale search and analysis of digitized materials were transformative uses of in copyright material, thus covered under the fair use rights in the US copyright act. We and Google always understood this to be the case, but the litigation removed any doubt.
Those rulings were made in 2014 and 2015. But we didn’t have a on/off switch on the wall labeled “TRANSFORMATIVE FAIR USES.” We had to take our time to talk through the ruling and make sure we understood its implications. We had to complete development of the infrastructure to support secure access, test it, and satisfy our experts. And we had to figure out not just how to offer the services technically, but how to support them and monitor them to ensure that we were at all times supporting lawful uses. We convened a small working group to develop the HathiTrust Non-Consumptive Use Policy that describes the boundaries and some of the steps we take to protect against uses that we don’t intend and which aren’t supported under the law.
It all took longer than we anticipated, but we wanted to get it right. We think it’s one of the most significant applications of fair use to support research in a long while, and we want this to provide a model for other programs and projects.
Which of these new services are available to the general public, and which are limited to HathiTrust member institutions — and why the limitation?
The Extracted Features data are available to anyone and have been for a while. To create an account to access most other Research Center services, you need to have affiliation with an educational institution. Specialized access to the in- copyright collection via the Data Capsule service is available only to researchers affiliated with HathiTrust member libraries. We have a robust infrastructure to support this work, but resources are not unlimited, and so we have to put some limits on this service to be sure that we can support it as best we can.
JOHN WALSH (Director, HathiTrust Research Center, Indiana University)
The number of data capsules we can provide is limited by our available computing resources. We anticipate increasing demand for data capsules now that the in-copyright data are available, and that increased demand may exceed our capacity. So the limitation on data capsules with in-copyright access was an obvious area where we could and should prioritize affiliates of member institutions. And data capsules with access only to public domain data remain available to everyone, regardless of affiliation.
What are some of the use cases that you expect to see for this newly-expanded research functionality?
The most obvious use case is the researcher who needs to incorporate works published in the 20th and 21st centuries into her analysis. Historical research on literary or publishing history, for instance, can now extend to the present moment. Data-mining for references to 20th and 21st century people, events, and institutions is now possible.
ELEANOR DICKSON KOEHL (Digital Humanities Specialist, HathiTrust Research Center, University of Illinois)
It is also HTRC’s effort to extend research support for advanced scholars whose text analysis research methods require data in a different format than what is offered in the Extracted Features dataset, and who, our user needs’ research has shown, are not satisfied with off-the-shelf tools. This new expanded access allows researchers to apply self-driven research methods, including specialized algorithms and machine learning techniques to subsets of the entire HathiTrust corpus that were previously impossible to accomplish.
What would you say is the coolest or most exciting application of HathiTrust-based data mining you’ve seen so far?
Several good examples come to mind:
- Ted Underwood: Pilot example of Data Capsule research using in-copyright data: “The Transformation of Gender in English-Language Fiction.” This project was written up in Smithsonian and The Economist, and referenced in The Washington Post.
- Cathie Jo Martin: She worked with both HTRC and HT, and was able to make use of Ted Underwood’s genre work, see HathiTrust Newsletter here for description.
- Samuel Franklin: “Inside the Creativity Boom” Advanced Collaborative Support project; turned into a chapter in his dissertation. Franklin as able to make an argument about how the meaning of the word “creativity” has shifted over time using Extracted Features to create topic models.
- Dan Baciu: “The Chicago School: Evolving Systems of Value.” This was part of his dissertation; he used machine learning and “wikification” techniques to demonstrate that the Chicago School concept appeared in the literature earlier than previously thought.
I would also call your attention to Ben Schmidt’s “A guided tour of the digital library,” which he released last week. Ben used the Extracted Features to explore alternative ways of clustering and classifying the HathiTrust collection and discusses their relationship to Library of Congress Subject Headings and other systems. It’s a fascinating read.