A week or so ago, a monumental thing happened: the number of public-domain books in the HathiTrust digital repository topped 5 million. And since no one (including HathiTrust, so far) seems to be making a very big deal about this, it seems like a good moment both to recap the achievements of HathiTrust and to consider a few of its implications for the future of reading and scholarship.
For those unfamiliar with the outlines of its history, HathiTrust emerged in the wake of the Google Books Library Project, a massive and still ongoing program of book digitization that Google undertook in 2004 in cooperation with some of the most comprehensive research libraries in North America and the UK. The basic outlines of the agreement between Google and each library were simple: in return for allowing Google’s employees to come in and (non-destructively) scan most or all of the books in the library’s collection, the library would receive a digital copy of the resulting set of text images. This meant that when Google had finished its scanning project in a library and departed, the library was left with a full digital copy of its collection (or of the subset scanned by Google, anyway)—a copy which had cost the library nothing except the inconvenience of having Google underfoot for a year or so.
Copyright law being what it is—for better or for worse—this was a project that brought a few legal issues to the surface, among them the question of what the library could do with the digital copies that resulted from this project.
These various legal questions have been (and to some degree continue to be) addressed in the course of several lawsuits, and some of those have been discussed in the Scholarly Kitchen already and in many other places as well. I’m not going to replow that ground here. But one issue does seem now to be settled: one thing the Google partner libraries could do was get together and use the images resulting from that project to create an enormous, centrally managed, and robustly archived library of digitized books.
That library now numbers over 13 million titles, most of which are in copyright and therefore not freely available for online reading. Instead, these can be used for research: if you need to figure out what terms appear in which books (and how often), you can use HathiTrust to do so; having identified the books that are of interest to you, you can then pursue full access to them by some other means. Other kinds of research are possible as well, within constraints designed to maximize access without crossing legal lines.
Whether those lines are placed where they should be is an important and vexed question, which I won’t pursue here. Instead, I want to address some of the implications of Google’s project for access to an incredibly rich and varied treasure trove of books and other documents that are in the public domain. Until quite recently, these books were largely hidden from the public by their imprisonment in physical objects. The law has allowed free, unfettered use and reuse of these books for years—for centuries, in many cases—but such use was never possible as long as the books were available only in print format. Print is a marvelous format for some kinds of use, but it is a terrible mechanism for distributing information efficiently to large numbers of people; furthermore, the older the book, the fewer physical copies are likely to be in existence. Until the intellectual content of these books was freed from the bonds of physical formats, there was never any hope of making them available on anything like a worldwide and comprehensive basis. And the only way that liberation was going to happen was was if an organization with the resources of Google did the initial work; subsequently, the only way the resulting digital files would be turned into a true public resource was if a significant number of richly-provisioned libraries undertook that project.
What are the contours of the resulting collection? Consider this: according to data provided (to its members) by the Association of Research Libraries (ARL), the average large research library in North America holds roughly 4 million books, and that collection of books might be available to, say, 25,000 local students and 2,000 faculty. The average ARL library has a materials budget of around $12 million, 25% of which (or $3 million) is spent annually on books as opposed to journals and databases.
Now consider this: for this average ARL library, the cost of a membership in HathiTrust—and thus the right to offer its patrons the capability of downloading all 5 million public domain books in HathiTrust as single .pdf files—will probably be under $20,000. (That’s the case for my library, which is right around the median on most measures of ARL library collections.)
To be clear, no one has to be a HathiTrust member or affiliated with any library at all in order to get access to the full content of these books; anyone may read them online, and can print or download them one page at a time. But in return for a modest membership fee that goes towards the still-ongoing creation of HathiTrust and its management and upkeep, member institutions are able to make access to those books incredibly easy and efficient, thus making possible not only enhanced real-time access, but also the building of rich personal libraries that could never have been created before.
Now, it’s worth acknowledging that HathiTrust’s collection of public domain books isn’t the same thing as a research collection that is made up of more contemporary and locally-relevant content and that has been shaped with institutional and curricular needs in mind. No one would argue that any library should chuck its book collection and replace it with the HathiTrust corpus.
However, what access to HathiTrust does do shouldn’t be underestimated either. Its capacity to transform research in the humanities (which are treated as second-class disciplines on many research campuses) in particular is tremendous, and when coupled with print-on-demand technologies and with the proliferation of affordable e-readers, it has the potential to hugely enrich all of our lives as readers as well. At the institution where I work, access to these books is changing the shape of some curricula already.
One last point: by undertaking the work of robustly archiving, preserving, and creating public access to these books, HathiTrust has (quietly and implicitly) thrown down the gauntlet in front of every academic and research library that has its own collection of public domain books—especially of books that are rare or unique. HathiTrust provides a mechanism for storing and making publicly available copies of those books; all you have to do is scan them and submit the images. This can’t be done at no cost, of course, but it seems like something that all of us ought to be willing to do to some degree, and that most of us could do more efficiently in partnership with HathiTrust than on our own. The challenge is particularly acute for those libraries that consider themselves to be on the front lines of providing open access to scholarly information. In most of our libraries there are troves of documents that can be shared without any concern for copyright restrictions, without the need for licensing of any kind, without fighting publishers over the structure of scholarly communication, and without having even to worry about rightsholders—because there are none. These documents really and truly are owned by the public, and all of us may do whatever we want with them: copy, redistribute, republish, create derivative works, commercially exploit, whatever. In order to make this possible with the public domain books that we own in research libraries, all that’s required is for us to put our money where our mouths are. HathiTrust provides a platform and a suite of services that make it easier than ever for us to do so.
[UPDATE: HathiTrust’s latest newsletter features a great piece by Executive Director Mike Furlough on the background and significance of this milestone achievement.]