Source: www.actualitté.com

A week or so ago, a monumental thing happened: the number of public-domain books in the HathiTrust digital repository topped 5 million. And since no one (including HathiTrust, so far) seems to be making a very big deal about this, it seems like a good moment both to recap the achievements of HathiTrust and to consider a few of its implications for the future of reading and scholarship.

For those unfamiliar with the outlines of its history, HathiTrust emerged in the wake of the Google Books Library Project, a massive and still ongoing program of book digitization that Google undertook in 2004 in cooperation with some of the most comprehensive research libraries in North America and the UK. The basic outlines of the agreement between Google and each library were simple: in return for allowing Google’s employees to come in and (non-destructively) scan most or all of the books in the library’s collection, the library would receive a digital copy of the resulting set of text images. This meant that when Google had finished its scanning project in a library and departed, the library was left with a full digital copy of its collection (or of the subset scanned by Google, anyway)—a copy which had cost the library nothing except the inconvenience of having Google underfoot for a year or so.

Copyright law being what it is—for better or for worse—this was a project that brought a few legal issues to the surface, among them the question of what the library could do with the digital copies that resulted from this project.

These various legal questions have been (and to some degree continue to be) addressed in the course of several lawsuits, and some of those have been discussed in the Scholarly Kitchen already and in many other places as well. I’m not going to replow that ground here. But one issue does seem now to be settled: one thing the Google partner libraries could do was get together and use the images resulting from that project to create an enormous, centrally managed, and robustly archived library of digitized books.

That library now numbers over 13 million titles, most of which are in copyright and therefore not freely available for online reading. Instead, these can be used for research: if you need to figure out what terms appear in which books (and how often), you can use HathiTrust to do so; having identified the books that are of interest to you, you can then pursue full access to them by some other means. Other kinds of research are possible as well, within constraints designed to maximize access without crossing legal lines.

Whether those lines are placed where they should be is an important and vexed question, which I won’t pursue here. Instead, I want to address some of the implications of Google’s project for access to an incredibly rich and varied treasure trove of books and other documents that are in the public domain. Until quite recently, these books were largely hidden from the public by their imprisonment in physical objects. The law has allowed free, unfettered use and reuse of these books for years—for centuries, in many cases—but such use was never possible as long as the books were available only in print format. Print is a marvelous format for some kinds of use, but it is a terrible mechanism for distributing information efficiently to large numbers of people; furthermore, the older the book, the fewer physical copies are likely to be in existence. Until the intellectual content of these books was freed from the bonds of physical formats, there was never any hope of making them available on anything like a worldwide and comprehensive basis. And the only way that liberation was going to happen was was if an organization with the resources of Google did the initial work; subsequently, the only way the resulting digital files would be turned into a true public resource was if a significant number of richly-provisioned libraries undertook that project.

What are the contours of the resulting collection? Consider this: according to data provided (to its members) by the Association of Research Libraries (ARL), the average large research library in North America holds roughly 4 million books, and that collection of books might be available to, say, 25,000 local students and 2,000 faculty. The average ARL library has a materials budget of around $12 million, 25% of which (or $3 million) is spent annually on books as opposed to journals and databases.

Now consider this: for this average ARL library, the cost of a membership in HathiTrust—and thus the right to offer its patrons the capability of downloading all 5 million public domain books in HathiTrust as single .pdf files—will probably be under $20,000. (That’s the case for my library, which is right around the median on most measures of ARL library collections.)

To be clear, no one has to be a HathiTrust member or affiliated with any library at all in order to get access to the full content of these books; anyone may read them online, and can print or download them one page at a time. But in return for a modest membership fee that goes towards the still-ongoing creation of HathiTrust and its management and upkeep, member institutions are able to make access to those books incredibly easy and efficient, thus making possible not only enhanced real-time access, but also the building of rich personal libraries that could never have been created before.

Now, it’s worth acknowledging that HathiTrust’s collection of public domain books isn’t the same thing as a research collection that is made up of more contemporary and locally-relevant content and that has been shaped with institutional and curricular needs in mind. No one would argue that any library should chuck its book collection and replace it with the HathiTrust corpus.

However, what access to HathiTrust does do shouldn’t be underestimated either. Its capacity to transform research in the humanities (which are treated as second-class disciplines on many research campuses) in particular is tremendous, and when coupled with print-on-demand technologies and with the proliferation of affordable e-readers, it has the potential to hugely enrich all of our lives as readers as well. At the institution where I work, access to these books is changing the shape of some curricula already.

One last point: by undertaking the work of robustly archiving, preserving, and creating public access to these books, HathiTrust has (quietly and implicitly) thrown down the gauntlet in front of every academic and research library that has its own collection of public domain books—especially of books that are rare or unique. HathiTrust provides a mechanism for storing and making publicly available copies of those books; all you have to do is scan them and submit the images. This can’t be done at no cost, of course, but it seems like something that all of us ought to be willing to do to some degree, and that most of us could do more efficiently in partnership with HathiTrust than on our own. The challenge is particularly acute for those libraries that consider themselves to be on the front lines of providing open access to scholarly information. In most of our libraries there are troves of documents that can be shared without any concern for copyright restrictions, without the need for licensing of any kind, without fighting publishers over the structure of scholarly communication, and without having even to worry about rightsholders—because there are none. These documents really and truly are owned by the public, and all of us may do whatever we want with them: copy, redistribute, republish, create derivative works, commercially exploit, whatever. In order to make this possible with the public domain books that we own in research libraries, all that’s required is for us to put our money where our mouths are. HathiTrust provides a platform and a suite of services that make it easier than ever for us to do so.

[UPDATE: HathiTrust’s latest newsletter features a great piece by Executive Director Mike Furlough on the background and significance of this milestone achievement.]

Rick Anderson

Rick Anderson

Rick Anderson is University Librarian at Brigham Young University. He has worked previously as a bibliographer for YBP, Inc., as Head Acquisitions Librarian for the University of North Carolina, Greensboro, as Director of Resource Acquisition at the University of Nevada, Reno, and as Associate Dean for Collections & Scholarly Communication at the University of Utah.


20 Thoughts on "5 Million Public Domain Ebooks in HathiTrust: What Does This Mean?"

Completely agree, Rick. The HathiTrust and other digital libraries, like the DPLA, are incredible resources. It is extraordinary what technology enables today, and what these libraries empower now and will empower into the future as more collections are included.

The HathiTrust is indeed a noble enterprise, but it could have been more if Google had been willing. When the Library Project started, after many publishers had already been engaged with the Google Partner program, we at Penn State Press (the first university press to join the Partner program) approached the University of Michigan library with a proposal: give us a copy of the scan of all of our books in your collection and we’ll grant you greater use rights than you have now. Michigan was willing, Google was not. (I believe Google has changed its mind about this in more recent years.) We had launched the Office of Digital Scholarly Publishing at Penn State and wanted to avoid digitizing books for this open-access publishing project if we could. (By the way, the executive director of HathiTrust, Mike Furlough, used to be co-director of the ODSP.) HathiTrust at one point started an effort to identify which scanned books were truly free of copyright restrictions, but it famously erred on some assessments that were challenged by the Authors Guild and has since suspended the program. So, do we really know how many public-domain books there are in the collection overall? What is of special value in HathiTrust are those “special collections” that various of the libraries came forward to contribute, Penn State among them.

Slight correction. HathiTrust erred in identifying potentially orphaned works and suspended that program. They continue to research the copyright status of works in the collection to determine whether or not a work is still protected. That’s how Rick could determine the number of works in the public domain.

The restrictions they place on accessing public domain works are irritating as compared with Gutenberg. Compare “Shakespear jest-books”, by W. Carew Hazlitt as found in HathiTrust ( with the same title in Gutenberg ( HathiTrust’s scans as PDF don’t allow text selection so scholars must do as they dis in the paper era, transcribe text letter by letter. This is unconscionable in this digital age.

Actually, that’s only partly true — here is some copied and pasted text from a 1916 Wodehouse novel I downloaded from Hathi:

“He does not stop to lament, nor does he hang about analyzing his emotions. He runs and runs and runs, and keeps on running until he has worked the poison out of his system. Not until then does he attempt introspection.”

It’s true that you can’t select and copy text from the online scan, but even if you’re not a HathiTrust member you can quickly download the page on which the desired text is found and then copy and paste from the resulting .pdf document. (To do so is, as Bertie Wooster would say, the work of but a moment.)

I think the difference you’re seeing between Hathi’s online functionality and that of Project Gutenberg is largely a function of scale: HathiTrust has digitized 13 million volumes, while Project Gutenberg has only done 47,000. When you’re operating on a much smaller scale you can do things like create HTML versions of every document.

It’s also possible to copy/paste text directly from the HathiTrust interface, without having to download pages or whole books: using the toolbar at the right of the screen, you can toggle from page image view to text view, which shows you the highlightable electronic text underlying the page. It’s important to realize, though, that because the text is generated on a massive scale by an uncorrected OCR process, the quality may vary depending on the age and quality of the digitized book.

Rick, thanks so much for this post. As I already mentioned to you I have a blog post coming on this very topic, where I will go into a little more detail about the 5 million volumes and how we got there. It’s truly a great thing and it has been thanks to the work of hundreds of people in the HathiTrust membership and beyond.

Thanks, Kevin. I actually added that link to the end of the posting itself, above, but it’s helpful to have it here in the comments as well.

Yes, unfortunately different countries have different copyright laws, and this means HathiTrust can only provide free access to just over 2 million books to those outside the US. I’m sure they’ll offer a personal apology if you contact them directly. 😉

the problem of access from Europe is greater than that… often, authors have died more than a century ago, thus making their books PD worldwide…. it’s quite long and painful to obtain access to those… at least HT is willing to do research, while Google prevents French people from accessing 19th century editions of Jules Verne :((

Rick, I need to suggest a correction. You start out by stating “… the number of public-domain books in the HathiTrust digital repository topped 5 million”. The key problem is the word “books”. When HathiTrust reports numbers they use “volumes” — which means a part (a fairly large part as it turns out) are individual serial volumes. When we at OCLC Research roll up the data under “books” and “serial titles” we get more like 2 million. But it is a large number nonetheless and you are right to celebrate the achievement.

Good point, Roy, thanks — it would be more accurate to say “5 million volumes.” For the purposes of most readers and researchers there may not be a consistently meaningful distinction between “book” and “volume,” but it’s a very meaningful distinction for purposes of bibliographic analysis.

To be precise, it’s 2,365,771 unique titles. It’s a important distinction for sure. Since for serials access status will be in part dependent upon the date of issue/volume publication, we have tended to emphasize volumes over titles in the counts to convey the quantity of content available. Thanks, Roy.

Comments are closed.