When my company added six million Google Book Search and Google News Archive links to two of its databases last month, I learned a few things about some puzzling disparities in Google’s treatment of scanned public-domain works.
The two databases — 19th Century Masterfile (NCM) and Public Documents Masterfile (PDM) — are discovery aids that link to many millions of documents, nearly all of which are in the public domain. By adding links to the locations of several millions of these on Google sites, we were able to discern, very clearly, differences in Google’s treatment of 19th century historical and literary materials versus scanned government documents.
Click through a bibliographic record to a 19th century item, and you will be able to download the complete document in PDF or Mobipocket formats about 90% of the time. By contrast, click on a bibliographic record linking to a government document, and you will be met in most cases with either no preview, a snippet view, or a limited view. The full-text seems to be available only 10-20% of the time.
I have been asking around about this and have been the beneficiary of any number of theories, but have not been able to get any real clarity as to why this discrepancy exists.
The same thing happens when the documents are viewed through a university’s system. When downloading a 19th century document, a scanned version of ‘An Introduction to Greek and Latin Etymology’ from the University of Michigan library, I was struck by the wording of the introduction that Google includes at the front of each of their scans, as follows:
Google is proud to partner with libraries to digitize public domain materials and make them widely accessible. Public domain books belong to the public and we are merely their custodians. Nevertheless, this work is expensive, so in order to keep providing this resource, we have taken steps to prevent abuse by commercial parties, including placing technical restrictions on automated querying.
We also ask that you:
- Make non-commercial use of the files. We designed Google Book Search for use by individuals, and we request that you use these files for personal, non-commercial purposes.
- Refrain from automated querying. Do not send automated queries of any sort to Google’s system: If you are conducting research on machine translation, optical character recognition or other areas where access to a large amount of text is helpful, please contact us. We encourage the use of public domain materials for these purposes and may be able to help.
- Maintain attribution. The Google “watermark” you see on each file is essential for informing people about this project and helping them find additional materials through Google Book Search. Please do not remove it.
- Keep it legal. Whatever your use, remember that you are responsible for ensuring that what you are doing is legal. Do not assume that just because we believe a book is in the public domain for users in the United States, that the work is also in the public domain for users in other countries. Whether a book is still in copyright varies from country to country, and we can’t offer guidance on whether any specific use of any specific book is allowed. Please do not assume that a book’s appearance in Google Book Search means it can be used in any manner anywhere in the world. Copyright infringement liability can be quite severe.
Google (or their lawyers) seem uncertain about the rights that they have to public domain scans. However, my interpretation is that they are clear about their desire to restrict use by other commercial entities or non-commercial organizations.
Without further information from the source, it’s difficult to know why access to government documents is currently noticeably restricted, in contradiction with the broad availability of 19th century materials. One is also left to wonder whether Google believes that they have any legal grounds for their “Usage Guidelines,” which seem to defy the very nature of public access.
Meanwhile, it seems Microsoft has plans of its own involving public domain materials. An article in Sunday’s Times Online announced that 65,000 19th century works of fiction from the British Library’s collection will be made available for free public downloads this spring . Cited as, “the latest move in the mounting online battle over the future of books,” one wonders if the British Library’s Microsoft-backed project may be the first in a series of initiatives aimed at reducing Google’s stranglehold on public-access materials.
Given how confused Google appears to be about what “public access” means, some competition in this area might be a good idea.