Detail of a C. E. Brock illustration for the 1...
Image via Wikipedia

It is a truth universally acknowledged that when given the choice between something that is free and something that costs even a small amount of money, other things being equal, most people recall that they are rational actors in an economic morality play and choose the free item.

I suppose it’s another universally acknowledged truth that when you begin to write anything in which Jane Austen will be cited, it is irresistible to quote the famous opening line of “Pride and Prejudice.” It’s a great line from a great book, one not only studied by literary scholars but taught in classrooms everywhere and even available in airport bookstores, where it sits on the shelf along with the other small number of public-domain classics that have established an enduring place in the popular imagination:  “Great Expectations,” “Moby Dick,” “Vanity Fair,” “Portrait of a Lady,” etc.  In our digital age, those great incipits can be cited with abandon — “Happy families are all alike”; “Call me Ishmael”; even “This is the saddest story I ever heard” — as someone need only copy them into the Google search bar to discover their provenance (“Anna Karenina,” “Moby Dick,” and Ford Maddox Ford’s “The Good Soldier”).  The pyrotechnology of Google married to the classics:  What’s not to like?  Google can make even the elliptical poetry of Ezra Pound accessible to many readers.

Other things are not always equal, however. Step inside your local bookstore, assuming your town is fortunate enough to still have one, and pick up a copy of, say, the Penguin Classics edition of “Martin Chuzzlewit” or the Oxford Classics version of “Tom Jones.”  These public-domain classics have been beautifully edited:  authoritative texts, useful introductions and suggestions for further reading, and notes for obscure references in the text. These are perfect, attractively priced student editions.

For some writers the amount of editorial work is much greater; consider, for example, what it means to edit one of Shakespeare’s plays for students or having to translate the works of Aeschylus, Horace, or Flaubert.  All of these works are in the public domain, but the work of translators and editors lifts them above what the public has title to.  Once you throw an editor into the mix, other things simply are not equal.

These thoughts were prompted by my introduction to Google Ebooks, the much-awaited program that promises to revolutionize access to books.

When the program was launched in December, I proceeded to install the Google Books app on my iPad to test the service.  With installation, three titles automatically appeared in my personal library:  “Pride and Prejudice,” “Great Expectations,” and “Alice in Wonderland” — all in the public domain.

I decided to read a few pages of “Pride and Prejudice” to test Google’s e-reader, but Austen being Austen, I am now almost finished with all of her major work:  “Emma,” “Pride and Prejudice,” “Mansfield Park,” “Sense and Sensibility,” “Persuasion,” and “Northanger Abbey.”  I have done this several times before, though in print, and I am not alone, as Austen is one of those writers people keep coming back to.  Hence her availability even in airport bookstores.

Incidentally, one of the advantages of the iPad and other e-reading devices is that you can load the complete Austen onto them, step inside your Man Cave, turn to NASCAR on the TV, and curl up with the tale of Elizabeth Bennet or Emma Woodhouse.  Heck, for all anyone knows, you could be reading “Typee,” Hemingway, or even Norman Mailer.

Rather than pay for the Penguin or any other edited version of Austen, I decided to be a cheapskate and searched for free Google versions.  And that’s when things began to go wrong.  The Google editions were packed with errors. If I were not studying Google Ebooks for professional reasons, if I were not already familiar with the works of Austen, would I have gone on? Would I have thought that Austen does not know how to place quotation marks, that she made grammatical mistakes that would embarrass even a high school freshman, or that her dialogue sometimes breaks off without explanation?  I began to wonder what service or disservice Google had performed, rendering one of the world’s most popular writers in a form as bizarre as the Zemblan translation of Shakespeare in Nabokov’s “Pale Fire.”

The problems with the Google versions of Austen potentially stem from four sources, though it is the third of these that is the principal culprit:

  1. The original print edition. Except for those books sent to Google directly by publishers, the books found in Google Ebooks derive from Google’s mass digitization project of library collections. The assumption is that if a university library puts a book on its shelf, that book must be okay.  This is a bad assumption, however.  Publishers make mistakes, libraries make mistakes; over time the seriousness of these mistakes becomes more apparent.  Texts have to be reviewed; sometimes explanatory notes are necessary to provide context.  Take a look at the public-domain 1911 edition of Encyclopaedia Britannica (not part of Google Ebooks) and ask yourself if some people will confuse the text for a modern one.
  2. Digital scans of the print edition. Digital scanning has gotten very good; as far as I know, Google does as good a job as anyone at this.  But scanning can nonetheless introduce errors and odd artifacts and in any event does not provide a text in a modern typeface that can fit any size screen.  This may not be a problem for a scholar working with obscure works, but for popular works like “Pride and Prejudice,” this raises a real barrier to readership.
  3. Optical character recognition (OCR). OCR has come a long way. Each Google Ebook of a scanned text is accompanied by an OCR’d version, which allows the user to change fonts and type size and reflow the text. Unfortunately, some of the OCR for the Austen volumes I read was simply terrible. Words got mashed together, spacing was bizarre, punctuation was simply not picked up, etc., etc. The problem for Austen is that it is the OCR version, not the scanned pages, that most readers are likely to use, as they will be reading from mobile devices with tiny screens, which require reflowed text.
  4. Metadata. The metadata for the public domain works in Google Ebooks is atrocious.  Geoffrey Nunberg has written forcefully about this.  To his comments I would add that the crime is worse for popular works such as Austen’s.  Scholars can struggle with poor metadata, but someone who might pick up a classic but once in his or her life is not likely to make much of an effort.  My experience with “Mansfield Park” is not atypical.  I began to read the work (in the corrupt OCR version) and then came to what appeared to be the end.  But it was marked the “end of Volume One.” There was no Volume Two.  I had a similar experience when researching books about the Beatles and encountered a title and cover for a Beatles book, but inside was a musical score by Mozart.

I wish to be clear that I am restricting my criticism to the small number of literary classics that continue to have a popular readership.  Google has done a disservice to these works and their readers.  Free is a terrible price, as many readers will flock to these free editions — not knowing that other things are not equal — bypassing the edited volumes prepared by scrupulous publishers.

The horse is out of the barn, alas:  free rules, and damn the consequences. What is needed is a branded collection of reader’s editions for popular public-domain works. Such editions would not have all the apparatus provided by a Penguin or Oxford, but they would have reliable texts and be in technical formats such that schoolkids can read them on their phones. They would have to be free.

I would like to see an authoritative organization — the Modern Language Association, perhaps? — take this on, perhaps by beginning with a review of the texts in Project Gutenberg.  There would be a one-time cost to get this going and then a small cost for maintenance. There is no self-evident way to recoup these costs.  The number of titles would be in the range of 200-300.  Titles that pose challenges (Shakespeare, translations) could be put off for a later time.  Perhaps the founders of Google, whose personal wealth will someday be the subject of literary epics, would underwrite this.  This is what is meant by “giving back” after you have taken away.

The branding (MLA Editions or something of that kind) is key as teachers could then recommend these free editions to their students.  It’s a shame that the good work of Penguin, Oxford, and other reputable publishers would be diminished by this, but the onslaught of uncurated free culture creates new costs, some hidden, everywhere.

Enhanced by Zemanta
Joseph Esposito

Joseph Esposito

Joe Esposito is a management consultant for the publishing and digital services industries. Joe focuses on organizational strategy and new business development. He is active in both the for-profit and not-for-profit areas.


33 Thoughts on "The Terrible Price of Free: On E-reading Jane Austen via Google's Ebooks"

I read the free Kindle version of _A Woman in White_ earlier this month, and I noticed no errors whatsoever.

As a (very) recent owner of a Kindle I have come to a rather startling discovery which your description of reading Austen on Google Books confirms: the public domain isn’t really as free as we think it is. The metaphor of texts “in the public domain” suggests that, like some enchanted garden grove, we can simply enter “the public domain” and pluck these texts like perfectly ripened fruit. But as you note, just because a text is unencumbered by copyright does not mean, as a practical matter, that it is available in a readable, authoritative, trust-worthy edition.

All the reservations you point out about GBooks, I think, are true of the wider “market” (if we can call it that) in free public domain etexts to a greater or lesser extent. Even when the OCR errors are not as egregious as the ones you describe, small hiccups in the reading text combine with a persistent worry that I may not really know exactly what I’m reading. (I mean, are these the poems of John Keats I want?)

The four problems you note are all quite right; I wonder if we might carve the turkey slightly differently. I see two chief problems with many of these texts: one is the problem of metadata/OCR which you (and Geoffery Nunberg) have pointed to. The other is the challenge of identifying the right text. All of these mass digitization projects are scanning old books; but the “best” reading text may be an eclectic edition, combining various texts from various books, created through some labor of scholarly editing. Indeed, in cases with complex textual histories (which is to say, most cases): the best reading text of a public domain work may be in copyright. I was recently looking at the print history of Joyce’s Portrait of the Artist as a Young Man and concluded that a pretty serious gap exists between any public domain text and the text one would ideally want as a reading text (the most recent such text is that established by Hans Walter Gabler; of the public domain editions the 1918 British Egoist edition is likely the best, but it is the 1916 Huebsch edition that one finds most frequently at and Google Books).

Your suggestion that someone could do an enormous benefit to world by helping to make the public domain actually public seems exactly right. This seems like a wonderful opportunity for scholars to do to a service to a wider reading public—a wider public which includes not only fugitive Austenites in ‘Man Caves’ but (as you suggest in closing) high school and college students in introdutory lit surveys. Take one more step, though, and imagine the ultimate results if such labor were untaken: the benefits of establishing reliable, well edited, and freely available reading texts of “major works” would be enormous. Not only could one find a free, legible, reliable text; these texts could provide the basis for further (open/free) scholarly work: roll your own anthologies with additional notes keyed to the needs a specific class; more complex, variorum editions mixing available public domain texts; etc.

You note the costs; to those I’d add the challenge of incentivizing scholars to actually do this work? And who would do this? Might such a project have a place under the umbrella of the emerging National Digital Public Library, for example. But it seems an incontestably valuable, even necessary, project at this point. If anyone’s doing it—sign me up.


Thanks for raising these issues. You could re-post this article nearly verbatim (only switch out “Google” with the name of a few major online book retailers), and re-title it “The Terrible Price of Nearly Free.”

About your suggestion in the next-to-final paragraph: “. . . perhaps by beginning with a review of the texts in Project Gutenberg.” Is this because you think that Project Gutenberg is a major offender or, on the other hand, because the texts in Project Gutenburg might offer the best starting points for arriving at authoritative texts?

I don’t know how authoritative the Gutenberg texts are. I hope they are good, but I don’t know. I thought that an investigation into solving the problem I described could begin with Gutenberg as a means of keeping down costs. But the Gutenberg licenses are terribly restrictive, so Gutenberg might be eliminated from this project because of the business rules. Note that I am talking about creating “reader’s editions,” not student’s or scholar’s editions. Penguin and Oxford among others do a great job for students. The need (which I regret to find that it exists at all) is for something free to push aside the sloppy free editions that are already out there. It pains me to write this, as I admire Penguin and Oxford so much. But they have been Googled: the triumph of IT over culture.

Gutenberg’s licenses allow you to do pretty much want you want with the text as long as you remove reference from Gutenberg to them – check out all the classics Kindle Editions on Amazon – they’re created using Gutenberg – you’ll soon notice they are pretty much identical. However, Amazon are now changing their policy on this and requiring that each one is not just a re-hash of the Gutenberg text. I think they’re going to distribute their own free Gutenberg version. Gutenberg texts are already available in Kindle version and can be downloaded quite nicely to a Kindle App no matter what device it’s on.

“the triumph of IT over culture” seems a bit harsh.

Didn’t Plato warn us that this new fangled “writing” thing would ruin memory? From writing to e-books, technology has been an integral part of human culture. To put them at odds is to reinforce the idea of “two cultures” at odds.

Looks like Google have scored a pretty big own goal and gifted the market for free classics to Amazon if they’re really just using OCR versions of PDF. As an occasional user of I have experienced the disappointment of trying to read the text versions of Google scanned PDFs only to realise that they are completely unreadable.

Of course it would take a massive budget, or the hard-work of volunteers, such as the Gutenberg folk to correct this mistake.

I suspect that Google will actually have to go the Amazon route and take properly formatted text from Gutenberg for these popular free classics.

What is also worrying is that Google obviously haven’t done any quality testing before unleashing free book content to the world. Silly, very silly.

You may also want to compare the version at Wikisource, where readers can correct OCR errors themselves immediately. (It’s a set of HTML pages, but it is not too hard to collect them into a PDF book using the “Print/export” link on the left hand side.)

I recall that I made a very similar point almost two years ago in the Chronicle on Google’s scanning of historical works in the public domain. My comments back then were dismissed with a good deal of sniffle (on the Read-20 list) or generally passed over. Glad to see that the issue is still being addressed and by people like JE.

I recall Ron’s piece. It was very good. There has been a great deal of commentary on this topic. What finally motivated me to write about this was the realization that the sloppiness of the Google program was affecting that small part of our culture with constituencies in both research universities and the popular marketplace. This did not have to happen.

Interesting and odd that they chose these versions with errors rather than the very high quality versions available at Feedbooks or Project Gutenberg (especially for popular works).

Google books has been like this for years. It’s totally unusable. You should have gone directly to Project Gutenberg or Feedbooks (which mostly comes from PG), like Keith said above.

The real problem is that no one can edit and fix these errors easily. I’d be glad to fix obvious errors while I read if it was possible, but it’s not. The technology exists today to make this a reality, but I’m not going to hold my breath.

The OCR behind Google Books (or materials is indeed unusuable in many instances. However, unlike PG, it has the advantage (in most cases) of clear (or at least clearer) provenance: that is, where the text is coming from (the specific edition, and physical book) can be inferred from the page images, even if the metadata (as it stands) is in rough shape.

I’ve used PG for many things; but with a PG text exactly what you’re reading is not always obvious. If I download Great Expectations (to take an example at random), which ending am I getting? By giving page images, Google Books at least provides enough information for anyone willing to track down what she’s looking actually looking at.

The ability for readers to fix errors in GBooks would, indeed, a boon.

If you need to know this, you can usually find it. First, the front pages are always kept, so any information there should be available, though on older books it often doesn’t help much.

Second, most of the books at PG have come through Distributed Proofreaders, which keeps the original page scans available, as a downloadable package. You can go there, see if the book is listed, and look at (or download) the pages/text. I understand the long term plan is to have those page scans available directly from PG, though it hasn’t happened yet.

Thanks for this info. Detailed bibliographical information, though, is not available for most PG books (at least via PG). It is unclear to me which version of Joyce’s Portrait the PG text reproduces. The version of Wells’s The Time Machine offers only a date; but the novel was originally published a few years earlier.

The larger issue is that establishing the best texts requires quite a bit of work. I admire PG enormously; they were working on improving access to the public domain long before Google was an even a glimmer. But there model is bibliographically naive.

These issues may seem irrelevant for many readers–only of concern to scholars. And indeed, this is often the case. However, a sort of guaranteed “bibliographical nutrition content” list (this text reproduces this bibliographical item) is of value to all readers; and there is no reason for widespread access to public domain materials not to be bibliographically sound.

Whether GBooks or PG offers better raw materials for creating a good, clean, and bibliographically clear reading text is a question I can’t really answer. I do think, however, that greater attention to bibliographical specificity would be a boon to readers. The texts being generated now (and which have been generated over the last decade and more by PG) will be recycled and repurposed in years ahead, in new formats. It makes sense to try to get it as right as possible now.

Moreover, the user interface of Google Books now suggests that the image files of the original book are a downgraded ‘preview version’. They lure you into using the new interface and the OCR version by suggesting that you should press an ominous “blue button” asking for your Google account details. Since I am mainly interested in early modern prints, the OCR version is worthless for me (screenshots on the blog that is linked). Still, Google wants to trick users into submitting personal details for viewing a file that could without any loss in user experience be viewed by pressing on the ‘preview’ link. A preview giving you the whole thing is more than just a preview, right?

Beware of blaming Google for being alone in this: I have bought ebooks to find exactly the same problems (dodgy spacing, missing quotation marks making speech difficult to read, errors in punctuation/spelling, etc., which were – I am sure – not in the original print version). (However I suspect the problem was – as you rightly point out – in the scanning process: if an epub version is created from the original e-file then the problems are (hopefully) few.)

I hate to say this, but all of these glitches you are going to find in some books also as originally published! I recently reviewed a book issued by the (UK) Publishers Association titled “Going Digital” that contained over 100 editorial errors.

I read tons of classics from project gutenberg on my kindle for android app. Sometimes OCR errors bug me, but I honestly don’t think it “ruins” a book for me. It never has yet, at least. I don’t have a problem with skimming over typos. Typos show up in print books too.

I read Wharton’s Age of Innocence on my iPad (free through Google books), and ran into the same difficulties. It was completely frustrating, and I kept switching back and forth from “flowing text” to the “scanned pages” to decipher some odditites. E-books are great, but they still have some tweaks to work out.

I too have had the same frustrations with downloading Google Books classics to my e-reader. As a new e-reader owner I thought I was doing something wrong in the way I was adding them to my Library. Clearly the “garbage in, garbage out” rule still applies. What a disappointment!

Try the Project Gutenberg version. I typed it and it was proof-read by someone in England.

Everyone was outraged that Amazon and B&N are selling classics for $1 a piece when they were free elsewhere. I think they are worth the dollar.

Sad, Joe, that as sincere as Google may be in “doing no harm,” it does the harm you describe partly by overreaching. The same might be said of Google’s bibliographic detail for contemporary books; its listing of my own titles is a mess. I suppose there will be some sorting out of these shortcomings eventually, but not without observations such as yours.

” Heck, for all anyone knows, you could be reading “Typee,” Hemingway, or even Norman Mailer.”

Or you could be playing Angry Birds.

I hate to say it, but I’ve found all kinds of errors in “real” ebooks put out by publishers. Fortunately, I didn’t pay for any of these, I got them through my library’s subscription, but I’ve been flabbergasted at the poor workmanship/editing shown in these books. Sometimes it was special character that were handled improperly (imagine a book with French names and expressions where every single accent was a question mark), sometimes it was page layout.

And free isn’t always bad. I’ve gotten some good stuff from Project Gutenberg (and some bad) and I put together a free ebook recently of some public domain works. My own is still not perfect, it’s given me a terrific respect for the amount of work it takes, but it’s still pretty good and much better than some of the ebooks that my library has to actually pay to offer us.

Perhaps as eReaders become more popular, we’ll see more quality in both paid and free ebooks. Can hope anyway.

Comments are closed.