On the Internet, no one knows you’re a dog. But do they know you’re a human? With the rise of automated robots (or bots), many website attempt to protect themselves from abuse and spam by requiring a piece of distorted text be retyped into a box and submitted. Humans are great at pattern recognition, not so with computers — at least not yet.

While it takes only a few seconds for a human to decipher and submit a piece of distorted text, millions of users can collectively add up to hundreds of thousands of hours per day that could be harnessed to do something useful.

In their article, “reCAPTCHA: Human-Based Character Recognition via Web Security Measures,” appearing online in the journal Science, Luis von Ahn and colleagues from the computer science department at Carnegie Mellon University surmised that that this work could be used for digitizing old printed text, working on the words that optical character recognition (OCR) programs fail to recognize.

CAPTCHA is an acronym which stands for Completely Automated Public Turing test to tell Computers and Humans Apart. It is a test that humans — but not computers — can solve easily. Instead of using randomly distorted words, the web tool, reCAPTCHA displays words taken from scanned texts.

Mass automated scanning projects like the Internet Archive or the Google Books Project are digitizing huge numbers of older books, but the OCR is anything but perfect, failing to recognize 20% of all words in some cases. For example, a computer does a poor job converting the following line of text:

OCR of scanned text
OCR of scanned text (from ReCAPTCHA.net)

In order to first detect a human, website using reCAPTCHA send the user two words — a known word and an unreadable word. Getting the known word right identifies the user as a human. To account for human error, reCAPTCHA sends multiple users the same unreadable word, and uses consensus as a means of verification.

The rate of transcription has currently surpassed 4 million suspicious words per day, equal to a human workforce of over 1,500 people deciphering 60 words per minute, 40 hours per week. The authors write:

We believe the results presented here are part of a proof of concept of a more general idea: “wasted” human processing power can be harnessed to solve problems that computers cannot yet solve…We hope that reCAPTCHA continues to have a positive impact on modern society by helping to digitize human knowledge.

Reblog this post [with Zemanta]
Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist. https://phil-davis.com/


2 Thoughts on "reCAPTCHA: Workforce of Accuracy"

Comments are closed.