Experimentation, Research, Technology, Tools

reCAPTCHA: Workforce of Accuracy

On the Internet, no one knows you’re a dog. But do they know you’re a human? With the rise of automated robots (or bots), many website attempt to protect themselves from abuse and spam by requiring a piece of distorted text be retyped into a box and submitted. Humans are great at pattern recognition, not so with computers — at least not yet.

While it takes only a few seconds for a human to decipher and submit a piece of distorted text, millions of users can collectively add up to hundreds of thousands of hours per day that could be harnessed to do something useful.

In their article, “reCAPTCHA: Human-Based Character Recognition via Web Security Measures,” appearing online in the journal Science, Luis von Ahn and colleagues from the computer science department at Carnegie Mellon University surmised that that this work could be used for digitizing old printed text, working on the words that optical character recognition (OCR) programs fail to recognize.

CAPTCHA is an acronym which stands for Completely Automated Public Turing test to tell Computers and Humans Apart. It is a test that humans — but not computers — can solve easily. Instead of using randomly distorted words, the web tool, reCAPTCHA displays words taken from scanned texts.

Mass automated scanning projects like the Internet Archive or the Google Books Project are digitizing huge numbers of older books, but the OCR is anything but perfect, failing to recognize 20% of all words in some cases. For example, a computer does a poor job converting the following line of text:

OCR of scanned text

OCR of scanned text (from ReCAPTCHA.net)

In order to first detect a human, website using reCAPTCHA send the user two words — a known word and an unreadable word. Getting the known word right identifies the user as a human. To account for human error, reCAPTCHA sends multiple users the same unreadable word, and uses consensus as a means of verification.

The rate of transcription has currently surpassed 4 million suspicious words per day, equal to a human workforce of over 1,500 people deciphering 60 words per minute, 40 hours per week. The authors write:

We believe the results presented here are part of a proof of concept of a more general idea: “wasted” human processing power can be harnessed to solve problems that computers cannot yet solve…We hope that reCAPTCHA continues to have a positive impact on modern society by helping to digitize human knowledge.

Reblog this post [with Zemanta]

About Phil Davis

I am an independent researcher and consultant, a former postdoc in science communication and science librarian.

Discussion

Trackbacks/Pingbacks

  1. Pingback: Late Summer Recap « The Scholarly Kitchen - Sep 2, 2008

  2. Pingback: Carnegie Mellon’s ReCAPTCHA Enhanced Audio New Functionality - Dec 9, 2008

Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Side Dishes by Stewart Wills

Find Posts by Category

Find Posts by Date

August 2008
S M T W T F S
« Jul   Sep »
 12
3456789
10111213141516
17181920212223
24252627282930
31  

The Scholarly Kitchen on Twitter

SSP_LOGO
The mission of the Society for Scholarly Publishing (SSP) is "[t]o advance scholarly publishing and communication, and the professional development of its members through education, collaboration, and networking." SSP established The Scholarly Kitchen blog in February 2008 to keep SSP members and interested parties aware of new developments in publishing.
......................................
The Scholarly Kitchen is a moderated and independent blog. Opinions on The Scholarly Kitchen are those of the authors. They are not necessarily those held by the Society for Scholarly Publishing nor by their respective employers.
Follow

Get every new post delivered to your Inbox.

Join 354 other followers