Google acquires reCAPTCHA for book scans
Teaching computers to read, Google hopes to bolster Google Books and the Google News Archive
Google has acquired reCAPTCHA, the company known by most users as a provider of those (slightly annoying) tests where you have to type out the squiggly, morphed words displayed to sign in to a site. The idea is to prevent bots from buying all the tickets for a show in the first 10 seconds of the sale or signing up for every available email address.
Google says reCAPTCHA currently guards over 100,000 Web sites from such spam attacks.
The service has much broader applications, though.
reCAPTCHA is aiding the massive task of digitizing books, newspapers and old time radio shows. For physical books, it’s a two-step process: scan a page, then transform into text using "Optical Character Recognition" (OCR).
Unfortunately, even the most sophisticated OCR program cannot easily transcribe just any scanned image of a page of text, for example, because in some older books, either time has taken its toll on the paper and ink or the font is just plain weird. But humans can probably figure out what it means.
According to reCAPTCHA’s Web site:
About 200 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that's not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day.
reCAPTCHA gives users two words. The first is a word reCAPTCHA knows. The second is the word from that ancient or damaged text that the computer is trying to transcribe. If a user gets the first word right, then reCAPTCHA assumes it’s dealing with a human, and accepts the user’s input for the second word. After many run-throughs with many different users, reCAPTCHA pools all the inputs for the second word and assumes the majority answer is probably what the word actually is.
In this way, reCAPTCHA can continually utilize the crowd to correct and improve its OCR.
Google’s acquisition of the company makes a lot of sense, considering that they are currently invested in two large-scale digitization projects: Google Books and the Google News Archive.