16340

Google acquires reCAPTCHA for book scans

Teaching computers to read, Google hopes to bolster Google Books and the Google News Archive

Technology trends and news by Ronny Kerr
September 16, 2009 | Comments
Short URL: http://vator.tv/n/aa8

Google has acquired reCAPTCHA, the company known by most users as a provider of those (slightly annoying) tests where you have to type out the squiggly, morphed words displayed to sign in to a site. The idea is to prevent bots from buying all the tickets for a show in the first 10 seconds of the sale or signing up for every available email address.

reCAPTCHA

Google says reCAPTCHA currently guards over 100,000 Web sites from such spam attacks.

The service has much broader applications, though.

reCAPTCHA is aiding the massive task of digitizing books, newspapers and old time radio shows. For physical books, it’s a two-step process: scan a page, then transform into text using "Optical Character Recognition" (OCR).

Unfortunately, even the most sophisticated OCR program cannot easily transcribe just any scanned image of a page of text, for example, because in some older books, either time has taken its toll on the paper and ink or the font is just plain weird. But humans can probably figure out what it means.

fail

According to reCAPTCHA’s Web site:

About 200 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that's not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day.

reCAPTCHA gives users two words. The first is a word reCAPTCHA knows. The second is the word from that ancient or damaged text that the computer is trying to transcribe. If a user gets the first word right, then reCAPTCHA assumes it’s dealing with a human, and accepts the user’s input for the second word. After many run-throughs with many different users, reCAPTCHA pools all the inputs for the second word and assumes the majority answer is probably what the word actually is.

In this way, reCAPTCHA can continually utilize the crowd to correct and improve its OCR.

Google’s acquisition of the company makes a lot of sense, considering that they are currently invested in two large-scale digitization projects: Google Books and the Google News Archive.


Related news


blog comments powered by Disqus
Find your friends' startup new!
Vator is more valuable if you know who's here.
Discover who has a startup and help their success by following their progress!

Featured Stories

Latest company news bites on Vator

Skit! - Robin Johnson (CEO and Founder)
Skit! 1.2, our biggest release yet, will be available for download in just a few days. ...
See more
UpOut was featured in a article: "Funding roundup - week ending 05/24/13" 1 day ago
Lyft, UpOut, Swivl, Change.org, Weemo, Mission Markets, Cubic Telecom, Adly, Imonomy, LoyalBlocks See more
Vator, Inc. - Bambi Francisco Roizen (CEO and Founder)
Are you coming to #vatorsplash LA on 5/30; come join us! http://bit.ly/Ys9mBq thnks! @kpmg @rackspace @wilsonsonsini
See more
© 2012 Vator, Inc.