Carnegie Mellon ‘reCAPTCHA’ Tool to boost Book Digitizing efforts

reCAPTCHA

A computer scientist at Carnegie Mellon University has been utilizing the help of multitudes of Web users daily, to eliminate a technical bottleneck that has slowed efforts to transform books, newspapers and other print material into digitized text that is searchable on computers.

Luis von Ahn, an assistant professor of computer science has said that this project will also improve Web security systems that are used to reduce spam, making it possible for individuals to safeguard their own email addresses from spammers.

The Carnegie Mellon project incorporates a dual use for an existing technology called CAPTCHAs, which is the distorted-letter tests found at the bottom of registration forms on Web sites like Yahoo, Hotmail, PayPal, and Wikipedia amongst others. This test requires users to type the distorted letters they see inside a box.

In case you were wondering CAPTCHAs is an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart. This acronym is very self-explanatory and helps to distinguish between legitimate human users and malevolent computer programs designed by spammers. This task is difficult for computers, but easy for us humans.

Von Ahn worked with computer science professor Manuel Blum, undergraduate student Ben Maurer and research programmer Mike Crawford and together they invented a new version of the test, called reCAPTCHAs, that will help convert printed text into computer-readable letters on behalf of the Internet Archive.

The San Francisco-based non-profit group administers the Open Content Alliance and is one of several large initiatives working to digitize books and other printed materials under open principles, making the text searchable by computer and capable of being reformatted for new uses.

Optical character recognition (OCR) systems that automatically perform this conversion are often stumped by underlined text, scribbles and fuzzy or otherwise poorly printed letters. ReCAPTCHAs will use words from these troublesome passages to replace the artificially distorted letters and numbers typically used in CAPTCHAs.

The new tests continue to distinguish between humans and machines because they use text that OCR systems have already failed to read. And because people must decipher these words to pass the reCAPTCHA test, they will help complete the expensive digitization process.

With support from Intel, von Ahn’s team has devised a free, Web-based service that allows individual webmasters to install reCAPTCHAs to protect their sites. Individuals can also use the service to protect their own email addresses, or lists of addresses they post on personal Web pages. In the case of some commercial Web sites with heavy traffic, reCAPTCHA may charge a fee to pay for additional bandwidth.

To make sure that people are correctly deciphering the printed text, the reCAPTCHAs system will require Web site visitors to type two words, one of which the system already knows. Each unknown word will be submitted to multiple visitors. If the visitor types the known word correctly, the system has ‘greater confidence’ that the unknown word is being typed correctly. If several visitors type the same answer for the unknown word, that answer will be assumed to be correct.