Thursday, 1 November 2012

Clever reCAPTCHA




Thank you! Thank you so much for helping to correct thousands, if not millions, of pages of scanned text!

What? You aren't doing anything of the sort? Oh, yes you are ... read on!

Optical character recognition (OCR) is the method used to digitise printed material — old newspapers and book, for example, can be scanned using OCR technology, and then we can access them online. The National Library of Australia uses this technology for its massive Trove database, for example.

The problem with OCR is that it's not always that accurate, especially from older printed material with yellowing paper and faded or smudged ink. It is a lot better than it used to be, but is still far from 100% accurate — 80% accuracy is more typical. So the resulting scanned texts have a lot of errors!

Here's an example:



We can read the top sentence of scanned type (This aged portion of society were distinguished from ...), because us humans are fucking brilliant, but a computer has a lot of trouble distinguishing the letter patterns. The human eye is just better at seeing those letters!

So, there's a problem. We have millions of pages of inaccurate scanned text, with 20% errors. Which makes it damn difficult to read and use, needless to say.

Now, you'll be familiar with CAPTCHA technology already. CAPTCHA stands for 'Completely Automated Public Turing test to tell Computers and Humans Apart'. It's those range of 'verification' boxes that pop up when posting on a blog or registering for something online, for example. You might need to type in some letters, deciphering a mangled or warped picture, or do some sums, or something which proves that you're a real person using the site, and not a computer being used by spammers to abuse online services. You're doing something that a computer can't do.

Some bright lads over at the Computer Science Department at Carnegie Mellon University (PA, USA) came up with a clever tool to help solve this problem of old OCR-ed texts, called reCAPTCHA.


This is a variety of CAPTCHA entry box, where one of the two 'distorted' words presented for you to retype is a word from actual scanned texts, which was unrecognisable by OCR. The other word is a 'control' word, to assess how accurate you are when typing in entries. Readings for these distorted words need to be agreed on by several users before they are cleared as being accurate.

In their Science paper reCAPTCHA: Human-Based Character Recognition via Web Security Measures (which you can download from this page)the scientists say that reCAPTCHA is used by more than 40,000 websites in 2008 (and presumably many more by now), and is proving to be highly effective and accurate.

You can read more about reCAPTCHA on Google and Wikipedia.


So whenever you enter text into a word verification box with this logo on it, you're helping to correct the vast quantities of old scanned texts, for all! I think it's so bloody clever; we're all helping to improve information and assisting in a massive complex process. And every one of us is contributing, whether we know it or not!

Images from http://www.google.com/recaptcha/

1 comment: