Skip to main content

Clever reCAPTCHA




Thank you! Thank you so much for helping to correct thousands, if not millions, of pages of scanned text!

What? You aren't doing anything of the sort? Oh, yes you are ... read on!

Optical character recognition (OCR) is the method used to digitise printed material — old newspapers and book, for example, can be scanned using OCR technology, and then we can access them online. The National Library of Australia uses this technology for its massive Trove database, for example.

The problem with OCR is that it's not always that accurate, especially from older printed material with yellowing paper and faded or smudged ink. It is a lot better than it used to be, but is still far from 100% accurate — 80% accuracy is more typical. So the resulting scanned texts have a lot of errors!

Here's an example:



We can read the top sentence of scanned type (This aged portion of society were distinguished from ...), because us humans are fucking brilliant, but a computer has a lot of trouble distinguishing the letter patterns. The human eye is just better at seeing those letters!

So, there's a problem. We have millions of pages of inaccurate scanned text, with 20% errors. Which makes it damn difficult to read and use, needless to say.

Now, you'll be familiar with CAPTCHA technology already. CAPTCHA stands for 'Completely Automated Public Turing test to tell Computers and Humans Apart'. It's those range of 'verification' boxes that pop up when posting on a blog or registering for something online, for example. You might need to type in some letters, deciphering a mangled or warped picture, or do some sums, or something which proves that you're a real person using the site, and not a computer being used by spammers to abuse online services. You're doing something that a computer can't do.

Some bright lads over at the Computer Science Department at Carnegie Mellon University (PA, USA) came up with a clever tool to help solve this problem of old OCR-ed texts, called reCAPTCHA.


This is a variety of CAPTCHA entry box, where one of the two 'distorted' words presented for you to retype is a word from actual scanned texts, which was unrecognisable by OCR. The other word is a 'control' word, to assess how accurate you are when typing in entries. Readings for these distorted words need to be agreed on by several users before they are cleared as being accurate.

In their Science paper reCAPTCHA: Human-Based Character Recognition via Web Security Measures (which you can download from this page)the scientists say that reCAPTCHA is used by more than 40,000 websites in 2008 (and presumably many more by now), and is proving to be highly effective and accurate.

You can read more about reCAPTCHA on Google and Wikipedia.


So whenever you enter text into a word verification box with this logo on it, you're helping to correct the vast quantities of old scanned texts, for all! I think it's so bloody clever; we're all helping to improve information and assisting in a massive complex process. And every one of us is contributing, whether we know it or not!

Images from http://www.google.com/recaptcha/

Comments

Post a Comment

Popular posts from this blog

American vs British crosswords

American and British crosswords. Is there a difference? The short answer is HELL YES! Now for the long answer ... There are major differences between American-style and British-style crosswords (which are seen in Commonwealth countries too).  Crosswords were started by Arthur Wynne in 1913, a British man who lived in America, so both countries claim a close connection with the development of this popular puzzle. They developed in slightly different directions in each country, which has led to the varieties we see today. American-style crosswords are almost exclusively published in America, while British-style crosswords have spread through the Commonwealth — Australia, New Zealand, Canada, South Africa, and other English-speaking nations tend to prefer this variety of the puzzle, as well as the United Kingdom, of course. A quick look at these grids will show you the most obvious difference between the two varieties : British-style Note th

Lesson 2: Anagrams

One thing it's important to do with cryptic clues is to ignore the surface reading! The surface is the sense you get when reading a clue for the first time, the mental image it brings up. Apart from some very rare clue types, this is only going to lead you astray. What's vitally important to do is to read each clue, word by word, looking for the hidden meaning. Also, there are a few important things to note with the definition part of clues. Firstly — they will always be at the start or the end of the clue, but never in the middle (ie with bits of wordplay around them). Secondly — pinning down the definition is a major part of getting success in solving cryptic clues. And of course, once you've figured out which part is the definition, the remainder of the clue has to be the wordplay! ANAGRAMS So, on to the first of the cryptic devices that you'll find in every cryptic crossword: Anagrams! Anagrams are a very popular cryptic device, and e

Crosswords in other languages

The crossword was invented nearly exactly 100 years ago — yes, 2103 is its centenary! We all know that in that time it's spread throughout all English-speaking countries. But what about other countries? The answer is a resounding yes! In almost every country that I researched, they have crosswords. The forms are often a bit different from what we're used to — the grids are often non-symmetrical, 2-letter words are allowed, accented characters are often ignored, and sometimes the clues are written into the black squares. Here are some links to crossword sites from other countries, for your enjoyment. Afrikaans blokkieraaisel Part of a Chinese crossword Chinese   填字游戏 Danish  kryds og tværs Dutch  kruiswoordraadsel Finnish crosswords often include picture clues Finnish  Sanaristikko   French grids use a different numbering system French mots croises A German crossword German Kreuzworträtsel Greek  σταυρόλεξο   Part of a He