11-05-2011, 07:05 AM | #1 |
how YOU doin?
Posts: 1,100
Karma: 7371047
Join Date: Feb 2009
Location: India
Device: Kindle Keyboard, iPad Pro 10.5”, Kobo Aura H2O, Kobo Libra 2
|
reCAPTCHA - Did You Know?
How many times have you been waiting to download a file, and come across a 'toll booth' in the form of a reCAPTCHA window?
What you probably didn't realise, as you keyed in the letters with one hand, was that you were actually a tiny component in the grand design to OCR old Texts for archival purposes. Every now and then, automated OCR scripts come across words on pages that are just indecipherable for various reasons. These words end up in those tiny reCAPTCHA windows, waiting for the ultimate OCR machine (you!) to transcribe them! The reason file hosting sites, or sign-up pages are confident that spambots can't get through that barrier is because these are words that have stumped automated text recognition programs to begin with! From wikipedia: reCAPTCHA is a system originally developed at Carnegie Mellon University that uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas.[1] On September 16, 2009, Google acquired reCAPTCHA.[2] reCAPTCHA is currently digitizing the archives of The New York Times and books from Google Books.[3] Twenty years of The New York Times have been digitized and the project planned to have completed the remaining years by the end of 2010.[4] reCAPTCHA supplies subscribing websites with images of words that optical character recognition (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects. The system is reported to display over 200 million CAPTCHAs every day,[5] and among its subscribers are such popular sites as Facebook, TicketMaster, Twitter, 4chan, CNN.com, and StumbleUpon.[6] Craigslist began using reCAPTCHA in June 2008.[7] The U.S. National Telecommunications and Information Administration also used reCAPTCHA for its digital TV converter box coupon program website as part of the US DTV transition.[8] NOW YOU KNOW! Last edited by howyoudoin; 11-12-2011 at 01:53 PM. |
11-05-2011, 07:35 AM | #2 |
Gadgetoholic
Posts: 1,467
Karma: 3865860
Join Date: Feb 2011
Location: Sweden
Device: Kobo Libra2, Tolino Vision 6
|
I'm not surprised. Some of them are ridiculously hard to decipher!
I have to do it every time I check out an eBook from the Swedish eLibrary and I do that a lot!). I haven't had any problems lately, but earlier this year I had to refresh multiple times to be able to make out the letters (or strange symbols that I had no idea how to recreate on my keyboard)! |
Advert | |
|
11-05-2011, 07:49 AM | #3 |
Addict
Posts: 280
Karma: 2064388
Join Date: Aug 2011
Location: MN, US
Device: Kobo Touch, Asus Eee Pad Slider
|
I had heard this before, but didn't know whether it was true. Neat!
@Asawi - It's certainly possible that some of the characters aren't represented anymore in modern versions of various languages. The English language lost quite a few characters during the periods of middle and early modern English, because they apparently didn't care enough to make their own typefaces for print, and imported them from other European countries instead. Problem is, a lot of the languages in those countries didn't use all of the letters that English did, and therefore their typefaces didn't include them. In some cases this culling of the English alphabet was for the best - wynn, for example, was a completely redundant letter. But in other cases, like with thorn, useful and logical letters were lost. Some of these letters were used up until just a couple hundred years ago (often substituted with other Latin letters in print). So, it is certainly possible that you've run across letters that simply don't exist anymore. And there is indeed no way for you to represent them on your keyboard - at least not without special characters. |
11-05-2011, 08:59 AM | #4 |
mrkrgnao
Posts: 241
Karma: 237248
Join Date: May 2010
Device: PRS650, K3 Wireless, Galaxy S3, iPad 3.
|
Now that is interesting.
4Chan users helping to OCR venerable archived texts as they upload vile images... a pretty amusing irony. I do seem to get truly undecipherable stuff pretty often: upside down Hebrew as an example. I wonder what processes they use to work out which responses are valid and can be relied on. |
11-05-2011, 09:00 AM | #5 |
Nameless Being
|
That is interesting. So all guesses are logged, even those guesses that at the time do not match the answer? Continuing update of the answer to match the most frequent guess?
Well in any case this information does not alter the fact that I find those reCAPTCHA barriers one of the most annoying things to encounter on the web. |
Advert | |
|
11-05-2011, 09:41 AM | #6 |
Grand Sorcerer
Posts: 12,936
Karma: 76440364
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
Maybe I am confused... How on earth can an answer to a RECAPTCHA help OCR? In order for you tp get through the check, you have to type the letters that correspond to the image.
In order for RECAPTCHA to know you are correct, it has to compare your result to it's known result. So to my eyes that says that the image has already been decoded at least once. |
11-05-2011, 09:46 AM | #7 | |
how YOU doin?
Posts: 1,100
Karma: 7371047
Join Date: Feb 2009
Location: India
Device: Kindle Keyboard, iPad Pro 10.5”, Kobo Aura H2O, Kobo Libra 2
|
Quote:
|
|
11-05-2011, 10:52 AM | #8 |
Guru
Posts: 895
Karma: 4383958
Join Date: Nov 2007
Device: na
|
You have to enter two words, one a known word and the other a unknown ocr based one. You can type anything you want for one of the two so long as the known word is entered correctly.
captchas are annoying yes, but they do limit spam a little at least which would be even more annoying. |
11-05-2011, 03:21 PM | #9 |
Wizard
Posts: 3,144
Karma: 8426142
Join Date: Jun 2008
Location: Chicago, IL
Device: Kindle PW2, Kindle Voyage, Kindle DXG, Boox M90, Kobo Aura HD
|
I read this years ago. I think about it every time I do one of them.
|
11-05-2011, 03:47 PM | #10 |
Spork Connoisseur
Posts: 2,355
Karma: 16780603
Join Date: Mar 2011
Device: Nook Color
|
I think it's pretty cool. It'd be nice to see the fruits of the "labor" though. Maybe some kind of report that shows what works have been completed with the use of this method.
|
11-05-2011, 04:38 PM | #11 |
Fanatic
Posts: 532
Karma: 3293888
Join Date: Oct 2011
Location: Virginia
Device: Nook Simple Touch
|
Captcha is what finally convinced me that I'm probably not human.
|
11-05-2011, 08:17 PM | #12 |
Wizard
Posts: 3,117
Karma: 9269999
Join Date: Feb 2011
Location: UK
Device: Sony- T3, PRS650, 350, T1/2/3, Paperwhite, Fire 8.9,Samsung Tab S 10.5
|
I kept trying new glasses 'till I worked it wasn't necessarily my fault I didn't know what I was looking at.................
p'raps it's Esperanto ? |
11-06-2011, 12:59 AM | #13 | |
Wizard
Posts: 1,358
Karma: 5766642
Join Date: Aug 2010
Device: Nook
|
Quote:
What's less well known is that some spam operations operate "free" porn sites for the specific purpose of getting people to decipher the OCR stuff, which they then use to sign up for free email accounts (having gotten the image from the sign up page for the email). This allows them to completely automate the process of getting free email accounts to spam from. The sad thing is, these are the less slimy spammers. |
|
11-06-2011, 12:07 PM | #14 |
The Forgotten
Posts: 1,136
Karma: 4689999
Join Date: Apr 2010
Location: Dubai
Device: Kindle Paperwhite; Nook HD; Sony Xperia Z3 Compact
|
Very interesting.
Like many others, I find the reCAPTCHA extremely annoying. Quite fascinating to find out what it's real purpose is. |
11-07-2011, 07:50 AM | #15 | |
mrkrgnao
Posts: 241
Karma: 237248
Join Date: May 2010
Device: PRS650, K3 Wireless, Galaxy S3, iPad 3.
|
Quote:
Only problem is, a lot of people know this, so just type 'wot' or some gibberish for every one of the non-recognised answers to speed things up. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Oh no, watch out for P word in your e-texts via reCaptcha | joedevon | News | 13 | 07-24-2009 07:12 PM |