Many of you may have already known this, but I thought it was really cool. Did you know that Captchas are being used to digitize books?
Here are some excerpts:
So... proprs to each of you who has been downloading music from file sharing sites without having a premium account. You're digitizing books at a rate of 57 years per day
Here are some excerpts:
Today it has become the principal method used by Google to authenticate text in Google Books, its vast project to digitize and disseminate rare and out-of-print texts on the Internet.
Digitization is normally a three-stage process: create a photographic image of the text, also known as a bitmap; encode the text in a compact, easily handled and searchable form using optical character recognition software, commonly called O.C.R.; and, finally, correct the mistakes.
Today’s technology makes the first two steps relatively straightforward. The third, however, can be extremely difficult. For vintage 19th-century texts in English, O.C.R. programs mess up or miss 10 percent to 30 percent of the words. Only humans can fix the errors.
Dr. von Ahn’s group estimated that humans around the world decode at least 200 million Captchas per day, at 10 seconds per Captcha. This works out to about 500,000 hours per day — a lot of applied brainpower being spent on what Dr. von Ahn regards as a fundamentally mindless exercise.
each suspicious word is turned into a Captcha. It is crucial to understand that the Captcha is a distorted version of the word as printed in the original photographic image. It is not made from the O.C.R.’s imagined translation, which is often unintelligible. The unknown word is then paired with a second Captcha word whose correct translation is already known. This is the “control.”
With all these constraints, reCaptcha nevertheless achieves an accuracy rate above 99 percent, which compares favorably with professional human transcribers.
Digitization is normally a three-stage process: create a photographic image of the text, also known as a bitmap; encode the text in a compact, easily handled and searchable form using optical character recognition software, commonly called O.C.R.; and, finally, correct the mistakes.
Today’s technology makes the first two steps relatively straightforward. The third, however, can be extremely difficult. For vintage 19th-century texts in English, O.C.R. programs mess up or miss 10 percent to 30 percent of the words. Only humans can fix the errors.
Dr. von Ahn’s group estimated that humans around the world decode at least 200 million Captchas per day, at 10 seconds per Captcha. This works out to about 500,000 hours per day — a lot of applied brainpower being spent on what Dr. von Ahn regards as a fundamentally mindless exercise.
each suspicious word is turned into a Captcha. It is crucial to understand that the Captcha is a distorted version of the word as printed in the original photographic image. It is not made from the O.C.R.’s imagined translation, which is often unintelligible. The unknown word is then paired with a second Captcha word whose correct translation is already known. This is the “control.”
With all these constraints, reCaptcha nevertheless achieves an accuracy rate above 99 percent, which compares favorably with professional human transcribers.
So... proprs to each of you who has been downloading music from file sharing sites without having a premium account. You're digitizing books at a rate of 57 years per day
Comment