Google Does OCR

Google Appliance as shown at RSA Expo 2008 in ...Image via Wikipedia

The Google juggernaut rolls on. It’s latest trick is being able to read and understand scanned PDF documents. While Google could always read and index PDF documents created with a text layer, this new trick included OCR to be able to read, parse and index scanned text in a PDF too. Impressive.

Via Ars Technica:

As announced on the Official Google Blog, the company is now performing optical character recognition (OCR) on documents that it indexes and identifies as scanned as PDFs. Google has indexed documents that were saved as text-based PDFs for quite some time. But many documents wind up being made into PDFs through scans, which store the text as images. Google has now decided that its open-source OCRopus technology, based on software called “Tesseract” that HP developed, is up to the task of indexing scanned documents that can contain any mixture of text, images, and coffee stains.

Related articles by Zemanta

Reblog this post [with Zemanta]

Leave a Reply

Your email address will not be published. Required fields are marked *

DISCLAIMER :Everything here is the personal opinions of the authors and is not read or approved by pratham books before it is posted. No warranties or other guarantees will be offered as to the quality of the opinions or anything else offered here