Pratham Books

Google Does OCR

Image via Wikipedia

The Google juggernaut rolls on. It’s latest trick is being able to read and understand scanned PDF documents. While Google could always read and index PDF documents created with a text layer, this new trick included OCR to be able to read, parse and index scanned text in a PDF too. Impressive.

Via Ars Technica:

As announced on the Official Google Blog, the company is now performing optical character recognition (OCR) on documents that it indexes and identifies as scanned as PDFs. Google has indexed documents that were saved as text-based PDFs for quite some time. But many documents wind up being made into PDFs through scans, which store the text as images. Google has now decided that its open-source OCRopus technology, based on software called “Tesseract” that HP developed, is up to the task of indexing scanned documents that can contain any mixture of text, images, and coffee stains.

Related articles by Zemanta

Leave a Reply

Your email address will not be published. Required fields are marked *