The Google juggernaut rolls on. It’s latest trick is being able to read and understand scanned PDF documents. While Google could always read and index PDF documents created with a text layer, this new trick included OCR to be able to read, parse and index scanned text in a PDF too. Impressive.
As announced on the Official Google Blog, the company is now performing optical character recognition (OCR) on documents that it indexes and identifies as scanned as PDFs. Google has indexed documents that were saved as text-based PDFs for quite some time. But many documents wind up being made into PDFs through scans, which store the text as images. Google has now decided that its open-source OCRopus technology, based on software called “Tesseract” that HP developed, is up to the task of indexing scanned documents that can contain any mixture of text, images, and coffee stains.
Related articles by Zemanta