Metadata-Version: 1.0
Name: Products.PDFtoOCR
Version: 1.0dev
Summary: PDFtoOCR does OCR processing on PDF documents. The text from OCR is used in the search results.
Home-page: http://svn.plone.org/svn/collective/
Author: Plone Collective
Author-email: product-developers@lists.plone.org
License: GPL
Description: Introduction
        ============
        The PDFtoOCR product processes text in PDF documents using OCR. This is needed
        when text cannot be extracted from a (scanned) PDF. PDFtoOCR uses content rules to
        schedule the OCR processing. The processing cannot be done one the fly, for
        example with a custom TextIndexNG plugin. Processing large PDF documents using
        OCR is a time consuming task.
        
        
        Configuration
        =============
        
        On the operating system
        -----------------------
        PDF to Text uses three tools that are available for under Linux. The
        cooperation with the tools is only tested in Debian. But it the will probably
        work in in other nix enviroments.
        
        
        Install requirements, PDF to OCR uses the following programs:
        
        - pdftotext, checks if OCR processing is necessary
        - ghostscript, converts the pdf documents to tiff images
        - tesseract,  does the OCR processing (make sure you've got all language packs!*)
        
        
        On the Plone site
        -----------------
        
        Add a content rule
        
        - Event trigger: Object modified
        - Condition: Content type is file
        - Actions: Store OCR output from a PDF in searchable text
        
        Assign content rule to a Plone site or a folder
        
        Install cron4plone and add the following cronjob: portal/@@do_pdf_ocr_index
        
        
        PDF Processing
        ==============
        
        Each time a file is added or modified  the unique id (uid) of the file is added
        to a queue. This queue is persistent and has two functions, for indexing en reindexing.
        The indexing function uses the queue to process the documents. When reindexing is used all
        files in the queue history are processed.
        
        If the text from a PDF document is extracted using pdftotext no OCR is done. Else the
        OCR extracts the text and stores it the content type file. The ATFile is patched with an
        extra field to accommodate the extracted text and the language of the PDF.
        
        Page views:
        
        - @@do_pdf_ocr_index - indexes documents in the queue
        - @@do_pdf_ocr_reindex - reindexes all pdf documents in the Plone site
        - @@pdf_ocr_status - Show the queue and a history 10 documents
        
        
        Futher reading:
        ===============
        
        http://plone.org/documentation/how-to/ocr-in-plone-using-tesseract-ocr/
        http://code.google.com/p/tesseract-ocr/
        
        * Make sure you don't got empty language files in /usr/local/share/tessdata/
        
        Maybe a good alternative in the future, uses tessract but hard to setup and
        still too much beta:
        http://sites.google.com/site/ocropus/
        
        
        
        Changelog
        =========
        
        1.0 - Unreleased
        ----------------
        
        * Initial release
        
        
Keywords: web zope plone theme
Platform: UNKNOWN
Classifier: Framework :: Plone
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development :: Libraries :: Python Modules
