Metadata-Version: 2.1
Name: TExtractor
Version: 0.1.2
Summary: Extract text content from many filetypes.
Home-page: http://bitbucket.org/whitie/textractor-py3/
Author: Thorsten Weimann
Author-email: weimann.th@yahoo.com
License: MIT
Keywords: text extract pdf docx
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Description-Content-Type: text/x-rst
Requires-Dist: pdfminer.six
Requires-Dist: pluginbase
Requires-Dist: chardet

TExtractor
==========

Extract text content from many filetypes in pure Python. This package extracts
pure text from many office filetypes. Only three external (pure Python)
libraries are needed to work. After extracting you get a list of words with
the most common stop words stripped out (only en, de).

Install with: `pip install TExtractor`

Usage::

    >>> from textractor import TExtractor
    >>> extractor = TExtractor()
    >>> extractor.index('test.docx', lang='en')
    ['workflow_history', 'portal_workflow', 'review_history',
     'implementation', 'organizations', 'Illustrations', ...]
    >>> extractor.index('test.pdf', lang='en')
    ['workflow_history', 'portal_workflow', 'review_history',
     'implementation', 'organizations', 'Illustrations', ...]
    >>>



