Metadata-Version: 2.1
Name: textdirectory
Version: 0.3.1.1
Summary: TextDirectory allows you to combine multiple text files into one.
Home-page: https://github.com/IngoKl/textdirectory
Author: Ingo Kleiber
Author-email: ingo@kleiber.me
License: MIT license
Description: =============
        TextDirectory
        =============
        
        .. image:: https://img.shields.io/pypi/v/textdirectory.svg
                :target: https://pypi.python.org/pypi/textdirectory
        
        .. image:: https://img.shields.io/travis/IngoKl/textdirectory.svg
                :target: https://travis-ci.org/IngoKl/textdirectory
        
        .. image:: https://readthedocs.org/projects/textdirectory/badge/?version=latest
                :target: https://textdirectory.readthedocs.io/en/latest/?badge=latest
                :alt: Documentation Status
        
        |
        |
        
        .. image:: https://user-images.githubusercontent.com/16179317/39367680-cd409a00-4a37-11e8-8d42-0bed5a4e814b.png
                :alt: TextDirectory
        
        *TextDirectory* allows you to combine multiple text files into one aggregated file. TextDirectory also supports matching
        files for certain criteria and applying transformations to the aggregated text.
        
        *TextDirectory* can be used as a mere tool (via the CLI) and as a Python library.
        
        Of course, everything *TextDirectory* does could be achieved in bash or PowerShell. However, there are certain
        use-cases (e.g. when used as a library) in which it might be useful.
        
        
        * Free software: MIT license
        * Documentation: https://textdirectory.readthedocs.io.
        
        Features
        ========
        * Aggregating multiple text files
        * Filtering documents/texts based on various parameters such as length, content, and random sampling
        * Transforming the aggregated text (e.g. transforming the text to lowercase)
        
        .. csv-table::
           :header: "Version", "Filters", "Transformations"
           :widths: 10, 30, 30
        
           0.1.0, filter_by_max_chars(n int); filter_by_min_chars(n int); filter_by_max_tokens(n int); filter_by_min_tokens(n int); filter_by_contains(str); filter_by_not_contains(str); filter_by_random_sampling(n int; replace=False), transformation_lowercase
           0.1.1, filter_by_chars_outliers(n sigmas int), transformation_remove_nl
           0.1.2, filter_by_filename_contains(str), transformation_usas_en_semtag; transformation_uppercase; transformation_postag(spacy_model str)
           0.1.3, filter_by_similar_documents(reference_file str; threshold float), transformation_remove_non_ascii; transformation_remove_non_alphanumerical
           0.2.0, filter_by_max_filesize(max_kb int); filter_by_min_filesize(min_kb int), transformation_to_leetspeak; transformation_crude_spellchecker(language model str)
           0.2.1, None, transformation_remove_stopwords(stopwords_source str; stopwords str [en]; spacy_model str; custom_stopwords str); transformation_remove_htmltags
           0.3.0, None, transformation_remove_weird_tokens(spaCy model; remove_double_space=False); transformation_lemmatizer(spaCy model)
        
        Quickstart
        ==========
        Install *TextDirectory* via pip: ``pip install textdirectory``
        
        *TextDirectory*, as exemplified below, works with a two-stage model. After loading in your data (directory) you can iteratively select the files you want to process. In a second step you can perform transformations on the text before finally aggregating it.
        
        .. image:: https://user-images.githubusercontent.com/16179317/39367589-7f774116-4a37-11e8-9a09-5cbdf5f3311b.png
                :alt: TextDirectory
        
        As a Command-Line Tool
        ~~~~~~~~~~~~~~~~~~~~~~
        *TextDirectory* comes equipped with a CLI.
        
        The syntax for both the *filters* and *tranformations* works similarly. They are chained by adding slashes (/) and
        parameters are passed via commas (,): ``filter_by_min_tokens,5/filter_by_random_sampling,2``.
        
        **Example 1: A Very Simple Aggregation**
        
        ``textdirectory --directory testdata --output_file aggregated.txt``
        
        This will take all files (.txt) in *testdata* and then aggregates the files into a file called *aggregated.txt*.
        
        **Example 2: Applying Filters and Transformations**
        
        In this example we want to filter the files based on their token count, perform a random sampling and finally transform all text to lowercase.
        
        ``textdirectory --directory testdata --output_file aggregated.txt --filters filter_by_min_tokens,5/filter_by_random_sampling,2 --transformations transformation_lowercase``
        
        After passing two filters (*filter_by_min_tokens* and *filter_by_random_sampling*) we've applied the *transform_lowercase* transformation.
        
        The resulting file will contain the content of two files that each have at least five tokens.
        
        As a Python Library
        ~~~~~~~~~~~~~~~~~~~
        In order to demonstrate *TextDirectory* as a Python library, we'll recreate the second example from above:
        
        .. code:: python
        
            import textdirectory
            td = textdirectory.TextDirectory(directory='testdata')
            td.load_files(recursive=False, filetype='txt', sort=True)
            td.filter_by_min_tokens(5)
            td.filter_by_random_sampling(2)
            td.stage_transformation(['transformation_lowercase'])
            td.aggregate_to_file('aggregated.txt')
        
        If we wanted to keep working with the actual aggregated text, we could have called ``text = td.aggregate_to_memory()``.
        
        Every applied filter will create a *state* (i.e. a checkpoint). If we want to go back to a previous state, we can print
        all states by calling ``td.print_saved_states()``. Previous states can then be loaded by
        calling ``td.load_aggregation_state(state=0)``.
        
        
        It's also possible to pass arguments to the individual transformations. In order to do this (at the moment) you have to adhere to the correct order of arguments.
        
        .. code:: python
        
            # def transformation_remove_stopwords(text, stopwords_source='internal', stopwords='en', spacy_model='en_core_web_sm', custom_stopwords=None, *args)
            td.stage_transformation(['transformation_remove_stopwords', 'internal', 'en', 'en_core_web_sm', 'dolor'])
        
        In the above example, we are adding additional custom stopwords to the transformer.
        
        Notes for Developers
        ====================
        If you want to run tests, please use `python setup.py test`.
        
        To-Do
        =======
        * Increasing test coverage
        * Writing better documentation
        * Adding better error handling (raw exception are, well ...)
        * Adding logging
        * Better handling of non-unicode files (e.g. by detecting and reporting the encoding)
        * Contemplating whether it makes sense to stage filters similarly to transformations
        * Allowing users to pass keyword arguments to transformers
        * Implementing autodoc (via Sphinx)
        
        Behavior
        =========
        We are not holding the actual texts in memory. This leads to much more disk read activity (and time inefficiency), but
        saves memory.
        
        ``transformation_usas_en_semtag`` relies on the web version of `Paul Rayson's USAS Tagger
        <http://ucrel.lancs.ac.uk/usas/>`_. Don't use this transformation for large amounts of text, give credit, and
        consider using their commercial product `Wmatrix <http://ucrel.lancs.ac.uk/wmatrix/>`_.
        
        Credits
        =======
        This package is based on the `audreyr/cookiecutter-pypackage`_ coockiecutter template. The *crude spellchecker*
        (transformation) is implemented following Peter Norvig's excellent `tutorial`_.
        
        .. _Cookiecutter: https://github.com/audreyr/cookiecutter
        .. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage
        .. _`tutorial`: http://norvig.com/spell-correct.html
        
        
        =======
        History
        =======
        
        
        0.1.0 (2018-04-26)
        ==================
        
        * Initial release
        * First release on PyPI.
        
        0.1.1 (2018-04-27)
        ==================
        
        * added filter_by_chars_outliers
        * added transformation_remove_nl
        
        0.1.2 (2018-04-29)
        ==================
        * added transformation_postag
        * added transformation_usas_en_semtag
        * added transformation_uppercase
        * added filter_by_filename_contains
        * added parameter support for transformations
        
        0.1.3 (2018-04-30)
        ==================
        * filter_by_random_sampling now has a "replacement" option
        * changed from tabulate to an embedded function
        * added transformation_remove_non_ascii
        * added transformation_remove_non_alphanumerical
        * added filter_by_similar_documents
        
        0.1.4 (2018-04-02)
        ==================
        * fixed an object mutation problem in the tabulate function
        
        0.2.0 (2018-05-13)
        ==================
        * added transform_to_memory() function
        * added transformation_to_leetspeak() function
        * added transformation_crude_spellchecker
        * added filter_by_max_filesize
        * added filter_by_min_filesize
        * fixed a bug where load_files() would fail if there were no files
        
        0.2.1 (2019-06-13)
        ==================
        added transformation_remove_stopwords
        added transformation_remove_htmltags
        fixed some minor bugs
        
        0.2.2 (2019-06-13)
        ==================
        changed the data packaging
        
        0.3.0 (2020-01-19)
        ==================
        * added transformation_remove_weird_tokens
        * added transformation_lemmatizer
        * fixed some minor bugs
        * added a function to revert applied filters
        * added a function that prints the current pipeline
        * added a function that clears all transformations
        * added helper functions to list available filters and transformations
        * fixed a bug in which ``tabulate_flat_list_of_dicts`` would fail if the dictionary was empty
        * ``self.aggregation`` does not hold a copy of the files anymore but references to ``self.files``
        * transformations relying on spaCy are now estimating a max_length based on available memory
        * TextDirectory objects are now iterable
        
        0.3.1 (2020-01-20)
        ==================
        * added long_description_content_type to setup.py
Keywords: textdirectory
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.6
Description-Content-Type: text/x-rst
