Metadata-Version: 2.1
Name: Lunas
Version: 0.1.6
Summary: A data processing pipeline and iterator with minimal dependencies for machine learning.
Home-page: https://github.com/pluiez/lunas
Author: Seann Zhang
Author-email: pluiefox@live.com
License: LICENSE.txt
Description: # Lunas
        
        [![PyPI version](https://img.shields.io/badge/pypi-v0.1.6-limegreen.svg)](https://github.com/pluiez/lunas)
        
        **Lunas** is a Python 3-based library that provides a set of simple interfaces for data processing pipelines and an iterator for looping through data.
        
        Basically, Lunas draws its data-handling style on *Tensorflow*, *PyTorch*, and some implementation details from *AllenNLP*.
        
        ## Features
        
        `Reader` A reader defines a dataset and corresponding preprocessing and filtering rules. Currently the following features are supported:
        
        1. Buffered reading.
        2. Buffered shuffling.
        3. Chained processing and filtering interface.
        4. Preprocess and filter the data buffer in parallel.
        5. Handling multiple input sources.
        6. Persistable.
        
        `DataIterator` An iterator performs multi-pass iterations over the dataset and maintains the iteration state:
        
        1. Dynamic batch size at runtime.
        2. Custom stopping criteria.
        3. Sort samples of a batch, which is useful for learning text presentation by RNNs in *PyTorch*.
        4. Persistable.
        
        *Persistable* provides the class with a *PyTorch* like interface to dump and load instance state, useful when the training process is accidentally aborted.
        
        ## Requirements
        
        - Numpy
        - overrides
        - typings
        - Python = 3.x
        
        Lunas hardly relies on any third-party libraries, all the required libraries are just
        to take advantage of the type hint feature provided by Python 3.
        
        ## Installation
        
        You can simply install Lunas by running pip:
        
        ```
        pip install lunas
        ```
        
        ## Example
        
        *Lunas* exposes minimal interfaces to the user so as to make it as simple as possible. We try to avoid adding any unnecessary features to keep it light-weight.
        
        However, you can still extend this library to suit your needs at any time to handle arbitrary data types such as text, images, and audios.
        
        1. Create a dataset reader and iterate through it.
        
           ```python
           from lunas.reader import RangeReader
        
           ds = RangeReader(10)
           for sample in ds:
               print(sample)
           ```
        
           - We created a dataset similar to range(10) and iterate through it for one epoch.
        
        2. Build a data processing pipeline.
        
           ```python
           ds = RangeReader(10).select(lambda x: x + 1).select(lambda x: x * 2).where(lambda x: x % 2 == 0)
           ```
        
           - we call `Reader.select(fn)` to define a processing procedure for the dataset.
           - `select()` returns the dataset itself to enable chaining invocations. You can apply any transformations to the dataset and return a sample of any type, say `Dict`, `List` and custom `Sample`.
           - `where()` accepts a predicate and returns a `bool` value to filter input sample, if True, the sample is preserved, otherwise discarded.
           - It should be noted that the processing will not be executed immediately, but will be performed when iterating through `ds`.
        
        3. Deal with multiple input sources.
        
           ```python
           from lunas.reader import RangeReader, ZipReader, ShuffleReader
        
           ds1 = RangeReader(10)
           ds2 = RangeReader(10)
           ds = ZipReader(ds1, ds2).select(lambda x: x[0] + x[1])
           ds = ds.shuffle()
           ```
        
           - In the above code, we created two datasets and *zip* them as a `ZipReader`. A `ZipReader` returns a tuple from its internal `readers`.
           - `ds.shuffle()` returns a `ShuffleReader` of the dataset.
        
        4. Practical use case in Machine Translation scenario.
        
           ```python
           from lunas.readers.text import TextReader
           from lunas.iterator import DataIterator
        
           # Tokenize the input into a list of tokens.
           tokenize = lambda line: line.split()
           # Ensure the inputs are of length no exceeding 50.
           limit = lambda src_tgt: max(map(len, src_tgt)) <= 50
           # Map word to id.
           word2id = lambda src_tgt: ...
        
           source = TextReader('train.fr').select(tokenize)
           target = TextReader('train.en').select(tokenize)
           ds = ZipReader(source, target).where(limit)
           ds = ds.shuffle().select(word2id)
        
           # Take maximum length of the sentence pair as sample_size
           sample_size = lambda x: max(map(len), x)
           # Convert a list of samples to model inputs
           collate_fn = lambda x: ...
           # Sort samples in batch by source text length
           sort_key = lambda x: len(x[0])
        
           it = DataIterator(ds, batch_size=4096, cache_size=40960, sample_size_fn=lambda x, collate_fn=collate_fn, sort_desc_by=sort_key)
        
           # Iterate 100 epoch and 1000000 steps at most.
           for batch in it.while_true(lambda: it.epoch < 100 and it.step < 1e6):
           	print(it.epoch, it.step, it.step_in_epoch, batch)
        
           ```
        
           - This code should be simple enough to understand, even if you are not familiar with machine translation.
        
        2. Save and reload iteration state.
        
           ```python
           import pickle
           pickle.dump(it.state_dict(), open('state.pkl'))
           # ...
           state = pickle.load(open('state.pkl'))
           it.load_state_dict(state)
           ```
        
           - `state_dict()` returns a picklable dictionary, which can be loaded by `it.load_state_dict()` to resume the iteration process later.
        
        3. Extend the reader.
        
           - You can refer to the implementation of `TextReader` to customize your own data reader.
        
        ## Conclusions
        
        Please feel free to contact me if you have any question or find any bug of Lunas.
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3
Description-Content-Type: text/markdown
