Metadata-Version: 1.1
Name: audiodatasets
Version: 1.0.0
Summary: Pulls and pre-processes major Open Source (non-commercial mostly) datasets for spoken audio
Home-page: https://github.com/mcfletch/audiodatasets
Author: Mike C. Fletcher
Author-email: mcfletch@vrplumber.com
License: MIT license
Description: ==============
        Audio Datasets
        ==============
        
        
        .. image:: https://img.shields.io/pypi/v/audiodatasets.svg
                :target: https://pypi.python.org/pypi/audiodatasets
        
        .. image:: https://img.shields.io/travis/mcfletch/audiodatasets.svg
                :target: https://travis-ci.org/mcfletch/audiodatasets
        
        .. image:: https://readthedocs.org/projects/audiodatasets/badge/?version=latest
                :target: https://audiodatasets.readthedocs.io/en/latest/?badge=latest
                :alt: Documentation Status
        
        .. image:: https://pyup.io/repos/github/mcfletch/audiodatasets/shield.svg
             :target: https://pyup.io/repos/github/mcfletch/audiodatasets/
             :alt: Updates
        
        
        Pulls and pre-processes major Open Source datasets for spoken audio
        
        * Supported Datasets:
        
          * `Librispeech <http://www.openslr.org/resources/12/>`_ (60GB)
          * `TEDLIUM_release2 <http://www.openslr.org/resources/19/>`_ (35GB)
          * `VCTK-Corpus <http://homepages.inf.ed.ac.uk/jyamagis/release/>`_ (11GB)
        
        * This is intended for use on Linux servers and it is expected that you will be using the 
          library to feed a machine learning system (not necessary, but that's sort of the point of 
          collecting these datasets)
        * MIT license for the software, but please note that the datasets themselves are 
          generally for non-commercial use only
        
        Features
        --------
        
        * Downloads common Open Source datasets and performs basic preprocessing on them
        * Provides iterables that produce Numpy arrays from the audio data in common formats
        * Uses `sphfile` to directly accesses sph files instead of needing to convert to `wav` first
        * Uses a single shared location for the datasets intended to be used by multiple projects
        
        Installation/Setup
        ------------------
        
        You need to create the download directory and make it writable by the running user. Preferably
        you will do that via group-based permissions to allow sharing, but we will here show creation
        of a user-specific ownership::
        
            $ mkdir -p /var/datasets
            $ chown user:group /var/datasets
            $ chmod g+rw /var/datasets
        
        if `/var/datasets` doesn't exist, or isn't writable, the downloader will instead populate
        `~/.config/datasets` with the data. You may wish to link that directory to `/var/datasets`
        so that you can use default instantiations of the corpora::
        
            $ ln -s /var/datasets ~/.config/datasets
        
        Note that the downloader expects that you have the following available, this may not
        yet be the case in a docker or minimal OS installation:
        
            * `tar`
            * `wget`
        
        Now you can download the datasets.
        
        .. note::
        
            The datasets are big (100+GB)!
            
            If you are paying for data or are working on a slow connection you will
            likely want to arrange to do this step during a low-rated period or on a 
            separate data connection.
        
        From a command prompt::
        
            $ pip install audiodatasets
            # this will download 100+GB and then unpack it on disk, it will take a while...
            $ audiodatasets-download 
        
        Creating MFCC data-files::
        
            # this will generate Multi-frequency Cepestral Coefficient (MFCC) summaries for the 
            # audio datasets (and download them if that hasn't been done). This isn't necessary
            # if you are doing only raw-audio processing
            $ audiodatasets-preprocess 
        
        Playing some audio::
        
            # this will iterate through playing every utterance that includes 'moon' in the transcript
            $ audiodatasets-search 'moon'
        
        Usage
        -------
        
        Once setup, you likely want to iterate over the data-sets using, for instance, a partition to 
        separate out test/train/validate data. To iterate over the raw audio:
        
        .. code:: python
        
            from audiodatasets.corpora import build_corpora, partition
            import random
        
            def train_valid_test():
                """Create training, validation and tests datasets
                
                returns three iterators yielding (array[10:512],transcript) batches
                """
                utterances = []
                for corpus in build_corpora():
                    utterances.extend( corpus.iter_utterances())
                random.shuffle(utterances)
                train, test,valid = partition( utterances, (3,1,1) )
                def generation( utterances ):
                    while True:
                        offset = random.randint(0,511)
                        for name,transcript,audio_file in utterances:
                            for batch in t.iter_batches( audio_file, batch_size=10, input=512, offset=offset ):
                                yield batch,transcript
                return generation(train),generation(test),generation(valid)
        
        To iterate over the 10ms MFCC preprocessed data, which yields 20 frequency batches per 
        processing window (10ms):
        
        .. code:: python
        
            from audiodatasets.corpora import build_corpora, partition
            import random
        
            def train_valid_test():
                """Create training, validation and tests datasets
        
                Note: the batches vary in *time* at highest frequency, while
                the frequency bins are the second-highest frequency.
        
                See: `LibRosa MFCC <https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html>`_
                
                returns three iterators yielding (array[10:20:63],transcript) batches
                """
                utterances = []
                for corpus in build_corpora():
                    utterances.extend( corpus.mfcc_utterances())
                random.shuffle(utterances)
                train, test,valid = partition( utterances, (3,1,1) )
                def generation( utterances ):
                    while True:
                        offset = random.randint(0,62)
                        for name,transcript,audio_file in utterances:
                            for batch in t.iter_batches( audio_file, batch_size=10, input=63, offset=offset ):
                                yield batch,transcript
                return generation(train),generation(test),generation(valid)
        
Keywords: audiodatasets
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
