Metadata-Version: 1.1
Name: Concurrent_AP
Version: 1.4
Summary: Scalable and parallel programming implementation of Affinity Propagation clustering
Home-page: https://github.com/GGiecold/Concurrent_AP
Author: Gregory Giecold
Author-email: ggiecold@jimmy.harvard.edu
License: MIT License
Download-URL: https://github.com/GGiecold/Concurrent_AP
Description: Concurrent\_AP
        ==============
        
        Overview
        --------
        
        A scalable and concurrent programming implementation of Affinity
        Propagation clustering.
        
        Affinity Propagation is a clustering algorithm based on passing messages
        between data-points.
        
        Storing and updating matrices of 'affinities', 'responsibilities' and
        'similarities' between samples can be memory-intensive. We address this
        issue through the use of an HDF5 data structure, allowing Affinity
        Propagation clustering of arbitrary large data-sets, where other Python
        implementations would return a MemoryError on most machines.
        
        We also significantly speed up the computations by splitting them up
        across subprocesses, thereby taking full advantage of the resources of
        multi-core processors and bypassing the Global Interpreter Lock of the
        standard Python interpreter, CPython.
        
        Installation and Requirements
        -----------------------------
        
        Concurrent\_AP requires Python 2.7 along with the following packages and
        a few modules from the Standard Python Library: \* NumPy >= 1.9 \*
        psutil \* PyTables \* scikit-learn \* setuptools
        
        Upon checking that the required dependencies are installed, you can
        upload Concurrent\_AP from the official Python Package Index (PyPI) as
        follows: \* open a terminal window; \* type in the command:
        ``pip install Concurrent_AP``
        
        The code herewith has been tested on Fedora, OS X and Ubuntu and should
        work fine on any other member of the Unix-like family of operating systems.
        
        Usage and Command Line Options
        ------------------------------
        
        See the docstrings associated to each function of the Concurrent\_AP
        module for more information and an understanding of how different tasks
        are organized and shared among subprocesses.
        
        Usage: ``Concurrent_AP [options] file_name``, where ``file_name``
        denotes the path where the data to be processed by Affinity Propagation
        clustering is held. Each row corresponds to a sample and the features must be 
        tab-separated (*.tsv file). It is also assumed that this file does not display any header.
        
        -  ``-c`` or ``--convergence``: specify the number of iterations without
           change in the number of clusters that signals convergence (defaults
           to 15);
        -  ``-d`` or ``--damping``: the damping parameter of Affinity
           Propagation (defaults to 0.5);
        -  ``-f`` or ``--file``: option to specify the file name or file handle
           of the hierarchical data format where the matrices involved in
           Affinity Propagation clustering will be stored, as part of a group named 
           'aff_prop_group' at the root of the HDF5 (defaults to a
           temporary file); the dataset itself won't be accessed there and needn't 
           be stored there either;
        -  ``-i`` or ``--iterations``: maximum number of message-passing
           iterations (defaults to 200);
        -  ``-m`` or ``--multiprocessing``: the number of processes to use;
        -  ``-p`` or ``--preference``: the preference parameter of Affinity
           Propagation (if not specified, will be determined as the median of
           the distribution of pairwise L2 Euclidean distances between samples);
        -  ``-s`` or ``--similarities``: determine if a similarity matrix has
           been pre-computed and stored in the HDF5 data structure accessible at
           the location specified through the command line option ``-f`` or
           ``--file`` (see above);
        -  ``-v`` or ``--verbose``: whether to be verbose.
        
        Demo of Concurrent\_AP
        ----------------------
        
        The following few lines illustrate the use of Concurrent\_AP on the
        'Iris data-set' from the UCI Machine Learning Repository. While the
        number of samples is here way too small for the benefits of the present
        multi-tasking implementation and the use of an HDF5 data structure to
        come fully into play, this data-set has the advantage of being
        well-known and prone to a quick comparison with scikit-learn's version
        of Affinity Propagation clustering.
        
        -  In a Python interpreter console, enter the following few lines, whose
           purpose is to create a file containing the Iris data-set that will be
           later subjected to Affinity Propagation clustering via
           Concurrent\_AP:
        
        ::
        
            >>> import numpy as np
            >>> from sklearn import datasets
        
            >>> iris = datasets.load_iris()
            >>> data = iris.data
            >>> with open('./iris_data.tsv', 'w') as f:
                    np.savetxt(f, data, fmt = '%.4f', delimiter = '\t')
        
        -  Open a terminal window.
        -  Type in: ``Concurrent_AP --preference 5.47 --v iris_data.tsv`` or
           simply ``Concurrent_AP iris_data.tsv``
        
        The latter will automatically compute a preference parameter from the
        data-set.
        
        When the rounds of message-passing among data-points have completed, a
        folder containing a file of cluster labels and a file of cluster centers
        indices both in tab-separated format is created in your current working
        directory.
        
        References
        ----------
        
        Brendan J. Frey and Delbert Dueck. "Clustering by Passing Messages
        between Data Points", Science Feb. 2007
        
Keywords: parallel multiprocessing machine-learning concurrent clustering
Platform: Any
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python :: 2.7
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Visualization
Classifier: Topic :: Scientific/Engineering :: Mathematics
