Metadata-Version: 2.1
Name: AdvancedAnalytics
Version: 0.9.0
Summary: Python support for 'The Art and Science of Data Analytics'
Home-page: https://github.com/tandonneur/AdvancedAnalytics
Author: Edward R Jones
Author-email: ejones@tamu.edu
License: UNKNOWN
Keywords: Analytics,data map,preprocessing,pre-processing,postprocessing,post-processing,NLTK,Sci-Learn,sklearn,StatsModels,web scraping,word cloud,regression,decision trees,random forest,neural network,cross validation,topic analysis,sentiment analytic,natural language processing,NLP
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Topic :: Utilities
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent

﻿AdvancedAnalytics
===================

A collection of python modules, classes and methods for simplifying the use of machine learning solutions.  **AdvancedAnalytics** provides easy access to advanced tools in **Sci-Learn**, **NLTK** and other machine learning packages.  **AdvancedAnalytics** was developed to simplify learning python from the book *The Art and Science of Data Analytics*.

Description
===========

From a high level view, building machine learning applications typically proceeds through three stages:

    1. Data Preprocessing
    2. Modeling or Analytics
    3. Postprocessing

The classes and methods in **AdvancedAnalytics** primarily support the first and last stages of machine learning applications. 

Data scientists report they spend 80% of their total effort in first and last stages. The first stage, *data preprocessing*, is concerned with preparing the data for analysis.  This includes:

    1. identifying and correcting outliers, 
    2. imputing missing values, and 
    3. encoding data. 

The last stage, *solution postprocessing*, involves developing graphic summaries of the solution, and metrics for evaluating the quality of the solution.

Documentation and Examples
============================

The API and documentation for all classes and examples are available at https://github.com/tandonneur/AdvancedAnalytics . 

Usage
=====

Currently the most popular usage is for supporting solutions developed using these advanced machine learning packages:

    * Sci-Learn
    * StatsModels
    * NLTK

The intention is to expand this list to other packages.  This is a simple example for linear regression that uses the data map structure to preprocess data:

.. code-block:: python

    from AdvancedAnalytics.ReplaceImputeEncode import DT
    from AdvancedAnalytics.ReplaceImputeEncode import ReplaceImputeEncode
    from AdvancedAnalytics.Tree import tree_regressor
    from sklearn.tree import DecisionTreeRegressor, export_graphviz 
    # Data Map Using DT, Data Types
    data_map = {
        “Salary”:         [DT.Interval, (20000.0, 2000000.0)],
        “Department”:     [DT.Nominal, (“HR”, “Sales”, “Marketing”)] 
        “Classification”: [DT.Nominal, (1, 2, 3, 4, 5)]
        “Years”:          [DT.Interval, (18, 60)] }
    # Preprocess data from data frame df
    rie = ReplaceImputeEncode(data_map=data_map, interval_scaling=None,
                              nominal_encoding= “SAS”, drop=True)
    encoded_df = rie.fit_transform(df)
    y = encoded_df[“Salary”]
    X = encoded_df.drop(“Salary”, axis=1)
    dt = DecisionTreeRegressor(criterion= “gini”, max_depth=4
                                min_samples_split=5, min_samples_leaf5)
    dt = dt.fit(X,y)
    tree_regressor.display_importance(dt, encoded_df.columns)
    tree_regressor.display_metrics(dt, X, y)

Current Modules and Classes
=============================

ReplaceImputeEncode
    Classes for Data Preprocessing
        * DT defines new data types used in the data dictionary
        * ReplaceImputeEncode a class for data preprocessing

Regression
    Classes for Linear and Logistic Regression
        * linreg support for linear regressino
        * logreg support for logistic regression
        * stepwise a variable selection class

Tree
    Classes for Decision Tree Solutions
        * tree_regressor support for regressor decision trees
        * tree_classifier support for classification decision trees

Forest
    Classes for Random Forests
        * forest_regressor support for regressor random forests
        * forest_classifier support for classification random forests

NeuralNetwork
    Classes for Neural Networks
        * nn_regressor support for regressor neural networks
        * nn_classifier support for classification neural networks

TextAnalytics
    Classes for Text Analytics
        * text_analysis support for topic analysis
        * sentiment_analysis support for sentiment analysis

Internet
    Classes for Internet Applications
        * scrape support for web scrapping
        * metrics a class for solution metrics

Installation and Dependencies
=============================

**AdvancedAnalytics** is designed to work on any operating system running python 3.  It can be installed using **pip** or **conda**.

.. code-block:: python

    pip install AdvancedAnalytics
    # or
    conda install -c conda-forge AdvancedAnalytics

General Dependencies
    There are dependencies.  Most classes import one or more modules from    
    **Sci-Learn**, referenced as *sklearn* in module imports, and 
    **StatsModels**.  These are both installed in with current versions
    of **anaconda**, a popular application for coding python solutions.

Decision Tree and Random Forest Dependencies
    The *Tree* and *Forest* modules plot decision trees and importance
    metrics using **pydotplus** and the **graphviz** packages.  If these
    are not installed and you are planning to use the *Tree* or *Forest*
    modules, they can be installed using the following code.

    .. code-block:: python

        conda install -c conda-forge pydotplus
        conda install -c conda-forge graphviz
        pip install graphviz

    One note, the second conda install does not complete the install of 
    the graphviz package.  To complete the graphviz install, it is 
    necessary to run the pip install after the conda graphviz install.

Text Analytics Dependencies
    The *TextAnalytics* module is based on the **NLTK** and **Sci-Learn**
    text analytics packages.  They are both installed with the current 
    version of anaconda. 

    However, *TextAnalytics* includes options to produce word clouds, 
    which are graphic displays of the word collections associated with 
    topic or data clusters.  The **wordcloud** package is used to produce
    these graphs.  If you are using the *TextAnalytics* module you can
    install the **wordcloud** package with the following code.

    .. code-block:: python

        conda install -c conda-forge wordcloud

    In addition, data used by the **NLTK** package is not automatically 
    installed with this package.  These data include the text 
    dictionary and other data tables.

    The following nltk.download commands should be run before using 
    **TextAnalytics**. However, it is only necessary to run these once to 
    download and install the data NLTK uses for text analytics.

    .. code-block:: python

        #The following NLTK commands should be run once to 
        #download and install NLTK data.
        nltk.download(“punkt”)
        nltk.download(“averaged_preceptron_tagger”)
        nltk.download(“stopwords”)
        nltk.download(“wordnet”)

Internet Dependencies
    The *Internet* module is contains a class *scrape* which has some   
    functions for scraping newsfeeds. Some of these is based on the 
    **newspaper3k** package.  It can be installed using:

    .. code-block:: python

        conda install -c conda-forge newspaper3k
        # or
        pip install newpaper3k

Code of Conduct
---------------

Everyone interacting in the AdvancedAnalytics project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct: https://www.pypa.io/en/latest/code-of-conduct/ .




