Metadata-Version: 2.0
Name: stickbugml
Version: 1.0.4
Summary: A framework to organize the process of designing supervised machine learning systems
Home-page: https://github.com/aaronduino/stick-bug-ml
Author: Aaron Janse
Author-email: gitduino@gmail.com
License: Apache 2.0
Keywords: stick bug,ml,machine learning,ai,artificial intelligence,framework,organization,organize
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: sklearn

stick-bug-ml
============

A framework to ease the burden of organizing code of a supervised
machine learning system.

It provides decorators that manage data & pass it between common steps
in building a machine learning system, such as: - loading the dataset -
preprocessing - feature generation - model definition

While doing this, it keeps the global namespace free of clutter such as
that from an endless chain of features and models.

In addition, it makes it easy to put new, real life, data through the
exact same process that training data goes through.

Installation
------------

Install simply via ``pip`` (Python 3):

.. code:: bash

    $ pip install stickbugml

Dependencies: - Python 3 - sklearn - pandas - numpy

Example
-------

Note: there is also a great `example for use in Jupyter
Notebooks <demo.ipynb>`__

First, import this library:

.. code:: python

    import stickbugml
    from stickbugml.decorators import dataset, feature, model

Load your dataset:

.. code:: python

    import seaborn.apionly as sns
    import pandas as pd

    @dataset(train_valid_test=(0.6, 0.2, 0.2)) # define your train/test/validation data splits
    def raw_dataset():
        titanic_dataset = sns.load_dataset('titanic')

        # Drop NaN rows for simplicity
        titanic_dataset.dropna(inplace=True)

        # Extract X and y
        X = titanic_dataset.drop('survived', axis=1)
        y = titanic_dataset['survived']
        return X, y

    print(raw_dataset.head()) # yes, this does work! raw_dataset is now a pandas DataFrame

(Optionally) do some pre-processing:

.. code:: python

    @preprocess
    def preprocessed_dataset(X):
        # Encode categorical columns
        categorical_column_names = [
                'sex', 'embarked', 'class',
                'who', 'adult_male', 'deck',
                'embark_town', 'alive', 'alone']

        X = pd.get_dummies(X,
                           columns=categorical_column_names,
                           prefix=categorical_column_names)

        return X

    print(preprocessed_dataset.head()) # See the first code block for explaination

Generate some features:

.. code:: python

    from sklearn import decomposition
    import numpy as np

    @feature('pca')
    def pca_feature(X):
        pca = decomposition.PCA(n_components=3)
        pca.fit(X)
        pca_out = pca.transform(X)

        pca_out = np.transpose(pca_out, (1, 0))
        return pd.DataFrame(pca_out)

    # let's preview
    print(pca_feature.head()) # See the first code block for explaination

    # you can add more features, btw

And define your (machine learning) model(s):

.. code:: python

    import xgboost as xgb

    @model('xgboost')
    def xgboost_model():
        def define(num_columns):
            return None # xgboost models aren't pre-defined


        def train(model, params, train, validation):
            params['objective'] = 'binary:logistic' # Static parameters can be defined here
            params['eval_metric'] = 'logloss'

            d_train = xgb.DMatrix(train['X'], label=train['y'])
            d_valid = xgb.DMatrix(validation['X'], label=validation['y'])

            watchlist = [(d_train, 'train'), (d_valid, 'valid')]

            trained_model = xgb.train(params, d_train, 2000, watchlist, early_stopping_rounds=50, verbose_eval=10)

            return trained_model

        def predict(model, X):
            return model.predict(xgb.DMatrix(X))

        return define, train, predict

Now you can train your model, trying out different parameters if you
want:

.. code:: python

    stickbugml.train('xgboost', {
        'max_depth': 7,
        'eta': 0.01
    })

The library keeps the test data's ground truth values locked away so
your models won't train on it. After you train your model, have the
framework evaluate it for you:

.. code:: python

    logloss_score = stickbugml.evaluate('xgboost')
    print(logloss_score)

You can add more models and features if so desired.

Since this library is built with reality in mind, you can easily get
predictions for new/real-life data:

.. code:: python

    raw_X = pd.read_csv('2018_titanic_manifesto.csv') # It will probably sink, but we don't know who will survive
    processed_X = stickbugml.process(raw_X) # Process the data
    del raw_X # Gotta keep that namespace clean, right?

    y = stickbugml.predict('xgboost', processed_X) # Make predictions

    print(y)

Contributing & Feedback
-----------------------

If you have any problems, or would like a new feature, submit an Issue.

If you want to help out, feel free to submit a Pull Request.

License
-------

This project uses the Apache 2.0 License


