Metadata-Version: 2.1
Name: autobazaar
Version: 0.1.0
Summary: The Machine Learning Bazaar
Home-page: https://github.com/HDI-project/AutoBazaar
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Keywords: automl machine learning hyperparameters tuning classification regression autobazaar
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.4
Description-Content-Type: text/markdown
Requires-Dist: absl-py (==0.4.0)
Requires-Dist: astor (==0.7.1)
Requires-Dist: baytune (==0.2.1)
Requires-Dist: boto (==2.48.0)
Requires-Dist: boto3 (==1.9.27)
Requires-Dist: botocore (==1.12.28)
Requires-Dist: certifi (==2018.8.13)
Requires-Dist: chardet (==3.0.4)
Requires-Dist: click (==6.7)
Requires-Dist: cloudpickle (==0.4.0)
Requires-Dist: cycler (==0.10.0)
Requires-Dist: dask (==0.18.2)
Requires-Dist: decorator (==4.3.0)
Requires-Dist: distributed (==1.22.1)
Requires-Dist: docutils (==0.14)
Requires-Dist: featuretools (==0.3.1)
Requires-Dist: future (==0.16.0)
Requires-Dist: gast (==0.2.0)
Requires-Dist: grpcio (==1.12.1)
Requires-Dist: h5py (==2.8.0)
Requires-Dist: HeapDict (==1.0.0)
Requires-Dist: idna (==2.6)
Requires-Dist: iso639 (==0.1.4)
Requires-Dist: jmespath (==0.9.3)
Requires-Dist: Keras (==2.1.6)
Requires-Dist: Keras-Applications (==1.0.6)
Requires-Dist: Keras-Preprocessing (==1.0.5)
Requires-Dist: kiwisolver (==1.0.1)
Requires-Dist: langdetect (==1.0.7)
Requires-Dist: lightfm (==1.15)
Requires-Dist: matplotlib (==2.2.3)
Requires-Dist: mit-d3m (==0.1.1)
Requires-Dist: mlblocks (==0.2.3)
Requires-Dist: mlprimitives (==0.1.3)
Requires-Dist: msgpack (==0.5.6)
Requires-Dist: networkx (==2.1)
Requires-Dist: nltk (==3.3)
Requires-Dist: numpy (==1.15.2)
Requires-Dist: opencv-python (==3.4.2.17)
Requires-Dist: pandas (==0.23.4)
Requires-Dist: Pillow (==5.1.0)
Requires-Dist: protobuf (==3.6.1)
Requires-Dist: psutil (==5.4.7)
Requires-Dist: pymongo (==3.7.2)
Requires-Dist: pyparsing (==2.2.0)
Requires-Dist: python-dateutil (==2.7.3)
Requires-Dist: python-louvain (==0.10)
Requires-Dist: pytz (==2018.5)
Requires-Dist: PyWavelets (==0.5.2)
Requires-Dist: PyYAML (==3.12)
Requires-Dist: requests (==2.20.0)
Requires-Dist: s3fs (==0.1.5)
Requires-Dist: s3transfer (==0.1.13)
Requires-Dist: scikit-image (==0.14.0)
Requires-Dist: scikit-learn (==0.20.0)
Requires-Dist: scipy (==1.1.0)
Requires-Dist: six (==1.11.0)
Requires-Dist: sortedcontainers (==2.0.4)
Requires-Dist: setuptools (==39.1.0)
Requires-Dist: tblib (==1.3.2)
Requires-Dist: tensorboard (==1.11.0)
Requires-Dist: tensorflow (==1.11.0)
Requires-Dist: termcolor (==1.1.0)
Requires-Dist: toolz (==0.9.0)
Requires-Dist: tornado (==5.1)
Requires-Dist: tqdm (==4.24.0)
Requires-Dist: urllib3 (==1.23)
Requires-Dist: Werkzeug (==0.14.1)
Requires-Dist: xgboost (==0.72.1)
Requires-Dist: zict (==0.1.3)
Provides-Extra: dev
Requires-Dist: bumpversion (>=0.5.3) ; extra == 'dev'
Requires-Dist: pip (>=10.0.0) ; extra == 'dev'
Requires-Dist: watchdog (>=0.8.3) ; extra == 'dev'
Requires-Dist: m2r (>=0.2.0) ; extra == 'dev'
Requires-Dist: Sphinx (>=1.7.1) ; extra == 'dev'
Requires-Dist: sphinx-rtd-theme (>=0.2.4) ; extra == 'dev'
Requires-Dist: autodocsumm (>=0.1.10) ; extra == 'dev'
Requires-Dist: flake8 (>=3.5.0) ; extra == 'dev'
Requires-Dist: isort (>=4.3.4) ; extra == 'dev'
Requires-Dist: autoflake (>=1.1) ; extra == 'dev'
Requires-Dist: autopep8 (>=1.3.5) ; extra == 'dev'
Requires-Dist: twine (>=1.10.0) ; extra == 'dev'
Requires-Dist: wheel (>=0.30.0) ; extra == 'dev'
Requires-Dist: tox (>=2.9.1) ; extra == 'dev'
Requires-Dist: coverage (>=4.5.1) ; extra == 'dev'
Requires-Dist: pytest (>=3.4.2) ; extra == 'dev'
Requires-Dist: pytest-cov (>=2.6.0) ; extra == 'dev'
Provides-Extra: tests
Requires-Dist: pytest (>=3.4.2) ; extra == 'tests'
Requires-Dist: pytest-cov (>=2.6.0) ; extra == 'tests'

<p align="left">
<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt=“AutoBazaar” />
<i>An open source project from Data to AI Lab at MIT.</i>
</p>


[![Travis](https://travis-ci.org/HDI-Project/AutoBazaar.svg?branch=master)](https://travis-ci.org/HDI-Project/AutoBazaar)
[![PyPi Shield](https://img.shields.io/pypi/v/autobazaar.svg)](https://pypi.python.org/pypi/autobazaar)


# AutoBazaar

- License: MIT
- Documentation: https://HDI-Project.github.io/AutoBazaar/
- Homepage: https://github.com/HDI-Project/AutoBazaar

# Overview

AutoBazaar is an AutoML system created to execute the experiments associated with the
[The Machine Learning Bazaar Paper: Harnessing the ML Ecosystem for Effective System
Development](https://arxiv.org/pdf/1905.08942.pdf)
by the [Human-Data Interaction (HDI) Project](https://hdi-dai.lids.mit.edu/) at LIDS, MIT.

It comes in the form of a python library which can be used directly inside any other python
project, as well as a CLI which allows searching for pipelines to solve a problem directly
from the command line.

# Install

## Requirements

**AutoBazaar** has been developed and tested on [Python 3.5, 3.6 and 3.7](https://www.python.org/downloads/)

Also, although it is not strictly required, the usage of a
[virtualenv](https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid
interfering with other software installed in the system where **AutoBazaar** is run.

These are the minimum commands needed to create a virtualenv using python3.6 for **AutoBazaar**:

```bash
pip install virtualenv
virtualenv -p $(which python3.6) autobazaar-venv
```

Afterwards, you have to execute this command to have the virtualenv activated:

```bash
source autobazaar-venv/bin/activate
```

Remember about executing it every time you start a new console to work on **AutoBazaar**!

## Install with pip

After creating the virtualenv and activating it, we recommend using
[pip](https://pip.pypa.io/en/stable/) in order to install **AutoBazaar**:

```bash
pip install autobazaar
```

This will pull and install the latest stable release from [PyPi](https://pypi.org/).

## Install from source

Alternatively, with your virtualenv activated, you can clone the repository and install it from
source by running `make install` on the `stable` branch:

```bash
git clone git@github.com:HDI-Project/AutoBazaar.git
cd AutoBazaar
git checkout stable
make install
```

For development, you can use `make install-develop` instead in order to install all
the required dependencies for testing and code linting.

# Data Format

AutoBazaar works with datasets in the [D3M Schema Format](https://github.com/mitll/d3m-schema)
as input.

This dataset Schema, developed by MIT Lincoln Labs Laboratory for DARPA's Data Driven Discovery
of Models Program, requires the data to be in plainly readable formats such as CSV files or
JPG images, and to be set within a folder hierarchy alongside some metadata specifications
in JSON format, which include information about all the data contained, as well as the problem
that we are trying to solve.

For more details about the schema and about how to format your data to be compliant with it,
please have a look at the [Schema Documentation](https://github.com/mitll/d3m-schema/tree/master/documentation)

As an example, you can browse some datasets which have been included in this repository for
demonstration purposes:
- [185_baseball](https://github.com/HDI-Project/AutoBazaar/tree/master/data/185_baseball): Single Table Regression
- [196_autoMpg](https://github.com/HDI-Project/AutoBazaar/tree/master/data/196_autoMpg): Single Table Classification

Additionally, you can find a collection with ~500 datasets already formatted in the
[d3m-data-dai S3 Bucket in AWS](https://d3m-data-dai.s3.amazonaws.com/index.html).

# Quickstart

In this short tutorial we will guide you through a series of steps that will help you getting
started with **AutoBazaar** using its CLI command `abz`.

For more details about its usage and the available options, please execute `abz --help`
on your command line.

## 1. Prepare your Data

Make sure to have your data prepared in the [Data Format](#data-format) explained above inside
and uncompressed folder in a filesystem directly accessible by **AutoBazaar**.

In order to check, whether your dataset is available and ready to use, you can execute
the `abz` command in your command line with its `list` subcommand.
If your dataset is in a different place than inside a folder called `data` within your
current working directory, do not forget to add the `-i` argument to your command indicating
the path to the folder that contains your dataset.

```bash
$ abz list -i /path/to/your/datasets/folder
```

The output should be a table which includes the details of all the datasets found inside
the indicated directory:

```
             data_modality                task_type task_subtype             metric size_human  train_samples
dataset
185_baseball  single_table           classification  multi_class            f1Macro       148K           1073
196_autoMpg   single_table               regression   univariate   meanSquaredError        32K            298
30_personae           text           classification       binary                 f1       1,4M            116
32_wikiqa      multi_table           classification       binary                 f1       4,9M          23406
60_jester     single_table  collaborative_filtering               meanAbsoluteError        44M         880719
```

**Note:** If you see an error saying that `No matching datasets found`, please review your
dataset format and make sure to have indicated the right path.

For the rest of this quickstart, we will be using the `185_baseball` dataset that you can
find inside the [data folder](https://github.com/HDI-Project/AutoBazaar/tree/master/data)
contained in this repository.

## 2. Start the search process

Once your data is ready, you can start the AutoBazaar search process using the `abz search`
command.
To do this, you will need to provide again the path to where your datasets are contained, as
well as the name of the datasets that you want to process.

```bash
$ abz search -i /path/to/your/datasets/folder name_of_your_dataset
```

This will evaluate the default pipeline without performing additional tuning iteration on it.

In order to start an actual tuning process, you will need to provide at least one of the
following additional options:

* `-b, --budget`: Maximum number of tuning iterations to perform.
* `-t, --timeout`: Maximum time that the system needs to run, in seconds.
* `-c, --checkpoints`: Comma separated string containing the different checkpoints where
  the best pipeline so far must be stored and evaluated against the test dataset. There must be
  no spaces between the checkpoint times. For example, to store the best pipeline every 10 minutes
  until 30 minutes have passed, you would use the option `-c 600,1200,1800`.

For example, to search process the `185_baseball` dataset during 30 seconds evaluating the
best pipeline so far every 10 seconds but with a maximum of 10 tuning iterations, we would
use the following command:

```bash
abz search 185_baseball -c10,20,30 -b10
```

For further details about the available options, please execute `abz search --help` in your
terminal.

## 3. Explore the results

Once the **AutoBazaar** has finished searching for the best pipeline, a table will be printed
in stdout with a summary of the best pipeline found for each dataset.
If multiple checkpoints were provided, details about the best pipeline in each checkpoint
will also be included.

The output will be a table similar to this one:

```
                                          pipeline     score      rank  cv_score   metric data_modality       task_type task_subtype    elapsed  iterations  load_time  trivial_time  fit_time    cv_time error  step
dataset
185_baseball  fce28425-e45c-4620-9d3c-d329b8684bea  0.316961  0.682957  0.317043  f1Macro  single_table  classification  multi_class  10.024457         0.0   0.011041      0.026212       NaN        NaN  None  None
185_baseball  f7428924-79ee-439d-bc32-998a9efea619  0.675132  0.390927  0.609073  f1Macro  single_table  classification  multi_class  21.412262         1.0   0.011041      0.026212   9.99484        NaN  None  None
185_baseball  397780a5-6bf6-48c9-9a85-06b0d08c5a9d  0.675132  0.357361  0.642639  f1Macro  single_table  classification  multi_class  31.712946         2.0   0.011041      0.026212   9.99484  12.618179  None  None
```

Alternatively, a `-r` option can be passed with the name of a CSV file, and the results will
be stored there:

```bash
abz search 185_baseball -c10,20,30 -b10 -r results.csv
```

## What's next?

For more details about **AutoBazaar** and all its possibilities and features, please check the
[project documentation site](https://HDI-Project.github.io/AutoBazaar/)!

# Credits

AutoBazaar is an Open Source project from the Data to AI Lab at MIT built by the following team:

* Carles Sala <csala@csail.mit.edu>
* Micah Smith <micahs@mit.edu>
* Max Kanter <max.kanter@gmail.com>
* Kalyan Veeramachaneni <kalyanv@mit.edu>

## Citing AutoBazaar

If you use AutoBazaar for yor research, please consider citing the following paper (https://arxiv.org/pdf/1905.08942.pdf):

```
@article{smith2019mlbazaar,
  author = {Smith, Micah J. and Sala, Carles and Kanter, James Max and Veeramachaneni, Kalyan},
  title = {The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development},
  journal = {arXiv e-prints},
  year = {2019},
  eid = {arXiv:1905.08942},
  pages = {arxiv:1904.09535},
  archivePrefix = {arXiv},
  eprint = {1905.08942},
}
```


# History

## 0.1.0 - 2019-06-24

First Release.

This is a slightly cleaned up version of the software used to generate the results
explained in [The Machine Learning Bazaar Paper](https://arxiv.org/pdf/1905.08942.pdf)


