Metadata-Version: 2.1
Name: PyImpuyte
Version: 1.3.5
Summary: Intelligent imputation using tree-based and machine learning algorithms
Home-page: https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte
Author: Marcus Suresh, Ronnie Taib
Author-email: marcus.suresh@industry.gov.au, marcus.suresh@data61.csiro.au, ronnie.taib@data61.csiro.au
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.6
Description-Content-Type: text/markdown

# PyImpuyte
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
[![Generic badge](https://img.shields.io/badge/PyPi-passing-<COLOR>.svg)](https://test.pypi.org/project/PyImpuyte/1.1.5/)
[![Documentation Status](https://readthedocs.org/projects/pyimpuyte/badge/?version=latest)](https://pyimpuyte.readthedocs.io/en/latest/?badge=latest)
[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte/browse/LICENSE)
[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)]()
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT.md)

<span style="font-size:1.5em;">`PyImpuyte` is a Python3.7+ package that simplifies the task of imputing missing values in datasets.

<p align="center">
  <img width="530" height="600" src="https://s3-marcus-public.s3-eu-west-1.amazonaws.com/PyImpuyte_1.PNG">
</p>

<span style="font-size:1.5em;">`PyImpuyte` was built with a strong customer-centric focus and leverages of `scikit-learn`. It brings together various imputation strategies and harnesses <b>machine learning algorithms</b> to improve data coverage.

<span style="font-size:1.5em;">`PyImpuyte` gives the user exactly what they want - hassle free deployment of machine learning algorithms. Simply ingest your data, set your target, pass in a feature matrix and select your chosen imputation strategy. You now have machine generated imputed values appended to your dataframe.

<span style="font-size:1.5em;">To learn more about how to use `PyImpuyte`, check out our <b>[docs](https://pyimpuyte.readthedocs.io/en/latest/)</b> for a step-by-step guide.</span>


## Contents
- [Motivation](#-motivation)
- [Installation](#-installation)
- [Quick Start](#-quick-start)
- [Contribute](#-contribute)
- [Conferences and Meet-ups](#-conferences-and-meet-ups)
- [Citation](#-citation)
- [Developers and Maintainers](#-developers-and-maintainers)
- [Acknowledgements](#-acknowledgements)
- [Copyright](#-copyright)


## Motivation
Incomplete data are quite common which can deteriorate statistical inference. As such, the `PyImpuyte` team set out to develop a Python package that simplifies the task of imputing missing values in Australian Government national statistical assets and other micro-data sets.

The development of `PyImpuyte` is motivated by helping micro-data practitioners select and implement advanced imputation methods. `PyImpuyte` adds an additional tool in the toolkit of practitioners seeking to preserve their data and fight information loss that arises from droping observations with missing values.

  #### Main Features
  * Interfaces with `scikit-learn` to provide a customer-centric and efficient way to perform imputation using machine learning algorithms.
  * Support for numerous imputation strategies and performance metrics, as specified below:


  #### Imputation Strategies

  | Univariate            | Generalised Linear Models   | Bagging and Boosted Trees    | Neural Nets
  | :---------------------| :-------------------------- | :----------------------------| :-----------------------
  | Mean                  | Linear Regressions          | Bagging Regressor            | Multi-layer Perceptron
  | Median                | Lasso                       | Extra Trees Regressor        |
  | Mode                  | Ridge                       | Extreme Gradient Boosting    |
  |                       |                             | Random Forest Regressor      |
  |                       |                             | XGBoost, LightGBM, CatBoost  |


  #### Performance Metrics
  |                       |
  | :---------------------|
  | Simple error           |
  | Percentage error      |
  | Naive forecasting          |
  | Relative Error      |
  | Bounded Relative Error         |
  | Geometric mean      |
  | Mean Squared Error          |
  | Normalized Root Mean Squared Error      |
  | Mean Error         |
  | Mean Absolute Error      |
  | Geometric Mean Absolute Error         |
  | Median Absolute Error      |
  | Mean Percentage Error |
  | Mean Absolute Percentage Error |
  | Median Absolute Percentage Error |
  | Symmetric Mean Absolute Percentage Error |
  | Symmetric Median Absolute Percentage Error |
  | Mean Arctangent Absolute Percentage Error |
  | Mean Absolute Scaled Error |
  | Normalized Absolute Error |
  | Normalized Absolute Percentage Error |
  | Root Mean Squared Percentage Error |
  | Root Median Squared Percentage Error |
  | Root Mean Squared Scaled Error |
  | Integral Normalized Root Squared Error |
  | Root Relative Squared Error |
  | Mean Relative Error |
  | Median Relative Absolute Error |
  | Geometric Mean Relative Absolute Error |
  | Mean Bounded Relative Absolute Error |
  | Unscaled Mean Bounded Relative Absolute Error |
  | Mean Directional Accuracy  |


  #### Versions and Dependencies
  * Python 3.7+
  * Dependencies:
      - `missingno` >= 0.4.1
      - `numpy` >= 1.15.4
      - `pandas` >= 0.20.3
      - `scikit-learn` >= 0.20.2
      - `xgboost` >= 0.83


## Installation
There are two ways to install the `PyImpuyte` package:

- Install `PyImpuyte` from PyPI (recommended):
```
pip install PyImpuyte==1.3.5
```
- Install `PyImpuyte` from the Bitbucket source:
```
git clone https://bitbucket.csiro.au/scm/dde/pyimpuyte.git
cd pyimpuyte
python setup.py install
```


## Quick Start
To start imputing missing values with `PyImpuyte`, a `config.json` file must be passed. The following workflow can be used:

```config.json
{
    "pyimpuyte": {
        "input": [
            "data/synth_data_test.csv"
        ],
        "feature_list": ["TURNOVER", "WAGES", "SALES"],
        "target": "FTE",
        "skip_columns": null,
        "nrows": 1000,
        "drop_duplicates": true,
        "output": "out/synth_data_test.csv",
        "evaluation": "out/evaluation.csv"
    }
}
```
For more information about how to configure `PyImpuyte`, see our suggested **[template](https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte/browse/config.md)**.


## Contribute
We welcome all kinds of contributions that improve the performance of the currently published pacakge. See the [Contribution Guide](https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte/browse/CONTRIBUTING.md) for more details.


## Conferences and Meet-ups
* We presented our research at the **[2019 Australasian Joint Conference on Artificial Intelligence](http://nugget.unisa.edu.au/AI2019/index.php)** which lead to the development of `PyImpuyte`.

* We will be presenting at the next Canberra Data Scientists Meet-up on 28 July 2020.


## Citation
Please cite our work in your publications if it helps your research.

* Conference Paper - Chapter 18 of **[AI2019: Advances in Artificial Intelligence](https://link.springer.com/chapter/10.1007/978-3-030-35288-2_18)**.

```BibTeX
@inbook{inbook,
  author = {Suresh, Marcus and Taib, Ronnie and Zhao, Yanchang and Jin, Warren},
  year = {2019},
  month = {11},
  pages = {215-227},
  title = {Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning},
  isbn = {978-3-030-35287-5},
  doi = {10.1007/978-3-030-35288-2_18}
}
```

* Python Package - **[PyImpuyte](https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte)**.

```BibTeX
@misc{Suresh2020_PyImpuyte,
  title={PyImpuyte},
  author={Suresh, Marcus et al.},
  year={2020},
  howpublished={\url{https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte}},
}
```


## Developers and Maintainers
* The developers began work to bring `PyImpuyte` into production in October 2019. `PyImpuyte` is actively maintained and there will be incremental improvements scheduled on a regular basis. The lead developers and maintainers are:

  * <b>Marcus Suresh</b>, Bitbucket: [sur033](https://bitbucket.csiro.au/users/sur033) and GitHub: [marcus-suresh](https://github.com/marcus-suresh)

  * <b>Ronnie Taib</b>, GitHub: [rtaib](https://github.com/rtaib)

* See the [Developers](https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte/browse/DEVELOPERS.rst) page to get in touch with the `PyImpuyte` team.


## Acknowledgements
* This research was funded by the Australian Government through the [Department of Industry, Science, Energy and Resources (DISER)](https://www.industry.gov.au/) and the [Data Integration Partnership for Australia (DIPA)](https://www.pmc.gov.au/public-data/data-integration-partnership-australia).

* The developers would like to extend their gratitude to Dr. Abrie Swanepoel (Branch Manager) and Dr. Tala Talgasawatta (Director) from DISER for their ongoing support in `PyImpuyte`.


## Copyright
`PyImpuyte` is distributed under the MIT license. See [LICENSE](https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte/browse/LICENSE) for details.

