Metadata-Version: 2.1
Name: EtaML
Version: 0.0.9
Summary: An automated machine learning platform with a focus on explainability
Author: Clemens Spielvogel
Author-email: Clemens Spielvogel <clemens.spielvogel@gmail.com>
Project-URL: Homepage, https://github.com/cspielvogel/ExplainableTabularAutoML
Project-URL: Bug Tracker, https://github.com/cspielvogel/ExplainableTabularAutoML/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: Microsoft :: Windows :: Windows 11
Classifier: Operating System :: Unix
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: asttokens (==2.0.5)
Requires-Dist: atomicwrites (==1.4.0)
Requires-Dist: attrs (==21.4.0)
Requires-Dist: backcall (==0.2.0)
Requires-Dist: Brotli (==1.0.9)
Requires-Dist: category-encoders (==2.4.1)
Requires-Dist: certifi (==2022.6.15)
Requires-Dist: cffi (==1.15.0)
Requires-Dist: charset-normalizer (==2.0.12)
Requires-Dist: click (==8.1.3)
Requires-Dist: cloudpickle (==2.1.0)
Requires-Dist: colorama (==0.4.4)
Requires-Dist: colour (==0.1.5)
Requires-Dist: cycler (==0.11.0)
Requires-Dist: dash (==2.4.1)
Requires-Dist: dash-core-components (==2.0.0)
Requires-Dist: dash-cytoscape (==0.3.0)
Requires-Dist: dash-html-components (==2.0.0)
Requires-Dist: dash-table (==5.0.0)
Requires-Dist: debugpy (==1.6.0)
Requires-Dist: decorator (==5.1.1)
Requires-Dist: dill (==0.3.5.1)
Requires-Dist: dtreeviz (==1.3.6)
Requires-Dist: entrypoints (==0.4)
Requires-Dist: executing (==0.8.3)
Requires-Dist: Flask (==2.1.2)
Requires-Dist: Flask-Compress (==1.12)
Requires-Dist: fonttools (==4.33.3)
Requires-Dist: gevent (==21.12.0)
Requires-Dist: graphviz (==0.20)
Requires-Dist: greenlet (==1.1.2)
Requires-Dist: htmlmin (==0.1.12)
Requires-Dist: idna (==3.3)
Requires-Dist: ImageHash (==4.2.1)
Requires-Dist: imageio (==2.19.2)
Requires-Dist: imbalanced-learn (==0.8.0)
Requires-Dist: imblearn (==0.0)
Requires-Dist: importlib-metadata (==4.11.4)
Requires-Dist: iniconfig (==1.1.1)
Requires-Dist: install (==1.3.5)
Requires-Dist: interpret (==0.2.7)
Requires-Dist: interpret-core (==0.2.7)
Requires-Dist: ipykernel (==6.13.0)
Requires-Dist: ipython (==8.3.0)
Requires-Dist: itsdangerous (==2.1.2)
Requires-Dist: jedi (==0.18.1)
Requires-Dist: Jinja2 (==3.1.2)
Requires-Dist: joblib (==1.1.0)
Requires-Dist: jupyter-client (==7.3.1)
Requires-Dist: jupyter-core (==4.10.0)
Requires-Dist: kiwisolver (==1.4.2)
Requires-Dist: lime (==0.2.0.1)
Requires-Dist: llvmlite (==0.38.1)
Requires-Dist: MarkupSafe (==2.1.1)
Requires-Dist: matplotlib (==3.4.3)
Requires-Dist: matplotlib-inline (==0.1.3)
Requires-Dist: miceforest (==5.6.3)
Requires-Dist: missingno (==0.5.1)
Requires-Dist: mrmr-selection (==0.2.5)
Requires-Dist: multimethod (==1.8)
Requires-Dist: multiprocess (==0.70.13)
Requires-Dist: nest-asyncio (==1.5.5)
Requires-Dist: networkx (==2.8.2)
Requires-Dist: numba (==0.55.1)
Requires-Dist: numpy (==1.21.6)
Requires-Dist: packaging (==21.3)
Requires-Dist: pandas (==1.4.2)
Requires-Dist: pandas-profiling (==3.2.0)
Requires-Dist: parso (==0.8.3)
Requires-Dist: pathos (==0.2.9)
Requires-Dist: patsy (==0.5.2)
Requires-Dist: pexpect (==4.8.0)
Requires-Dist: phik (==0.12.2)
Requires-Dist: pickleshare (==0.7.5)
Requires-Dist: Pillow (==9.1.1)
Requires-Dist: plotly (==5.8.0)
Requires-Dist: pluggy (==1.0.0)
Requires-Dist: pox (==0.3.1)
Requires-Dist: ppft (==1.7.6.5)
Requires-Dist: prompt-toolkit (==3.0.29)
Requires-Dist: psutil (==5.9.1)
Requires-Dist: ptyprocess (==0.7.0)
Requires-Dist: pure-eval (==0.2.2)
Requires-Dist: py (==1.11.0)
Requires-Dist: pycparser (==2.21)
Requires-Dist: pydantic (==1.9.1)
Requires-Dist: Pygments (==2.12.0)
Requires-Dist: pynndescent (==0.5.7)
Requires-Dist: pyparsing (==3.0.9)
Requires-Dist: pytest (==7.1.2)
Requires-Dist: python-dateutil (==2.8.2)
Requires-Dist: pytz (==2022.1)
Requires-Dist: PyWavelets (==1.3.0)
Requires-Dist: PyYAML (==6.0)
Requires-Dist: pyzmq (==23.0.0)
Requires-Dist: requests (==2.27.1)
Requires-Dist: SALib (==1.4.5)
Requires-Dist: scikit-image (==0.19.2)
Requires-Dist: scikit-learn (==1.1.0)
Requires-Dist: scipy (==1.9.1)
Requires-Dist: seaborn (==0.11.2)
Requires-Dist: shap (==0.40.0)
Requires-Dist: six (==1.16.0)
Requires-Dist: sklearn (==0.0)
Requires-Dist: skope-rules (==1.0.1)
Requires-Dist: slicer (==0.0.7)
Requires-Dist: stack-data (==0.2.0)
Requires-Dist: statsmodels (==0.13.2)
Requires-Dist: tangled-up-in-unicode (==0.2.0)
Requires-Dist: tenacity (==8.0.1)
Requires-Dist: threadpoolctl (==3.1.0)
Requires-Dist: tifffile (==2022.5.4)
Requires-Dist: tomli (==2.0.1)
Requires-Dist: tornado (==6.1)
Requires-Dist: tqdm (==4.64.0)
Requires-Dist: traitlets (==5.2.1.post0)
Requires-Dist: treeinterpreter (==0.2.3)
Requires-Dist: typing-extensions (==4.2.0)
Requires-Dist: umap-learn (==0.5.3)
Requires-Dist: urllib3 (==1.26.9)
Requires-Dist: visions (==0.7.4)
Requires-Dist: wcwidth (==0.2.5)
Requires-Dist: Werkzeug (==2.1.2)
Requires-Dist: xgboost (==1.6.1)
Requires-Dist: zipp (==3.8.0)
Requires-Dist: zope.event (==4.5.0)
Requires-Dist: zope.interface (==5.4.0)

![img](./Assets/etaml_logo-transpartent.svg#center)

## Summary
This project aims to create a template for solving classification problems based on tabular data.
The template handles *binary and multi-class* problems. Among others, the project includes an *exploratory data analysis*, a *preprocessing* pipeline before train/test splitting, a fold-wise preprocessing pipeline after train/test splitting, a scalable and robust *Monte Carlo cross-validation* scheme, *various classification algorithms* which are evaluated for *multiple performance metrics* and a set of capabilities enabling *explainable artificial intelligence* including visualizations.

<img src="./Assets/tct_flow_simple.png" alt="Workflow diagram" width="600"/>

Content:

- Exploratory data analysis 
    - Report via Pandas Profiling
    - Visualization by dimensionality reduction (via PCA, tSNE and UMAP)
- Preprocessing
    - Removing all-NA instances
    - Removing features with constant value over all instances (ignoring NaNs)
    - Removing features with a user-provided ratio of missing values
    - One hot encoding of non-numeric features
- Fold-wise preprocessing
    - Normalization / Standardization
    - Filling missing values using kNN or MICE imputation
    - Resampling for handling label imbalances via SMOTE
- Performance estimation using Monte Carlo cross validation with multiple metrics
    - Accuracy
    - Area under the receiver operating characteristic curve (AUC)
    - Balanced accuracy
    - Sensitivity / Recall / True positive rate
    - Specificity / True negative rate
    - Positive predictive value (Precision)
    - Negative predictive value
- Receiver operating characteristic curve
- Feature selection using mRMR (or univariate filter methods)
- Hyperparameter optimization (using cross-validated randomized search)
- Training and evaluation of multiple classification algorithms
    - Explainable boosting machine (EBM)
    - Extreme gradient boosting (XGBoost)
    - k-nearest neighbors (kNN)
    - Decision tree (DT)
    - Random forest (RF)
    - Neural network (NN)
    - Support vector machine (SVM)
    - Logistic regression (LGR)
- Probability calibration (not supported for EBM)
    - Calibration plots with Brier score
- Explainable Artificial Intelligence (XAI)
    - Permutation feature importance (+ visualizations)
    - Individual conditional expectation (ICE) and partial dependence plots (PDPs)
    - EBM-specific global feature-wise PDPs
    - SHAP values (+ summary visualization)
    - Surrogate models (approximation via DT and EBM)
- Visualization of performance evaluation
    - Performance metrics for each classification model
    - Confusion matrices
    - Detailed list of predictions for each cross-validation sample
    - Receiver operating characteristic (ROC) curve for each model

## Output
Output content:
    - EDA: results as `html` report
    - Intermediate data: preprocessed data for final models as `csv`
    - Input data: input table and settings
    - Models: `joblib` objects and tuned hyperparameters as `json`
    - Performance: Confusion matrices and overall performance metrics for each model as `csv` and visalization as `svg`
    - XAI: Partial dependence plots, permutation feature importances and SHAP summary plots as `csv` and `svg`
    
Output structure:
    
```
Results/
    ├── EDA
    |   ├── exploratory_data_analysis.html
    |   └── umap.html
    ├── Intermediate_data
    |   ├── preprocessed_features.csv
    |   └── preprocessed_labels.csv
    ├── Models
    |   ├── ebm_model.pickle
    |   ├── ebm_model_hyperparameters.json
    |   └── ... (other pickled models and hyperparameters)
    ├── Performance
    |   ├── confusion_matrix-ebm.csv
    |   ├── confusion_matrix-ebm.png
    |   ├── ... (other models confusion matrices)
    |   ├── performance.png
    |   └── performances.csv
    └── XAI
        ├── Partial_dependence_plots
        |   ├── partial_dependence-ebm_feature-1_class-A.png
        |   └── ... (PDPs of other features, models and classes)
        ├── Permutation_importances   
        |   ├── permutation_importance_ebm-test.png
        |   ├── permutation_importance_ebm-train.png
        |   ├── ... (other models permutation importances for train and test set)
        ├── Surrogate_models
        |   ├── dt_surrogate_model_for_opaque_model.pickle
        |   ├── ebm_surrogate_model_for_opaque_model.pickle 
        |   ├── dt_surrogate_model_for_opaque_model.svg 
        └── SHAP
            ├── label-0_shap-values.csv
            ├── ... (other labels shap values if multiclass)
            ├── shap_summary-ebm.png
            └── ... (other models shap summary plots)
```

## Installation
Recommended: Create and activate a virtual environment
```
python3 -m venv /path/to/new/virtual/environment
cd /path/to/new/virtual/environment
source bin/activate
```

Clone this repository, navigate to the corresponding directory and install the supplied `requirements.txt`. The project was built using `python 3.9.5`.
```
pip install -r requirements.txt
```
Alternatively, the individual packages contained in the `requirements.txt` file can be installed manually.

Afterwards, run the software using 
```
python etaml.py --config ../Example_data/settings.ini
```


## System specifications
The software was tested with the following specifications

- Ubuntu 18.04 LTS (64-bit)
- Ubuntu 20.04 LTS (64-bit)
- Windows 11 (64-bit)
- Python 3.8
- Python 3.9
