Metadata-Version: 2.1
Name: adversarial-labeller
Version: 0.1.8
Summary: Sklearn compatiable model instance labelling tool to help validate models in situations involving data drift.
Home-page: https://github.com/robinsonkwame/adversarial_labeller
Author: Kwame Porter Robinson
Author-email: kwamepr@umich.edu
License: MIT license
Keywords: model selection,validation,data drift
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
Requires-Dist: attrs (==19.3.0)
Requires-Dist: bleach (==3.1.0)
Requires-Dist: certifi (==2019.11.28)
Requires-Dist: cffi (==1.13.2)
Requires-Dist: chardet (==3.0.4)
Requires-Dist: cryptography (==2.8)
Requires-Dist: docutils (==0.15.2)
Requires-Dist: idna (==2.8)
Requires-Dist: imbalanced-learn (==0.6.1)
Requires-Dist: jeepney (==0.4.1)
Requires-Dist: joblib (==0.14.1)
Requires-Dist: keyring (==21.0.0)
Requires-Dist: more-itertools (==8.0.2)
Requires-Dist: numpy (==1.18.0)
Requires-Dist: packaging (==19.2)
Requires-Dist: pandas (==0.25.3)
Requires-Dist: pkginfo (==1.5.0.1)
Requires-Dist: pluggy (==0.13.1)
Requires-Dist: py (==1.8.1)
Requires-Dist: pycparser (==2.19)
Requires-Dist: pygments (==2.5.2)
Requires-Dist: pyparsing (==2.4.6)
Requires-Dist: pytest (==5.3.2)
Requires-Dist: python-dateutil (==2.8.1)
Requires-Dist: pytz (==2019.3)
Requires-Dist: readme-renderer (==24.0)
Requires-Dist: requests-toolbelt (==0.9.1)
Requires-Dist: requests (==2.22.0)
Requires-Dist: scikit-learn (==0.22)
Requires-Dist: scipy (==1.4.1)
Requires-Dist: six (==1.13.0)
Requires-Dist: sklearn (==0.0)
Requires-Dist: tqdm (==4.41.0)
Requires-Dist: twine (==3.1.1)
Requires-Dist: urllib3 (==1.25.7)
Requires-Dist: wcwidth (==0.1.7)
Requires-Dist: webencodings (==0.5.1)
Requires-Dist: zipp (==0.6.0)
Requires-Dist: importlib-metadata (==1.3.0) ; python_version < "3.8"
Requires-Dist: secretstorage (==3.1.1) ; sys_platform == "linux"

# adversarial_labeller
---

Adversarial labeller is a sklearn compatible labeller that scores instances as belonging to the test dataset or not to help model selection under data drift. Adversarial labeller is distributed under the MIT license.

## Installation

*Dependencies*

Adversarial validator requires:
* Python (>= 3.7)
* scikit-learn (>= 0.21.0)
* [imbalanced learn](>= 0.5.0)
* [pandas](>= 0.25.0)

*User installation*

The easiest way to install adversarial validator is using 
```pip
pip install adversarial_labeller
```

*Example Usage*
```python
import numpy as np
import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from adversarial_labeller import AdversarialLabelerFactory, Scorer

scoring_metric = accuracy_score

# Our blob data generation parameters for this example
number_of_samples = 1000
number_of_test_samples = 300

# Generate 1d blob data and label a portion as test data
# ... 1d blob data can be visualized as a rug plot
variables, labels = \
make_blobs(
    n_samples=number_of_samples,
    centers=2,
    n_features=1,
    random_state=0
)

df = pd.DataFrame(
{
    'independent_variable':variables.flatten(),
    'dependent_variable': labels,
    'label': 0  #  default to train data
}
)
test_indices = df.index[-number_of_test_samples:]
train_indices = df.index[:-number_of_test_samples]

df.loc[test_indices,'label'] = 1  # ... now we mark instances that are test data

# Now perturb the test samples to simulate data drift/different test distribution
df.loc[test_indices, "independent_variable"] +=\
np.std(df.independent_variable)

# ... now we have an example of data drift where adversarial labeling can be used to better estimate the actual test accuracy

features_for_labeller = df.independent_variable
labels_for_labeller = df.label

pipeline, flip_binary_predictions =\
    AdversarialLabelerFactory(
        features = features_for_labeller,
        labels = labels_for_labeller,
        run_pipeline = False
    ).fit_with_best_params()

scorer = Scorer(the_scorer=pipeline,
                flip_binary_predictions=flip_binary_predictions)

# Now we evaluate a classifer on training data only, but using
# our fancy adversarial labeller
_X = df.loc[train_indices]\
    .independent_variable\
    .values\
    .reshape(-1,1)

_X_test = df.loc[test_indices]\
            .independent_variable\
            .values\
            .reshape(-1,1)

# ... sklearn wants firmly defined shapes
clf_adver = RandomForestClassifier(n_estimators=100, random_state=1)
adversarial_scores =\
    cross_val_score(
        X=_X,
        y=df.loc[train_indices].dependent_variable,
        estimator=clf_adver,
        scoring=scorer.grade,
        cv=10,
        n_jobs=-1,
        verbose=1)
# ... and we get ~ 0.70 - 0.68
average_adversarial_score =\
    np.array(adversarial_scores).mean()

# ... let's see how this compares with normal cross validation
clf = RandomForestClassifier(n_estimators=100, random_state=1)
scores =\
    cross_val_score(
        X=_X,
        y=df.loc[train_indices].dependent_variable,
        estimator=clf,
        cv=10,
        n_jobs=-1,
        verbose=1)

# ... and we get ~ 0.92
average_score =\
    np.array(scores).mean()

# now let's see how this compares with the actual test score
clf_all = RandomForestClassifier(n_estimators=100, random_state=1)
clf_all.fit(_X,
            df.loc[train_indices].dependent_variable)

# ... actual test score is 0.70
actual_score =\
accuracy_score(
    clf_all.predict(_X_test),
    df.loc[test_indices].dependent_variable
)

adversarial_result = abs(average_adversarial_score - actual_score)
print(f"... adversarial labelled cross validation was {adversarial_result:.2f} points different than actual.")  # ... 0.00 - 0.02 points

cross_val_result = abs(average_score - actual_score)
print(f"... regular validation was {cross_val_result:.2f} points different than actual.")  # ... 0.23 points

#  See tests/ for additional examples, including against the Titanic and stock market trading
```


