Metadata-Version: 2.1
Name: DutchDraw
Version: 0.0.1
Summary: Determine (optimal) baselines for binary classification
Home-page: https://github.com/joris-pries/DutchDraw
Author: Etienne van de Bijl, Jan Klein, Joris Pries
Author-email: joris.pries@cwi.nl
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown

# DutchDraw

DutchDraw is a Python package for binary classification.

## Paper

This package is an implementation of the ideas from INSERTONZEPAPER, where VERHAALWATWEINDEPAPERDOEN.

### Citation
If you have used the DutchDraw package, please also cite: INSERTONZEBIBTEX

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install the package

```bash
pip install DutchDraw
```

----

### Windows users

```bash
python -m pip install --upgrade  --index-url https://test.pypi.org/simple/ DutchDraw
```

<!-- ```bash
python -m pip install DutchDraw
``` -->

or

```bash
py -m pip install --upgrade  --index-url https://test.pypi.org/simple/ DutchDraw
```

<!-- ```bash
py -m pip install DutchDraw
``` -->

## Method

To properly assess the performance of a binary classification model, the score of a chosen measure should be compared with the score of a 'simple' baseline. E.g. an accuracy of 0.9 isn't that great if a model (without knowledge) attains an accuracy of 0.88.

### Basic baseline

Let `M`  be the total number of samples, where `P` are positive and `N` are negative. Let `Î¸_star = round(Î¸ * M) / M`. Randomly shuffle the samples and label the first `Î¸_star * M` samples as `1` and the rest as `0`. This gives a baseline for each `Î¸` in `[0,1]`. Our package can optimize (maximize and minimize) the baseline.

## Reasons to use

This package contains multiple functions. Let `y_true` be the actual labels and `y_pred` be the labels predicted by a model.

If:

* You want to determine an included measure --> `measure_score(y_true, y_pred, measure)`
* You want to get statistics of a baseline given `theta` --> `baseline_functions_given_theta(theta, y_true, measure)`
* You want to get statistics of the optimal baseline --> `optimized_baseline_statistics(y_true, measure)`
* You want the baseline without specifying `theta` --> `baseline_functions(y_true, measure)`

### List of all included measures

| Measure                                                                  |                                                                               Definition                                                                                |
| ------------------------------------------------------------------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| TP                                                                       |                                                                                   TP                                                                                    |
| TN                                                                       |                                                                                   TN                                                                                    |
| FP                                                                       |                                                                                   FP                                                                                    |
| FN                                                                       |                                                                                   FN                                                                                    |
| TPR                                                                      |                                                                                 TP / P                                                                                  |
| TNR                                                                      |                                                                                 TN / N                                                                                  |
| FPR                                                                      |                                                                                 FP / N                                                                                  |
| FNR                                                                      |                                                                                 FN / P                                                                                  |
| PPV                                                                      |                                                                             TP / (TP + FP)                                                                              |
| NPV                                                                      |                                                                             TN / (TN + FN)                                                                              |
| FDR                                                                      |                                                                             FP / (TP + FP)                                                                              |
| FOR                                                                      |                                                                             FN / (TN + FN)                                                                              |
| ACC, ACCURACY                                                            |                                                                              (TP + TN) / M                                                                              |
| BACC, BALANCED ACCURACY                                                  |                                                                             (TPR + TNR) / 2                                                                             |
| FBETA, FSCORE, F, F BETA, F BETA SCORE, FBETA SCORE                      |                                            ((1 + Î²<sup>2</sup>) * TP) / ((1 + Î²<sup>2</sup>) * TP + Î²<sup>2</sup> * FN + FP)                                            |
| MCC, MATTHEW, MATTHEWS CORRELATION COEFFICIENT                           |                                                       (TP * TN - FP * FN) / (sqrt((TP + FP) * (TN + FN) * P * N))                                                       |
| BM, BOOKMAKER INFORMEDNESS, INFORMEDNESS                                 |                                                                              TPR + TNR - 1                                                                              |
| MK                                                                       |                                                                              PPV + NPV - 1                                                                              |
| COHEN, COHENS KAPPA, KAPPA                                               | (P<sub>o</sub> - P<sub>e</sub>) / (1 - P<sub>e</sub>) with P<sub>o</sub> = (TP + TN) / M and <br> P<sub>e</sub> = ((TP + FP) / M) * (P / M) + ((TN + FN) / M) * (N / M) |
| G1, GMEAN1, G MEAN 1, FOWLKES-MALLOWS, FOWLKES MALLOWS, FOWLKES, MALLOWS |                                                                             sqrt(TPR * PPV)                                                                             |
| G2, GMEAN2, G MEAN 2                                                     |                                                                             sqrt(TPR * TNR)                                                                             |
| TS, THREAT SCORE, CRITICAL SUCCES INDEX, CSI                             |                                                                           TP / (TP + FN + FP)                                                                           |
| PT, PREVALENCE THRESHOLD                                                 |                                                                  (sqrt(TPR * FPR) - FPR) / (TPR - FPR)                                                                  |

## Usage

As example, we first generate the true and predicted labels.

```python
import random
random.seed(123) # To ensure similar outputs

y_pred = random.choices((0,1), k = 10000, weights = (0.9, 0.1))
y_true = random.choices((0,1), k = 10000, weights = (0.9, 0.1))
```

----

### Measure performance

In general, to determine the score of a measure, use `measure_score(y_true, y_pred, measure, beta = 1)`.

#### Input

* `y_true` (list or numpy.ndarray): 1-dimensional boolean list/numpy.ndarray containing the true labels.

* `y_pred` (list or numpy.ndarray): 1-dimensional boolean list/numpy containing the predicted labels.

* `measure` (string): Measure name, see `all_names_except([''])` for possible measure names.

* `beta` (float): Default is 1. Parameter for the F-beta score.

#### Output

* `float`: The score of the given measure evaluated with the predicted and true labels.

#### Example

To examine the performance of the predicted labels, we measure the markedness (MK) and F<sub>2</sub> score (FBETA).

```python
import DutchDraw as bbl

# Measuring markedness (MK):
print('Markedness: {:06.4f}'.format(bbl.measure_score(y_true, y_pred, measure = 'MK')))

# Measuring FBETA for beta = 2:
print('F2 Score: {:06.4f}'.format(bbl.measure_score(y_true, y_pred, measure = 'FBETA', beta = 2)))
```

This returns as output

```python
Markedness: 0.0061
F2 Score: 0.1053
```

Note that `FBETA` is the only measure that requires an additional parameter value.

----

### Get basic baseline given `theta`

To obtain the basic baseline given `theta` use `baseline_functions_given_theta(theta, y_true, measure, beta = 1)`.

#### Input

* `theta` (float): Parameter for the shuffle baseline.

* `y_true` (list or numpy.ndarray): 1-dimensional boolean list/numpy.ndarray containing the true labels.

* `measure` (string): Measure name, see `all_names_except([''])` for possible measure names.

* `beta` (float): Default is 1. Parameter for the F-beta score.

#### Output

The function `baseline_functions_given_theta` gives the following output:

* `dict`: Containing `Mean` and `Variance`
  * `Mean` (float): Expected baseline given `theta`.
  * `Variance` (float): Variance baseline given `theta`.

#### Example

To evaluate the performance of a model, we want to obtain a baseline for the F<sub>2</sub> score (FBETA).

```python
results_baseline = bbl.baseline_functions_given_theta(theta = 0.5, y_true = y_true, measure = 'FBETA', beta = 2)
```

This gives us the mean and variance of the baseline.

```python
print('Mean: {:06.4f}'.format(results_baseline['Mean']))
print('Variance: {:06.4f}'.format(results_baseline['Variance']))
```

with output

```python
Mean: 0.2829
Variance: 0.0001
```

----

### Get basic baseline

To obtain the basic baseline without specifying `theta` use `baseline_functions(y_true, measure, beta = 1)`.

#### Input

* `y_true` (list or numpy.ndarray): 1-dimensional boolean list/numpy.ndarray containing the true labels.

* `measure` (string): Measure name, see `all_names_except([''])` for possible measure names.

* `beta` (float): Default is 1. Parameter for the F-beta score.

#### Output

The function `baseline_functions` gives the following output:

* `dict`: Containing `Distribution`, `Domain`, `(Fast) Expectation Function` and `Variance Function`.

  * `Distribution` (function): Pmf of the measure, given by: `pmf_Y(y, theta)`, where `y` is a measure score and `theta` is the parameter of the shuffle baseline.

  * `Domain` (function): Function that returns attainable measure scores with argument `theta`.

  * `(Fast) Expectation Function` (function): Expectation function of the baseline with `theta` as argument. If `Fast Expectation Function` is returned, there exists a theoretical expectation that can be used for fast computation.

  * `Variance Function` (function): Variance function for all values of `theta`.

#### Example

Next, we determine the baseline without specifying `theta`. This returns a number of functions that can be used for different values of `theta`.

```python
baseline = bbl.baseline_functions(y_true = y_true, measure = 'G2')
print(baseline.keys())
```

with output

```python
dict_keys(['Distribution', 'Domain', 'Fast Expectation Function', 'Variance Function', 'Expectation Function'])
```

To inspect the expected value of `G2` for different `theta` values, we do:

```python
import matplotlib.pyplot as plt
theta_values = np.arange(0, 1 + 0.01, 0.01)
expected_value_plot = [baseline['Expectation Function'](theta) for theta in theta_values]
plt.plot(theta_values, expected_value_plot)
plt.xlabel('Theta')
plt.ylabel('Expected value')
plt.show()
```

with output:

![expectation example](DutchDraw/expected_value_function_example.png)

The variance can be determined similarly

```python
theta_values = np.arange(0, 1 + 0.01, 0.01)
variance_plot = [baseline['Variance Function'](theta) for theta in theta_values]
plt.plot(theta_values, variance_plot)
plt.xlabel('Theta')
plt.ylabel('Variance')
plt.show()
```

with output:

![expectation example](DutchDraw/variance_function_example.png)

`Distribution` is a function with two arguments: `y` and `theta`. Let's investigate the distribution for `theta = 0.5` using `Domain`.

```python
theta = 0.5
pmf_values = [baseline['Distribution'](y, theta) for y in baseline['Domain'](theta)]
plt.plot(baseline['Domain'](theta), pmf_values)
plt.xlabel('Measure score')
plt.ylabel('Probability mass')
plt.show()
```

with output:

![expectation example](DutchDraw/pmf_example.png)

----

### Get optimal baseline

To obtain the optimal baseline use `optimized_baseline_statistics(y_true, measure = possible_names, beta = 1)`.

#### Input

* `y_true` (list or numpy.ndarray): 1-dimensional boolean list/numpy.ndarray containing the true labels.

* `measure` (string): Measure name, see `all_names_except([''])` for possible measure names.

* `beta` (float): Default is 1. Parameter for the F-beta score.

#### Output

The function `optimized_baseline_statistics` gives the following output:

* dict: Containing `Max Expected Value`, `Argmax Expected Value`, `Min Expected Value` and `Argmin Expected Value`.
  * `Max Expected Value` (float): Maximum of the expected values for all `theta`.
  * `Argmax Expected Value` (list): List of all `theta_star` values that maximize the expected value.
  * `Min Expected Value` (float): Minimum of the expected values for all `theta`.
  * `Argmin Expected Value` (list): List of all `theta_star` values that minimize the expected value.

Note that `theta_star = round(theta * M) / M`.

#### Example

To evaluate the performance of a model, we want to obtain the optimal baseline for the F<sub>2</sub> score (FBETA).

```python
optimal_baseline = bbl.optimized_baseline_statistics(y_true, measure = 'FBETA', beta = 1)

print('Max Expected Value: {:06.4f}'.format(optimal_baseline['Max Expected Value']))
print('Argmax Expected Value: {:06.4f}'.format(*optimal_baseline['Argmax Expected Value']))
print('Min Expected Value: {:06.4f}'.format(optimal_baseline['Min Expected Value']))
print('Argmin Expected Value: {:06.4f}'.format(*optimal_baseline['Argmin Expected Value']))
```

with output

```python
Max Expected Value: 0.1874
Argmax Expected Value: 1.0000
Min Expected Value: 0.0000
Argmin Expected Value: 0.0000
```

----

### All example code

```python
import DutchDraw as bbl
import random
import numpy as np

random.seed(123) # To ensure similar outputs

# Generate true and predicted labels
y_pred = random.choices((0,1), k = 10000, weights = (0.9, 0.1))
y_true = random.choices((0,1), k = 10000, weights = (0.9, 0.1))

######################################################
# Example function: measure_score
print('\033[94mExample function: `measure_score`\033[0m')
# Measuring markedness (MK):
print('Markedness: {:06.4f}'.format(bbl.measure_score(y_true, y_pred, measure = 'MK')))

# Measuring FBETA for beta = 2:
print('F2 Score: {:06.4f}'.format(bbl.measure_score(y_true, y_pred, measure= 'FBETA', beta = 2)))

print('')
######################################################
# Example function: baseline_functions_given_theta
print('\033[94mExample function: `baseline_functions_given_theta`\033[0m')
results_baseline = bbl.baseline_functions_given_theta(theta = 0.5, y_true = y_true, measure = 'FBETA', beta = 2)

print('Mean: {:06.4f}'.format(results_baseline['Mean']))
print('Variance: {:06.4f}'.format(results_baseline['Variance']))

print('')
######################################################
# Example function: baseline_functions
print('\033[94mExample function: `baseline_functions`\033[0m')
baseline = bbl.baseline_functions(y_true = y_true, measure = 'G2')
print(baseline.keys())


# Expected Value
import matplotlib.pyplot as plt
theta_values = np.arange(0, 1 + 0.01, 0.01)
expected_value_plot = [baseline['Expectation Function'](theta) for theta in theta_values]
plt.plot(theta_values, expected_value_plot)
plt.xlabel('Theta')
plt.ylabel('Expected value')
#plt.savefig('expected_value_function_example.png', dpi= 600)
plt.show()

# Variance
theta_values = np.arange(0, 1 + 0.01, 0.01)
variance_plot = [baseline['Variance Function'](theta) for theta in theta_values]
plt.plot(theta_values, variance_plot)
plt.xlabel('Theta')
plt.ylabel('Variance')
#plt.savefig('variance_function_example.png', dpi= 600)
plt.show()

# Distribution and Domain
theta = 0.5
pmf_values = [baseline['Distribution'](y, theta) for y in baseline['Domain'](theta)]
plt.plot(baseline['Domain'](theta), pmf_values)
plt.xlabel('Measure score')
plt.ylabel('Probability mass')
#plt.savefig('pmf_example.png', dpi= 600)
plt.show()

print('')
######################################################
# Example function: optimized_baseline_statistics
print('\033[94mExample function: `optimized_baseline_statistics`\033[0m')
optimal_baseline = bbl.optimized_baseline_statistics(y_true, measure = 'FBETA', beta = 1)

print('Max Expected Value: {:06.4f}'.format(optimal_baseline['Max Expected Value']))
print('Argmax Expected Value: {:06.4f}'.format(*optimal_baseline['Argmax Expected Value']))
print('Min Expected Value: {:06.4f}'.format(optimal_baseline['Min Expected Value']))
print('Argmin Expected Value: {:06.4f}'.format(*optimal_baseline['Argmin Expected Value']))
```

## License

[MIT](https://choosealicense.com/licenses/mit/)


