Metadata-Version: 2.4
Name: model-preprocessor
Version: 0.1.0
Summary: Automated feature engineering: target-encodes strings, text-mines high-cardinality columns, and imputes missing values with cross-variable predictions.
Author: Jason Karpeles
License: MIT
Project-URL: Homepage, https://github.com/jasonkarpeles/model-preprocessor
Project-URL: Repository, https://github.com/jasonkarpeles/model-preprocessor
Keywords: preprocessing,feature-engineering,target-encoding,imputation,machine-learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5
Requires-Dist: numpy>=1.23
Requires-Dist: scikit-learn>=1.2
Dynamic: license-file

# model-preprocessor

Automated feature engineering for tabular data. Converts a raw DataFrame with mixed string/numeric columns and missing values into a fully numeric, fully imputed dataset ready for modeling.

## What it does

1. **Target-encodes low-cardinality strings** (≤250 unique values by default):
   smoothed mean of the dependent variable per category, shrunk toward the
   overall mean for small groups using weight `sqrt(n)/sqrt(100)`.

2. **Text-mines high-cardinality strings** (>250 unique values): tokenises
   values and one-hot encodes tokens that appear in at least max(30, 0.1%)
   of rows.

3. **Imputes missing values** by predicting each variable from all other
   variables using a fast model (Ridge regression by default).

## Installation

```bash
pip install model-preprocessor
```

## Quick start

```python
import pandas as pd
from model_preprocessor import ModelPreprocessor

df = pd.read_csv("data.csv")

pp = ModelPreprocessor(target="revenue")
clean = pp.fit_transform(df)
# clean is fully numeric with no missing values
```

## Parameters

| Parameter | Default | Description |
|---|---|---|
| `target` | *(required)* | Name of the dependent variable column |
| `max_unique` | `250` | Threshold for target-encoding vs text-mining |
| `min_token_count` | `None` | Min rows a token must appear in (default: max(30, 0.1% of rows)) |
| `impute_model` | `None` | sklearn estimator for imputation (default: `Ridge(alpha=1.0)`) |

## API

```python
pp = ModelPreprocessor(target="y")
pp.fit(train_df)           # learn encodings and imputation models
result = pp.transform(df)  # apply to new data
# or
result = pp.fit_transform(train_df)  # fit + transform in one step
```
