Metadata-Version: 2.4
Name: dataforge_prep
Version: 0.1.1
Summary: Intelligent data preprocessing for data scientists. Stop writing boilerplate. Start at model training.
Author: Tufail Ahmad Dar
License: MIT
Keywords: automation,data cleaning,data science,feature engineering,machine learning,pandas,preprocessing,sklearn
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pyarrow>=12.0.0
Requires-Dist: regex>=2023.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.10.0
Description-Content-Type: text/markdown

# dataforge-prep

> Intelligent data preprocessing for data scientists.
> Stop writing boilerplate. Start at model training.

[![PyPI version](https://badge.fury.io/py/dataforge-prep.svg)](https://badge.fury.io/py/dataforge-prep)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

## The problem every data scientist knows

You receive a dataset. Before you can train a single model you spend
hours doing the same things you did on the last project:

- Figuring out what each column actually is
- Handling missing values column by column  
- Deciding how to encode categoricals
- Scaling numeric features
- Hoping you did not accidentally leak test data into training

This is not data science. This is plumbing. And it happens on every single project.

---

## The solution
```python
from dataforge import AutoPrep

prep        = AutoPrep(target='churn')
clean_train = prep.fit_transform(df_train)
clean_test  = prep.transform(df_test)

prep.report()
prep.save('pipeline.pkl')
```

Five lines. Your dataset is profiled, cleaned, encoded, and scaled.
Ready for sklearn, xgboost, lightgbm, or any other model.

---

## Installation
```bash
pip install dataforge-prep
```

Requires Python 3.10 or above.

---

## What gets handled automatically

### Column understanding
dataforge does not just read dtypes. It understands what a column
actually means. It detects 16 semantic types:

| Type | Example columns |
|---|---|
| continuous_numeric | age, temperature, score |
| categorical | status, region, grade |
| binary | is_active, has_churned |
| identifier | customer_id, order_id |
| datetime | signup_date, created_at |
| email | email, contact_email |
| phone | phone, mobile_number |
| postal_code | zip_code, pin_code |
| geographic | latitude, longitude |
| currency | price, revenue, salary |
| percentage | churn_rate, conversion_rate |
| duration | session_length, response_time |
| free_text | comments, description, notes |
| structured_string | product_code, reference_number |
| constant | same value in every row |
| unknown | cannot determine with confidence |

### Missing values
Strategy chosen per column based on actual distribution:
- Median for skewed numeric columns
- Mean for normally distributed columns
- Mode for low cardinality categoricals
- Dedicated missing category for high cardinality columns
- Forward fill for datetime columns

### Outlier detection
Uses IQR and modified Z-score combined. Both methods must agree
before a value is flagged. This eliminates false positives significantly.
- Clips outliers when rate is below 2%
- Flags with indicator column when rate is above 2%

### Encoding
- One-hot encoding for columns with 15 or fewer unique values
- Target encoding for medium cardinality columns
- Label encoding for binary columns
- Unseen categories at inference time never cause a crash

### Scaling
- RobustScaler when outliers are present
- StandardScaler for normally distributed columns
- Never applies StandardScaler blindly to everything

### Leakage detection
Computes mutual information of every feature against the target
before training. Flags columns with suspiciously high correlation.
Also detects post-event column names like result, outcome, approved,
final_status, and decision.

### Test set safety
fit() and transform() are always separate operations.
It is structurally impossible to contaminate your test set
through the dataforge pipeline.

### Full audit trail
Every decision logged. What was detected, why the strategy
was chosen, what changed. Human readable report in one call.

---

## Complete example
```python
import pandas as pd
from dataforge import AutoPrep
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# load data
df = pd.read_csv('customer_data.csv')

# split before fitting — never fit on full data
train, test = train_test_split(df, test_size=0.2, random_state=42)

# fit pipeline on training data only
prep        = AutoPrep(target='churn', verbose=True)
clean_train = prep.fit_transform(train)

# apply same transformations to test data
# uses exact same parameters learned from training data
clean_test  = prep.transform(test)

# read the full profiler report
prep.report()

# save pipeline for production use
prep.save('pipeline.pkl')

# train your model on clean data
X_train = clean_train.drop('churn', axis=1)
y_train = clean_train['churn']

model = RandomForestClassifier()
model.fit(X_train, y_train)
```

---

## The profiler report

After fitting, `prep.report()` prints a complete analysis:
```
============================================================
DATAFORGE PROFILE REPORT
============================================================
Total columns     : 12
Columns to drop   : 3
Need attention    : 5
Clean columns     : 4
HIGH LEAKAGE RISK : 1 column(s) — review immediately

------------------------------------------------------------
COLUMNS TO DROP
------------------------------------------------------------
  customer_id     identifier column — no predictive value
  notes           100% empty — will be dropped
  email           email column — not encodable for ML

------------------------------------------------------------
COLUMNS NEEDING ATTENTION
------------------------------------------------------------

  revenue
    type    : currency (85% confidence)
    nulls   : 2.1% (safe)
    actions : impute(median), transform(yeo-johnson), scale(robust)
    warning : distribution is skewed right (skewness=2.4)

  churn_result
    type    : continuous_numeric (90% confidence)
    leakage : high risk
    warning : HIGH LEAKAGE RISK — mutual info with target = 0.923
              verify this column is available at prediction time

------------------------------------------------------------
CLEAN COLUMNS
------------------------------------------------------------
  age        continuous_numeric    actions: impute(mean), scale(standard)
  status     categorical           actions: impute(mode), encode(OHE)
  tenure     duration              actions: impute(median), scale(robust)
  region     categorical           actions: impute(mode), encode(OHE)
============================================================
END OF REPORT
============================================================
```

---

## Saving and loading the pipeline

The fitted pipeline saves all learned parameters to disk.
Load it later to apply identical transformations to new data.
```python
# save after fitting
prep.save('pipeline.pkl')

# load in a new session or production environment
from dataforge import AutoPrep

prep        = AutoPrep.load('pipeline.pkl')
clean_data  = prep.transform(new_data)
predictions = model.predict(clean_data)
```

---

## AutoPrep parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| target | str | None | Name of the column you are predicting |
| task | str | auto | classification, regression, or auto |
| verbose | bool | True | Print progress while pipeline runs |

---

## Why not just use sklearn pipelines

sklearn ColumnTransformer requires you to manually specify
which transformer to apply to which column. You still have to
inspect the data yourself, decide the strategies, and wire
everything together. Every project. Every time.

dataforge makes those decisions automatically based on the
actual characteristics of your data.

| Feature | sklearn | pandas-profiling | dataforge |
|---|---|---|---|
| Auto type detection | No | Partial | Yes — 16 types |
| Auto strategy selection | No | No | Yes |
| Leakage detection | No | No | Yes |
| Test set safety enforced | Manual | No | Yes |
| Audit trail | No | Report only | Yes |
| One line preprocessing | No | No | Yes |

---

## Current status

dataforge-prep v0.1.0 is the first public release.
The profiler and core pipeline are complete and tested
with 31 passing tests.

### What is working in v0.1.0
- Full dataset profiler with 16 semantic types
- Automatic missing value imputation
- Outlier detection
- Basic encoding — OHE and label encoding
- Standard and robust scaling
- Target leakage detection
- Save and load pipeline
- Full audit report

### Coming in v0.2.0
- Datetime feature extraction
- Target encoding for high cardinality columns
- String normalisation
- Outlier clipping execution
- Full documentation website

---

## Contributing

Feedback, issues, and pull requests are very welcome.
```bash
git clone https://github.com/tufailahmaddar/dataforge
cd dataforge
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
python3 -m pytest tests/ -v
```

---

## License

MIT License. Free to use in personal and commercial projects.

---

Built by Tufail Ahmad Dar