Metadata-Version: 2.4
Name: transfory
Version: 1.3.0
Summary: An Object-Oriented Data Transformation Toolkit
Author-email: MCQYP <mandamian.rogelioiii@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Troge-dev/transfory
Project-URL: Bug Tracker, https://github.com/Troge-dev/transfory/issues
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: joblib
Dynamic: license-file

# Transfory

[![PyPI version](https://badge.fury.io/py/transfory.svg)](https://badge.fury.io/py/transfory)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/transfory)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**An Object-Oriented, Explainable Data Transformation Toolkit for Python.**

`Transfory` is a data preprocessing library designed with clarity and modularity in mind. Inspired by `scikit-learn`'s API, it provides a suite of common data transformation tools that can be easily combined into powerful, reusable pipelines.

What makes `Transfory` unique is its built-in **`InsightReporter`**, an explainability engine that provides detailed, human-readable logs of every step in your transformation process. No more black boxes—understand exactly what happens to your data.

## Key Features

*   **Modular Transformers**: A collection of intuitive, single-purpose transformers for common preprocessing tasks.
*   **Powerful Pipelines**: Chain transformers together with `Pipeline` and apply different logic to different columns with `ColumnTransformer`.
*   **Explainability First**: The `InsightReporter` gives you a step-by-step narrative of your data's journey, making debugging and validation effortless.
*   **Pandas-Native**: Built to work seamlessly with pandas DataFrames.

## Installation

You can install `Transfory` directly from PyPI:

```bash
pip install transfory
```

## Quick Start

Let's perform a simple data cleaning operation on the Titanic dataset and see what `Transfory` tells us.

```python
import pandas as pd
import seaborn as sns
from transfory.pipeline import Pipeline
from transfory.missing import MissingValueHandler
from transfory.encoder import Encoder
from transfory.scaler import Scaler
from transfory.insight import InsightReporter

# 1. Load data and initialize the reporter
df = sns.load_dataset('titanic')[['age', 'fare', 'embarked', 'sex']]
reporter = InsightReporter()

# 2. Define a multi-step pipeline
pipeline = Pipeline(
    steps=[
        ("imputer", MissingValueHandler(strategy="mean")),
        ("encoder", Encoder(method="onehot")),
        ("scaler", Scaler(method="zscore"))
    ],
    logging_callback=reporter.get_callback()
)

# 3. Fit and transform the data
transformed_df = pipeline.fit_transform(df)

print("Transformed Data:")
print(transformed_df.head())

# 4. Review the Insight Report for a step-by-step explanation
print("\n--- Insight Report ---")
print(reporter.summary())
```

The `InsightReporter` will output a clear summary of what happened, such as the value used for imputation, the new columns created by the encoder, and the columns that were scaled.

## Core Modules Explained

`Transfory` is built on a set of core components that can be mixed and matched to create any preprocessing workflow.

### Orchestrators

These modules are used to control the flow and application of transformers.

#### `Pipeline`

Sequentially applies a list of transformers. The output of one step becomes the input to the next, making it easy to define an ordered workflow.

```python
from transfory.pipeline import Pipeline

pipeline = Pipeline([
    ("imputer", MissingValueHandler()),
    ("scaler", Scaler()),
])
```

#### `ColumnTransformer`

Applies different transformers to different columns of a DataFrame in parallel. This is highly efficient and essential for workflows where numeric and categorical data need separate handling.

```python
from transfory.column_transformer import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric_scaler", Scaler(), ['age', 'fare']),
        ("categorical_encoder", Encoder(), ['embarked', 'sex'])
    ],
    remainder='passthrough' # Keep other columns
)
```

#### `InsightReporter`

The explainability engine of `Transfory`. It captures events from all transformers in a pipeline and provides a detailed, human-readable summary of the entire process. Simply create an instance and pass its callback to your pipeline.

```python
from transfory.insight import InsightReporter

reporter = InsightReporter()
pipeline = Pipeline([...], logging_callback=reporter.get_callback())
pipeline.fit_transform(df)
print(reporter.summary())
```

---

### Data Transformers

These are the building blocks that perform the actual data transformations.

#### `MissingValueHandler`

Handles missing (`NaN`) values in a DataFrame.

*   **Strategies**: `mean`, `median`, `mode`, `constant` (with `fill_value`).

```python
import pandas as pd
import numpy as np
from transfory.missing import MissingValueHandler

data = pd.DataFrame({'score': [88, 92, np.nan, 75], 'grade': ['A', 'A', 'C', np.nan]})

# Impute missing 'score' with the mean and 'grade' with the mode
imputer = MissingValueHandler(strategy="mean")
transformed_data = imputer.fit_transform(data)
print(transformed_data)
```

#### `Encoder`

Converts categorical columns into a numerical format.

*   **Methods**: `onehot` (creates binary columns), `label` (assigns a unique integer to each category).

```python
import pandas as pd
from transfory.encoder import Encoder

data = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})

# Convert the 'color' column into one-hot encoded format
encoder = Encoder(method="onehot")
transformed_data = encoder.fit_transform(data)
print(transformed_data)
```

#### `Scaler`

Scales numeric features to a common range.

*   **Methods**: `zscore` (standard scaling), `minmax` (scales to a [0, 1] range).

```python
from transfory.scaler import Scaler

# Scale features to have zero mean and unit variance
scaler = Scaler(method="zscore")
```

#### `OutlierHandler`

Caps extreme values (outliers) to mitigate their effect on downstream models.

*   **Methods**: `iqr` (caps at 1.5 * Interquartile Range), `zscore` (caps at a z-score threshold), `quantile`.

```python
from transfory.outlier import OutlierHandler

# Cap values outside the 1.5 * IQR range
outlier_handler = OutlierHandler(method="iqr")
```

#### `DatetimeFeatureExtractor`

Creates new features from datetime columns, such as year, month, day, or day of the week.

```python
from transfory.datetime import DatetimeFeatureExtractor

# Extract year and month from a 'transaction_date' column
date_extractor = DatetimeFeatureExtractor(features=['year', 'month'])
```

#### `FeatureGenerator`

Automatically creates polynomial and interaction features from numeric columns to help models capture non-linear relationships.

```python
from transfory.featuregen import FeatureGenerator

# Create quadratic features (e.g., a^2, b^2) and interaction features (a*b)
feature_gen = FeatureGenerator(degree=2, include_interactions=True)
```


## Comparison: Scikit-Learn vs Transfory

`Transfory` is inspired by scikit-learn, but it is built specifically to be **more transparent, Pandas-native, and beginner-friendly**.
This table summarizes the key differences.

### Feature Comparison Table

| Aspect                     | **Scikit-Learn**                                                        | **Transfory**                                                              |
| -------------------------- | ----------------------------------------------------------------------- | -------------------------------------------------------------------------- |
| **Design Philosophy**      | Highly optimized ML preprocessing and modeling framework.               | Explainable, modular, beginner-friendly data transformation toolkit.       |
| **Data Structure Support** | Primarily works with NumPy arrays. Pandas support is secondary.         | 100% Pandas-native. Always returns DataFrames with column names preserved. |
| **Explainability**         | Minimal. Transformers do not log what they did unless manually wrapped. | Built-in **InsightReporter** logs every step in human-readable form.       |
| **Pipeline System**        | Powerful, but verbose and produces NumPy arrays by default.             | Simpler, cleaner, DataFrame-preserving pipelines.                          |
| **ColumnTransformer**      | Flexible but difficult to inspect; output column names often tricky.    | Column names preserved, easy to debug, supports explainability.            |
| **Error Messages**         | Often cryptic (NumPy errors, shape mismatch errors).                    | Human-friendly, transformer-specific error messages.                       |
| **Extensibility**          | Requires subclassing and adhering to strict API.                        | Extremely easy—just subclass `BaseTransformer`.                            |
| **Transform Output**       | NumPy array (unless wrapped).                                           | Pandas DataFrame (always).                                                 |
| **Logging**                | None by default.                                                        | Automatic logging through callbacks.                                       |
| **Intended Users**         | Production ML engineers, researchers.                                   | Students, data analysts, instructors, prototyping ML pipelines.            |



## Demos

For more in-depth examples, check out the Jupyter Notebooks in the `/demo` directory. They provide interactive, step-by-step walkthroughs of the library's capabilities on real datasets.

1.  **`demo1_basic_pipeline.ipynb`**: Introduction to `Pipeline`.
2.  **`demo2_column_transformer.ipynb`**: Advanced workflows with `ColumnTransformer`.
3.  **`demo3_feature_engineering.ipynb`**: Using `DatetimeFeatureExtractor` and `FeatureGenerator`.
4.  **`demo4_outlier_handling.ipynb`**: Using the `OutlierHandler`.

## Contributing

Contributions are welcome! If you have a feature request, bug report, or want to improve the documentation, please open an issue or submit a pull request on the GitHub repository.

## Contributors

This project was made possible by the following contributors:

### Project Lead & Core Architecture

*   **Rogelio Q. Mandamian III** ([mandamian.rogelioiii@gmail.com](mailto:mandamian.rogelioiii@gmail.com))
    *   Project lead, architect of the core framework (`BaseTransformer`, `Pipeline`, `ColumnTransformer`), and developer of the `InsightReporter`.

### Module Development

*   **Jose Gabriel P. Yana** ([gabrielpaumaryana@gmail.com](mailto:gabrielpaumaryana@gmail.com))
    *   Responsible for the implementation of `Scaler`, `MissingValueHandler`, `Encoder`, and `FeatureGenerator`.
*   **Renell josh cuerquis** (renelljoshcuerquis@gmail.com)
    *   Responsible for the implementation of `Scaler`, `MissingValueHandler`, `Encoder`, and `FeatureGenerator`.

### Documentation

*   **Ashlanie L. Pandi** (pandiashlanie16@gmail.com)
    *   Responsible for the creation and refinement of all user-facing documentation.
*   **Eleah Jaizel T. Cabili** (cabili.eleahjaizel@gmail.com)
    *   Responsible for the creation and refinement of all user-facing documentation.

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.
