Metadata-Version: 2.1
Name: BambooChute
Version: 1.2.1
Summary: Data cleaning built on top of Pandas
Home-page: https://github.com/itaymev/bamboo
Author: Itay Mevorach
Author-email: Itay Mevorach <itaym@uoregon.edu>
License: MIT
Project-URL: Homepage, https://github.com/itaymev/smart-clean-package
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.4.0
Requires-Dist: numpy>=1.18.0
Requires-Dist: scikit-learn>=0.24.0
Requires-Dist: fancyimpute>=0.7.0
Requires-Dist: fuzzywuzzy>=0.18.0
Requires-Dist: joblib>=1.0.0
Requires-Dist: matplotlib>=3.1.0
Requires-Dist: seaborn>=0.11.0

# BambooChute: Data Cleaning for Pandas

**BambooChute** is a comprehensive data cleaning toolkit built on top of Pandas, offering a vast array of functions to streamline your data preparation process. From handling missing data to detecting outliers, managing categorical data, and ensuring data integrity, BambooChute empowers data analysts, scientists, and engineers to work more efficiently with their data.

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Getting Started](#getting-started)
- [Functionality Overview](#functionality-overview)
  - [Loading Data](#loading-data)
  - [Handling Missing Data](#handling-missing-data)
  - [Outlier Detection and Removal](#outlier-detection-and-removal)
  - [Categorical Data Processing](#categorical-data-processing)
  - [Date Handling and Transformation](#date-handling-and-transformation)
  - [Data Type Validation](#data-type-validation)
  - [Duplicate Management](#duplicate-management)
  - [Data Formatting](#data-formatting)
  - [Data Profiling](#data-profiling)
- [Example Project](#example-project)
- [Testing](#testing)
- [Contributing](#contributing)
- [License](#license)

## Features

- **Versatile Data Loading:** Seamlessly load data from multiple formats (CSV, Excel, JSON, and Pandas DataFrames).
- **Comprehensive Missing Data Handling:** Use a variety of imputation strategies (mean, median, KNN, regression, etc.) and drop methods.
- **Flexible Outlier Detection & Removal:** Detect outliers using various methods (Z-score, IQR, Isolation Forest, etc.) with detailed configuration options.
- **Efficient Categorical Data Processing:** Convert, encode, map, and manage rare categories with ease.
- **Robust Date Handling:** Convert dates, extract date parts, handle invalid dates, and calculate date differences.
- **Data Validation Tools:** Validate data types, check for missing data, and ensure value consistency.
- **Duplicate Management:** Identify, mark, merge, and handle near-duplicates with options for fuzzy matching.
- **Consistent Data Formatting:** Clean string data by trimming whitespace, standardizing cases, and removing special characters.
- **Data Profiling:** Generate summary reports on missing data, outliers, distribution, and correlations for a comprehensive overview.

## Installation

Install BambooChute via pip:

```bash
pip install BambooChute
```

Ensure dependencies from `requirements.txt` are installed:

```bash
pip install -r requirements.txt
```

## Getting Started

Here’s an example to get you started with loading data, handling missing values, detecting outliers, and exporting the cleaned data:

```python
import pandas as pd
from BambooChute import Bamboo

# Load data
data = pd.read_csv('data.csv')

# Initialize Bamboo
bamboo = Bamboo(data)

# Preview data
print(bamboo.preview_data())

# Handle missing data
bamboo.impute_missing(strategy='mean')

# Detect and remove outliers
bamboo.detect_outliers_zscore(threshold=3)

# Export cleaned data
bamboo.export_data('cleaned_data.csv')
```

## Functionality Overview

### Loading Data

BambooChute makes it easy to load data from various sources. The `Bamboo` class can accept:
- **CSV**: Read from a CSV file path.
- **Excel**: Read from an Excel file.
- **JSON**: Load data from a JSON file.
- **Pandas DataFrame**: Load data directly from an in-memory DataFrame.

```python
# Load data from various formats
bamboo = Bamboo('data.csv')  
bamboo = Bamboo(df)  
```

This flexibility allows BambooChute to be integrated with most data pipelines, irrespective of the data source.

### Handling Missing Data

The package offers multiple imputation methods to fill missing values, including:
- **Mean, Median, and Mode Imputation**: Fill missing values in numeric columns using statistical averages.
- **K-Nearest Neighbors (KNN)**: Impute values based on the values of nearby points, ensuring continuity in numerical patterns.
- **Custom Functions**: Define your own function for filling missing values.

Example Usage:

```python
# Impute missing values using the mean of each column
bamboo.impute_missing(strategy='mean')

# Impute using K-Nearest Neighbors
bamboo.impute_knn(n_neighbors=5)

# Drop rows or columns with missing values
bamboo.drop_missing(axis=0, how='any')
```

Each function supports optional parameters to control which columns are affected, allowing for precise control over data imputation.

### Outlier Detection and Removal

Outliers can distort your data analysis, so BambooChute offers several detection methods:
- **Z-score Detection**: Identifies outliers by measuring how many standard deviations a value is from the mean.
- **IQR (Interquartile Range)**: Detects values outside a specific range based on the first and third quartiles.
- **Isolation Forest and DBSCAN**: Use machine learning to detect outliers in complex datasets.

Example Usage:

```python
# Detect outliers using Z-Score
outliers = bamboo.detect_outliers_zscore(threshold=3)

# Remove outliers using IQR
bamboo.remove_outliers(method='iqr', multiplier=1.5)

# Remove outliers using Isolation Forest
bamboo.remove_outliers_isolation_forest(contamination=0.1)
```

### Categorical Data Processing

BambooChute makes it easy to manage categorical data with functions for:
- **Conversion to Categorical**: Change columns to categorical types.
- **Encoding**: Convert categories to one-hot or label encodings.
- **Rare Category Detection**: Identify and replace rare categories to improve model performance.

Example Usage:

```python
# Convert columns to categorical type
bamboo.convert_to_categorical(['column'])

# Encode categorical data with one-hot encoding
bamboo.encode_categorical(method='onehot')

# Detect and replace rare categories
rare_categories = bamboo.detect_rare_categories(column='category_column', threshold=0.05)
bamboo.replace_rare_categories(column='category_column', replacement='Other')
```

### Date Handling and Transformation

BambooChute simplifies working with dates, offering tools for:
- **Conversion to Datetime**: Convert columns to a datetime format.
- **Extraction of Date Parts**: Extract parts of a date (year, month, day).
- **Date Range Creation**: Generate sequences of dates.
- **Date Differences**: Calculate differences between dates.

Example Usage:

```python
# Convert columns to datetime format
bamboo.convert_to_datetime(['date_column'])

# Extract specific date parts (e.g., year, month)
bamboo.extract_date_parts('date_column', parts=['year', 'month'])

# Calculate time difference between two date columns
bamboo.calculate_date_differences(start_column='start_date', end_column='end_date')
```

### Data Type Validation

Data integrity is critical, and BambooChute’s data type validation tools allow you to:
- **Check Consistency**: Identify columns with mixed types.
- **Convert Data Types**: Enforce specific types across columns.
- **Identify Invalid Types**: Detect rows with types that don’t match the expected type for a column.

Example Usage:

```python
# Check for data type consistency
consistency = bamboo.check_dtype_consistency()

# Enforce specific data types
bamboo.enforce_column_types({'age': 'int64', 'price': 'float64'})
```

### Duplicate Management

Efficiently manage duplicates with functions for:
- **Identifying Duplicates**: Find duplicate rows.
- **Dropping Duplicates**: Remove duplicate rows, keeping specific occurrences.
- **Near-Duplicate Detection**: Identify nearly identical values using fuzzy matching.

Example Usage:

```python
# Identify duplicates based on specific columns
duplicates = bamboo.identify_duplicates(subset=['name'])

# Drop duplicates, keeping the first occurrence
bamboo.drop_duplicates(keep='first')

# Handle near-duplicates using fuzzy matching
bamboo.handle_near_duplicates(column='name', threshold=0.8)
```

### Data Formatting

BambooChute’s formatting tools allow you to standardize data appearance:
- **Whitespace Management**: Trim excess whitespace from strings.
- **Case Standardization**: Convert text to lowercase, uppercase, or title case.
- **Special Character Removal**: Clean up text fields by removing unwanted symbols.

Example Usage:

```python
# Trim whitespace in string columns
bamboo.trim_whitespace()

# Standardize case to title
bamboo.standardize_case(case='title')

# Remove special characters in specific columns
bamboo.remove_special_characters(columns=['text_column'], chars_to_remove='@#$')
```

### Data Profiling

Gain insights into your data by generating summary reports, allowing for a deeper understanding of data structure and issues.

```python
# Generate a summary report with key insights on data
summary = bamboo.generate_summary_report()
```

## Example Project

This is an example project by Itay Mevorach using Bamboo for some minor data cleaning.

### Data Key

#### Match Information
- **Season**: League Season
- **Div**: League Division
- **Date**: Match Date (dd/mm/yy)
- **Time**: Time of match kick off
- **HomeTeam**: Home Team
- **AwayTeam**: Away Team
- **FTHG**: Full Time Home Team Goals
- **FTAG**: Full Time Away Team Goals
- **FTR**: Full Time Result (H=Home Win, D=Draw, A=Away Win)
- **HTHG**: Half Time Home Team Goals
- **HTAG**: Half Time Away Team Goals
- **HTR**: Half Time Result (H=Home Win, D=Draw, A=Away Win)

#### Match Statistics (where available)
- **Referee**: Match Referee
- **HS**: Home Team Shots
- **AS**: Away Team Shots
- **HST**: Home Team Shots on Target
- **AST**: Away Team Shots on Target
- **HC**: Home Team Corners
- **AC**: Away Team Corners
- **HF**: Home Team Fouls Committed
- **AF**: Away Team Fouls Committed
- **HY**: Home Team Yellow Cards
- **AY**: Away Team Yellow Cards
- **HR**: Home Team Red Cards
- **AR**: Away Team Red Cards

```python
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# This is my own package that I wrote for data cleaning. Check it out: https://pypi.org/project/BambooChute/ :)
import bamboo as Bamboo
```

```python
past_data = pd.read_csv("data/past-data.csv")
past_data.head(5)
```

```python
bamboo = Bamboo.Bamboo(past_data, sys_log=False)
bamboo.save_state()
bamboo.preview_data()
```

```python
# I want a single date time column, as datetime type we set NaN time values to 00:00, since we dont want to lose the data that is only missing time of match
bamboo.data["DateTime"] = pd.to_datetime(bamboo.data["Date"] + ", " + bamboo.data["Time"].fillna("00:00"), errors="coerce") 
bamboo.data = bamboo.data.drop(columns=["Date", "Time"])
bamboo.save_state()
bamboo.data.shape
```

```python
# This is one way to get rid of our nulls, but the issue with imputing by mean is that it will skew results of our match statistics.
# We can drop the rows with missing values instead, or use a more complex imputation strategy. However since the ultimate goal is to
# find the factors which dictate the outcome of a match, we should drop the rows with missing vals to avoid skewing our results.

# Impute by mean, defaults to mode for categorical columns
bamboo.impute_missing(strategy="mean")
bamboo.data.isnull().any()
```

```python
bamboo.undo()
bamboo.data.isnull().any()
```

```python
# We will actually drop the rows with missing values, to perserve the integrity of our data
bamboo.drop_missing()
bamboo.save_state()
bamboo.data.shape
```

```python
categorical_headers = ["Season", "Div", "HomeTeam", "AwayTeam", "Referee", "HTR", "FTR"] # Even though season is a number, it is categorical
bamboo.convert_to_categorical(columns=categorical_headers)

for cat in categorical_headers:
    print(bamboo.get_unique_categories(cat))

# We can see the categories for each categorical column now
```

```python
# Checking for outliers with z-scores
if True in bamboo.detect_outliers_zscore(threshold=2):
    print("Outliers detected -- Z-Score")
elif True in bamboo.detect_outliers_iqr():
    print("Outliers detected -- IQR")
elif True in bamboo.detect_outliers_modified_zscore(threshold=2):
    print("Outliers detected -- Modified Z-Score")
elif True in bamboo.detect_outliers_isolation_forest(n_estimators=150):
    print("Outliers detected -- Isolation Forest")
elif True in bamboo.detect_outliers_lof():
    print("Outliers detected -- Local Outlier Factor")
elif True in bamboo.detect_outliers_dbscan():
    print("Outliers detected -- DBSCAN")

# There are no outliers because this is a dataset of football matches (the only outlier in football is Ronaldo)
```

```python
bamboo.export_data("data/past-data-clean.csv", format="csv") # Making a backup of our cleaned data
clean_data = bamboo.get_data()
print(clean_data)
print("\nData Types:\n", clean_data.dtypes)
```

```python
sns.countplot(clean_data['Div'])
plt.title("Matches by Division")
plt.xlabel("Count")
plt.ylabel("Division")
plt.show()
```

```python
ftr = clean_data['FTR'].value_counts().sort_index()
htr = clean_data['HTR'].value_counts().sort_index()
outcomes_df = pd.DataFrame({'FTR': ftr, 'HTR': htr})

outcomes_df.plot(kind='bar', color=['green', 'blue'], width=0.8)
plt.title("Match Outcomes by Half Time and Full Time Results")
plt.xlabel("Outcome")
plt.ylabel("Count")
plt.show()
# Remember H means home team win, A means away team win, D means draw
```

```python
numerical_cols = ['FTHG', 'FTAG', 'HS', 'AS', 'HTHG', 'HTAG', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR']
numerical_data = clean_data[numerical_cols]

scaler = StandardScaler()
scaled_data = scaler.fit_transform(numerical_data)

# We can use PCA to reduce the dimensionality of our data
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)
clean_data['PCA1'] = pca_result[:, 0]
clean_data['PCA2'] = pca_result[:, 1]

sns.scatterplot(x='PCA1', y='PCA2', hue=clean_data['FTR'], palette='cool', data=clean_data)
plt.title("PCA of Match Data")
plt.show()
```

```python
kmeans = KMeans(n_clusters=3, random_state=42)
clean_data['Cluster'] = kmeans.fit_predict(scaled_data)

sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', palette='viridis', data=clean_data)
plt.title("Clustering of Match Data")
plt.show()
```

```python
features = ['HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR'] # FTHG and FTAG would be way too good of features, they are basically the target
target = 'FTR'  # Home Win: 1, Draw: 0, Away Win: -1

# Map FTR
ftr_mapping = {'H': 1, 'D': 0, 'A': -1}
clean_data['FTR_num'] = clean_data['FTR'].map(ftr_mapping)

x = clean_data[features]
y = clean_data['FTR_num']
scaler = StandardScaler()
x = scaler.fit_transform(x)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

print(len(features), len(reg_model.coef_))

feature_importance = pd.DataFrame({
    'Feature': features,
    'Coefficient': reg_model.coef_
})
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)
print(feature_importance)

sns.barplot(x='Coefficient', y='Feature', data=feature_importance, palette='coolwarm', hue='Feature')
plt.title("Feature Importance Based on Linear Regression Coefficients")
plt.show()
```

```python
# It is commonly known that the accuracy of shots is extremely important in determining the outcome of a match. I'll create a new
# column that is the ratio of shots on target to total shots, and see how important this new feature is.

clean_data['HST_%'] = clean_data.apply(lambda row: row['HST'] / row['HS'] if row['HS'] != 0 else 0, axis=1)
clean_data['AST_%'] = clean_data.apply(lambda row: row['AST'] / row['AS'] if row['AS'] != 0 else 0, axis=1)

features = ['HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'HST_%', 'AST_%'] # Added HST_% and AST_%, the rest of the code is copied down
target = 'FTR'  # Home Win: 1, Draw: 0, Away Win: -1

# Map FTR
ftr_mapping = {'H': 1, 'D': 0, 'A': -1}
clean_data['FTR_num'] = clean_data['FTR'].map(ftr_mapping)

x = clean_data[features]
y = clean_data['FTR_num']
scaler = StandardScaler()
x = scaler.fit_transform(x)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

print(len(features), len(reg_model.coef_))

feature_importance = pd.DataFrame({
    'Feature': features,
    'Coefficient': reg_model.coef_
})
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)
print(feature_importance)

sns.barplot(x='Coefficient', y='Feature', data=feature_importance, palette='coolwarm', hue='Feature')
plt.title("Feature Importance Based on Linear Regression Coefficients")
plt.show()
```

```python
sns.boxplot(x='FTR', y='HST_%', data=clean_data, palette='coolwarm', hue='FTR')
plt.title("Correlation between Match Outcome and Home Team Shots on Target Percentage")
plt.xlabel("Match Outcome (FTR)")
plt.ylabel("Home Team Shots on Target Percentage")
plt.show()
```

```python
sns.boxplot(x='FTR', y='AST_%', data=clean_data, palette='coolwarm', hue='FTR')
plt.title("Correlation between Match Outcome and Away Team Shots on Target Percentage")
plt.xlabel("Match Outcome (FTR)")
plt.ylabel("Away Team Shots on Target Percentage")
plt.show()
```

## Testing

The BambooChute package uses **pytest** for testing. To execute all tests in the `tests` directory, run:

```bash
pytest tests/
```

For more detailed output, add the `-v` flag for verbose mode or `--maxfail=3` to stop after three failures. This ensures every function is well-tested for reliability and performance.

## Contributing

To contribute:

1. Go the homepage, fork the repository.
2. Create a new branch.
3. Make changes and submit a pull request with your email.
  - You will receive an email asking about your changes, please reply.
4. Thanks for contributing!

## License

BambooChute is licensed under the MIT License.
