Metadata-Version: 2.1
Name: BambooChute
Version: 1.1.4
Summary: Data cleaning package built on top of Pandas
Home-page: https://github.com/itaymev/bamboo
Author: Itay Mevorach
Author-email: itaym@uoregon.edu
License: UNKNOWN
Description: # BambooChute: Data Cleaning for Pandas - Beta 1.1.3
        
        **BambooChute** is a comprehensive data cleaning toolkit built on top of Pandas, offering a vast array of functions to streamline your data preparation process. From handling missing data to detecting outliers, managing categorical data, and ensuring data integrity, BambooChute empowers data analysts, scientists, and engineers to work more efficiently with their data.
        
        ## Table of Contents
        
        - [Features](#features)
        - [Installation](#installation)
        - [Getting Started](#getting-started)
        - [Functionality Overview](#functionality-overview)
          - [Loading Data](#loading-data)
          - [Handling Missing Data](#handling-missing-data)
          - [Outlier Detection and Removal](#outlier-detection-and-removal)
          - [Categorical Data Processing](#categorical-data-processing)
          - [Date Handling and Transformation](#date-handling-and-transformation)
          - [Data Type Validation](#data-type-validation)
          - [Duplicate Management](#duplicate-management)
          - [Data Formatting](#data-formatting)
          - [Data Profiling](#data-profiling)
        - [Example Project](#example-project)
        - [Testing](#testing)
        - [Contributing](#contributing)
        - [License](#license)
        
        ## Features
        
        - **Versatile Data Loading:** Seamlessly load data from multiple formats (CSV, Excel, JSON, and Pandas DataFrames).
        - **Comprehensive Missing Data Handling:** Use a variety of imputation strategies (mean, median, KNN, regression, etc.) and drop methods.
        - **Flexible Outlier Detection & Removal:** Detect outliers using various methods (Z-score, IQR, Isolation Forest, etc.) with detailed configuration options.
        - **Efficient Categorical Data Processing:** Convert, encode, map, and manage rare categories with ease.
        - **Robust Date Handling:** Convert dates, extract date parts, handle invalid dates, and calculate date differences.
        - **Data Validation Tools:** Validate data types, check for missing data, and ensure value consistency.
        - **Duplicate Management:** Identify, mark, merge, and handle near-duplicates with options for fuzzy matching.
        - **Consistent Data Formatting:** Clean string data by trimming whitespace, standardizing cases, and removing special characters.
        - **Data Profiling:** Generate summary reports on missing data, outliers, distribution, and correlations for a comprehensive overview.
        
        ## Installation
        
        Install BambooChute via pip:
        
        ```bash
        pip install BambooChute
        ```
        
        Ensure dependencies from `requirements.txt` are installed:
        
        ```bash
        pip install -r requirements.txt
        ```
        
        ## Getting Started
        
        Here’s an example to get you started with loading data, handling missing values, detecting outliers, and exporting the cleaned data:
        
        ```python
        import pandas as pd
        from BambooChute import Bamboo
        
        # Load data
        data = pd.read_csv('data.csv')
        
        # Initialize Bamboo
        bamboo = Bamboo(data)
        
        # Preview data
        print(bamboo.preview_data())
        
        # Handle missing data
        bamboo.impute_missing(strategy='mean')
        
        # Detect and remove outliers
        bamboo.detect_outliers_zscore(threshold=3)
        
        # Export cleaned data
        bamboo.export_data('cleaned_data.csv')
        ```
        
        ## Functionality Overview
        
        ### Loading Data
        
        BambooChute makes it easy to load data from various sources. The `Bamboo` class can accept:
        - **CSV**: Read from a CSV file path.
        - **Excel**: Read from an Excel file.
        - **JSON**: Load data from a JSON file.
        - **Pandas DataFrame**: Load data directly from an in-memory DataFrame.
        
        ```python
        # Load data from various formats
        bamboo = Bamboo('data.csv')  
        bamboo = Bamboo(df)  
        ```
        
        This flexibility allows BambooChute to be integrated with most data pipelines, irrespective of the data source.
        
        ### Handling Missing Data
        
        The package offers multiple imputation methods to fill missing values, including:
        - **Mean, Median, and Mode Imputation**: Fill missing values in numeric columns using statistical averages.
        - **K-Nearest Neighbors (KNN)**: Impute values based on the values of nearby points, ensuring continuity in numerical patterns.
        - **Custom Functions**: Define your own function for filling missing values.
        
        Example Usage:
        
        ```python
        # Impute missing values using the mean of each column
        bamboo.impute_missing(strategy='mean')
        
        # Impute using K-Nearest Neighbors
        bamboo.impute_knn(n_neighbors=5)
        
        # Drop rows or columns with missing values
        bamboo.drop_missing(axis=0, how='any')
        ```
        
        Each function supports optional parameters to control which columns are affected, allowing for precise control over data imputation.
        
        ### Outlier Detection and Removal
        
        Outliers can distort your data analysis, so BambooChute offers several detection methods:
        - **Z-score Detection**: Identifies outliers by measuring how many standard deviations a value is from the mean.
        - **IQR (Interquartile Range)**: Detects values outside a specific range based on the first and third quartiles.
        - **Isolation Forest and DBSCAN**: Use machine learning to detect outliers in complex datasets.
        
        Example Usage:
        
        ```python
        # Detect outliers using Z-Score
        outliers = bamboo.detect_outliers_zscore(threshold=3)
        
        # Remove outliers using IQR
        bamboo.remove_outliers(method='iqr', multiplier=1.5)
        
        # Remove outliers using Isolation Forest
        bamboo.remove_outliers_isolation_forest(contamination=0.1)
        ```
        
        ### Categorical Data Processing
        
        BambooChute makes it easy to manage categorical data with functions for:
        - **Conversion to Categorical**: Change columns to categorical types.
        - **Encoding**: Convert categories to one-hot or label encodings.
        - **Rare Category Detection**: Identify and replace rare categories to improve model performance.
        
        Example Usage:
        
        ```python
        # Convert columns to categorical type
        bamboo.convert_to_categorical(['column'])
        
        # Encode categorical data with one-hot encoding
        bamboo.encode_categorical(method='onehot')
        
        # Detect and replace rare categories
        rare_categories = bamboo.detect_rare_categories(column='category_column', threshold=0.05)
        bamboo.replace_rare_categories(column='category_column', replacement='Other')
        ```
        
        ### Date Handling and Transformation
        
        BambooChute simplifies working with dates, offering tools for:
        - **Conversion to Datetime**: Convert columns to a datetime format.
        - **Extraction of Date Parts**: Extract parts of a date (year, month, day).
        - **Date Range Creation**: Generate sequences of dates.
        - **Date Differences**: Calculate differences between dates.
        
        Example Usage:
        
        ```python
        # Convert columns to datetime format
        bamboo.convert_to_datetime(['date_column'])
        
        # Extract specific date parts (e.g., year, month)
        bamboo.extract_date_parts('date_column', parts=['year', 'month'])
        
        # Calculate time difference between two date columns
        bamboo.calculate_date_differences(start_column='start_date', end_column='end_date')
        ```
        
        ### Data Type Validation
        
        Data integrity is critical, and BambooChute’s data type validation tools allow you to:
        - **Check Consistency**: Identify columns with mixed types.
        - **Convert Data Types**: Enforce specific types across columns.
        - **Identify Invalid Types**: Detect rows with types that don’t match the expected type for a column.
        
        Example Usage:
        
        ```python
        # Check for data type consistency
        consistency = bamboo.check_dtype_consistency()
        
        # Enforce specific data types
        bamboo.enforce_column_types({'age': 'int64', 'price': 'float64'})
        ```
        
        ### Duplicate Management
        
        Efficiently manage duplicates with functions for:
        - **Identifying Duplicates**: Find duplicate rows.
        - **Dropping Duplicates**: Remove duplicate rows, keeping specific occurrences.
        - **Near-Duplicate Detection**: Identify nearly identical values using fuzzy matching.
        
        Example Usage:
        
        ```python
        # Identify duplicates based on specific columns
        duplicates = bamboo.identify_duplicates(subset=['name'])
        
        # Drop duplicates, keeping the first occurrence
        bamboo.drop_duplicates(keep='first')
        
        # Handle near-duplicates using fuzzy matching
        bamboo.handle_near_duplicates(column='name', threshold=0.8)
        ```
        
        ### Data Formatting
        
        BambooChute’s formatting tools allow you to standardize data appearance:
        - **Whitespace Management**: Trim excess whitespace from strings.
        - **Case Standardization**: Convert text to lowercase, uppercase, or title case.
        - **Special Character Removal**: Clean up text fields by removing unwanted symbols.
        
        Example Usage:
        
        ```python
        # Trim whitespace in string columns
        bamboo.trim_whitespace()
        
        # Standardize case to title
        bamboo.standardize_case(case='title')
        
        # Remove special characters in specific columns
        bamboo.remove_special_characters(columns=['text_column'], chars_to_remove='@#$')
        ```
        
        ### Data Profiling
        
        Gain insights into your data by generating summary reports, allowing for a deeper understanding of data structure and issues.
        
        ```python
        # Generate a summary report with key insights on data
        summary = bamboo.generate_summary_report()
        ```
        
        ## Example Project
        
        This is an example project by Itay Mevorach using Bamboo for some minor data cleaning.
        
        ### Data Key
        
        #### Match Information
        - **Season**: League Season
        - **Div**: League Division
        - **Date**: Match Date (dd/mm/yy)
        - **Time**: Time of match kick off
        - **HomeTeam**: Home Team
        - **AwayTeam**: Away Team
        - **FTHG**: Full Time Home Team Goals
        - **FTAG**: Full Time Away Team Goals
        - **FTR**: Full Time Result (H=Home Win, D=Draw, A=Away Win)
        - **HTHG**: Half Time Home Team Goals
        - **HTAG**: Half Time Away Team Goals
        - **HTR**: Half Time Result (H=Home Win, D=Draw, A=Away Win)
        
        #### Match Statistics (where available)
        - **Referee**: Match Referee
        - **HS**: Home Team Shots
        - **AS**: Away Team Shots
        - **HST**: Home Team Shots on Target
        - **AST**: Away Team Shots on Target
        - **HC**: Home Team Corners
        - **AC**: Away Team Corners
        - **HF**: Home Team Fouls Committed
        - **AF**: Away Team Fouls Committed
        - **HY**: Home Team Yellow Cards
        - **AY**: Away Team Yellow Cards
        - **HR**: Home Team Red Cards
        - **AR**: Away Team Red Cards
        
        ```python
        import pandas as pd
        import numpy as np
        
        import matplotlib.pyplot as plt
        import seaborn as sns
        from sklearn.decomposition import PCA
        from sklearn.cluster import KMeans
        from sklearn.preprocessing import StandardScaler
        from sklearn.linear_model import LinearRegression
        from sklearn.model_selection import train_test_split
        
        # This is my own package that I wrote for data cleaning. Check it out: https://pypi.org/project/BambooChute/ :)
        import bamboo as Bamboo
        ```
        
        ```python
        past_data = pd.read_csv("data/past-data.csv")
        past_data.head(5)
        ```
        
        ```python
        bamboo = Bamboo.Bamboo(past_data, sys_log=False)
        bamboo.save_state()
        bamboo.preview_data()
        ```
        
        ```python
        # I want a single date time column, as datetime type we set NaN time values to 00:00, since we dont want to lose the data that is only missing time of match
        bamboo.data["DateTime"] = pd.to_datetime(bamboo.data["Date"] + ", " + bamboo.data["Time"].fillna("00:00"), errors="coerce") 
        bamboo.data = bamboo.data.drop(columns=["Date", "Time"])
        bamboo.save_state()
        bamboo.data.shape
        ```
        
        ```python
        # This is one way to get rid of our nulls, but the issue with imputing by mean is that it will skew results of our match statistics.
        # We can drop the rows with missing values instead, or use a more complex imputation strategy. However since the ultimate goal is to
        # find the factors which dictate the outcome of a match, we should drop the rows with missing vals to avoid skewing our results.
        
        # Impute by mean, defaults to mode for categorical columns
        bamboo.impute_missing(strategy="mean")
        bamboo.data.isnull().any()
        ```
        
        ```python
        bamboo.undo()
        bamboo.data.isnull().any()
        ```
        
        ```python
        # We will actually drop the rows with missing values, to perserve the integrity of our data
        bamboo.drop_missing()
        bamboo.save_state()
        bamboo.data.shape
        ```
        
        ```python
        categorical_headers = ["Season", "Div", "HomeTeam", "AwayTeam", "Referee", "HTR", "FTR"] # Even though season is a number, it is categorical
        bamboo.convert_to_categorical(columns=categorical_headers)
        
        for cat in categorical_headers:
            print(bamboo.get_unique_categories(cat))
        
        # We can see the categories for each categorical column now
        ```
        
        ```python
        # Checking for outliers with z-scores
        if True in bamboo.detect_outliers_zscore(threshold=2):
            print("Outliers detected -- Z-Score")
        elif True in bamboo.detect_outliers_iqr():
            print("Outliers detected -- IQR")
        elif True in bamboo.detect_outliers_modified_zscore(threshold=2):
            print("Outliers detected -- Modified Z-Score")
        elif True in bamboo.detect_outliers_isolation_forest(n_estimators=150):
            print("Outliers detected -- Isolation Forest")
        elif True in bamboo.detect_outliers_lof():
            print("Outliers detected -- Local Outlier Factor")
        elif True in bamboo.detect_outliers_dbscan():
            print("Outliers detected -- DBSCAN")
        
        # There are no outliers because this is a dataset of football matches (the only outlier in football is Ronaldo)
        ```
        
        ```python
        bamboo.export_data("data/past-data-clean.csv", format="csv") # Making a backup of our cleaned data
        clean_data = bamboo.get_data()
        print(clean_data)
        print("\nData Types:\n", clean_data.dtypes)
        ```
        
        ```python
        sns.countplot(clean_data['Div'])
        plt.title("Matches by Division")
        plt.xlabel("Count")
        plt.ylabel("Division")
        plt.show()
        ```
        
        ```python
        ftr = clean_data['FTR'].value_counts().sort_index()
        htr = clean_data['HTR'].value_counts().sort_index()
        outcomes_df = pd.DataFrame({'FTR': ftr, 'HTR': htr})
        
        outcomes_df.plot(kind='bar', color=['green', 'blue'], width=0.8)
        plt.title("Match Outcomes by Half Time and Full Time Results")
        plt.xlabel("Outcome")
        plt.ylabel("Count")
        plt.show()
        # Remember H means home team win, A means away team win, D means draw
        ```
        
        ```python
        numerical_cols = ['FTHG', 'FTAG', 'HS', 'AS', 'HTHG', 'HTAG', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR']
        numerical_data = clean_data[numerical_cols]
        
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(numerical_data)
        
        # We can use PCA to reduce the dimensionality of our data
        pca = PCA(n_components=2)
        pca_result = pca.fit_transform(scaled_data)
        clean_data['PCA1'] = pca_result[:, 0]
        clean_data['PCA2'] = pca_result[:, 1]
        
        sns.scatterplot(x='PCA1', y='PCA2', hue=clean_data['FTR'], palette='cool', data=clean_data)
        plt.title("PCA of Match Data")
        plt.show()
        ```
        
        ```python
        kmeans = KMeans(n_clusters=3, random_state=42)
        clean_data['Cluster'] = kmeans.fit_predict(scaled_data)
        
        sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', palette='viridis', data=clean_data)
        plt.title("Clustering of Match Data")
        plt.show()
        ```
        
        ```python
        features = ['HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR'] # FTHG and FTAG would be way too good of features, they are basically the target
        target = 'FTR'  # Home Win: 1, Draw: 0, Away Win: -1
        
        # Map FTR
        ftr_mapping = {'H': 1, 'D': 0, 'A': -1}
        clean_data['FTR_num'] = clean_data['FTR'].map(ftr_mapping)
        
        x = clean_data[features]
        y = clean_data['FTR_num']
        scaler = StandardScaler()
        x = scaler.fit_transform(x)
        
        X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
        
        reg_model = LinearRegression()
        reg_model.fit(X_train, y_train)
        
        print(len(features), len(reg_model.coef_))
        
        feature_importance = pd.DataFrame({
            'Feature': features,
            'Coefficient': reg_model.coef_
        })
        feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)
        print(feature_importance)
        
        sns.barplot(x='Coefficient', y='Feature', data=feature_importance, palette='coolwarm', hue='Feature')
        plt.title("Feature Importance Based on Linear Regression Coefficients")
        plt.show()
        ```
        
        ```python
        # It is commonly known that the accuracy of shots is extremely important in determining the outcome of a match. I'll create a new
        # column that is the ratio of shots on target to total shots, and see how important this new feature is.
        
        clean_data['HST_%'] = clean_data.apply(lambda row: row['HST'] / row['HS'] if row['HS'] != 0 else 0, axis=1)
        clean_data['AST_%'] = clean_data.apply(lambda row: row['AST'] / row['AS'] if row['AS'] != 0 else 0, axis=1)
        
        features = ['HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'HST_%', 'AST_%'] # Added HST_% and AST_%, the rest of the code is copied down
        target = 'FTR'  # Home Win: 1, Draw: 0, Away Win: -1
        
        # Map FTR
        ftr_mapping = {'H': 1, 'D': 0, 'A': -1}
        clean_data['FTR_num'] = clean_data['FTR'].map(ftr_mapping)
        
        x = clean_data[features]
        y = clean_data['FTR_num']
        scaler = StandardScaler()
        x = scaler.fit_transform(x)
        
        X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
        
        reg_model = LinearRegression()
        reg_model.fit(X_train, y_train)
        
        print(len(features), len(reg_model.coef_))
        
        feature_importance = pd.DataFrame({
            'Feature': features,
            'Coefficient': reg_model.coef_
        })
        feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)
        print(feature_importance)
        
        sns.barplot(x='Coefficient', y='Feature', data=feature_importance, palette='coolwarm', hue='Feature')
        plt.title("Feature Importance Based on Linear Regression Coefficients")
        plt.show()
        ```
        
        ```python
        sns.boxplot(x='FTR', y='HST_%', data=clean_data, palette='coolwarm', hue='FTR')
        plt.title("Correlation between Match Outcome and Home Team Shots on Target Percentage")
        plt.xlabel("Match Outcome (FTR)")
        plt.ylabel("Home Team Shots on Target Percentage")
        plt.show()
        ```
        
        ```python
        sns.boxplot(x='FTR', y='AST_%', data=clean_data, palette='coolwarm', hue='FTR')
        plt.title("Correlation between Match Outcome and Away Team Shots on Target Percentage")
        plt.xlabel("Match Outcome (FTR)")
        plt.ylabel("Away Team Shots on Target Percentage")
        plt.show()
        ```
        
        ## Testing
        
        The BambooChute package uses **pytest** for testing. To execute all tests in the `tests` directory, run:
        
        ```bash
        pytest tests/
        ```
        
        For more detailed output, add the `-v` flag for verbose mode or `--maxfail=3` to stop after three failures. This ensures every function is well-tested for reliability and performance.
        
        ## Contributing
        
        To contribute:
        
        1. Go the homepage, fork the repository.
        2. Create a new branch.
        3. Make changes and submit a pull request with your email.
          - You will receive an email asking about your changes, please reply.
        4. Thanks for contributing!
        
        ## License
        
        BambooChute is licensed under the MIT License.
        
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.6
Description-Content-Type: text/markdown
