Metadata-Version: 2.1
Name: VJModels
Version: 3.0.1
Summary: My personal machine learning models
Home-page: https://github.com/Vanderval31bs/VJModels
Download-URL: https://github.com/Vanderval31bs/VJModels/archive/refs/tags/v3.0.1-alpha.tar.gz
Author: Vanderval Borges de Souza Junior
Author-email: vander31bs@gmail.com
License: MIT
Keywords: MachineLearning,Models,Forests
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Description-Content-Type: text/markdown
License-File: LICENSE.txt

# 🧪 VJModels

A collection of my experimental machine learning models. These models are part of my personal exploration in the field, so they might not be fully refined, but they contain some interesting ideas. Feel free to check them out! You can also install the package via [pip](https://pypi.org/project/VJModels/) and incorporate the models into your own projects.


```bash
pip install VJModels
```

**Example usage:**

```python
from VJModels.Forests import IncrementalForestClassifier

# X_train, y_train, X_test should be your datasets

inc_forest = IncrementalForestClassifier()
inc_forest.fit(X_train, y_train)
y_test_pred = inc_forest.predict(X_test)
```

# Summary

1. Forests
   - [1.1 WSagging](#wsagging)
   - [1.2 Incremental](#incremental)

2. Linear Models
   - [2.1 Advanced Linear Regression](#advanced-linear-regression)

# Forests

## WSagging

**WSagging** is a term I coined, standing for **Weighted Score Averaging**. The idea is quite simple. Suppose you have `X` features and `y` targets. You select a number of times the algorithm will run (`n_models` parameter in the class constructor). At each iteration, you randomly split the dataset `X` into two different datasets: `X_train` and `X_validation`. The model is trained on the `X_train` dataset and scored on `X_validation` against `y_validation`. Both the trained model and its score are saved.

Later, during the prediction phase, you average the predictions based on the scores from the validation set to obtain a new prediction for your data. Specifically, the classifier and regressor work as follows:

### Classifier

```python
def predict(self, X):
    n_samples = len(X)
    results = [0] * n_samples

    predictions_list = [tree.predict(X) for tree in self.trees]
    importance_list = [self.get_importance(i) for i in range(len(self.trees))]

    for i in range(n_samples):
        results[i] = sum(importance if prediction[i] == 1 else -importance
                         for prediction, importance in zip(predictions_list, importance_list))

    return [1 if result > 0 else 0 for result in results]
```

- **`predictions_list`**: This contains the predictions made by all trees in the forest for each sample.
- **`importance_list`**: This holds the importance score for each tree, reflecting how much weight each tree's prediction carries.
- **Weighted Sum Calculation**:
  - For each sample, calculate the weighted sum of predictions from all trees.
  - Add the importance score for positive predictions (`1`) and subtract the importance score for negative predictions (`0`).
- **Final Classification**:
  - If the resulting weighted sum is positive, classify the sample as `1`.
  - If the weighted sum is zero or negative, classify the sample as `0`.

  ### Regressor

```python
def predict(self, X):
    predictions_list = [tree.predict(X) for tree in self.trees]
    importance_list = [self.get_importance(i) for i in range(len(self.trees))]

    results = [sum(importance * prediction for importance, prediction in zip(importance_list, preds)) / sum(importance_list) for preds in zip(*predictions_list)]
    
    return results
```

 **`predictions_list`**: This contains the predictions made by all trees in the forest for each sample.
- **`importance_list`**: This holds the importance score for each tree, indicating the weight of each tree's predictions.
- **Weighted Average Calculation**:
  - For each sample, compute the weighted average of predictions from all trees.
  - Multiply each prediction by its corresponding tree's importance score.
  - Sum these weighted predictions and divide by the total sum of importance scores to obtain the final result.
- **Final Prediction**:
  - The result is a weighted average of the predictions, where the importance scores determine the contribution of each tree’s prediction to the final outcome.

## Incremental

The IncrementalForests algorithm builds upon WSagging by incorporating an incremental approach. In the first iteration, the dataset `X` and targets `y` are split into `train_0` and `validation_0`. A model is trained on `train_0`, scored on `validation_0`, and both the model and score are saved.

In the next iteration, `validation_0` is further split into `train_1` and `validation_1`. The new training set `train_0` is combined with `train_1`, and a new model is trained on this merged dataset. This model is evaluated on `validation_1`, and the model and score are saved.

This process continues until one of the stopping criteria is met: the validation set becomes too small, the score drops by a predefined margin, the maximum score is achieved, or the number of trees reaches the maximum limit.

During the prediction phase, like in WSagging, predictions are averaged based on validation scores to obtain final predictions. The prediction algorithm is similar to WSagging but uses a different importance formula: `(n - i) * scores[i] ** exponent`, where `n` is the total number of models trained and `scores[i]` is the score of the ith model. This formula gives more weight to models trained earlier in the process.

# Linear Models

## Advanced Linear Regression

This class is designed for building, fitting, and summarizing a stepwise regression model with additional diagnostic checks to ensure model validity. The main steps include transforming categorical variables, fitting the model, checking the normality of residuals, applying transformations if necessary, and testing for heteroscedasticity. The class also provides a detailed summary of the model, including parameters, R² values, and p-values.

### **Key Steps and Functionality**

1. **Step 1: Categorical Variable Transformation**
    - The class begins by transforming categorical variables into dummy variables, which are suitable for regression modeling. If there is only one categorical variable, it is referred to as a "category." Otherwise, multiple variables are called "categories."

2. **Step 2: Model Fitting**
    - After transforming the variables, the class fits a stepwise regression model. This process involves removing predictors that are non-significant or cause multicollinearity issues.

3. **Step 3: Residual Normality Test**
    - The class checks the normality of residuals to validate model assumptions. The type of normality test used depends on the sample size:
        - **Shapiro-Francia test:** Used when the sample size is 30 or more.
        - **Shapiro-Wilk test:** Used when the sample size is less than 30.
    - The p-value from the test is reported, and a conclusion is drawn regarding the normality of the residuals.

4. **Step 4: Box-Cox Transformation (Optional)**
    - If the model's response variable does not meet the normality assumption, a Box-Cox transformation can be applied to stabilize variance and make the data more normally distributed. The transformed target variable is then used to refit the model via the stepwise method.

5. **Step 5: Heteroscedasticity Test**
    - The Breusch-Pagan test is performed to check for heteroscedasticity (non-constant variance of residuals). The p-value from the test is reported, and a conclusion is drawn about the presence or absence of heteroscedasticity.

### **Usage**

To use this class effectively:
1. Initialize the class with your dataset and specify any categorical columns.
2. Call the fitting method to perform all steps.
3. Use the `summary` method to get a detailed report of the model, including diagnostic tests and final results.
4. Use the object to predict the target value on new observations.

```python
from VJModels.LinearModels import AdvancedLinearRegression

model = AdvancedLinearRegression(data, 'target')
model.fit()

print(model.summary())

predictions = model.predict(new_data)
print(predictions)
```


  
