Metadata-Version: 2.4
Name: fast-seqfunc
Version: 0.2.0
Requires-Python: >=3.10
Requires-Dist: lazy-loader<0.5,>=0.4
Requires-Dist: loguru>=0.7.0
Requires-Dist: numpy>=1.22.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: pycaret<4,>=2.2.3
Requires-Dist: scikit-learn>=1.2.0
Requires-Dist: typer>=0.9.0
Description-Content-Type: text/markdown

# fast-seqfunc

Painless sequence-function models for proteins and nucleotides.

Made with ❤️ by Eric Ma (@ericmjl).

## Overview

Fast-SeqFunc is a Python package designed for efficient sequence-function modeling for proteins and nucleotide machine learning problems. It provides a simple, high-level API that handles sequence embedding methods and automates model selection and training.

The core purpose of Fast-SeqFunc is to quickly determine if there is meaningful "signal" in your sequence-function data. By rapidly building baseline models, you can discover early whether predictive relationships exist in your data and opportunistically use these models for scoring and ranking candidate sequences to test. When signal is detected, you can invest your time more effectively in developing advanced models (such as deep neural networks) as a second iteration.

### Key Features

- **Multiple Embedding Methods**:
  - One-hot encoding (currently implemented)
  - CARP (Microsoft's protein-sequence-models) - planned for future releases
  - ESM2 (Facebook's ESM) - planned for future releases

- **Automated Machine Learning**:
  - Uses PyCaret for model selection and hyperparameter tuning
  - Supports regression and classification tasks
  - Evaluates performance with appropriate metrics

- **Sequence Handling**:
  - Flexible handling of variable-length sequences
  - Configurable padding options for consistent embeddings
  - Custom alphabets support

- **Simple API**:
  - Single function call to train models
  - Handles data loading and preprocessing

- **Command-line Interface**:
  - Train models directly from the command line
  - Make predictions on new sequences
  - Compare different embedding methods

## Installation

### Using pip

```bash
pip install fast-seqfunc
```

### From Source

```bash
git clone git@github.com:ericmjl/fast-seqfunc
cd fast-seqfunc
pixi install
```

## Quick Start

### Python API

```python
from fast_seqfunc import train_model, predict
import pandas as pd

# Load your sequence-function data
train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")

# Train a model
model = train_model(
    train_data=train_data,
    test_data=test_data,
    sequence_col="sequence",
    target_col="function",
    embedding_method="one-hot",  # currently the only implemented method
    model_type="regression",     # or "classification"
)

# Make predictions on new sequences
new_data = pd.read_csv("new_sequences.csv")
predictions = predict(model, new_data["sequence"])

# Save the model for later use
model.save("my_model.pkl")
```

### Command-line Interface

Train a model:

```bash
# All outputs (model, metrics, cache) will be saved to the 'outputs' directory
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function --embedding-method one-hot --output-dir outputs
```

Make predictions:

```bash
# All prediction outputs will be saved to the 'prediction_outputs' directory
fast-seqfunc predict-cmd outputs/model.pkl new_sequences.csv --output-dir prediction_outputs
```

Compare embedding methods:

```bash
# All outputs (comparison results, metrics, models, cache) will be saved to the 'comparison_outputs' directory
fast-seqfunc compare-embeddings train_data.csv --test-data test_data.csv --output-dir comparison_outputs
```

## Advanced Usage

### Using Multiple Embedding Methods

Currently, only one-hot encoding is implemented. Support for multiple embedding methods is planned for future releases.

```python
model = train_model(
    train_data=train_data,
    embedding_method="one-hot",
)
```

### Detailed Performance Metrics and Visualizations

The output directories from CLI commands contain comprehensive model performance metrics and visualizations:

```
outputs/                          # Main output directory
├── model.pkl                     # Saved model
├── summary.json                  # Summary of output locations and parameters
├── metrics/                      # Performance metrics and visualizations
│   ├── one-hot_metrics.json      # Detailed metrics in JSON format
│   ├── one-hot_predictions.csv   # Raw predictions and true values
│   ├── one-hot_scatter_plot.png  # Visualization plots
│   ├── one-hot_residual_plot.png
│   └── ...
└── cache/                        # Cached embeddings
```

For predictions:

```
prediction_outputs/               # Prediction output directory
├── predictions.csv               # Saved predictions
├── predictions_histogram.png     # Histogram of prediction values (for regression)
└── prediction_summary.json       # Summary of prediction parameters
```

When comparing embedding methods, a similar structure is created:

```
comparison_outputs/
├── embedding_comparison.csv      # Table comparing all methods
├── embedding_comparison_plot.png # Bar chart comparing metrics across methods
├── summary.json                  # Summary of output locations and parameters
├── models/                       # Saved models for each method
│   ├── one-hot_model.pkl
├── metrics/                      # Performance metrics for each method
│   ├── one-hot_metrics.json
└── cache/                        # Cached embeddings
```

You can also generate these outputs programmatically:

```python
from pathlib import Path
from fast_seqfunc import train_model, save_model, save_detailed_metrics

# Create output directories
output_dir = Path("my_model_outputs")
output_dir.mkdir(exist_ok=True)
metrics_dir = output_dir / "metrics"
metrics_dir.mkdir(exist_ok=True)
cache_dir = output_dir / "cache"
cache_dir.mkdir(exist_ok=True)

# Train model
model_info = train_model(
    train_data=train_data,
    test_data=test_data,
    embedding_method="one-hot",
    cache_dir=cache_dir,
)

# Save model
save_model(model_info, output_dir / "model.pkl")

# Save detailed metrics if test data was provided
if model_info.get("test_results"):
    save_detailed_metrics(
        metrics_data=model_info["test_results"],
        output_dir=metrics_dir,
        model_type=model_info["model_type"],
        embedding_method="one-hot"
    )
```

### Custom Metrics for Optimization

Specify metrics to optimize during model selection:

```python
model = train_model(
    train_data=train_data,
    model_type="regression",
    optimization_metric="r2"  # or "rmse", "mae", etc.
)
```

### Handling Variable Length Sequences

Fast-SeqFunc handles variable length sequences with configurable padding:

```python
# Default behavior pads all sequences to the max length with "-"
model = train_model(
    train_data=train_data,
    embedding_method="one-hot",
    embedder_kwargs={"pad_sequences": True, "gap_character": "-"}
)

# Disable padding for sequences of different lengths
model = train_model(
    train_data=train_data,
    embedding_method="one-hot",
    embedder_kwargs={"pad_sequences": False}
)

# Set a fixed maximum length and custom gap character
model = train_model(
    train_data=train_data,
    embedding_method="one-hot",
    embedder_kwargs={"max_length": 100, "gap_character": "X"}
)
```

For a complete example, see `examples/variable_length_sequences.py`.

## Documentation

For full documentation, visit [https://ericmjl.github.io/fast-seqfunc/](https://ericmjl.github.io/fast-seqfunc/).

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.
