Metadata-Version: 2.1
Name: biosets
Version: 1.1.0
Summary: Bioinformatics datasets and tools
Home-page: https://github.com/psmyth94/biosets
Author: Patrick Smyth
Author-email: psmyth1994@gmail.com
License: Apache 2.0
Download-URL: https://github.com/psmyth94/biosets/tags
Keywords: omics machine learning bioinformatics datasets
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8.0,<3.12.0
Description-Content-Type: text/markdown
Requires-Dist: datasets>=2.19.1
Requires-Dist: biocore
Provides-Extra: apache-beam
Requires-Dist: apache-beam<2.44.0,>=2.26.0; extra == "apache-beam"
Provides-Extra: docs
Requires-Dist: s3fs; extra == "docs"
Provides-Extra: jax
Requires-Dist: jax>=0.3.14; extra == "jax"
Requires-Dist: jaxlib>=0.3.14; extra == "jax"
Provides-Extra: polars
Requires-Dist: polars>=0.20.5; extra == "polars"
Requires-Dist: timezones>=0.10.2; extra == "polars"
Provides-Extra: quality
Requires-Dist: ruff>=0.1.5; extra == "quality"
Provides-Extra: s3
Requires-Dist: s3fs; extra == "s3"
Provides-Extra: scipy
Requires-Dist: scipy; extra == "scipy"
Provides-Extra: tensorflow
Requires-Dist: tensorflow!=2.6.0,!=2.6.1,>=2.2.0; (sys_platform != "darwin" or platform_machine != "arm64") and extra == "tensorflow"
Requires-Dist: tensorflow-macos; (sys_platform == "darwin" and platform_machine == "arm64") and extra == "tensorflow"
Provides-Extra: tensorflow_gpu
Requires-Dist: tensorflow-gpu!=2.6.0,!=2.6.1,>=2.2.0; extra == "tensorflow-gpu"
Provides-Extra: test
Requires-Dist: ruff>=0.1.5; extra == "test"
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-timeout; extra == "test"
Requires-Dist: pytest-xdist; extra == "test"
Requires-Dist: scipy; extra == "test"
Requires-Dist: polars>=0.20.5; extra == "test"
Requires-Dist: timezones>=0.10.2; extra == "test"
Requires-Dist: s3fs; extra == "test"
Provides-Extra: torch
Requires-Dist: torch; extra == "torch"
Provides-Extra: vcf
Requires-Dist: cyvcf2>=0.30.0; extra == "vcf"
Requires-Dist: sgkit>=0.0.1; extra == "vcf"

<p align="center">
    $${\Huge{\textbf{\textsf{\color{#2E8B57}Bio\color{#4682B4}sets}}}}$$
    <br/>
    <br/>
</p> 
<p align="center">
    <a href="https://github.com/psmyth94/biosets/actions/workflows/ci_cd_pipeline.yml?query=branch%3Amain"><img alt="Build" src="https://github.com/psmyth94/biosets/actions/workflows/ci_cd_pipeline.yml/badge.svg?branch=main"></a>
    <a href="https://github.com/psmyth94/biosets/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/psmyth94/biosets.svg?color=blue"></a>
    <a href="https://github.com/psmyth94/biosets/tree/main/docs"><img alt="Documentation" src="https://img.shields.io/website/http/github/psmyth94/biosets/tree/main/docs.svg?down_color=red&down_message=offline&up_message=online"></a>
    <a href="https://github.com/psmyth94/biosets/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/psmyth94/biosets.svg"></a>
    <a href="CODE_OF_CONDUCT.md"><img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg"></a>
    <a href="https://zenodo.org/records/14028772"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.14028772.svg" alt="DOI"></a>
</p>

**Biosets** is a specialized library that extends 🤗 [Datasets](https://github.com/huggingface/datasets) for bioinformatics data, providing the following main features:

- **Bioinformatics Specialization**: Streamlines data management specific to bioinformatics, such as handling samples, features, batches, and associated metadata.
- **Automatic Column Detection**: Infers sample, batch, input features, and target columns, simplifying downstream preprocessing.
- **Custom Data Classes**: Leverages specialized data classes (`ValueWithMetadata`, `Sample`, `Batch`, `RegressionTarget`, etc.) to manage metadata-rich bioinformatics data.
- **Polars Integration**: Optional [Polars](https://github.com/pola-rs/polars) integration enables high-performance data manipulation, ideal for large datasets.
- **Flexible Task Support**: Native support for binary classification, multiclass classification, multiclass-to-binary classification, and regression, adapting to diverse bioinformatics tasks.
- **Integration with 🤗 Datasets**: `load_dataset` function supports loading various bioinformatics formats like CSV, JSON, NPZ, and more, including metadata integration.
- **Arrow File Caching**: Uses [Apache Arrow](https://github.com/apache/arrow) for efficient on-disk caching, enabling fast access to large datasets without memory limitations.

Biosets helps bioinformatics researchers focus on analysis rather than data handling, with seamless compatibility with 🤗 Datasets.

## Installation

### With pip

You can install **Biosets** from PyPI:

```bash
pip install biosets
```

### With conda

Install **Biosets** via conda:

```bash
conda install -c patrico49 biosets
```

## Usage

**Biosets** provides a straightforward API for handling bioinformatics datasets with integrated metadata management. Here's a quick example:

```python
from biosets import load_biodata

bio_data = load_dataset(
    data_files="data_with_samples.csv",
    sample_metadata_files="sample_metadata.csv",
    feature_metadata_files="feature_metadata.csv",
    target_column="metadata1",
    experiment_type="metagenomics",
    batch_column="batch",
    sample_column="sample",
    metadata_columns=["metadata1", "metadata2"],
    drop_samples=False
)["train"]
```

For further details, check the [advance usage documentation](./docs/DATA_LOADING.md).

## Main Differences Between Biosets and 🤗 Datasets

- **Bioinformatics Focus**: While 🤗 Datasets is a general-purpose library, Biosets is tailored for the bioinformatics domain.
- **Seamless Metadata Integration**: Biosets is built for datasets with metadata dependencies, like sample and feature metadata.
- **Automatic Column Detection**: Reduces preprocessing time with automatic inference of sample, batch, feature, and label columns.
- **Specialized Data Classes**: Biosets introduces custom classes (e.g., `Sample`, `Batch`, `ValueWithMetadata`) to enable richer data representation.

## Disclaimers

Biosets may run Python code from custom `datasets` scripts to handle specific data formats. For security, users should:

- Inspect dataset scripts prior to execution.
- Use pinned versions for any repository dependencies.

If you manage a dataset and wish to update or remove it, please open a discussion or pull request on the Community tab of 🤗's datasets page.

## BibTeX

If you'd like to cite **Biosets**, please use the following:

```bibtex
@misc{smyth2024biosets,
    title = {psmyth94/biosets: 1.1.0},
    author = {Patrick Smyth},
    year = {2024},
    url = {https://github.com/psmyth94/biosets},
    note = {A library designed to support bioinformatics data with custom features, metadata integration, and compatibility with 🤗 Datasets.}
}
```



