Metadata-Version: 2.1
Name: biosets
Version: 1.0.0
Summary: Bioinformatics datasets and tools
Home-page: https://github.com/psmyth94/biosets
Download-URL: https://github.com/psmyth94/biosets/tags
Author: Patrick Smyth
Author-email: psmyth1994@gmail.com
License: Apache 2.0
Keywords: omics machine learning bioinformatics datasets
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8.0,<3.12.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datasets
Provides-Extra: polars
Requires-Dist: polars>=0.20.5; extra == "polars"
Requires-Dist: timezones>=0.10.2; extra == "polars"
Provides-Extra: apache-beam
Requires-Dist: apache-beam<2.44.0,>=2.26.0; extra == "apache-beam"
Provides-Extra: vcf
Requires-Dist: cyvcf2>=0.30.0; extra == "vcf"
Requires-Dist: sgkit>=0.0.1; extra == "vcf"
Provides-Extra: tensorflow
Requires-Dist: tensorflow!=2.6.0,!=2.6.1,>=2.2.0; (sys_platform != "darwin" or platform_machine != "arm64") and extra == "tensorflow"
Requires-Dist: tensorflow-macos; (sys_platform == "darwin" and platform_machine == "arm64") and extra == "tensorflow"
Provides-Extra: tensorflow-gpu
Requires-Dist: tensorflow-gpu!=2.6.0,!=2.6.1,>=2.2.0; extra == "tensorflow-gpu"
Provides-Extra: torch
Requires-Dist: torch; extra == "torch"
Provides-Extra: jax
Requires-Dist: jax>=0.3.14; extra == "jax"
Requires-Dist: jaxlib>=0.3.14; extra == "jax"
Provides-Extra: s3
Requires-Dist: s3fs; extra == "s3"
Provides-Extra: test
Requires-Dist: ruff>=0.1.5; extra == "test"
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-timeout; extra == "test"
Requires-Dist: pytest-xdist; extra == "test"
Requires-Dist: s3fs; extra == "test"
Provides-Extra: quality
Requires-Dist: ruff>=0.1.5; extra == "quality"
Provides-Extra: docs
Requires-Dist: s3fs; extra == "docs"

# BioSets: Dataset Creation for Biological Research

# DNA Symbol: 🧬
🧬 BioSets is a specialized library built on top of the `datasets` library, designed to
facilitate the loading, manipulation, and processing of biological datasets for machine
learning purposes. It supports various types of biological data, including omics
datasets such as genomics, transcriptomics, proteomics, and metabolomics, as well as
other types of tabular biological data. This library is intended to provide users
with an efficient way to work with biological data in their machine learning pipelines.

## Overview

This repository contains tools and documentation for creating biological datasets using
BioSets. The library provides capabilities for loading biological data from local
files, creating custom datasets, and handling large volumes of biological information
with ease. BioSets is particularly useful for researchers and data scientists working
in fields such as bioinformatics, systems biology, and biotechnology.

BioSets is geared towards accelerating the loading and processing of high-dimensional
data, which many machine learning libraries lack. This is achieved through efficient
handling of both sample metadata and feature metadata, enabling users to build modular
and high-performance data processing pipelines.

## Features

🧬 **Loading sample metadata and feature metadata**: BioSets provides the unique
capability to load both sample metadata and feature metadata, facilitating modular
downstream analysis pipelines. This ensures that users can easily manage and access
detailed information about each sample and feature, improving the interpretability and
flexibility of their datasets.

🧬 **Support for various biological data types**: BioSets includes predefined classes
for different biological data types, such as genomic variants, gene expression data,
clinical trial data, and OTU tables.

🧬 **Automatic Sample/Batch Detection**: BioSets can automatically detect sample and
batch information from the loaded data, making it easier to handle batch effects and
other confounding factors in downstream analyses.

🧬 **Custom dataset creation**: Create tailored datasets with custom features, metadata,
and labels.

🧬 **Integration with datasets library**: BioSets builds on the functionality provided
by the `datasets` library. For general-purpose dataset operations, users can refer to
the `datasets` library documentation. If you do not use any of the
`biosets.list_experiments()`, then it will simply act like Huggingface's `datasets`
library.

## Getting Started

To use the BioSets library, you'll need to clone the repository and install the
necessary dependencies. After setting up your environment, you can create your own
dataset by following the steps below.

### Installation

You can install BioSets using pip:

```bash
pip install biosets
```

### Creating a Biological Dataset

To create a dataset for biological research using BioSets, follow these steps:

1. **Organize Your Data**: Prepare your biological data in a structured format that
BioSets can process (e.g., directory of relevant files).

2. **Load Your Data with Metadata**: Use `load_dataset()` to load your data along with
sample metadata and feature metadata. This modular approach allows for more detailed
downstream analyses:

   ```python
   from biosets import load_dataset

   dataset = load_dataset(
       "snp",
       data_files="/path/to/snp_data.csv",
       sample_metadata_files="/path/to/sample_metadata.csv",
       feature_metadata_files="/path/to/feature_metadata.csv",
   )
   ```

3. **Utilize Metadata for Analysis**: The loaded dataset allows you to access and use
metadata easily in downstream analyses. For example, you can handle abundance data
differently based on its type:

   ```python
   from biosets.features import Abundance
   for k, v in dataset.features.items():
       if isinstance(v, Abundance):
           print(f"Processing abundance feature: {k}")
   ```

   This feature is particularly useful for modular pipeline development, where certain
   analyses or transformations are applied only to specific types of data, such as
   abundance measurements.

### Dataset Examples

#### Loading Specific Experiments

With BioSets, users are encouraged to use specific experiment types for loading data,
such as `otu`, `maldi`, `rna`, or `snp` to ensure the appropriate configuration is
applied:

- **OTU Data**

  ```python
  dataset = load_dataset("otu", data_files="/path/to/otu_data.csv")
  ```

- **RNA Data**

  ```python
  dataset = load_dataset("rna", data_files="/path/to/rna_data.csv")
  ```

- **SNP Data**

  ```python
  dataset = load_dataset("snp", data_files="/path/to/snp_data.csv")
  ```

### Next Steps

After creating your biological dataset, you can leverage BioSets for downstream tasks
such as feature extraction, model training, or data visualization.

For more advanced usage for loading and processing biological datasets, refer to the
[dataset loading documentation](src/biosets/DATASET_LOADING.md). For building custom
datasets, refer to the [custom dataset creation documentation](src/biosets/CUSTOM_DATASETS.md).

For any additional information not covered in the BioSets documentation,
please refer to the [datasets library documentation](https://huggingface.co/docs/datasets/).

## Contributing

We welcome contributions to the BioSets project! If you have suggestions for
improvements or new features, feel free to open an issue or submit a pull request. For
major changes, please open an issue first to discuss what you would like to change.

## License

This project is licensed under the Apache 2.0 License. See the LICENSE file for more details.
