Metadata-Version: 2.1
Name: biosets
Version: 1.0.1
Summary: Bioinformatics datasets and tools
Home-page: https://github.com/psmyth94/biosets
Download-URL: https://github.com/psmyth94/biosets/tags
Author: Patrick Smyth
Author-email: psmyth1994@gmail.com
License: Apache 2.0
Keywords: omics machine learning bioinformatics datasets
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8.0,<3.12.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datasets
Provides-Extra: apache-beam
Requires-Dist: apache-beam <2.44.0,>=2.26.0 ; extra == 'apache-beam'
Provides-Extra: docs
Requires-Dist: s3fs ; extra == 'docs'
Provides-Extra: jax
Requires-Dist: jax >=0.3.14 ; extra == 'jax'
Requires-Dist: jaxlib >=0.3.14 ; extra == 'jax'
Provides-Extra: polars
Requires-Dist: polars >=0.20.5 ; extra == 'polars'
Requires-Dist: timezones >=0.10.2 ; extra == 'polars'
Provides-Extra: quality
Requires-Dist: ruff >=0.1.5 ; extra == 'quality'
Provides-Extra: s3
Requires-Dist: s3fs ; extra == 's3'
Provides-Extra: scipy
Requires-Dist: scipy ; extra == 'scipy'
Provides-Extra: tensorflow
Requires-Dist: tensorflow !=2.6.0,!=2.6.1,>=2.2.0 ; (sys_platform != "darwin" or platform_machine != "arm64") and extra == 'tensorflow'
Requires-Dist: tensorflow-macos ; (sys_platform == "darwin" and platform_machine == "arm64") and extra == 'tensorflow'
Provides-Extra: tensorflow_gpu
Requires-Dist: tensorflow-gpu !=2.6.0,!=2.6.1,>=2.2.0 ; extra == 'tensorflow_gpu'
Provides-Extra: test
Requires-Dist: ruff >=0.1.5 ; extra == 'test'
Requires-Dist: pytest ; extra == 'test'
Requires-Dist: pytest-timeout ; extra == 'test'
Requires-Dist: pytest-xdist ; extra == 'test'
Requires-Dist: scipy ; extra == 'test'
Requires-Dist: polars >=0.20.5 ; extra == 'test'
Requires-Dist: timezones >=0.10.2 ; extra == 'test'
Requires-Dist: s3fs ; extra == 'test'
Provides-Extra: torch
Requires-Dist: torch ; extra == 'torch'
Provides-Extra: vcf
Requires-Dist: cyvcf2 >=0.30.0 ; extra == 'vcf'
Requires-Dist: sgkit >=0.0.1 ; extra == 'vcf'

# BioSets: Dataset Creation for Biological Research

Please note that this project is in the early stages of development. The documentation
and features are subject to change.

## Overview

BioSets is a library built on top of the `datasets` library for loading, manipulating,
and processing biological datasets for machine learning purposes. It supports genomics,
transcriptomics, proteomics, metabolomics, and other types of biological data.

This repository contains tools and documentation for creating biological datasets using
BioSets. The library loads biological data from local files, creates custom datasets,
and handles large volumes of biological information. BioSets is intended for
researchers and data scientists in bioinformatics, systems biology, and biotechnology.

## Features

🧬 **Loading sample metadata and feature metadata**: BioSets loads both sample
metadata and feature metadata.

🧬 **Support for various biological data types**: Includes predefined classes for
genomic variants, gene expression data, clinical trial data, and OTU tables.

🧬 **Automatic Sample/Batch Detection**: Automatically detects sample and batch
information from the loaded data to handle batch effects and confounding factors.

🧬 **Custom dataset creation**: Create custom datasets with specific features,
metadata, and labels.

🧬 **Integration with datasets library**: BioSets builds on the `datasets` library's
functionality. Note that if `path` is not a value found in
`biosets.list_experiment_types()`, it acts like Huggingface's `datasets` library.

## Getting Started

To use the BioSets library, clone the repository and install the necessary
dependencies. After setting up your environment, create your dataset by following the
steps below.

### Installation

Install BioSets using pip:

```bash
pip install biosets
```

### Creating a Biological Dataset

To create a dataset for biological research using BioSets, follow these steps:

1. **Organize Your Data**: Prepare your biological data in a structured format that
BioSets can process (e.g., directory of relevant files).

2. **Load Your Data with Metadata**: Use `load_dataset()` to load your data along with
sample metadata and feature metadata:

   ```python
   from biosets import load_dataset

   dataset = load_dataset(
       "snp",
       data_files="/path/to/snp_data.csv",
       sample_metadata_files="/path/to/sample_metadata.csv",
       feature_metadata_files="/path/to/feature_metadata.csv",
   )
   ```

3. **Utilize Metadata for Analysis**: The loaded dataset allows you to access and use
metadata in downstream analyses. For example, you can handle abundance data differently
based on its type:

   ```python
   from biosets.features import Abundance
   for k, v in dataset.features.items():
       if isinstance(v, Abundance):
           print(f"Processing abundance feature: {k}")
   ```

### Dataset Examples

#### Loading Specific Experiments

Use specific experiment types for loading data, such as `otu`, `maldi`, `rna`, or `snp`
to ensure the appropriate configuration is applied:

🧬 **OTU Data**

  ```python
  dataset = load_dataset("otu", data_files="/path/to/otu_data.csv")
  ```

🧬 **RNA Data**

  ```python
  dataset = load_dataset("rna", data_files="/path/to/rna_data.csv")
  ```

🧬 **SNP Data**

  ```python
  dataset = load_dataset("snp", data_files="/path/to/snp_data.csv")
  ```

### Next Steps

After creating your biological dataset, you can use BioSets for feature extraction, model
training, or data visualization.

For more advanced usage, refer to the [dataset loading
documentation](src/biosets/DATASET_LOADING.md). For building custom datasets, refer to
the [custom dataset creation documentation](src/biosets/CUSTOM_DATASETS.md).

For any additional information, refer to the [datasets library
documentation](https://huggingface.co/docs/datasets/).

## Contributing

Contributions are welcome! If you have suggestions for improvements or new features,
open an issue or submit a pull request. For major changes, open an issue first to
discuss it.

## License

This project contains portions derived from various sources under the Apache License,
Version 2.0. For full details, please refer to the [LICENSE](./LICENSE) file.
