Metadata-Version: 2.1
Name: bifidotyper
Version: 0.1.4
Summary: A bioinformatics tool for analyzing Bifidobacteria in sequencing data.
Home-page: https://github.com/Bennibraun/bifidotyper
Author: Ben Braun
Author-email: braun.ben-1@colorado.edu
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: seaborn
Requires-Dist: matplotlib
Requires-Dist: tqdm
Requires-Dist: scikit-learn
Requires-Dist: natsort
Requires-Dist: palettable

# Bifidotyper

Bifidotyper is a fast, lightweight bioinformatics tool designed to take you from raw FastQ files to a complete and reproducible analysis of [*Bifidobacterial*](https://en.wikipedia.org/wiki/Bifidobacterium) strains in your samples. It makes use of [Sylph](https://doi.org/10.1038/s41587-024-02412-y) for rapid, k-mer-based read alignments. It also uses [Salmon](https://doi.org/10.1038/nmeth.4197) to detect the presence of genes necessary for the metabolism of human milk oligosaccharides ([HMOs](https://en.wikipedia.org/wiki/Human_milk_oligosaccharide)) based on alignments to [*Bifidobacterium longum*](https://www.ncbi.nlm.nih.gov/nuccore/CP001095.1/) genes annotated by [Henrick et al](https://doi.org/10.1016/j.cell.2021.05.030).

Bifidotyper was developed as part of a PhD rotation in the [Olm Lab](https://www.colorado.edu/lab/olm/) and the [IQ Biology](https://www.colorado.edu/certificate/iqbiology/) program.

![Bifidotyper Graphical Abstract](src/bifidotyper/data/reference/bifidotyper_graphical_abstract.png "Bifidotyper")

---

## Installation
```bash
pip install bifidotyper
```

> [!NOTE]
> Bifidotyper can be installed with `pip` but it depends on [Sylph](https://github.com/bluenote-1577/sylph) and [Salmon](https://github.com/COMBINE-lab/salmon), which don't have `pip` distributions. For ease of use, binaries are included for Sylph and automatically downloaded for Salmon if they aren't found in your `PATH`. If you have problems with these, you can install both manually with Conda (`conda install -c bioconda sylph salmon`).

---

## Usage

### Command Line Interface

For single-end reads:
```bash
bifidotyper -se <single-end FASTQ files> [-t <threads>] [-g <genome-dir> | -s <genome-sketch>] [-r <rpm_threshold>]
```

Or paired-end reads:
```bash
bifidotyper -pe <paired-end FASTQ files> [--r1-suffix <R1 suffix>] [--r2-suffix <R2 suffix>] [-t <threads>] [-g <genome-dir> | -s <genome-sketch>] [-r <rpm_threshold>]
```

### Options

- `-se, --single-end`: Single-end FASTQ files.
- `-pe, --paired-end`: Paired-end FASTQ files (R1 and R2 files, supports wildcards).
- `-t, --threads`: Number of threads to use for parallel processing (default: 1).
- `--r1-suffix`: Suffix for R1 files (optional, only for paired-end mode. Default: "_R1").
- `--r2-suffix`: Suffix for R2 files (optional, only for paired-end mode. Default: "_R2").
- `-g, --genome-dir`: Directory containing reference genomes (optional, defaults to provided genomes).
- `-s, --genome-sketch`: Path to a pre-sketched Sylph genome database (optional, defaults to provided database). Use `sylph sketch *.fna` to generate your own database.
- `-r, --rpm_threshold`: Minimum RPM threshold for HMO genes to be considered present (default: 10).


### Examples
```bash
# Run bifidotyper in paired-end mode with 4 threads and _R1/_R2 suffixes
bifidotyper -pe data/*.fastq.gz --r1-suffix _R1 --r2-suffix _R2 -t 4
# Run bifidotyper in single-end mode with 8 threads, a custom genome directory, and a custom RPM threshold
bifidotyper -se data/*.fastq.gz -t 8 -g my_genomes/ -r 25
```

---

## Output

The tool generates several output files and directories:

- `plots/`: All plots generated by the program. Also includes tables for convenience.
- `sylph_genome_sketches/`: The database of Sylph genome indices.
- `sylph_fastq_sketches/`: K-mer indices of input samples processed with Sylph.
- `sylph_genome_queries/`: Results of running Sylph queries against the genomes.
- `hmo_quantification/`: HMO gene alignments with Salmon.

---

## Provided Reference Files
- HMO functional annotations were retrieved from [Henrick et al. 2021](https://data.mendeley.com/datasets/gc4d9h4x67/2). The table is provided in [`src/data/reference/humann2_HMO_annotation.csv`](src/data/reference/humann2_HMO_annotation.csv)
- All *B. longum* annotations are from the NCBI record for [CP001095.1](https://www.ncbi.nlm.nih.gov/nuccore/CP001095.1/)
- A pre-processed Sylph genome database is provided for ease of use. Any genome matching the family *Bifidobacterium* in NCBI and GTDB was included. The genome database was dereplicated with [dRep](https://github.com/MrOlm/drep) with `--S_ani 0.95` before indexing. All genome accessions are listed in [`src/data/reference/genomes.csv`](src/data/reference/genomes.csv)

---

## License

This project is licensed under the MIT License.

## Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.
