Metadata-Version: 2.1
Name: bifidotyper
Version: 0.1.1
Summary: A bioinformatics tool for analyzing Bifidobacteria in sequencing data.
Home-page: https://github.com/Bennibraun/bifidotyper
Author: Ben Braun
Author-email: braun.ben-1@colorado.edu
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: seaborn
Requires-Dist: matplotlib
Requires-Dist: tqdm
Requires-Dist: scikit-learn
Requires-Dist: natsort
Requires-Dist: palettable

# Bifidotyper

Bifidotyper is a fast, lightweight bioinformatics tool designed to take you from raw FastQ files to a complete and reproducible analysis of [*Bifidobacterial*](https://en.wikipedia.org/wiki/Bifidobacterium) strains in your samples. It makes use of [Sylph](https://doi.org/10.1038/s41587-024-02412-y) for rapid, k-mer-based read alignments. It also uses [Salmon](https://doi.org/10.1038/nmeth.4197) to detect the presence of genes necessary for the metabolism of human milk oligosaccharides ([HMOs](https://en.wikipedia.org/wiki/Human_milk_oligosaccharide)) based on alignments to the *Bifidobacterium longum*[^1] genome using gene annotations from Henrick et al.[^2]

[^1]: [CP001095.1](https://www.ncbi.nlm.nih.gov/nuccore/CP001095.1/)

[^2]: Henrick, B. M. et al. (2021). Bifidobacteria-mediated immune system imprinting early in life. Cell, 184(15). [https://doi.org/10.1016/j.cell.2021.05.030](https://doi.org/10.1016/j.cell.2021.05.030)

Bifidotyper was developed as part of a PhD rotation in the [Olm Lab](https://www.colorado.edu/lab/olm/) and the [IQ Biology](https://www.colorado.edu/certificate/iqbiology/) program.

## Installation

Bifidotyper can be installed with `pip` but it depends on [Sylph](https://github.com/bluenote-1577/sylph) and [Salmon](https://github.com/COMBINE-lab/salmon), which don't have `pip` distributions. You can install both with Conda (`conda install -c bioconda sylph salmon`), but for ease of use, binaries are included for Sylph and automatically downloaded for Salmon if they aren't found in your `PATH`.

Clone the repository and install the package:
```bash
git clone https://github.com/Bennibraun/bifidotyper.git
cd bifidotyper
pip install -e .
```

A proper distribution via PyPI or Anaconda is planned.

## Usage

### Command Line Interface

For single-end reads:
```bash
bifidotyper -se <single-end FASTQ files> [-t <threads>] [-g <genome-dir> | -s <genome-sketch>]
```

Or paired-end reads:
```bash
bifidotyper -pe <paired-end FASTQ files> [--r1-suffix <R1 suffix>] [--r2-suffix <R2 suffix>] [-t <threads>] [-g <genome-dir> | -s <genome-sketch>]
```

### Options

- `-se, --single-end`: Single-end FASTQ files.
- `-pe, --paired-end`: Paired-end FASTQ files (R1 and R2 files, supports wildcards).
- `-t, --threads`: Number of threads to use for parallel processing (default: 1).
- `--r1-suffix`: Suffix for R1 files (optional, only for paired-end mode. Default: "_R1").
- `--r2-suffix`: Suffix for R2 files (optional, only for paired-end mode. Default: "_R2").
- `-g, --genome-dir`: Directory containing reference genomes (optional, defaults to provided genomes).
- `-s, --genome-sketch`: Path to a pre-sketched Sylph genome database (optional, defaults to provided database). Use `sylph sketch *.fna` to generate your own database.


### Examples
```bash
# Run bifidotyper in paired-end mode with 4 threads and _R1/_R2 suffixes
bifidotyper -pe data/*.fastq.gz --r1-suffix _R1 --r2-suffix _R2 -t 4
# Run bifidotyper in single-end mode with 8 threads and a custom genome directory
bifidotyper -se data/*.fastq.gz -t 8 -g my_genomes/
```

## Output

The tool generates several output files and directories:

- `plots/`: All plots generated by the program.
- `sylph_genome_sketches/`: The database of Sylph genome indices.
- `sylph_fastq_sketches/`: K-mer indices of input samples processed with Sylph.
- `sylph_genome_queries/`: Results of running Sylph queries against the genomes.
- `hmo_quantification/`: HMO gene alignments with Salmon.

---

## Provided Reference Files
- HMO functional annotations were retrieved from [Henrick et al. 2021](https://data.mendeley.com/datasets/gc4d9h4x67/2). The table is provided in [`src/data/reference/humann2_HMO_annotation.csv`](src/data/reference/humann2_HMO_annotation.csv)
- All *B. longum* annotations are from the NCBI record for [CP001095.1](https://www.ncbi.nlm.nih.gov/nuccore/CP001095.1/)
- A pre-processed Sylph genome database is provided for ease of use. Any genome matching the family *Bifidobacterium* in NCBI and GTDB was included. The genome database was dereplicated with [dRep](https://github.com/MrOlm/drep) with `--S_ani 0.95` before indexing. All genome accessions are listed in [`src/data/reference/genomes.csv`](src/data/reference/genomes.csv)

---

## License

This project is licensed under the MIT License.

## Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.
