Metadata-Version: 2.4
Name: pathofetch
Version: 0.1.0
Summary: Download SNP matrices from NCBI Pathogen Detection
Author-email: Erin Young <eriny@utah.gov>
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tqdm
Dynamic: license-file

# Pathofetch

Lightweight CLI and Python tool for downloading SNP cluster data from NCBI Pathogen Detection (PD).

## Installation

### From PyPI
```
pip install pathofetch
```

### From source
```
git clone https://github.com/erinyoung/pathofetch.git
cd pathofetch
pip install -e .
```

## Usage

Pathofetch requires two arguments: an **organism group** and a **SNP cluster ID** (as defined by NCBI Pathogen Detection).  
It will download the corresponding tarball for the SNP tree and generate a pairwise SNP distance matrix.

By default:

- The SNP matrix is saved to `{cluster_id}.snp_distance_matrix.csv`.
- The raw tarball is saved to `{cluster_id}.tar.gz`.

### Example

```bash
pathofetch -o Salmonella -c PDS000254123.2
```

### Custom Output Filename

```bash
# specify CSV matrix
pathofetch -o Salmonella -c PDS000254123.2 -f results/matrix.csv
# generate qc file
pathofetch -o Salmonella -c PDS000254123.2 -q
```


## Python API Usage

It is possible to import `pathofetch` into other Python scripts to build larger bioinformatics pipelines.
```python
import pathofetch

# -----------------------------
# Download and process a single SNP cluster
# -----------------------------
# Returns True on success, False on failure
success = pathofetch.fetch_snp_matrix(
    organism="Listeria_monocytogenes",    # Organism group on NCBI PD
    cluster_id="PDS000000123.5",          # SNP cluster ID
    out_file="results/matrix.csv",        # Where to save the SNP distance matrix
    tar_file="archives/raw_data.tar.gz",  # Optional: save tarball to custom location
    qc_file_arg="AUTO"                     # Generate QC file alongside output
)

if success:
    print("Matrix generated successfully!")
    print("Matrix CSV: results/matrix.csv")
    print("Tarball: archives/raw_data.tar.gz")
    print("QC stats: PDS000000123.5.qc.csv (auto-named)")
else:
    print("Failed to generate SNP matrix. Check organism and cluster ID.")

```


## Output Format

The output is a standard CSV symmetric matrix representing pairwise SNP distances.

```
-,PDT003080107.1,PDT002963418.1,PDT003087591.1
PDT003080107.1,0,12,5
PDT002963418.1,12,0,8
PDT003087591.1,5,8,0
```

## QC Statistics File

If using the `-q` flag, a CSV file is generated with the following structure:

```
key,value
pathofetch_version,0.1.0
download_time,2.21s
organism_group,Aeromonas_salmonicida
cluster,PDS000097767.14
cluster_create_date,Feb 4 11:28
num_sample,87
snp_alignment_length,612
min_pairwise_distance,0
max_pairwise_distance,104
avg_pairwise_distance,50.46
tarball_file,PDS000097767.14.tar.gz
tarball_filesize,0.81 MB
snp_matrix_file,PDS000097767.14.snp_distance_matrix.csv
qc_file,PDS000097767.14.qc.csv
```

## Arguments

```
usage: pathofetch [-h] [--version] [--organism ORGANISM] [--cluster CLUSTER] [--out-file OUT_FILE] [--tar-file TAR_FILE]
                  [--qc-file [QC_FILE]]

Download SNP Cluster Distance Matrices

options:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit
  --organism ORGANISM, -o ORGANISM
                        Organism Group (e.g. Salmonella)
  --cluster CLUSTER, -c CLUSTER
                        Cluster ID (e.g. PDS000254123.2)
  --out-file OUT_FILE, -f OUT_FILE
                        Path to save output CSV
  --tar-file TAR_FILE, -t TAR_FILE
                        Path to save tar.gz
  --qc-file [QC_FILE], -q [QC_FILE]
                        Save QC metrics to file
```

## License

GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007
