Metadata-Version: 2.4
Name: biohelpers
Version: 0.1.3
Summary: Useful function for biologist
Home-page: https://github.com/lixiang117423/biohelpers_python
Author: Xiang Li
Author-email: lixiang117423@gmail.com
License: MIT
Keywords: bioinformatics genome-analysis
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Biopython>=1.81
Dynamic: license-file

GeneFamily

## Installation

```bash
pip install biohelpers
```

## Dependencies

### Install miniforge

Install [miniforge](https://github.com/conda-forge/miniforge) according to the instructions on the website.

### Install dependencies

```bash
mamba install conda-forge::biopython=1.85
mamba install bioconda::gffread=0.12.7
mamba install bioconda::seqkit=2.10.0

```

## Usage

### example data

The demo data were downloaded from [RiceSuperPIRdb](http://www.ricesuperpir.com/web/download) from the paper, [A super pan-genomic landscape of rice](https://www.nature.com/articles/s41422-022-00685-z).

```bash
wget http://www.ricesuperpir.com/uploads/common/gene_annotation/NIP-T2T.gff3.gz
wget http://www.ricesuperpir.com/uploads/common/genome_sequence/NIP-T2T.fa.gz

gunzip NIP-T2T.gff3.gz
gunzip NIP-T2T.fa.gz

mv NIP-T2T.gff3 Nipponbare.gff3
mv NIP-T2T.fa Nipponbare.fa
```

### get the meta information of a project or run from ENA

```bash
get_fq_meta -h

usage: get_fq_meta [-h] [-id ACCESSION] [-o OUTPUT] [-s SAVE [SAVE ...]]

Download sequencing metadata TSV from ENA API

options:
  -h, --help            show this help message and exit
  -id ACCESSION, --accession ACCESSION
                        ENA accession number (required) (e.g. PRJNA123456)
  -o OUTPUT, --output OUTPUT
                        Output path (supports .tsv/.csv/.txt/.xlsx extensions, default: ./tmp/[accession].meta.tsv)
  -s SAVE [SAVE ...], --save SAVE [SAVE ...]
                        Fields to save (all|field1 field2), available fields: secondary_study_accession,sample_accession,secondary_sample_acces  
                        sion,experiment_accession,study_accession,submission_accession,tax_id,scientific_name,instrument_model,nominal_length,l  
                        ibrary_layout,library_source,library_selection,base_count,first_public,last_updated,study_title,experiment_alias,run_al  
                        ias,fastq_bytes,fastq_md5,fastq_ftp,fastq_aspera,fastq_galaxy,submitted_bytes,submitted_md5,submitted_ftp,submitted_gal  
                        axy,submitted_format,sra_bytes,sra_md5,sra_ftp,sample_alias,broker_name,sample_title,nominal_sdev,bam_ftp,bam_bytes 
```

```bash
get_fq_meta -id PRJNA510920 -o PRJNA510920.meta.txt
# Meta information file saved to: PRJNA510920.meta.txt
```

### download FASTQ format data from ENA

```bash
get_fq_file -h

usage: get_fq_file [-h] --accession ACCESSION --type {ftp,aspera} [--key KEY] [--method {run,save}] [--output OUTPUT]

Download FASTQ files from ENA

options:
  -h, --help            show this help message and exit
  --accession ACCESSION, -id ACCESSION
                        Accession number (required) Format example: PRJNA661210/SRP000123 Supports ENA/NCBI standard accession formats
                        (default: None)
  --type {ftp,aspera}, -t {ftp,aspera}
                        Download protocol type ftp: Standard FTP download aspera: High-speed transfer protocol (requires private key) (default:  
                        None)
  --key KEY, -k KEY     Path to aspera private key Required when using aspera protocol Default location:
                        ~/.aspera/connect/etc/asperaweb_id_dsa.openssh (default: None)
  --method {run,save}, -m {run,save}
                        Execution mode run: Execute download commands directly save: Generate download script (default) (default: save)
  --output OUTPUT, -o OUTPUT
                        Output directory Default format: [accession].fastq.download Auto-create missing directories (default: None)
```

`-m run` will directly download the FASTQ files.  **But we strongly recommend using ` -m save` to save download script then get the FASTQ data.**

#### wget

```bash
get_fq_file -id PRJNA510920 -m save -t ftp -o ./fastq

# Please run the next command to download the FASTQ data:
# bash download_PRJNA510920_fastq_by_wget.sh
```

#### aspera

```bash
get_fq_file -id PRJNA510920 -m save -t aspera -k ./asperaweb_id_dsa.openssh  -o ./fastq

# Please run the next command to download the FASTQ data:
# bash download_PRJNA510920_fastq_by_aspera.sh
```

### download the HMM file

```bash
download_hmm -h
usage: download_hmm [-h] -id HMM_ID -o OUTPUT

Download HMM profile from InterPro

options:
  -h, --help            show this help message and exit
  -id, --hmm_id HMM_ID  Pfam HMM ID (e.g. PF00010)
  -o, --output OUTPUT   Output directory path

```

```bash
download_hmm -id PF00010 -o example/
```

### get the longest transcript for each gene

```bash
parse_longest_mrna -h
usage: parse_longest_mrna [-h] -g GENOME -f GFF3 -o OUTPUT

Extract longest mRNA transcripts

options:
  -h, --help           show this help message and exit
  -g, --genome GENOME  Input genome FASTA file
  -f, --gff3 GFF3      Input GFF3 annotation file
  -o, --output OUTPUT  Output FASTA file
```

```bash
python3 src/biohelpers/parse_longest_mrna.py -g example/Nipponbare.fa -f example/Nipponbare.gff3 -o test/longest.pep.fa
```

```bash
################################################################
Total genes: 57359
Total transcripts: 67818
Genes with multiple transcripts: 6510
################################################################
Successfully extracted 57359 longest transcripts
Longest transcript protein sequences saved to: test/longest.pep.fa
Gene and transcript information saved to: example/Nipponbare.gene.info.txt
```


## Requirements

- Python 3.7+
- requests>=2.31.0
