Metadata-Version: 2.1
Name: bio-shark
Version: 1.2.2
Summary: SHARK (Similarity/Homology Assessment by Relating K-mers)
Home-page: https://git.mpi-cbg.de/tothpetroczylab/shark
Author: Willis Chow <chow@mpi-cbg.de>, Soumyadeep Ghosh <soumyadeep11194@gmail.com>, Anna Hadarovich <hadarovi@mpi-cbg.de>, Agnes Toth-Petroczy <tothpet@mpi-cbg.de>, Maxim Scheremetjew <schereme@mpi-cbg.de>
Author-email: chow@mpi-cbg.de
Project-URL: Homepage, https://git.mpi-cbg.de/tothpetroczylab/shark
Project-URL: Documentation, https://git.mpi-cbg.de/tothpetroczylab/shark/-/blob/master/README.md
Project-URL: Funding, https://www.mpi-cbg.de/
Project-URL: Repository, https://git.mpi-cbg.de/tothpetroczylab/shark
Project-URL: Issue tracker, https://git.mpi-cbg.de/tothpetroczylab/shark/-/issues
Keywords: intrinsically disordered protein regions,motif detection,IDRs,sequence-to-function,alignment-free,machine learning,homology detection
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Requires-Python: >=3.9,<3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests ~=2.31.0
Requires-Dist: catboost ~=1.2
Requires-Dist: matplotlib ~=3.8.2
Requires-Dist: pandas ~=2.1.3
Requires-Dist: logomaker ~=0.8
Requires-Dist: alfpy ~=1.0.6

<h1 align="center">
<img src="https://git.mpi-cbg.de/tothpetroczylab/shark/-/raw/master/branding/logo/SharkDive_logo.png" width="300">
</h1><br>

# SHARK (Similarity/Homology Assessment by Relating K-mers)

To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers). 

##  SHARK-dive 

We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment in assessing homology in unalignable sequences, and correctly identified dissimilar IDRs capable of functional rescue in IDR-replacement experiments reported in the literature.

### 1. Dive-Score
Scoring the similarity between a pair of sequence

Variants:
   1. Normal (`SHARK-score (T)`)
   2. Sparse (`SHARK-score (best)`)

### 2. Dive-Predict
Find sequences similar to a given query from a target set   


## User Section

### Installation

SHARK officially supports Python versions >=3.9,<3.12.

**Recommended** Use within a local python virtual environment

```shell
python3 -m venv /path/to/new/virtual/environment
```

#### SHARK is installable from PyPI soon

```shell
$ pip install bio-shark
```

#### SHARK is also installable from source

* This allows users to import functionalities as a python package 
* This also allows user to run the functionalities as a command line utility 

```shell
$ git clone git@git.mpi-cbg.de:tothpetroczylab/shark.git
```
Once you have a copy of the source, you can embed it in your own Python package, or install it into your site-packages easily.

```shell
# Make sure you have the required Python version installed
$ python3 --version
Python 3.11.5

$ cd shark
$ python3 -m venv shark-env
$ source shark-env/bin/activate
$ (shark-env) % python -m pip install .
```

#### SHARK is also installable from GitLab source directly

```shell
$ pip install git+https://git.mpi-cbg.de/tothpetroczylab/shark.git
```

###  How to use?

### 1. SHARK-scores: Given two protein sequences and a k-mer length (1 to 20), score the similarity b/w them 

##### Inputs

1. Protein Sequence 1
2. Protein Sequence 2
3. Scoring-variant: Normal (`SHARK-score (T)`)/ Sparse (`SHARK-score (best)`)
   1. Threshold (for "Normal")
4. K-Mer Length (Should be <= smallest_len(sequences))

##### 1.1. As a command-line utility
* Run the command `shark-score` along with input fasta files and scoring parameters
* Instead of input fasta files (--infile or --dbfile), a pair of query-target sequences can also be provided, e.g.:
```shell
% shark-score QUERYSEQUENCE TARGETSEQUENCE -k 5 - t 0.95 -s threshold -o results.tsv
```
* Note that if a FASTA file is provided, it will be used instead.
* The overall usage is as follows:

```shell
% shark-score --infile <path/to/query/fasta/file> --dbfile <path/to/target/fasta/file> --outfile <path/to/result/file> --length <k-mer length> --threshold <shark-score threshold> 
usage: shark-score [-h] [--infile INFILE] [--dbfile DBFILE] [--outfile OUTFILE] [--scoretype {best,threshold,NGD}] [--length LENGTH] [--threshold THRESHOLD] [query] [target]

Run SHARK-Scores (best or T=x variants) or Normalised Google Distance Scores. Note that if a FASTA file is provided, it will be used instead.

positional arguments:
  query                 Query sequence
  target                Target sequence

optional arguments:
  -h, --help            show this help message and exit
  --infile INFILE, -i INFILE
                        Query FASTA file
  --dbfile DBFILE, -d DBFILE
                        Target FASTA file
  --outfile OUTFILE, -o OUTFILE
                        Result file
  --scoretype {best,threshold,NGD}, -s {best,threshold,NGD}
                        Score type: best or threshold or NGD. Default is threshold.
  --length LENGTH, -k LENGTH
                        k-mer length
  --threshold THRESHOLD, -t THRESHOLD
                        threshold for SHARK-Score (T=x) variant
```

##### 1.2. As an imported python package

```python
from bio_shark.core import utils
from bio_shark.dive.run import run_normal, run_sparse

dive_t_score = run_normal(
    sequence1="LASIDPTFKAN",
    sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
    k=3,
    threshold=0.8
)   # Compute SHARK-score (T)  

dive_best_score = run_sparse(
    sequence1="LASIDPTFKAN",
    sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
    k=3,
)   # Compute SHARK-score (best)
```

#### 2. SHARK-Dive: Homology Assessment between two sequences

##### 2.1. As an imported python package

```python
from bio_shark.dive.prediction import Prediction

predictor = Prediction(q_sequence_id_map=<dict-fasta-id-seq>, t_sequence_id_map=<dict-fasta-id-seq>)

expected_out_keys = ['seq_id1', 'sequence1', 'seq_id2', 'sequence2', 'similarity_scores_k', 'pred_label', 'pred_proba']
output = predictor.predict()    # List of output objects; Each element is for one pair
```

##### 2.2. As a command-line utility
- Run the command `shark-dive` with the absolute path of the sequence fasta files as only argument
- Sequences should be of length > 10, since `prediction` is always based on scores of k = [1..10]
- _You may use the `sample_fasta_file.fasta` from `data` folder (Owncloud link)_


```shell
usage: shark-dive [-h] [--output_dir OUTPUT_DIR] query target

DIVE-Predict: Given some query sequences, compute their similarity from the list of target sequences;Target is
supposed to be major database of protein sequences

positional arguments:
  query       Absolute path to fasta file for the query set of input sequences
  target      Absolute path to fasta file for the target set of input sequences

options:
  -h, --help  show this help message and exit
  --output_dir OUTPUT_DIR
                        Output folder (default: current working directory)
  
$ shark-dive "<query-fasta-file>.fasta" "<target-fasta-file>.fasta"
Read fasta file from path <query-fasta-file>.fasta; Found 4 sequences; Skipped 0 sequences for having X
Read fasta file from path <target-fasta-file>.fasta; Found 6 sequences; Skipped 0 sequences for having X
Output stored at <OUTPUT_DIR>/<path-to-sequence-fasta-file>.fasta.csv
```

- Output CSV has the following column headers: 
    - (1) "Query": Fasta ID of sequence from Query list
    - (2) "Target": Fasta ID of sequence from Target list
    - (3..12) "SHARK-Score (k=*)": Similarity score between the two sequences for specific k-value
    - (13) "SHARK-Dive": Aggregated similarity score over all lengths of k-mer

##### 2.3. Parallelised Runs of SHARK-Dive
- Each k-mer score is run in parallel, with a final aggregation step of the 10 k-mer scores, whereupon SHARK-Dive is run.
- change the environmental variables in parallel_run_example_environment.env (or create your own!)
- navigate to the parallel_run folder
- run parallel_run.sh

```shell
$ bash parallel_run.sh
...
Read fasta file from path ../data/IDR_Segments.fasta; Found 6 sequences; Skipped 0 sequences for having non-canonical AAs
All sequences are present! Proceeding with SHARK-dive prediction...
Finished in 0.10163092613220215 seconds
121307136
SHARK-dive prediction complete!
Elapsed Time: 3 seconds
```


## Publication
***SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences.***
Chi Fung Willis Chow, Soumyadeep Ghosh, Anna Hadarovich, and Agnes Toth-Petroczy. 
Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: [10.1073/pnas.2401622121](https://www.doi.org/10.1073/pnas.2401622121). Epub 2024 Oct 9. PMID: 39383002.

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

