Metadata-Version: 2.1
Name: IBSpy
Version: 0.4.0rc0
Summary: A package to detect IBS regions
Home-page: https://github.com/Uauy-Lab/IBSpy
Author: Ricardo H. Ramirez-Gonzalez
Author-email: ricardo.ramirez-gonzalez@jic.ac.uk
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# IBSpy

![Python package](https://github.com/Uauy-Lab/IBSpy/workflows/Python%20package/badge.svg)
[![Maintainability](https://api.codeclimate.com/v1/badges/5a4b1b0e89f7f9f8c34c/maintainability)](https://codeclimate.com/github/Uauy-Lab/IBSpy/maintainability)

Python library to identify Identical By State regions



To build the mker database for kmc and the tests run this comand:

```sh
kmc -k31 -r -ci1 -fm data/test4B.jagger.fa data/test4B.jagger.kmc_k31 tmp
```


## Installyng IBSpy

There easiest way to install IBSpy is to use pip3. 

```sh
pip3 install IBSpy
```


If ```pip3``` fails, you can clone the project and compiling it with:

```sh
pip3 install cython biopython pyfaidx
python3 setup.py develop
```

Then you should have the  IBSpy command available. 


### KMC3 

If you want to use the [KMC](https://github.com/refresh-bio/KMC) binder, install the KMC and compile the python instructions.

Then, run the following command to setup the path for it.  
```sh
cd KMC/py_kmc_api
source set_path.sh 
```


## Preparing the databases

IBSpy requires to have a kmer database from the sequencing files. Currently two formats are supported:

  1. Jellyfish: Follow the instructions in its [website](https://github.com/gmarcais/Jellyfish/blob/master/doc/Readme.md)
  2. kmerGWAS: Has an adhoc file format that contains only the kmers in a binary representation, sorted. This option is faster than the jellyfish version, but creating the kmer table is less straight forward. The manual is [here](https://github.com/voichek/kmersGWAS/blob/master/manual.pdf).

## Runn unit tests

To makes sure that your changes havent broken the core IBSpy, run the unit tests:

```sh
python3 setup.py test
```


## Running IBSPy

IBSpy has relatively few options, you can look at them with the ```--help``` command. 

```sh
IBSPy --help
usage: IBSPy [-h] [-w WINDOW_SIZE] [-k KMER_SIZE] [-d DATABASE] [-r REFERENCE]
             [-z] [-o OUTPUT] [-f {kmerGWAS,jellyfish}]

optional arguments:
  -h, --help            show this help message and exit
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        window size to analyze
  -k KMER_SIZE, --kmer_size KMER_SIZE
                        Kmer size of the database
  -d DATABASE, --database DATABASE
                        Kmer database
  -r REFERENCE, --reference REFERENCE
                        The reference with the position of the kmers
  -z, --compress        When an ouput file is present, it is compressed as .gz
  -o OUTPUT, --output OUTPUT
                        Output file. If missing, the ouptut is sent to stdout
  -f {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}, --database_format {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}
                        Database format 
```

To generate the table with the number of observed kmers and variants run the following command, using the kmer database from kmerGWAS use the following command:


```sh
 IBSpy --output "kmer_windows_LineXXX.tsv.gz" -z --database kmers_with_strand  --reference arinaLrFor.fa --window_size 50000 --compress --database_format kmerGWAS
```
For KMC3, the database is the name used while creating the database, not the filename. 


## Running IBSplot

Look at the IBSplot commands using ```--help```.

```sh
IBSPy --help
usage: IBSplot [-h] [-i IBSPY_COUNTS] [-w WINDOW_SIZE] [-f FILTER_COUNTS]
               [-n N_COMPONENTS] [-c COVARIANCE_TYPE] [-s STITCH_NUMBER]
               [-o OUTPUT] [-r REFERENCE] [-q QUERY] [-p PLOT_OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i IBSPY_COUNTS, --IBSpy_counts IBSPY_COUNTS
                        tvs file genetared by IBSpy output
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Windows size to count variations within
  -f FILTER_COUNTS, --filter_counts FILTER_COUNTS
                        Filter number of variaitons above this threshold to
                        compute GMM model, default=None
  -n N_COMPONENTS, --n_components N_COMPONENTS
                        Number of componenets for the GMM model, default=3
  -c COVARIANCE_TYPE, --covariance_type COVARIANCE_TYPE
                        type of covariance used for GMM model, default="full"
  -s STITCH_NUMBER, --stitch_number STITCH_NUMBER
                        Consecutive "outliers" in windows to stitch, default=3
  -o OUTPUT, --output OUTPUT
                        tsv file with variations count by windows and summary
                        statistics
  -r REFERENCE, --reference REFERENCE
                        genome reference name
  -q QUERY, --query QUERY
                        query sample
  -p PLOT_OUTPUT, --plot_output PLOT_OUTPUT
                        histograms and ascatter files in .PDF format
```

IBSplot uses the output table generated by IBSpy described above (e.g., ```"kmer_windows_LineXXX.tsv.gz"```). It can be used to count variant assigning larger windows. In the example below it is using 400,000 bp windows to compute  a GMM model and generate the plots.

To generate the table with variant count categorized by the GMM model as IBS or non-IBS and generate the plots, run the following command:
The description of the GMM model is [here](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture)

```sh
# minimal arguments
IBSplot --IBSpy_counts "kmeribs-Wheat_Jagger-Flame.tsv.gz" --window_size 400000 --output gmm_ibs.tsv.gz --reference Jagger --query Flame --plot_output gmm_plots.pdf
```

In addition, you can include some or all of the following commands to tune the GMM model parameters and define the best IBS and non-IBS according to the reference and query sample used:

```sh
IBSplot --filter_counts 1000 --n_components 3 --covariance_type 'full' --stitch_number 3
```



