Metadata-Version: 2.1
Name: RKP
Version: 0.1.0
Summary: Relative K-mer Project
Home-page: https://gitlab.com/microbial_genomics/relative-kmer-project
Author: Lennard Epping, Felix Hartkopf
Author-email: EppingL@rki.de, HartkopfF@rki.de
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: numpy (==1.17.3)
Requires-Dist: matplotlib (==3.1.2)
Requires-Dist: pandas (==0.25.3)
Requires-Dist: biopython (==1.76)
Requires-Dist: argparse (==1.4.0)
Requires-Dist: tqdm (==4.41.1)

# Relative K-mer Project

## Abstract

### WGS analysis reveals extended natural transformation in Campylobacter impacting diagnostics and the pathogens adaptive potential. 
### Running title: WGS analysis of Campylobacter hybrid strains

### Julia C. Golz 1a, Lennard Epping 2#, Marie-Theres Knüver 1a, Maria Borowiak 1b, Felix Hartkopf 2, Carlus Deneke 1b, Burkhard Malorny 1b, Torsten Semmler 2, Kerstin Stingl 1a*

1 German Federal Institute for Risk Assessment, Department of Biological Safety, a National Reference Laboratory for *Campylobacter*, b Study Centre for Genome Sequencing and Analysis, Berlin, Germany
2 Robert Koch Institute, Microbial Genomics, Berlin, Germany  

\# sharing first author  
\* corresponding author

In the past decade, *Campylobacter* infections are getting more common worldwide. These infections can lead to diarrhea, abdominal pain, fever, headache, nausea, and/or vomiting and pose a serious danger for public health.  This sparked efforts to improve prevention, treatment and reduce transmissions. As further stated by Kaakoush et al. [1], the main risks are the consumption of animal products and water, contact with animals and international travels. 

As the threat to public health differs among *Campylobacter* species, it is important to identify dangerous *Campylobacter* species and investigate their characteristics in genotype and phenotype. In this work, a kmer mapping approach is used to identify recombination events and involved genes to describe hybrid species. Therefore, hybrids of *Campylobacter jejuni* and *Campylobacter coli* are analyzed to validate this approach and to develop a workflow that can be applied to emerging hybrids in general. This would allow a fast and reliable classification of hybrids. 

KMC3 [2] and BEDTools [5] are utilized to extract kmers of *Campylobacter* genomes and to calculate shared kmers of two species and their hybrids. Subsequently, these kmers can be used in combination with Blast [3] and Bowtie 2 [4] to select genes that are shared with the hybrid genomes. These genes can be grouped into batches that were involved in a single recombination event. A visualization of the gene coverage generated using R provides further information about the selected genes. 

This work will provide a new generic tool for hybrid analysis that could be expanded to other bacteria and enable researchers to classify new species and recombination events in a fast and reliable manner.


[1] Global Epidemiology of Campylobacter Infection
Nadeem O. Kaakoush, Natalia Castaño-Rodríguez, Hazel M. Mitchell, Si Ming Man
Clinical Microbiology Reviews Jun 2015, 28 (3) 687-720; DOI: 10.1128/CMR.00006-15  
[2] Marek Kokot, Maciej Długosz, Sebastian Deorowicz, KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33, Issue 17, 01 September 2017, Pages 2759–2761, https://doi.org/10.1093/bioinformatics/btx304  
[3] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman,
Basic local alignment search tool, Journal of Molecular Biology, Volume 215, Issue 3, 1990, Pages 403-410, ISSN 0022-2836, https://doi.org/10.1016/S0022-2836(05)80360-2.  
[4] Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.  
[5] Aaron R. Quinlan, Ira M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics, Volume 26, Issue 6, 15 March 2010, Pages 841–842, https://doi.org/10.1093/bioinformatics/btq033

## Requirements

+ [Conda](https://docs.conda.io/en/latest/)

or 

+ Python 3.X
  + numpy = 1.17.3
  + matplotlib = 3.1.2
  + pandas = 0.25.3
  + biopython = 1.76
  + argparse = 1.4.0
  + tqdm = 4.41.1
+ kmc = 3.1.1
+ bowtie2 = 2.3.5
+ bedtools = 2.29.2
+ r = 3.6
  + pheatmap = 1.0.12
  + gplots = 3.0.1.1
+ blast = 2.9.0
+ samtools = 1.10
+ bedops = 2.4.37
+ seqkit=0.11.0


## Installation



1.

Change to src directory in RKP repository:
```bash
cd path/to/repo/src
```
2.

Create environment with all dependencies needed by RKP:
```bash
conda env create -f RKP.yaml
```

3. 

Activate RKP environment:
```bash
conda activate RKP
```

4.

Run RKP:
```bash
 python RKP.py -A <acceptor genome dir A> -B <hybrid genome dir B> -C <donor genome dir C> -k  <kmerlength> -a <acceptor treshold> -c <donor threshold> -g <acceptor reference genome fasta> -f <acceptor refernecs genome gff> -o <output directory>
```


Required parameters: 

|  Parameter | Description  |  
|------------|--------------|
| -A, -C     | Two directories with genomes (.fna) of acceptor and donor | 
| -B         | Directory with genomes (.fasta) and fnn files of hybrids | 
|  -k        |  Length of kmers | 
|  -at        |  Relative amount (0 to 1) of isolates of acceptor that should have kmer x| 
|  -dt        |  Relative amount (0 to 1) of isolates of donor that should have kmer x| 
|  -g        |  acceptor reference genome | 
|  -f        |  acceptor reference gff file | 
|  -o        |  output directory| 


Optional parameters: 


|  Parameter | Description  |  
|------------|--------------|
| -d         | Keep all temporary files | 
|  --version |  Show version of RKP | 
|  -h        |  Show help | 
|  -t        |  number of threads, default = 8| 

## File structure of output
```
output
│
│  
│
└───Acceptor
│   │   (only temporary files)
│   
└───Hybrid
|   │   *_iso_seq_protein.fasta
|   |   *_iso_seq.fasta
|   |   mapping_result_Genes_count.csv
|   |   mapping_result_Genes_cutoff_20.csv
|   |   mapping_result_Genes_raw.csv
|   |   mapping_result.csv
|   |   mapping_result.pdf
|   |   recombination_cov_<kmerLength>_W50.pdf
|   |   recombination_cov_<kmerLength>_W100.pdf
|   |   recombination_cov_<kmerLength>_W200.pdf
|   |   recombination_cov_<kmerLength>_W300.pdf
|   |   recombination_cov_<kmerLength>_W400.pdf
|   |   recombination_cov_<kmerLength>_W500.pdf
|   |   Recombination_result_<kmerLength>_W50.csv
|   |   Recombination_result_<kmerLength>_W100.csv
|   |   Recombination_result_<kmerLength>_W200.csv
|   |   Recombination_result_<kmerLength>_W300.csv
|   |   Recombination_result_<kmerLength>_W400.csv
|   |   Recombination_result_<kmerLength>_W500.csv
|
└───Donor
|   │   (only temporary files)
|
└───RKP.log
``` 

## Call structure

```mermaid
graph TD;
  RKP.py-->create_kmers.sh;
  create_kmers.sh-->map_kmers.sh;
  RKP.py-->heatmap.R;
```

## Workflow

![workflow](workflow.png "Workflow")

