Metadata-Version: 2.1
Name: GENIALbiologists
Version: 0.9.0
Summary: GENIAL: GENes Indentification with Abricate for Lucky biologists
Home-page: https://github.com/p-barbet/GENIAL
Author: Pauline Barbet, Arnaud Felten
Author-email: pauline.barbet@anses.fr, arnaud.felten@anses.fr
License: UNKNOWN
Platform: UNKNOWN
Classifier: License :: OSI Approved :: GNU General Public License v2 (GPLv2)
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: POSIX :: Linux
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: seaborn
Requires-Dist: biopython

GENIAL : GENes Identification with Abricate for Lucky biologists
================================================================

Authors : Barbet Pauline, Felten Arnaud

Affiliation: [Food Safety Laboratory - ANSES Maisons Alfort (France)](https://www.anses.fr/en/content/laboratory-food-safety-maisons-alfort-and-boulogne-sur-mer)

You can find the latest version of the tool at [https://github.com/p-barbet/GENIAL](https://github.com/p-barbet/GENIAL)


GENIAL
======

GENIAL aims to identify antimicrobial resistance and virulence genes from bacterial genomes matching them to a database gathering genes of interest using [ABRicate](https://github.com/tseemann/abricate).

### Databases

Default databases available are ([Resfinder](https://cge.cbs.dtu.dk/services/ResFinder/), [CARD](https://card.mcmaster.ca/), [ARG-ANNOT](http://backup.mediterranee-infection.com/article.php?laref=282&titre=arg-annot), [NCBI](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA313047), [EcOH](https://github.com/katholt/srst2/tree/master/data), [PlasmidFinder](https://cge.cbs.dtu.dk/services/PlasmidFinder/), [Ecoli_VF](https://github.com/phac-nml/ecoli_vf) and [VFDB](http://www.mgc.ac.cn/VFs/))

As well as this databes, it's posible to use your own database.

The tool is divided into two scripts.

### GENIALanalysis

GENIALanalysis aims to run ABricate. It takes in input a tsv file containing genomes fasta files paths and IDs.If you want to use your own database you also need to provide a multifasta whith genes IDs as headers. Then the script run ABricate and produce in output one ABRicate result file per genome, corresponding to a tsv file including genes found in each sample.

### GENIALresults

GENIALresults aims to conditionning ABRicate results in the form of matrixes and heatmaps of presence/absence. It takes in input a temporary file produced by the Abricate analysis containing the genomes Abricate results paths and IDs. In the case of vfdb database a file containing the virulence factors names, their family and species is automticaly included in the script.

The output depending on the database used :

* In any cases a matrix in tsv format and a heatmap in png format with all genes found are created


On top of that:

* If you use one of the default databases [Resfinder](https://cge.cbs.dtu.dk/services/ResFinder/) or [VFDB](http://www.mgc.ac.cn/VFs/) news matrix and heatmap by gene type are produced with a correspondace table between the gene name, its family and its number in all genomes.

* If you don't use one of the two previous databases or if you use your own database, only a corespondance table between the gene name and its number in all genomes is produced in addition.

![](workflow.PNG?raw=true "script workflow")


Dependencies
============

The script has been developed with python 3.6 (tested with 3.6.6)

### External dependencies

* [ABRicate](https://github.com/tseemann/abricate) tested with 0.8.7
* [Pandas](https://pandas.pydata.org/) tested with 0.23.4
* [seaborn](https://seaborn.pydata.org/installing.html) tested with 0.9.0


Parameters
==========

### Command line options


|   Options   |                                                              Description                                                              |        Required        |      Default     |
|:-----------:|:-------------------------------------------------------------------------------------------------------------------------------------:|:----------------------:|:----------------:|
|      -f     |                                            tsv file with FASTA files paths ans strains IDs                                            |           Yes          |                  |
|     -dbp    |                                   Path to ABRicate databases repertory. Implies -dbf and --privatedb                                  |   Yes if --privatedb   |                  |
|     -dbf    |                           Multifasta containing the private database sequences. Implies -dbp and --privatedb                          |   Yes if --privatedb   |                  |
|      -T     |                                                        Number of thread to use                                                        |           No           |         1        |
|      -w     |                                                           Working directory                                                           |           No           |         .        |
|      -r     |                                                         Results directory name                                                        |           No           | ABRicate_results |
| --defaultdb |  default databases available : resfinder, card, argannot, acoh, ecoli_vf, plasmidfinder, vfdb or ncbi. Incompatible with --privatedb  | Yes if not --privatedb |                  |
| --privatedb |                              Private database name. Implies -dbp and -dbf. Incompatible with --defaultdb                              | Yes if not --defaultdb |                  |
|   --mincov  |                                                   Minimum proportion of gene covered                                                  |           No           |        80        |
|   --minid   |                                             Minimum proportion of exact nucleotide matches                                            |           No           |        90        |
|     --R     |                                          Remove genes present in all genomes from the matrix                                          |           No           |       False      |

Test 
====

After installing ABRicate and Pandas and seaborn you can test the script with the command line :

## Default database

	python AntiViruce.py -f input_file.tsv --defaultdb vfdb -r results_directory --minid 90 --mincov 80

## Private database

	python AntiViruce.py -f input_file.tsv  --privatedb private_db_name -T 10 -r results_directory --minid 90 --mincov 80 -dbp path_to_abricate_databases_repertory -dbf private_db_multifasta_path









