Metadata-Version: 2.1
Name: Mikado
Version: 1.2.4
Summary: A Python3 annotation program to select the best gene model in each locus
Home-page: https://github.com/lucventurini/mikado
Author: Luca Venturini
Author-email: luca.venturini@earlham.ac.uk
License: GPL3
Keywords: rna-seq annotation genomics transcriptomics
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Provides-Extra: bam
Provides-Extra: postgresql
Provides-Extra: mysql
Requires-Dist: wheel (>=0.28.0)
Requires-Dist: pyyaml
Requires-Dist: jsonschema
Requires-Dist: cython (>=0.25)
Requires-Dist: numpy
Requires-Dist: networkx (>=1.10)
Requires-Dist: sqlalchemy (>=1)
Requires-Dist: sqlalchemy-utils
Requires-Dist: biopython (>=1.66)
Requires-Dist: intervaltree
Requires-Dist: nose
Requires-Dist: pyfaidx
Requires-Dist: scikit-learn (>=0.17.0)
Requires-Dist: scipy (>=0.15.0)
Requires-Dist: frozendict
Requires-Dist: python-magic
Requires-Dist: drmaa
Requires-Dist: docutils (!=0.13.1)
Requires-Dist: tabulate
Requires-Dist: ujson
Requires-Dist: simplejson
Requires-Dist: snakemake (<4.0.0); python_version < "3.5"
Requires-Dist: typing; python_version < "3.5"
Requires-Dist: snakemake; python_version >= "3.5"
Provides-Extra: bam
Requires-Dist: pysam (>=0.8); extra == 'bam'
Provides-Extra: mysql
Requires-Dist: mysqlclient (>=1.3.6); extra == 'mysql'
Provides-Extra: postgresql
Requires-Dist: psycopg2; extra == 'postgresql'

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification
of expressed loci from RNA-Seq data * and to select the best models in each locus.

The logic of the pipeline is as follows:

1. In a first step, the annotation (provided in GTF/GFF3 format) is parsed to locate *superloci* of overlapping features on the **same strand**.
2. The superloci are divided into different *subloci*, each of which is defined as follows:

    * For multiexonic transcripts, to belong to the same sublocus they must share at least a splicing junction (i.e. an intron)
    * For monoexonic transcripts, they must overlap for at least one base pair
    * All subloci must contain either only multiexonic or only monoexonic transcripts
3. In each sublocus, the pipeline selects the best transcript according to a user-defined prioritization scheme.
4. The resulting *monosubloci* are merged together, if applicable, into *monosubloci_holders*
5. The best non-overlapping transcripts are selected, in order to define the *loci* contained inside the superlocus.

    * At this stage, monoexonic and multiexonic transcript are checked for overlaps
    * Moreover, two multiexonic transcripts are considered to belong to the same locus if they share a splice *site* (not junction)

6. Once the loci have been defined, the program backtracks and looks for transcripts which can be assigned unambiguously to a single locus and constitute valid alternative splicing isoforms of the main transcripts. 

The criteria used to select the "*best*" transcript are left to the user's discretion, using specific configuration files.

