Metadata-Version: 2.0
Name: RiboCode
Version: 1.2.2
Summary: A package for identifying the translated ORFs using ribosome-profiling data
Home-page: https://github.com/xzt41/RiboCode
Author: Zhengtao Xiao
Author-email: xzt13@mails.tsinghua.edu.cn
License: MIT
Keywords: ribo-seq ribosome-profiling ORF
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Environment :: Console
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Requires-Dist: pysam (>0.8.4)
Requires-Dist: matplotlib
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: pyfasta
Requires-Dist: biopython
Requires-Dist: h5py

Detect translated ORFs using ribosome-profiling data
====================================================

*RiboCode* is a very simple but high-quality computational algorithm to
identify genome-wide translated ORFs using ribosome-profiling data.

Dependencies:
-------------

- pysam

- pyfasta

- h5py

- Biopython

- Numpy

- Scipy

- matplotlib

- setuptools

Installation
------------

*RiboCode* can be installed like any other Python packages. Here are some
popular ways:

* Install from PyPI:

.. code-block:: bash

   pip install RiboCode

* Install from local:

.. code-block:: bash

   pip install RiboCode-*.tar.gz

If you have not administrator permission, you need to install *RiboCode* locally in you own directory by adding the
option ``--user`` to installation commands. Then, you need to add ``~/.local/bin/`` to the ``PATH`` variable, and
``~/.local/lib/`` to the ``PYTHONPATH`` variable. For example, if you are using the bash shell, you would do this by adding
the following lines to your ``~/.bashrc`` file:

.. code-block:: bash

   export PATH=$PATH:$HOME/.local/bin/
   export PYTHONPATH=$HOME/.local/lib/python2.7

You then need to source your ``~/.bashrc`` file by this command:

.. code-block:: bash

   source ~/.bashrc

Tutorial to analyze ribosome-profiling data and run *RiboCode*
--------------------------------------------------------------

Here, we use the `HEK293 dataset`_ as an example to illustrate the use of *RiboCode*.
Please make sure the path of file is correctly.

1. **Required files**

   The genome FASTA file, GTF file for annotation can be downloaded from:


   http://www.gencodegenes.org

   or from:

   http://asia.ensembl.org/info/data/ftp/index.html

   http://useast.ensembl.org/info/data/ftp/index.html

   For example, the required files in this tutorial can be downloaded from following URL:

   GTF: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

   FASTA: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz

   The raw Ribo-seq FASTQ file can be download by using fastq-dump tool from `SRA_Toolkit`_:

   .. code-block:: bash

      fastq-dump -A <SRR1630831>

2. **Trimming adapter sequence for ribo-seq data**

   Using cutadapt program https://cutadapt.readthedocs.io/en/stable/installation.html

   Example:

   .. code-block:: bash

      cutadapt -m 20 --match-read-wildcards -a (Adapter sequence) -o <Trimmed fastq file> <Input fastq file>


   Here, the adapter sequences for this data had already been trimmed off, so we can skip this step.

3. **Removing ribosomal RNA(rRNA) derived reads**

   Align the trimmed reads to rRNA sequences using Bowtie, then select unaligned reads for the next step.

   Bowtie program http://bowtie-bio.sourceforge.net/index.shtml

   rRNA sequences: We provided a `rRNA.fa`_ file in data folder of this package.

   Example:

   .. code-block:: bash

      bowtie-build <rRNA.fa> rRNA
      bowtie -p 8 -norc --un un_aligned.fastq rRNA -q <SRR1630831.fastq> <HEK293_rRNA.align>

4. **Aligning the clean reads to reference genome**

   Using STAR program: https://github.com/alexdobin/STAR

   Example:

   (1). Build index

   .. code-block:: bash

      STAR --runThreadN 8 --runMode genomeGenerate --genomeDir <hg19_STARindex>
      --genomeFastaFiles <hg19_genome.fa> --sjdbGTFfile <gencode.v19.annotation.gtf>

   (2). Alignment:

   .. code-block:: bash

      STAR --outFilterType BySJout --runThreadN 8 --outFilterMismatchNmax 2 --genomeDir <hg19_STARindex>
      --readFilesIn <un_aligned.fastq>  --outFileNamePrefix (HEK293) --outSAMtype BAM
      SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --outFilterMultimapNmax 1
      --outFilterMatchNmin 16

5. **Running *RiboCode* to identify translated ORFs**

   (1). Preparing the transcripts annotation files:

   .. code-block:: bash

      prepare_transcripts -g <gencode.v19.annotation.gtf> -f <hg19_genome.fa> -o <RiboCode_annot>

   (2). Selecting the length range of the RPF reads and identify the P-site locations:

   .. code-block:: bash

      metaplots -a <RiboCode_annot> -r <HEK293Aligned.toTranscriptome.out.bam>


   This step will generate a PDF file, which plots the aggregate profiles of the distance between the 5'-end of reads
   and the annotated start codons or stop codons.

   Users can select the read lengths which show strong 3-nt periodicity and identify the P-site locations for each length.

   (3). Detecting translated ORFs using the ribosome-profiling data:

   .. code-block:: bash

      RiboCode -a <RiboCode_annot> -c <config.txt> -l no -o <RiboCode_ORFs_result>


   Specify the information of the bam file and P-site parameters in `config.txt`_, please refer to the example file in data folder.

   **Explanation of final result files**

   The *RiboCode* generates two text files as below:
   The "(output file name).txt" contains the information of predicted ORFs in each
   transcript; The "(output file name)_collapsed.txt" file combines the ORFs with the
   same stop codon in different transcript isoforms: the one harboring the most
   upstream in-frame ATG is chosen.
   Some column names of the result file::

    - ORF_ID: The identifier of ORFs that predicated.
    - ORF_type: The type of ORF. The following ORF categories are reported:

     "annotated" (overlapping annotated CDS, have the same stop with annnotated CDS)

     "uORF" (in upstream of annotated CDS, not overlapping annotated CDS)

     "dORF" (in downstream of annotated CDS, not overlapping annotated CDS)

     "Overlap_uORF" (in upstream of annotated CDS, overlapping annotated CDS)

     "Overlap_dORF" (in downstream of annotated CDS, overlapping annotated CDS"

     "Internal" (in internal of annotated CDS, but in a different frame relative annotated CDS)

     "novel" (in non-coding genes or non-coding transcripts of coding genes).

    - ORF_tstart, ORF_tstop: the beginning and end of ORF in RNA transcript (1-based coordinate)
    - ORF_gstart, ORF_gstop: the beginning and end of ORF in genome (1-based coordinate)
    - pval_frame0_vs_frame1: significance levels of P-site densities of frame0 greater than of frame1
    - pval_frame0_vs_frame2: significance levels of P-site densities of frame0 greater than of frame2
    - pval_combined: integrated P-value

   (4). (optional) plot the P-site densities of predicted ORFs

   Users can plot the density of predicted ORFs using the "plot_orf_density" command, as example below:

   .. code-block:: bash

      plot_orf_density -a <RiboCode_annot> -c <config.txt> -t (transcript_id)
      -s (ORF_gstart) -e (ORF_gstop)


For any questions, please contact:
----------------------------------

   Zhengtao Xiao (xzt13@mails.tsinghua.edu.cn)

   Rongyao Huang (THUhry12@163.com)

   Xudong Xing (xudonxing_bioinf@sina.com)

.. _SRA_Toolkit: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
.. _HEK293 dataset: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1630831
.. _config.txt: https://github.com/xzt41/RiboCode/blob/master/data/config.txt
.. _rRNA.fa: https://github.com/xzt41/RiboCode/blob/master/data/rRNA.fa


