Metadata-Version: 1.1
Name: baseqRNA
Version: 1.5
Summary: Pipeline for Processing RNA-Seq datasets
Home-page: https://www.beiseq.com
Author: Xiannian Zhang
Author-email: friedpine@gmail.com
License: UNKNOWN
Description: # DropRNA
        
        ## Install baseq_drops
        We need python3 and a package called: baseq_drops, which could be installed by:
        
            pip install baseqdrops
        
        After install, you will have a runnable command `baseq-Drop`
        
        ## Config file
        
        The pipeline need the following software or resources:
        
        + `star`: STAR software, for fast alignment of RNA-Seq data;
        + `samtools`: Sorting bam file;
        + `whitelistDir`: The barcode whitelist files for indrop and 10X should be placed under whitelistDir.
        These files can be downloaded from XXX.
        + `cellranger_ref_<genome>`: The key process of read alignment and tagging to genes
         are inspired and borrowed from the open source cellranger pipeline
         (https://github.com/10XGenomics/cellranger).
         The refernces of genome index and transcriptome can be downloaded
         from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
        In the config file, the directory of cellrange references is named as `cellranger_<genome>`.
        
        While running command, the configures are recorded in the file called `config_drops.ini`:
        
            [Drops]
            samtools = /path/to/samtools
            star = /path/to/STAR
            whitelistDir = /path/to/whitelist_file_directory
            cellranger_hg38 = /path/to/reference/refdata-cellranger-GRCh38-1.2.0/
        
        ## Process Steps
        1. `Extract the Cell Barcode` Counting the number of each kinds of barcode; this will genrate a barcode_count.<sample>.csv;
        2. `Cell Barcode correction and filtering` Correcting the cell barcode with 1bp mismatch, filtering the barcode with min number of reads;
        3. `Split the reads of valid Cell Barcodes` The raw pair-end raw reads are splitted to 16 single end files for multiprocessing according to the 2bp prefix of barcode; For example, we will get: split.<sample>.<AA|AT|AC|AG...|GG>.fq
        4. `Star Alignment` Fastq files runs at the same time; The bam file sorted by sequence header is generated;
        5. `Reads tagging` Tagging the reads alignment position to the corresponding gene name
        6. `Genrating UMI table`
        
        ## Run Command
        
        The main config is:
        
        + `--config`: config file;
        + `--genome/-g`: genome version;
        + `--protocol`: [10X|indrop|dropseq]
        + `--minreads`:  Minimum reads for a barcode
        + `--name/-n` : Sample name
        + `--fq1/-1`: Read 1
        + `--fq2/-2`: Read 2
        + `--top_million_reads`: How many million reads to use, mainly for testing pipeline with fraction of reads (default 1000)
        + `--dir/-d`: output path
        
        If you config the: `cellranger_ref_hg38` you can run the following:
        
            baseqdrops run_pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 10000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./
        
        ### For older version 10X results
        The cell barcode length is 15 and UMI length is 5.
        
            baseqdrops run_pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 10000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./
        
        
Keywords: sample setuptools development
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.6
