Metadata-Version: 2.1
Name: APEC
Version: 1.1.0.1
Summary: Single cell epigenomic clustering based on accessibility pattern
Home-page: https://github.com/QuKunLab/APEC
Author: Bin Li
Author-email: libinsnet@gmail.com
License: BSD License
Description: # User Guide for APEC (v1.1.0)
        
        (Accessibility Pattern based Epigenomic Clustering)
        
        APEC can perform fine cell type clustering on single cell chromatin accessibility data from scATAC-seq, snATAC-seq, sciATAC-seq or any other relevant experiment. It can also be used to evaluate gene score from relevant accesson, search for differential motifs/genes for each cell cluster, find super enhancers, and construct pseudo-time trajectory (by calling Monocle). **If users have already obtained the fragment-count-per-peak matrix from other mapping pipelines (such as CellRanger), please run APEC from the first Section "Run APEC from fragment count matrix". If users have only the raw fastq files, please jump to the second Section "Get fragment count matrix from raw data".**
        
        ## Run AEPC from fragment count matrix
        
        ### 1. Requirements and installation
        
        #### 1.1 Requirements
        
        APEC requires Linux system (CentOS 7.3+ or Ubuntu 16.04+), as well as Python (2.7.15+ or 3.6.8+). If users want to build pseudotime trajectory with APEC, please install R (3.4.0+) environment and monocle (2.4.0). Also, the following software are required for APEC:
        
            Bedtools: http://bedtools.readthedocs.io/en/latest/content/installation.html
            Meme 4.11.2: http://meme-suite.org/doc/download.html?man_type=web
            Homer: http://homer.ucsd.edu/homer/
        
        **notes: Users need to download genome reference for Homer by "perl /path-to-homer/configureHomer.pl -install hg19" and "perl /path-to-homer/configureHomer.pl -install mm10".**
        
        The files in **reference** folder are required for APEC. **But we didn't upload reference files to GitHub since they are too big. Users can download all reference files from http://galaxy.ustc.edu.cn:30803/APEC/**. The **reference** folder should contains the following files:
        
            hg19_RefSeq_genes.gtf, hg19_chr.fa, hg19_chr.fa.fai,
            mm10_RefSeq_genes.gtf, mm10_chr.fa, mm10_chr.fa.fai,
            JASPAR2018_CORE_vertebrates_non-redundant_pfms_meme.txt, tier1_markov1.norc.txt
        
        #### 1.2 Install and import APEC
        
        Users can install APEC by:
        
            pip install APEC
        
        In Ipython, Jupyter-notebook or a python script, users can import packages of APEC by:
        
            import clustering,plot,generate from APEC
        
        Users can inquire the manual for each function of APEC by typing "help()" in Ipython or Jupyter, for example:
        
            help(clustering.cluster_by_APEC)
        
        ### 2. Input data
        
        Users need to prepare a project folder (termed '$project'), which contains **matrix**, **peak**, **result** and **figure** folders. Please place "filtered_cells.csv" and "filtered_reads.mtx" in **matrix** folder, "top_filtered_peaks.bed" in **peak** folder. Here is the instruction for three input files:
        
            filtered_cells.csv: Two-column (separated by tabs) list of cell information ('name' and 'notes'), such as:
                                	name    notes
                                	CD4-001 CD4
                                	CD4-002 CD4
                                	CD8-001 CD8
                                	CD8-002 CD8
            top_filtered_peaks.bed: Three-column list of peaks, which is a standard bed format file.
                                    It is similar to the "peaks.bed" file in the CellRanger output of a 10X scATAC-seq dataset.
            filtered_reads.mtx: Fragment count matrix in mtx format, where each row is a peak and each column represents a cell.
                                It is similar to the "matrix.mtx" file in the CellRanger output of a 10X scATAC-seq dataset.
                                The order of cells should be same with "filtered_cells.csv", and the order of peaks should be
                                same with "filtered_top_peaks.bed".
        
        ### 3. Functions of APEC (step by step)
        
        #### 3.1 Clustering by APEC
        
        Use the following codes to cluster cells by APEC algorithm:
        
            clustering.build_accesson('$project', ngroup=600)
            clustering.cluster_by_APEC('$project', nc=0, norm='zscore')
        
        output files:
        
            $project/matrix/Accesson_peaks.csv
            $project/matrix/Accesson_reads.csv
            $project/result/louvain_cluster_by_APEC.csv
        
        Then users can plot tSNE, UMAP or corrlation heatmap for cells:
        
            plot.plot_tsne('$project')
            plot.plot_umap('$project')
            plot.correlation('$project')
        
        output files:
        
            $project/result/TSNE_by_APEC.csv
            $project/figure/TSNE_by_APEC_with_notes_label.pdf
            $project/result/UMAP_by_APEC.csv
            $project/figure/UMAP_by_APEC_with_notes_label.pdf
            $project/figure/cell_cell_correlation_by_APEC_with_louvain_clustering.png
        
        #### 3.2 Clustering by chromVAR
        
        Use the following codes to cluster cells by chromVAR algorithm:
        
            generate.motif_matrix('$project', genome_fa='$reference/hg19_chr.fa',
                                  background='$reference/tier1_markov1.norc.txt',
                                  meme='$reference/JASPAR2018_CORE_vertebrates_redundant_pfms_meme.txt')
            clustering.cluster_byMotif('$project')
        
        output files:
        
            $project/matrix/Accesson_peaks.csv
            $project/matrix/Accesson_reads.csv
            $project/result/louvain_cluster_by_APEC.csv
        
        #### 3.3 Evaluate ARI, NMI and AMI for clustering result
        
        If users have the real cell type in the 'notes' column of '$project/matrix/filtered_cells.csv', please use the following code to calculate ARI, NMI and AMI to estimate the accuracy of the clustering algorithm.
        
            clustering.cluster_comparison('$project/matrix/filtered_cells.csv',
                                          '$project/result/louvain_cluster_by_APEC.csv')
        
        The output ARI, NMI and AMI values will present on the screen directly.
        
        #### 3.4 Generate pseudotime trajectory
        
            generate.monocle_trajectory('$project')
            plot.plot_trajectory('$project')
        
        output files:
        
            $project/result/monocle_trajectory.csv
            $project/result/monocle_reduced_dimension.csv
            $project/figure/pseudotime_trajectory_with_notes_label.pdf
        
        #### 3.5 Generate gene scores
        
            generate.gene_score('$project', genome='hg19')
        
        output files:
        
            $project/matrix/Accesson_annotated.csv
            $project/matrix/gene_annotated.csv
            $project/matrix/gene_score.csv
        
        #### 3.6 Generate differential feature for a cell cluster
        
            generate.differential_feature('$project', feature='motif', target='0')
            generate.differential_feature('$project', feature='gene', target='0')
        
        The differential motifs/genes of cell cluster '0' will presents on the screen directly. **Notes: Differential motif search requires the running of clustering.cluster_byMotif() beforehand (see 3.2), and differential gene search requires the running of generate.gene_score() beforehand (see 3.5).**
        
        #### 3.7 Plot motif/gene on tSNE/trajectory diagram
        
            plot.plot_feature('$project', space='tsne', feature='gene', name='FOXO1')
            plot.plot_feature('$project', space='trajectory', feature='motif', name='GATA1')
        
        output files:
        
            $project/figure/gene_FOXO1_on_tsne_by_APEC.pdf
            $project/figure/motif_GATA1_on_trajectory_by_APEC.pdf
        
        **Notes: Plotting feature on tSNE diagram requires the running of plot.plot_tsne() beforehand (see 3.1), and plotting feature on trajectory requires the running of generate.monocle_trajectory() beforehand (see 3.4).**
        
        #### 3.8 Generate potential supper enhancer
        
            generate.search_supper_enhancer('$project', supper_range=1000000)
        
        output file:
        
            $project/result/potential_super_enhancer.csv
        
        
        ## Get fragment count matrix from raw data
        
        ### 1. Requirements and installation
        
        All of the following software needs to be placed in the global environment of the Linux system to ensure that they can be called in any path/folder. Picard is also required, but we have placed it into $APEC/reference folder, and users don't need to install it. We recommend that users adopt the latest version of these software, except Meme (version 4.11.2). If users have their own fragment count matrix and only want to run APEC from section 3 "Clustering", then Bowtie2 and Macs2 are not required.
        
            Bowtie2: https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/
            Samtools: https://github.com/samtools/samtools
            Bedtools: http://bedtools.readthedocs.io/en/latest/content/installation.html
            Macs2: https://github.com/taoliu/MACS.git
            Meme 4.11.2: http://meme-suite.org/doc/download.html?man_type=web
            bedGraphToBigWig: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
        
        ### 1.2	Installation
        
        Users simply completes the APEC installation by copying the APEC folder to any path on the computer (i.e. $APEC). There are two subfolders in APEC: a **code** folder (code_v1.1.0), which contains all APEC programs for data processing; a **reference** folder, which contains all necessary index and reference files for the hg19 and mm10 genomes. Users **must** run APEC program directly in $APEC/code/, since each program will invoke the reference files automatically. The **reference** folder is required for APEC and should be placed in the same path ($APEC) with the **code** folder. **But we didn't upload reference files to GitHub since they are too big. Users can download all reference files from http://galaxy.ustc.edu.cn:30803/APEC/**. If users have downloaded the **code** and **reference** folders, the installation will take less than 1 minute by moving them to the same path (i.e. $APEC). The **reference** folder should contains the following files:
        
            hg19_refseq_genes_TSS.txt, hg19_RefSeq_genes.gtf, hg19_blacklist.JDB.bed,
            hg19_chr.fa, hg19_chr.fa.fai, hg19.chrom.sizes,
            hg19.1.bt2, hg19.2.bt2, hg19.3.bt2, hg19.4.bt2, hg19.rev.1.bt2, hg19.rev.2.bt2,
            mm10_refseq_genes_TSS.txt, mm10_RefSeq_genes.gtf, mm10_blacklist.BIN.bed,
            mm10_chr.fa, mm10_chr.fa.fai, mm10.chrom.sizes,
            mm10.1.bt2, mm10.2.bt2, mm10.3.bt2, mm10.4.bt2, mm10.rev.1.bt2, mm10.rev.2.bt2,
            JASPAR2018_CORE_vertebrates_non-redundant_pfms_meme.txt, tier1_markov1.norc.txt, picard.jar
        
        ## 2. Fragment count matrix
        
        ### 2.1	Arrangement of raw data
        
        Users need to build a project folder (i.e. $project), which contains a **data** folder, then copy all raw sequencing fastq files into the $project/**data**/ folder. All these pair-end fastq files should be named as:
        
            type1-001_1.fastq, type1-001_2.fastq, type1-002_1.fastq, type1-002_2.fastq, ……;
            type2-001_1.fastq, type2-001_2.fastq, type2-002_1.fastq, type2-002_2.fastq, ……;
            ……
        
        where "\_1" and "\_2" indicate forward and backward reads for pair-end sequencing. {type1, type2, ...} can be cell-types or batches of samples, such as {GM, K562, ...}, or {batch1, batch2, ...}, or any other words without underline "\_" or dash "-".
        The **work**, **matrix**, **peak**, **result** and **figure** folders will be automatically built by subsequent steps, and placed in $project folder.
        
        ### 2.2	Easy-run of matrix preparation
        
        Users can use the script ***APEC_prepare_steps.sh*** to finish the process from raw data to fragment count matrix.  This script includes steps of "trimming", "mapping", "peak calling", "aligning read counts matrix", "quality contral", "estimating gene score". Running this step on our example project (i.e. project01 with 672 cells) will take 10~20 hours on an 8-core 32 GB computer, since the sequence mapping step is the slowest step.
        
        Example:
        
            bash APEC_prepare_steps.sh -s $project -g hg19 -n 4 -l 3 -p 0.2 -f 2000
        
        Input parameters:
        
            -s: The project path, which should contain data folder before running APEC.
            -g: "hg19" or "mm10".
            -n: Number of CPU cores.
            -l: Threshold for the –log(Q-value) of peaks, used to filter peaks.
            -p: Threshold of the percentage of fragments in peaks, used to filter cells.
            -f: Threshold of the fragment number of each cell, used to filter cells.
        
        Output files:
        
        The script ***APEC_prepare_steps.sh*** will generate **work**, **peak**, **matrix**, and **figure** folders with many output files. Here, we only introduce files that are useful to users. For our example projects, all of these results can be reproduced on a general computer system.
        
        (1) In **work** folder:
        
        For each cell, the mapping step can generate a subfolder (with cell name) in the **work** folder. There are several useful files in each subfolder:
        
            cell_name.hist.pdf: A histogram of fragment length distribution of each cell.
            cell_name.RefSeqTSS.pdf: Insert enrichment around TSS regions of each cell.
        
        (2) In **peak** folder:
        
            mergeAll.hist.pdf: A histogram of fragment length distribution of all cells.
            mergeAll.RefSeqTSS.pdf: Insert enrichment around TSS regions of all cells.
            top_filtered_peaks.bed: Filtered top peaks, ranked by Q-value.
            genes_scored_by_peaks.csv: Gene scores evaluated by TSS peaks.
        
        (3) In **matrix** folder:
        
            reads.csv: Fragment count matrix.
            cell_info.merged.csv: Data quality report of each cell.
            filtered_cells.csv: Filtered cells information in csv format.
            filtered_reads.mtx: Filtered fragment count matrix in mtx format.
        
        (4) In **figure** folder:
        
            cell_quality.pdf: A scatter plot of the fragment number and the percentage of fragments in peaks.
        
Platform: all
Classifier: Development Status :: 4 - Beta
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Description-Content-Type: text/markdown
