Metadata-Version: 2.1
Name: DAStk
Version: 0.1.6
Summary: Differential ATAC-seq toolkit
Home-page: https://biof-git.colorado.edu/dowelllab/DAStk
Author: Ignacio Tripodi
Author-email: ignacio.tripodi@colorado.edu
License: BSD
Project-URL: Bug Reports, https://biof-git.colorado.edu/dowelllab/DAStk/issues
Project-URL: Source, https://biof-git.colorado.edu/dowelllab/DAStk
Project-URL: Human Motif Sites, http://dowell.colorado.edu/pubs/DAStk/human_motifs.tar.gz
Project-URL: Mouse Motif Sites, http://dowell.colorado.edu/pubs/DAStk/mouse_motifs.tar.gz
Keywords: bioinformatics genomics chromatin ATAC-seq motif transcription_factor
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 3
Requires-Python: >=2.7
Requires-Dist: argparse
Requires-Dist: datetime
Requires-Dist: numpy
Requires-Dist: matplotlib
Requires-Dist: adjustText
Requires-Dist: scipy
Requires-Dist: pandas

# DAStk

The Differential [ATAC-seq](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4374986/) Toolkit (DAStk) is a set of scripts to aid analyzing differential ATAC-Seq data. By leveraging changes in accessible chromatin, we can detect significant changes in transcription factor (TF) activity. This is a simple but powerful tool for cellular perturbation analysis.

You will need the following inputs:

- A pair of files listing peaks of ATAC-seq signal in two biological conditions (e.g. DMSO and drug-treated) in any BedGraph-compatible format (tab-delimited)
- A set of files listing the putative binding sites across the reference genome of choice, one file per transcription factor motif, also in any BedGraph-like format. These are normally generated from position weight matrices (PWMs) available at TF model databases like [HOCOMOCO](http://hocomoco11.autosome.ru). These files are expected to have a `.bed`, `.BedGraph` or `.txt` extension.

**IMPORTANT: All files mentioned above (ATAC-seq peaks and computed motif sites) MUST be sorted by the same criteria. Different bioinformatics tools use different lexical sorting criteria (e.g. chr10 after chr1, or chr2 after chr1) so please ensure the sorting criteria is uniform.**

### Install

You can install DAStk using `pip`:

    $ pip install DAStk

This is the simplest option, and it will also create the executable commands `process_atac` and `differential_md_score`. Alternatively, you can clone this repository by running:

    $ git clone https://biof-git.colorado.edu/dowelllab/DAStk

### Required Python libraries (can be installed thru `pip`):

* numpy
* argparse
* matplotlib
* scipy
* adjustText
* pandas
* multiprocessing

These scripts feature comprehensive help when called with the `--help` argument. Every argument provides a short and long form (i.e. `-t` or `--threads`), and can either be provided as input arguments or via a configuration file, depending on your preference. The are normally two steps in a typical workflow:

1. Process the ATAC-seq peak files to calculate the [MD-score statistic](https://genome.cshlp.org/content/28/3/334.short) for each motif provided.
2. Detect the most statistically significant changes in MD-score between both biological conditions, and generate MA and barcode plots.

### TL;DR;

If you satisfy all the Python library requirements, you can simply clone this repository and run `tf_activity_changes` with the following syntax:

    $ ./tf_activity_changes PREFIX CONDITION_1_NAME CONDITION_2_NAME CONDITION_1_ATAC_PEAKS_FILE \
      CONDITION_2_ATAC_PEAKS_FILE PATH_TO_MOTIF_FILES [NUMBER_OF_THREADS]

... then use `differential_md_score` (instructions below, or via `--help`) to explore which TFs are changing the most in activity for different p-value cutoffs.

### Usage examples

Unpack the motif files (see below for how to create your own, instead):

    $ mkdir /path/to/hg19_motifs
    $ tar xvfz motifs/human_motifs.tar.gz --directory /path/to/hg19_motifs

Calculate the MD-scores for the first biological condition:

    $ process_atac --prefix 'mcf7_DMSO' --threads 8 --atac-peaks /path/to/DMSO/ATAC/peaks/file \
      --motif-path /path/to/directory/containing/motif/files

The above command generates a file called `mcf7_DMSO_md_scores.txt`. The required prefix is a good way to keep track of what these new files represent. It's expected to be some sort of assay identifier and the biological condition, separated by a `_`; it's generally a good idea to use the cell type (or sample number) and a brief condition description (e.g. `k562_DMSO` or `SRR1234123_Metronidazole`). Alternatively, this could have been executed as:

    $ process_atac --prefix 'mcf7_DMSO' --threads 8 --config /path/to/config/file.py

... where the contents of this configuration file `file.py` would look like:

    atac_peaks_filename = '/path/to/DMSO/ATAC/peaks/file'
    tf_motif_path = '/path/to/directory/containing/motif/files'

We would then generate the same file, for the other condition we are comparing against:

    $ process_atac --prefix 'mcf7_Treatment' --threads 8 --atac-peaks /path/to/treatment/ATAC/peaks/file \
      --motif-path /path/to/directory/containing/motif/files

The above generates a file called `mcf7_Treatment_md_scores.txt`. Finally:

    $ differential_md_score --prefix mcf7 --assay-1 DMSO --assay-2 Treatment --p-value 0.0000001 -b

The above generates an MA plot that labels the most significant TF activity changes, at a p-value cutoff of 1e-7. Note that the condition names (DMSO and Treatment) were the same ones used earlier as the second half of the prefix. The plots look like the example below:

![Sample MA plot](./doc_files/sample_MA_plot.png)

The `-b` flag also generates a "barcode plot" of each of these statistically significat motif differences that depicts how close the ATAC-seq peak centers were to the motif centers, within a 1500 base-pair radius of the motif center:

![Sample barcode plot](./doc_files/sample_barcode_plot.png)

This entire process can be executed in this order by calling `tf_activity_changes`. If you can take advantage of multiprocessing, you can calculate MD-scores for both conditions simultaneously, assigning several threads to each, then generate the plots once both `*_md_scores.txt` files are ready.

### Motif Files

Feel free to use the motif files provided, [human_motifs.tar.gz](http://dowell.colorado.edu/pubs/DAStk/human_motifs.tar.gz) and [mouse_motifs.tar.gz](http://dowell.colorado.edu/pubs/DAStk/mouse_motifs.tar.gz) for the `hg19` and `mm10` reference genomes, respectively. They have been generated from HOCOMOCO's v10 mononucleotide model. To generate your own files for each motif, you can use FIMO in combination with the downloaded `.meme` files from your TF database of choice. For example, if using HOCOMOCO, you can create the motif file for TP53 using their mononucleotide model with a p-value threshold of 0.000001 by:

    $ fimo -max-stored-scores 10000000 --thresh 1e-6 -oc /path/to/output/directory -motif /path/to/motif/file \
      /path/to/HOCOMOCOv11_HUMAN_mono_meme_format.meme /path/to/whole_genome.fa


-----

### Citation

Please cite DAStk if you have used it in your research!  
*Tripodi, I.J.; Allen, M.A.; Dowell, R.D.	Detecting Differential Transcription Factor Activity from ATAC-Seq Data. Molecules 2018, 23, 1136.*  


For any questions or bug reports, please use the Issue Tracker or email us:  


*Ignacio Tripodi (ignacio.tripodi at colorado.edu)*  
*Computer Science Department, BioFrontiers Institute*  
*University of Colorado, Boulder, USA*


