Metadata-Version: 2.0
Name: CMash
Version: 0.2.4
Summary: Fast and accurate set similarity estimation via containment min hash (for genomic datasets).
Home-page: https://github.com/dkoslicki/CMash
Author: David Koslicki
Author-email: dmkoslicki@gmail.com
License: UNKNOWN
Keywords: jaccard min hash containment genomics metagenomics
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Programming Language :: Python :: 2.7
Classifier: License :: OSI Approved :: BSD License
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.5
Classifier: Natural Language :: English
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Requires-Dist: khmer (>=2.1.1)
Requires-Dist: screed
Requires-Dist: h5py
Requires-Dist: numpy
Requires-Dist: blist
Requires-Dist: argparse
Requires-Dist: pandas
Requires-Dist: setuptools (>=24.2.0)
Requires-Dist: six
Requires-Dist: scipy

# CMash
CMash is a fast and accurate way to estimate the similarity of two sets. This is a probabilisitic data analysis approach, and uses containment min hashing. Please see the [associated paper](http://www.biorxiv.org/content/early/2017/09/04/184150) for further details (and please cite if you use it):
>Improving Min Hash via the Containment Index with applications to Metagenomic Analysis
>David Koslicki, Hooman Zabeti
>bioRxiv 184150; doi: https://doi.org/10.1101/184150

## Installation
The easiest way to install this is to use [virtualenv](https://virtualenv.pypa.io/en/stable/):
```bash
virtualenv CMashVE
source CMashVE/bin/activate
pip install -U pip
pip install CMash
```
You can also just use ``pip install CMash`` if you don't want to create a virtual environment.

To get the absolute latest edition of CMash, then you can build from the Github repository via:
```bash
virtualenv CMashVE
source CMashVE/bin/activate
pip install -U pip
git clone https://github.com/dkoslicki/CMash.git
cd CMash
pip install -r requirements.txt
```

Note that the python code in this repository is python2 and python3 compatible, but the dependency ``khmer`` technically requires python3 (but ``khmer`` version ``2.1.1`` runs just fine in python2.)
The external dependency ``pytst`` requires python2, so I'm making this a python2 repository.
## Usage
The basic paradigm is to create a reference/training database, form a sample bloom filter, and then query the database.

#### Forming a reference/training database
Say you have three reference fasta/q file: ``ref1.fa``, ``ref2.fa`` and ``ref3.fa``. In a file (here called ``FileNames.txt``), place the absolute paths pointing to the fasta/q files:
```bash
cat FileNames.txt
# /abs/path/to/ref1.fa
# /abs/path/to/ref2.fa
# /abs/path/to/ref3.fa
```
Then you can create the training database via:
```bash
MakeDNADatabase.py FileNames.txt TrainingDatabase.h5
```
See ``MakeDNADatabase.py -h`` for more options when forming a database.

#### Creating a sample bloom filter
Given a (large) query fasta/q file ``Metagenome.fa``, you can *optionally* create a bloom filter via ``MakeNodeGraph.py Metagenome.fa .``. 
See ``MakeNodeGraph.py -h`` for more details about this function.

This step is not strictly necessary (as the next step automatically forms a nodegraph/bloom filter if you didn't already create one). 
However, I've provided this script in case you want to pre-process a bunch of metagenomes.

#### Query the database
To get containment and Jaccard index estimates of the references files in your query file ``Metagenome.fa``, use something like the following ``QueryDNADatabase.py Metagenome.fa TrainingDatabase.h5 Output.csv``.

There are a bunch of options available: ``QueryDNADatabase.py -h``. The output file is a CSV file with rows corresponding (in this case) to ``ref1.fa``, ``ref2.fa``, and ``ref3.fa`` and columns corresponding to the containment index estimate, intersection cardinality, and Jaccard index estimate.

#### Other functionality
The module ``MinHash`` (imported in python via ``from CMash import MinHash as MH``) has a bunch more functionality, including (but not limited to!):
1. Fast updates to the training databases (via ``help(MH.delete_from_database)``, ``help(MH.insert_to_database)``, ``help(MH.union_databases)``)
2. Ability to form a matrix of Jaccard indexes (for comparison of all pairwise Jaccard indexes of organisms in the training database). This is useful for identifying redundances/patterns/structure in your training database: ``help(MH.form_jaccard_count_matrix)`` and ``help(MH.form_jaccard_matrix)``.
3. Access to the k-mers that MinHash randomly selected (see the class ``CountEstimator`` and the associated ``_kmers`` data structure.)

I'd encourage you to poke through the source code of ``MinHash.py`` and take a look at the scripts as well.

Protein databases (and for that matter, arbitrary K-length strings) coming soon...


