PubChemPy
=========

A simple Python wrapper around the `PubChem PUG REST
API <http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html>`__.

Installation
------------

**Option 1**: Use `pip <http://www.pip-installer.org/en/latest/>`__.

::

    pip install PubChemPy

**Option 2**: `Download the latest
release <https://pypi.python.org/packages/source/P/PubChemPy/PubChemPy-1.0.tar.gz>`__
and install yourself:

::

    tar xzvf PubChemPy-1.0.tar.gz
    cd PubChemPy-1.0
    sudo python setup.py install

**Option 3**: `Download
pubchempy.py <https://github.com/mcs07/PubChemPy/raw/master/pubchempy.py>`__
and manually place it in your project directory or anywhere on your
PYTHONPATH.

**Option 4**: Get the latest development version by cloning the Git
repository.

::

    git clone https://github.com/mcs07/PubChemPy.git

Basic usage
-----------

PubChemPy provides a variety of functions and classes that allow you to
retrieve information from PubChem.

::

    from pubchempy import *

    c = Compound.from_cid(1423)
    cs = get_compounds('Aspirin', 'name')

Substances and compounds
------------------------

The ``get_substances`` and ``get_compounds`` functions allow retrieval
of PubChem Substance and Compound records. The functions take a wide
variety of inputs, and return a list of results, even if only a single
match was found.

For a specific CID or SID:

::

    get_compounds(1234)
    get_substances(4321)

A second ``namespace`` argument allows you to use different types of
input:

::

    get_compounds('Aspirin', 'name')
    get_compounds('C1=CC2=C(C3=C(C=CC=N3)C=C2)N=C1', 'smiles')

Beware that line notation inputs like SMILES and InChI can return
automatically generated records that aren't actually present in PubChem,
and therefore have no CID or SID and are missing many properties.

By default, compounds are returned with 2D coordinates. Use the
``record_type`` keyword argument to specify otherwise:

::

    get_compounds('Aspirin', 'name', record_type='3d')

Advanced search types
~~~~~~~~~~~~~~~~~~~~~

By default, requests look for an exact match with the input.
Alternatively, you can specify substructure, superstructure, similarity
and identity searches using the ``searchtype`` keyword argument:

::

    get_compounds('CC', searchtype='superstructure', listkey_count=3)

The ``listkey_count`` and ``listkey_start`` arguments can be used for
pagination. Each ``searchtype`` has its own options that can be
specified as keyword arguments. For example, similarity searches have a
``Threshold``, and super/substructure searches have ``MatchIsotopes``. A
full list of options is available at the `PUG REST
specification <http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html>`__.

Note: These types of search are *slow*.

The Compound class
------------------

The ``get_compounds`` function returns a list of ``Compound`` objects.
You can also instantiate a ``Compound`` object from a CID:

::

    c = Compound.from_cid(6819)

Each ``Compound`` has a ``record`` property, which is a dictionary that
contains the all the information about the compound. All other
properties are derived from this record.

Compounds with regular 2D coordinates have the following properties:
cid, record, atoms, bonds, elements, synonyms, sids, aids,
coordinate\_type, charge, molecular\_formula, molecular\_weight,
canonical\_smiles, isomeric\_smiles, inchi, inchikey, iupac\_name,
xlogp, exact\_mass, monoisotopic\_mass, tpsa, complexity,
h\_bond\_donor\_count, h\_bond\_acceptor\_count, rotatable\_bond\_count,
fingerprint, heavy\_atom\_count, isotope\_atom\_count,
atom\_stereo\_count, defined\_atom\_stereo\_count,
undefined\_atom\_stereo\_count, bond\_stereo\_count,
defined\_bond\_stereo\_count, undefined\_bond\_stereo\_count,
covalent\_unit\_count.

Many of the above properties are missing from 3D records, however they
do have the following additional properties: volume\_3d, multipoles\_3d,
conformer\_rmsd\_3d, effective\_rotor\_count\_3d,
pharmacophore\_features\_3d, mmff94\_partial\_charges\_3d,
mmff94\_energy\_3d, conformer\_id\_3d, shape\_selfoverlap\_3d,
feature\_selfoverlap\_3d, shape\_fingerprint\_3d.

Properties
----------

The ``get_properties`` function allows the retrieval of specific
properties without having to deal with entire compound records. This is
especially useful for retrieving the properties of a large number of
compounds at once.

::

    p = get_properties('IsomericSMILES', 'CC', 'smiles', searchtype='superstructure')

Multiple properties may be specified in a list, or in a comma-separated
string. The available properties are: MolecularFormula,MolecularWeight,
CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP,
ExactMass, MonoisotopicMass, TPSA, Complexity, Charge, HBondDonorCount,
HBondAcceptorCount, RotatableBondCount, HeavyAtomCount,
IsotopeAtomCount, AtomStereoCount, DefinedAtomStereoCount,
UndefinedAtomStereoCount, BondStereoCount, DefinedBondStereoCount,
UndefinedBondStereoCount, CovalentUnitCount, Volume3D,
XStericQuadrupole3D, YStericQuadrupole3D, ZStericQuadrupole3D,
FeatureCount3D, FeatureAcceptorCount3D, FeatureDonorCount3D,
FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D,
FeatureHydrophobeCount3D, ConformerModelRMSD3D, EffectiveRotorCount3D,
ConformerCount3D.

Synonyms
--------

Get a list of synonyms for a given input using the ``get_synonyms``
function:

::

    get_synonyms('Aspirin', 'name')
    get_synonyms('Aspirin', 'name', 'substance')

Inputs that match more than one SID/CID will have multiple, separate
synonyms lists returned.

Identifier lists
----------------

There are three functions for getting a list of identifiers for a given
input:

-  get\_cids
-  get\_sids
-  get\_aids

For example, passing a CID to get\_sids will return a list of SIDs
corresponding to the Substance records that were standardised and merged
to produce the given Compound.

Download
--------

The download function is for saving a file to disk. The following
formats are available: XML, ASNT/B, JSON, SDF, CSV, PNG, TXT. Beware
that not all formats are available for all types of information. SDF and
PNG are only available for full Compound and Substance records, and CSV
is best suited to tables of properties and identifiers.

Examples:

::

    download('PNG', 'asp.png', 'Aspirin', 'name')
    download('CSV', 's.csv', [1,2,3], operation='property/CanonicalSMILES,IsomericSMILES')

For PNG images, the ``image_size`` argument can be used to specfiy
``large``, ``small`` or ``<width>x<height>``.

The Substance class
-------------------

This class has the following properties: sid, synonyms, source\_name,
source\_id, cids, aids, deposited\_compound and standardized\_compound.

The deposited\_compound is a Compound object that corresponds to the
deposited Substance record. The standardized\_compound is the
corresponding record in the Compound database.

Assays
------

TODO

Custom requests
---------------

If you wish to perform more complicated requests, you can use the
``request`` function. This is an extremely simple wrapper around the
REST API that allows you to construct any sort of request from a few
parameters. The `PUG REST
specification <http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html>`__
has all the information you will need to formulate your requests.

The ``request`` function simply returns the exact response from the
PubChem server as a string. This can be parsed in different ways
depending on the output format you choose. See the Python
`json <http://docs.python.org/2/library/json.html>`__,
`xml <http://docs.python.org/2/library/xml.etree.elementtree.html>`__
and `csv <http://docs.python.org/2/library/csv.html>`__ packages for
more information. Additionally, cheminformatics toolkits such as `Open
Babel <http://openbabel.org/docs/current/UseTheLibrary/Python.html>`__
and `RDKit <http://www.rdkit.org>`__ offer tools for handling SDF files
in Python.

The ``get`` function is very similar to the ``request`` function, except
it handles ``listkey`` type responses automatically for you. This makes
things simpler, however it means you can't take advantage of using the
same ``listkey`` repeatedly to obtain different types of information.
See the `PUG REST
specification <http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html>`__
for more information on how ``listkey`` responses work.

Summary of possible inputs
~~~~~~~~~~~~~~~~~~~~~~~~~~

::

    <identifier> = list of cid, sid, aid, source, inchikey, listkey; string of name, smiles, xref, inchi, sdf;
    <domain> = substance | compound | assay

    compound domain
    <namespace> = cid | name | smiles | inchi | sdf | inchikey | <structure search> | <xref> | listkey | formula
    <operation> = record | property/[comma-separated list of property tags] | synonyms | sids | cids | aids | assaysummary | classification

    substance domain
    <namespace> = sid | sourceid/<source name> | sourceall/<source name> | name | <xref> | listkey
    <operation> = record | synonyms | sids | cids | aids | assaysummary | classification

    assay domain
    <namespace> = aid | listkey | type/<assay type> | sourceall/<source name>
    <assay type> = all | confirmatory | doseresponse | onhold | panel | rnai | screening | summary
    <operation> = record | aids | sids | cids | description | targets/{ProteinGI, ProteinName, GeneID, GeneSymbol} | doseresponse/sid

    <structure search> = {substructure | superstructure | similarity | identity}/{smiles | inchi | sdf | cid}
    <xref> = xref/{RegistryID | RN | PubMedID | MMDBID | ProteinGI | NucleotideGI | TaxonomyID | MIMID | GeneID | ProbeID | PatentID}
    <output> = XML | ASNT | ASNB | JSON | JSONP [ ?callback=<callback name> ] | SDF | CSV | PNG | TXT

Avoiding TimeoutError
---------------------

If there are too many results for a request, you will receive a
TimeoutError. There are different ways to avoid this, depending on what
type of request you are doing.

If retrieving full compound or substance records, instead request a list
cids or sids for your input, and then request the full records for those
identifiers individually or in small groups. For example:

::

    sids = get_sids('Aspirin', 'name')
    for sid in sids:
        s = Substance.from_sid(sid)

When using the ``formula`` namespace or a ``searchtype``, you can also
alternatively use the ``listkey_count`` and ``listkey_start`` keyword
arguments to specify pagination. For example:

::

    get_compounds('CC', 'smiles', searchtype='substructure', listkey_count=5)
    get('C10H21N', 'formula', listkey_count=3, listkey_start=6)

