3.1.5. Calculating path similarity — MDAnalysis.analysis.psa¶
| Author: | Sean Seyler |
|---|---|
| Year: | 2015 |
| Copyright: | GNU Public License v3 |
New in version 0.10.0.
The module contains code to calculate the geometric similarity of trajectories using path metrics such as the Hausdorff or Fréchet distances [Seyler2015]. The path metrics are functions of two paths and return a nonnegative number, i.e., a distance. Two paths are identical if their distance is zero, and large distances indicate dissimilarity. Each path metric is a function of the individual points (e.g., coordinate snapshots) that comprise each path and, loosely speaking, identify the two points, one per path of a pair of paths, where the paths deviate the most. The distance between these points of maximal deviation is measured by the root mean square deviation (RMSD), i.e., to compute structural similarity.
One typically computes the pairwise similarity for an ensemble of paths to produce a symmetric distance matrix, which can be clustered to, at a glance, identify patterns in the trajectory data. To properly analyze a path ensemble, one must select a suitable reference structure to which all paths (each conformer in each path) will be universally aligned using the rotations determined by the best-fit rmsds. Distances between paths and their structures are then computed directly with no further alignment. This pre-processing step is necessary to preserve the metric properties of the Hausdorff and Fréchet metrics; using the best-fit rmsd on a pairwise basis does not generally preserve the triangle inequality.
See also
The PSAnalysisTutorial outlines a typical application of PSA to a set of trajectories, including doing proper alignment, performing distance comparisons, and generating heat map-dendrogram plots from hierarchical clustering.
References
| [Seyler2015] | Sean L. Seyler, Avishek Kumar, Michael F. Thorpe, Oliver Beckstein. Path Similarity Analysis: a Method for Quantifying Macromolecular Pathways. arXiv:1505.04807 (2015). |
3.1.5.1. Helper functions and variables¶
The following global variables are used by the functions and classes in this module.
-
MDAnalysis.analysis.psa.hausdorff_names¶
-
MDAnalysis.analysis.psa.frechet_names¶
The following functions are used by the other functions in this module.
-
MDAnalysis.analysis.psa.sqnorm(v, axis=None)[source]¶ Compute the sum of squares of elements along specified axes.
Arguments: - v
numpy.ndarrayof coordinates- axis
None or int or tuple of ints, optional Axis or axes along which a sum is performed. The default (axis = None) is perform a sum over all the dimensions of the input array. axis may be negative, in which case it counts from the last to the first axis.
Returns: float, the sum of the squares of the elements of v along axis
3.1.5.2. Classes, methods, and functions¶
-
MDAnalysis.analysis.psa.get_path_metric_func(name)[source]¶ Selects a path metric function by name.
Arguments: - name
string, name of path metric
Returns: The path metric function specified by name (if found).
-
MDAnalysis.analysis.psa.hausdorff(P, Q, N=None)[source]¶ Calculate the Hausdorff distance between two paths.
P (Q) is a
numpy.ndarrayof \(N_p\) (\(N_q\)) time steps, \(N\) atoms, and \(3N\) coordinates (e.g.,MDAnalysis.core.AtomGroup.AtomGroup.coordinates()). P (Q) has either shape \(N_p\times N\times 3\) (\(N_q\times N\times 3\)), or \(N_p\times(3N)\) (\(N_q\times(3N)\)) in flattened form.Arguments: - P
numpy.ndarrayrepresenting a path- Q
numpy.ndarrayrepresenting a path- N
int, number of atoms; if
Nonethen P and Q are each assumed to be a 3Dnumpy.ndarray; else, they are 2D (flattened)
Returns: float, the Hausdorff distance between paths P and Q
- Example::
>>> u = Universe(PSF,DCD) >>> mid = len(u.trajectory)/2 >>> ca = u.selectAtoms('name CA') >>> P = numpy.array([ ... ca.positions for _ in u.trajectory[:mid:] ... ]) # first half of trajectory >>> Q = numpy.array([ ... ca.positions for _ in u.trajectory[mid::] ... ]) # second half of trajectory >>> hausdorff(P,Q) 4.7786639840135905 >>> hausdorff(P,Q[::-1]) # hausdorff distance w/ reversed 2nd trajectory 4.7786639840135905
-
MDAnalysis.analysis.psa.discrete_frechet(P, Q, N=None)[source]¶ Calculate the discrete Frechet distance between two paths.
P (Q) is a
numpy.ndarrayof \(N_p\) (\(N_q\)) time steps, \(N\) atoms, and \(3N\) coordinates (e.g.,MDAnalysis.core.AtomGroup.AtomGroup.coordinates()). P (Q) has either shape \(N_p\times N\times 3\) (\(N_q\times N\times 3\)), or :\(N_p\times(3N)\) (\(N_q\times(3N)\)) in flattened form.Arguments: - P
numpy.ndarrayrepresenting a path- Q
numpy.ndarrayrepresenting a path- N
int, number of atoms; if
Nonethen P and Q are each assumed to be a 3Dnumpy.ndarray; else, they are 2D (flattened)
Returns: float, the discrete Frechet distance between paths P and Q
- Example::
>>> u = Universe(PSF,DCD) >>> mid = len(u.trajectory)/2 >>> ca = u.selectAtoms('name CA') >>> P = numpy.array([ ... ca.positions for _ in u.trajectory[:mid:] ... ]) # first half of trajectory >>> Q = numpy.array([ ... ca.positions for _ in u.trajectory[mid::] ... ]) # second half of trajectory >>> discrete_frechet(P,Q) 4.7786639840135905 >>> discrete_frechet(P,Q[::-1]) # frechet distance w/ 2nd trj reversed 2nd 6.8429011177113832
-
class
MDAnalysis.analysis.psa.Path(universe, reference, ref_select='name CA', path_select='all', ref_frame=0)[source]¶ Pre-process a
MDAnalysis.Universeobject: (1) fit the trajectory to a reference structure, (2) convert fitted time series to anumpy.ndarrayrepresentation ofPath.path.The analysis is performed with
PSA.run()and stores the result in thenumpy.ndarraydistance matrixPSA.D.PSA.run()also generates a fitted trajectory and path from alignment of the original trajectories to a reference structure.New in version 0.9.1.
Setting up trajectory alignment and fitted path generation.
Arguments: - universe
MDAnalysis.Universeobject containing a trajectory- reference
reference structure;
MDAnalysis.Universeobject; ifNonethen traj is used (uses the current time step of the object) [None]- ref_select
The selection to operate on for rms fitting; can be one of:
- any valid selection string for
selectAtoms()that produces identical selections in mobile and reference; or - a dictionary
{'mobile':sel1, 'reference':sel2}(theMDAnalysis.analysis.align.fasta2select()function returns such a dictionary based on a ClustalW_ or STAMP_ sequence alignment); or - a tuple
(sel1, sel2)
When using 2. or 3. with sel1 and sel2 then these selections can also each be a list of selection strings (to generate an AtomGroup with defined atom order as described under Ordered selections).
- any valid selection string for
- path_select
atom selection composing coordinates of (fitted) path; if
Nonethen path_select is set to ref_select [None]
-
u_original¶ MDAnalysis.Universeobject with a trajectory
-
u_reference¶ MDAnalysis.Universeobject containing a reference structure
-
ref_select¶ string, selection for
selectAtoms()to select frame fromPath.u_reference
-
path_select¶ string, selection for
selectAtoms()to select atoms to composePath.path
-
ref_frame¶ int, frame index to select frame from
Path.u_reference
-
u_fitted¶ MDAnalysis.Universeobject with the fitted trajectory
-
path¶ numpy.ndarrayobject representation of the fitted trajectory
-
fit_to_reference(filename=None, prefix='', postfix='_fit', rmsdfile=None, targetdir='.', mass_weighted=False, tol_mass=0.1)[source]¶ Align each trajectory frame to the reference structure with
MDAnalysis.analysis.align.rms_fit_trj().Arguments: - filename
file name for the RMS-fitted trajectory or pdb; defaults to the original trajectory filename (from
Path.u_original) with prefix prepended- prefix
prefix for auto-generating the new output filename
- rmsdfile
file name for writing the RMSD time series [
None]- mass_weighted
do a mass-weighted RMSD fit
- tol_mass
Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]
Returns: MDAnalysis.Universeobject containing a fitted trajectory
-
run(align=False, filename=None, postfix='_fit', rmsdfile=None, targetdir='.', mass_weighted=False, tol_mass=0.1, flat=False)[source]¶ Generate a path from a trajectory and reference structure, aligning to a reference structure if specified.
This is a convenience method to generate a fitted trajectory from an inputted universe (
Path.u_original) and reference structure (Path.u_reference).Path.fit_to_reference()andPath.to_path()are used consecutively to generate a new universe (Path.u_fitted) containing the fitted trajectory along with the correspondingPath.pathrepresented as annumpy.ndarray. The method returns a tuple of the topology name and new trajectory name, which can be fed directly into anMDAnalysis.Universeobject after unpacking the tuple using the*operator, as inMDAnalysis.Universe(*(top_name, newtraj_name)).Arguments: - align
Align trajectory to atom selection
Path.ref_selectofPath.u_reference. IfTrue, a universe containing an aligned trajectory is produced withPath.fit_to_reference()[False]- filename
filename for the RMS-fitted trajectory or pdb; defaults to the original trajectory filename (from
Path.u_original) with prefix prepended- prefix
prefix for auto-generating the new output filename
- rmsdfile
file name for writing the RMSD time series [
None]- mass_weighted
do a mass-weighted RMSD fit
- tol_mass
Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]
- flat
represent
Path.pathwith a 2D (\(N_p\times 3N\))numpy.ndarray; ifFalsethenPath.pathis a 3D (\(N_p\times N\times 3\))numpy.ndarray[False]
Returns: A tuple of the topology name and new trajectory name.
-
to_path(fitted=False, select=None, flat=False)[source]¶ Generates a coordinate time series from the fitted universe trajectory.
Given a selection of N atoms from select, the atomic positions for each frame in the fitted universe (
Path.u_fitted) trajectory (with \(N_p\) total frames) are appended sequentially to form a 3D or 2D (if flat isTrue)numpy.ndarrayrepresentation of the fitted trajectory (with dimensions \(N_p\times N\times 3\) or \(N_p\times 3N\), respectively).Arguments: - fitted
construct a
Path.pathfrom thePath.u_fittedtrajectory; ifFalsethenPath.pathis generated with the trajectory fromPath.u_original[False]- select
the selection for constructing the coordinates of each frame in
Path.path; ifNonethenPath.path_selectis used, else it is overridden by select [None]- flat
represent
Path.pathas a 2D (\(N_p\times 3N\))numpy.ndarray; ifFalsethenPath.pathis a 3D (\(N_p\times N\times 3\))numpy.ndarray[False]
Returns: numpy.ndarrayrepresenting a time series of atomic positions of anMDAnalysis.core.AtomGroup.AtomGroupselection fromPath.u_fitted.trajectory
-
class
MDAnalysis.analysis.psa.PSA(universes, reference=None, ref_select='name CA', ref_frame=0, path_select=None, labels=None, targetdir=None)[source]¶ Perform Path Similarity Analysis (PSA) on a set of trajectories.
The analysis is performed with
PSA.run()and stores the result in thenumpy.ndarraydistance matrixPSA.D.PSA.run()also generates a fitted trajectory and path from alignment of the original trajectories to a reference structure.New in version 0.8.
Setting up Path Similarity Analysis.
The mutual similarity between all unique pairs of trajectories are computed using a selected path metric.
Arguments: - universes
a list of universes (
MDAnalysis.Universeobject), each containing a trajectory- reference
reference coordinates;
MDAnalysis.Universeobject; ifNonethe first time step of the first item in trajs is used [None]- ref_select
The selection to operate on; can be one of:
- any valid selection string for
selectAtoms()that produces identical selections in mobile and reference; or - a dictionary
{'mobile':sel1, 'reference':sel2}(theMDAnalysis.analysis.align.fasta2select()function returns such a dictionary based on a ClustalW_ or STAMP_ sequence alignment); or - a tuple
(sel1, sel2)
When using 2. or 3. with sel1 and sel2 then these selections can also each be a list of selection strings (to generate an AtomGroup with defined atom order as described under Ordered selections).
- any valid selection string for
- mass_weighted
do a mass-weighted RMSD fit [
False]- tol_mass
Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]
- ref_frame
frame index to select frame from reference [0]
- path_select
atom selection composing coordinates of (fitted) path; if
Nonethen path_select is set to ref_select [None]- targetdir
output files are saved there [.]
- labels
list of strings, names of trajectories to be analyzed (
MDAnalysis.Universe); ifNone, defaults to trajectory names [None]
-
universes¶ list of
MDAnalysis.Universeobjects containing trajectories
-
u_reference¶ MDAnalysis.Universeobject containing a reference structure
-
ref_select¶ string, selection for
selectAtoms()to select frame fromPSA.u_reference
-
path_select¶ string, selection for
selectAtoms()to select atoms to composePath.path
-
ref_frame¶ int, frame index to select frame from
Path.u_reference
-
paths¶ list of
numpy.ndarrayobjects representing the set/ensemble of fitted trajectories
-
cluster(distArray, method='ward', count_sort=False, distance_sort=False, no_labels=True, color_threshold=4)[source]¶ Cluster trajectories.
Arguments: - method
string, name of linkage criterion for clustering [
'ward']- no_labels
boolean, if
Truethen do not label dendrogram [True]- color_threshold
For brevity, let t be the color_threshold. Colors all the descendent links below a cluster node k the same color if k is the first node below the cut threshold t. All links connecting nodes with distances greater than or equal to the threshold are colored blue. If t is less than or equal to zero, all nodes are colored blue. If color_threshold is None or ‘default’, corresponding with MATLAB(TM) behavior, the threshold is set to 0.7*max(Z[:,2]). [
4]]
Returns: list of indices representing the row-wise order of the objects after clustering
-
generate_paths(**kwargs)[source]¶ Generate paths, aligning each to reference structure if necessary.
Keywords: - align
Align trajectories to atom selection
PSA.ref_selectofPSA.u_reference[False]- filename
strings representing base filename for fitted trajectories and paths [
None]- infix
additional tag string that is inserted into the output filename of the fitted trajectory files [‘’]
- mass_weighted
do a mass-weighted RMSD fit
- tol_mass
Reject match if the atomic masses for matched atoms differ by more than tol_mass
- ref_frame
frame index to select frame from reference
- flat
represent
Path.pathas a 2D (\(N_p\times 3N\))numpy.ndarray; ifFalsethenPath.pathis a 3D (\(N_p\times N\times 3\))numpy.ndarray[False]- save
boolean; if
True, pickle list of names for fitted trajectories [True]- store
boolean; if
Truethen writes each path (numpy.ndarray) inPSA.pathsto compressed npz (numpy) files [False]
The fitted trajectories are written to new files in the “/trj_fit” subdirectory in
PSA.targetdirnamed “filename(trajectory)XXX*infix*_psa”, where “XXX” is a number between 000 and 999; the extension of each file is the same as its original. Optionally, the trajectories can also be saved in numpy compressed npz format in the “/paths” subdirectory inPSA.targetdirfor persistence and can be accessed as the attributePSA.paths.
-
plot(filename=None, linkage='ward', count_sort=False, distance_sort=False, figsize=4.5, labelsize=12)[source]¶ Plot a clustered distance matrix using method linkage along with the corresponding dendrogram. Rows (and columns) are identified using the list of strings specified by
PSA.labels.Arguments: - filename
string, save figure to filename [
None]- linkage
string, name of linkage criterion for clustering [
'ward']- figsize
set the vertical size of plot in inches [
6]- labelsize
set the font size for colorbar labels; font size for path labels on dendrogram default to 3 points smaller [
12]
If filename is supplied then the figure is also written to file (the suffix determines the file type, e.g. pdf, png, eps, ...). All other keyword arguments are passed on to
pylab.imshow().
-
run(**kwargs)[source]¶ Perform path similarity analysis on the trajectories to compute the distance matrix.
A number of parameters can be changed from the defaults. The result is stored as the array
PSA.D.Keywords: - metric
selection string specifying the path metric to measure pairwise distances among
PSA.paths['hausdorff']- start, stop, step
start and stop frame index with step size: analyze
trajectory[start:stop:step][None]- store
boolean; if
Truethen writesPSA.Dto text and compressed npz (numpy) files [True]- filename
string, filename to save
PSA.D
-
save_paths(filename=None)[source]¶ Save fitted
PSA.pathsto numpy compressed npz files.Arguments: - filename
string, specifies filename [
None]
The data are saved with
numpy.savez_compressed()in the directory specified byPSA.targetdir.
-
save_result(filename=None)[source]¶ Save distance matrix
PSA.Dto a numpy compressed npz file and text file.Arguments: - filename
string, specifies filename [
None]
The data are saved with
numpy.savez_compressed()andnumpy.savetxt()in the directory specified byPSA.targetdir.