Example of Combined Cluster Analysis

by Daniel R. Roe

Cluster Analysis with CPPTRAJ

One way of determining structure populations from simulations is cluster analysis. Clustering is a means of partitioning data so that data points inside a cluster are more similar to each other than they are to points outside a cluster. In the context of molecular simulation, this means grouping similar conformations together. Similarity is determined by a distance metric - the smaller the distance, the more similar the structures. One commonly used distance metric is coordinate RMSD.

It is very important to note that there is no one "correct" way to perform cluster analysis. There are many different algorithms and distance metrics available, and different combinations will work better for certain systems. Cluster analysis usually requires some trial and error before satisfactory results are obtained. This tutorial is just one example of cluster analysis. For a more in-depth discussion on cluster analysis of MD trajectories, users are encouraged to read the 2007 paper by Shao et al. on the subject.

In this example we will use CPPTRAJ to perform both cluster analysis and combined clustering analysis (as a means of ascertaining convergence). CPPTRAJ supports a variety of clustering algorithms, distance metrics, clustering metrics, and output options. This example will use trajectories of the tetranucleotide rGACC that were generated using multidimensional replica exchange (MREMD, 24 x Temperature, 8 x aMD). Although the trajectories were originally generated with explicit TIP3P water, the solvent has been removed to decrease the size of the trajectories. This data and analysis was first reported (in part) in Roe et al., J. Phys. Chem. B, 2014, 118(13), pp 2543-3552.


Note that although input is provided in files, users are encouraged to use the interactive mode to become better familiar with CPPTRAJ workflow and command options.

Performing Cluster Analysis with CPPTRAJ

Step 1: Loading topology and trajectory

We will begin by performing cluster analysis on a single trajectory. Start CPPTRAJ by typing 'cpptraj'. Load the topology and trajectory with the following commands:

parm rGACC.nowat.parm7
Since these trajectories contain ions and we are only interested in rGACC itself, we can remove them (ensuring they do not show up in any output structures) with a strip command. We will also use the outprefix keyword so that the corresponding stripped topology will be written as 'noions.rGACC.nowat.parm7'. Note that this step is optional and does not have to be done in order to perform clustering.
strip :Na+ outprefix noions

Step 2: Clustering Command

Now we will issue the clustering command. There are many options for clustering which can be grouped into 4 general categories: Clustering Algorithm, Distance Metric, Output, and Coordinate Output.

cluster C0 \
        dbscan minpoints 25 epsilon 0.9 sievetoframe \
        rms :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \
        sieve 10 random \
        out cnumvtime.dat \
        sil Sil \
        summary summary.dat \
        info info.dat \
        cpopvtime cpopvtime.agr normframe \
        repout rep repfmt pdb \
        singlerepout singlerepfmt netcdf \
        avgout Avg avgfmt restart

We will briefly look at each option with more detail:

Once your input has been read in, type run to begin trajectory processing and analysis. Be patient, as this may take about a minute running on a fairly modern CPU. Once clustering has completed you can type quit to exit CPPTRAJ. You will now have many output files containing clustering results. If you check 'summary.dat' you should see 16 total clusters. The various output options are explained in detail in the Amber manual.

One particularly interesting output are the cluster populations versus time, written to 'cpopvtime.agr' (xmgrace format). In particular one can see that at the beginning of the trajectory the cluster populations are changing rapidly over time. As the run progresses the cluster populations gradually stabilize until they reach their final values somewhere around 70,000 frames. This is an indication that the cluster populations are reaching an equilibrium and the results may be suitable for things like thermodynamic analysis. However, a better way of ascertaining this is to compare the results to an independent run.

Combined Cluster analysis with CPPTRAJ

"Convergence" is an important concept in molecular simulations. It is a way of gauging how effectively the underlying conformational landscape is being sampled. A simulation starting from a single point (i.e. conformation) may become kinetically trapped in a few local energy wells; it is difficult to ascertain whether this has occurred from a single simulation. However, if additional simulations of a system are started with different initial conditions and/or different initial conformations (ideally both), eventually they should sample the same conformations in the same populations; once this has occurred the simulations are said to have "converged". When two or more simulations have converged and structure populations are no longer significantly changing, one can start to calculate things like thermodynamic quantities with some degree of certainty.

Cluster analysis can be used to compare populations of structures in separate runs as a means of determining convergence. However, it can be challenging to compare clusters between different trajectory runs; they may either be present in each trajectory in different populations or a conformation may occur in one trajectory but not the other. In order to facilitate the comparison of clusters between different trajectories, CPPTRAJ supports "combined clustering", where cluster analysis is performed on two (or more!) combined trajectories, then the results are partitioned based on the original trajectories.

Step 1: Loading topology and trajectories

Since we are going to be combining the two trajectories before clustering, we need to know how many frames are in each so we can tell the cluster command where to partition the combined clustering results. This can be done with cpptraj on the command line:

cpptraj -p rGACC.nowat.parm7 -y -tl
Here the '-p' flag loads the topology, the '-y' flag loads the trajectory, and the '-tl' flag is used to report the number of frames. The output should be:
Frames: 83843
So we know we need to split the combined trajectories at frame 83843.

Now start CPPTRAJ by typing 'cpptraj'. This will bring up the interactive prompt. Load the topology and trajectories (and strip ions):

parm rGACC.nowat.parm7
strip :Na+ outprefix noions

Step 2: Clustering command for combined cluster analysis

Now enter the clustering command. Most lines will look similar to the previous run, except we will be adding keywords for doing combined cluster analysis.

cluster C1 \
        dbscan minpoints 25 epsilon 0.9 sievetoframe \
        rms :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \
        sieve 20 \
        out combined.cnumvtime.dat \
        info \
        summarysplit split.dat splitframe 83843

There are two new keywords here:

Once your input has been read in, type run to begin trajectory processing and analysis. Be patient, as this may take about a minute and a half to two minutes running on a fairly modern CPU. Once clustering has completed you can type quit to exit CPPTRAJ.

The main output from the combined clustering analysis is in 'split.dat'. The first two lines of this file describe how the splitting of the combined input frames was done and how many total frames comprise each section:

# 1st <= 83843 < 2nd
# 1st= 83843  2nd= 108887
Next comes several columns: The cluster populations between the first and second trajectories agree quite well (max absolute difference in fraction population about 0.015), indicating that the two trajectories are relatively well-converged.

Copyright Daniel R. Roe, 2015