Analyzing Simulation Trajectories
Let's say you've conducted a simulation. Everything up to that point (parametrization, initialization, actually running the simulation) will be assumed and probably discussed another day. What you have from a simulation is a trajectory (timeseries of coordinates), and now we have to derive some meaningful properties from this trajectory.
Many meaningful properties can be derived from these coordinates, be it how atomic coordinates are related to each other, the sorts of geometries or larger structures we see, or how these coordinates are correlated over time. Whatever it is you're interested in, it all starts with the coordinates
There are many analysis packages:
- [MDTraj] (http://mdtraj.org/1.9.3/)
- [MDAnalysis] (https://www.mdanalysis.org/docs/)
- [Freud] (https://freud.readthedocs.io/en/stable/)
- [Pytraj] (https://amber-md.github.io/pytraj/latest/index.html)
- [Cpptraj] (https://amber-md.github.io/cpptraj/CPPTRAJ.xhtml)
- and many, many others (this is what happens when a open-source software goes rampant with different desired functionality and starts from independent research groups)
While each has a variety of different built-in/common analysis routines, some are more common (like radial distribution functions). What EVERY modeller will use, though, is the coordinates. The most important function in these analysis packages is the ability to turn a large trajectory file, written to disk, and read it into memory as a data structure whose XYZ coordinates we can access.
Every simulation engine has different file formats and data encodings, but many of these analysis packages can support a wide range of file formats and pump out the same Trajectory
data structure core to each package.
For example, we can use MDtraj to read in some simulation files from GROMACS. We obtain information about the XYZ coordinates and molecular topology (atoms, elements, atom names/types, residues, chemical bonding)
In general, there's a sort of hierarchy/classification to groups of atoms. At the base, you have an atom, which is as it sounds, or a coarse-grained particle depending on your simulation. Groups of atoms can form a chain, which is pretty much just a bonded network of atoms. Groups of atoms and chains form a residue. This derives from protein amino acid residues, where each monomer was a residue. In other applications, this can also refer to a closed-loop bonded network of atoms (a singular molecule). All of these different entities/groupings form your topology
import mdtraj
traj = mdtraj.load('trajectory.xtc', top='em.gro')
traj
Most analysis packages have some way to access each atom in your topology
traj.topology.atom(0)
If designed well, you can access residue information from each atom
traj.topology.atom(0).residue
Or, you could acess each residue in your topology
traj.topology.residue(0)
And then access each atom from within that residue
traj.topology.residue(0).atom(2)
Every atom has an index, which is often used for accessing the different arrays
traj.topology.atom(100).index
Some analysis packages also have an atom-selection language, which returns various atom indices
traj.topology.select("element N")
Now we can get to the important numbers, the coordinates
traj.xyz
This is a multi-dimensional array, but off the bat you can start seeing these 3-tuples for XYZ.
This is a numpy array
, though, so we can use some numpy functions
traj.xyz.shape
1501 frames, 18546 atoms, 3 spatial coordinates.
We can also snip out a frame to get all of the coordinates for all the atoms in that one frame
traj.xyz[0].shape
Snip out an atom (or collection of atoms) - based on index - to get all frames and all the coordinates of that collection of atoms
traj.xyz[:, [1,2,3],:].shape
Snip out just one dimension to get all frames and all atoms and just one dimension
traj.xyz[:,:,0].shape
Since a trajectory is just a collection of frames, one after another, you can also snip out frames from a trajectory
traj[0]
This is still a Trajectory
object, just 1 frame. XYZ coordinates are still accessible as earlier
All simulations occur within a unitcell to define the boundaries of the simulation.
traj.unitcell_vectors
traj.unitcell_vectors.shape
For each frame, there is a 3x3 array to describe the simulation box vectors
I won't go into how you should analyze a trajectory, but every molecular modeller should be familiar with what analysis routines exist in which packages, and which analysis routines you should design yourself
Comments
There is a whole zoo of trajectory file formats that simulation engines produce - each analysis package can accommodate a subset of those file formats, each analysis package has different built-in analysis routines. Sometimes it's a mix-and-match game where you need to use package A to read a trajectory, and convert to package B representation because it has some particular analysis routine you need.
You could use an analysis package to read in one file format but write in another file format, or use an analysis package to manipulate coordinates/toplogy. Because these packages are designed intuitively and very similar to other structures in the SciPy ecosystem, there is a lot of room for creativity
Recent developments in the SciPy ecosystem look at out-of-memory or GPU representations of numpy array
or pandas DataFrame
, and this is a growing issue in our field - sometimes loading an entire Trajectory into memory is just not possible, so chunking is necessary to break the whole Trajectory into memory-manageable data