Molecular Modeling Software: MDTraj

7 minute read

Published:

Some anecdotes with analyzing simulations

Let’s say you’ve conducted a simulation. Everything up to that point (parametrization, initialization, actually running the simulation) will be assumed and probably discussed another day. What you have from a simulation is a trajectory (timeseries of coordinates), and now we have to derive some meaningful properties from this trajectory.

Many meaningful properties can be derived from these coordinates, be it how atomic coordinates are related to each other, the sorts of geometries or larger structures we see, or how these coordinates are correlated over time. Whatever it is you’re interested in, it all starts with the coordinates

There are many analysis packages:

  • MDTraj
  • MDAnalysis
  • Freud
  • Pytraj
  • Cpptraj
  • and many, many others (this is what happens when a open-source software goes rampant with different desired functionality and starts from independent research groups)

While each has a variety of different built-in/common analysis routines, some are more common (like radial distribution functions). What EVERY modeler will use, though, is the coordinates. The most important function in these analysis packages is the ability to turn a large trajectory file, written to disk, and read it into memory as a data structure whose XYZ coordinates we can access.

Every simulation engine has different file formats and data encodings, but many of these analysis packages can support a wide range of file formats and pump out the same Trajectory data structure core to each package.

For example, we can use MDtraj to read in some simulation files from GROMACS. We obtain information about the XYZ coordinates and molecular topology (atoms, elements, atom names/types, residues, chemical bonding)

In general, there’s a sort of hierarchy/classification to groups of atoms. At the base, you have an atom, which is as it sounds, or a coarse-grained particle depending on your simulation. Groups of atoms can form a chain, which is pretty much just a bonded network of atoms. Groups of atoms and chains form a residue. This derives from protein amino acid residues, where each monomer was a residue. In other applications, this can also refer to a closed-loop bonded network of atoms (a singular molecule). All of these different entities/groupings form your topology

import mdtraj
traj = mdtraj.load('trajectory.xtc', top='em.gro')
traj
<mdtraj.Trajectory with 1501 frames, 18546 atoms, 2688 residues, and unitcells at 0x10dab9ba8>

Most analysis packages have some way to access each atom in your topology

traj.topology.atom(0)
DSPC1-N

If designed well, you can access residue information from each atom

traj.topology.atom(0).residue
DSPC1

Or, you could acess each residue in your topology

traj.topology.residue(0)
DSPC1

And then access each atom from within that residue

traj.topology.residue(0).atom(2)
DSPC1-H13A

Every atom has an index, which is often used for accessing the different arrays

traj.topology.atom(100).index
100

Some analysis packages also have an atom-selection language, which returns various atom indices

traj.topology.select("element N")
array([    0,   142,   284,   426,   568,   710,   852,   994,  1136,
        1278,  1420,  1562,  1704,  1846,  1988,  2130,  2272,  2414,
        2556,  2698,  2840,  9273,  9415,  9557,  9699,  9841,  9983,
       10125, 10267, 10409, 10551, 10693, 10835, 10977, 11119, 11261,
       11403, 11545, 11687, 11829, 11971, 12113])

Now we can get to the important numbers, the coordinates

traj.xyz
array([[[3.75500011e+00, 2.16800022e+00, 6.84000015e+00],
        [3.64100027e+00, 2.17200017e+00, 6.94400024e+00],
        [3.68000007e+00, 2.21600008e+00, 7.03500032e+00],
        ...,
        [1.38600004e+00, 2.20000014e-01, 8.43700027e+00],
        [1.31000006e+00, 2.01000005e-01, 8.49200058e+00],
        [1.39500010e+00, 1.43000007e-01, 8.38100052e+00]],

       [[3.92900014e+00, 2.18300009e+00, 6.83200026e+00],
        [3.85500026e+00, 2.24800014e+00, 6.94500017e+00],
        [3.92200017e+00, 2.28600001e+00, 7.02000046e+00],
        ...,
        [7.00000003e-02, 3.23000014e-01, 1.30000010e-01],
        [6.00000005e-03, 3.45000029e-01, 6.30000010e-02],
        [8.50000009e-02, 4.05000031e-01, 1.77000001e-01]],

       [[3.79200029e+00, 2.13300014e+00, 6.90600014e+00],
        [3.75500011e+00, 2.17000008e+00, 7.04500055e+00],
        [3.69800019e+00, 2.26100016e+00, 7.03900051e+00],
        ...,
        [3.31000030e-01, 8.42000067e-01, 8.24400043e+00],
        [2.39000008e-01, 8.64000022e-01, 8.26000023e+00],
        [3.51000011e-01, 7.74000049e-01, 8.30900002e+00]],

       ...,

       [[5.35700035e+00, 3.21400023e+00, 6.66500044e+00],
        [9.00000054e-03, 3.30200005e+00, 6.77700043e+00],
        [5.32400036e+00, 3.37800026e+00, 6.80000019e+00],
        ...,
        [5.89000046e-01, 2.46400023e+00, 6.28700018e+00],
        [5.42000055e-01, 2.38100004e+00, 6.27500010e+00],
        [6.04000032e-01, 2.49500012e+00, 6.19800043e+00]],

       [[9.20000076e-02, 3.15200019e+00, 7.07700014e+00],
        [1.93000004e-01, 3.26800013e+00, 7.08700037e+00],
        [1.33000001e-01, 3.35700011e+00, 7.09800053e+00],
        ...,
        [8.20000052e-01, 2.19400001e+00, 6.70000029e+00],
        [8.31000030e-01, 2.10200000e+00, 6.67600012e+00],
        [7.78000057e-01, 2.23400021e+00, 6.62400055e+00]],

       [[1.24000005e-01, 2.99600005e+00, 6.71500015e+00],
        [9.60000008e-02, 3.05700016e+00, 6.84500027e+00],
        [4.00000019e-03, 3.11200023e+00, 6.83600044e+00],
        ...,
        [5.77000022e-01, 2.19900012e+00, 6.82900047e+00],
        [6.62000060e-01, 2.23500013e+00, 6.80300045e+00],
        [5.80000043e-01, 2.10800004e+00, 6.80000019e+00]]], dtype=float32)

This is a multi-dimensional array, but off the bat you can start seeing these 3-tuples for XYZ.

This is a numpy array, though, so we can use some numpy functions

traj.xyz.shape
(1501, 18546, 3)

1501 frames, 18546 atoms, 3 spatial coordinates. In numpy array terms, the frames are the first dimension, atoms the second dimension, and spatial coordinates the third.

We can also snip out a frame to get all of the coordinates for all the atoms in that one frame

traj.xyz[0].shape
(18546, 3)

Snip out an atom (or collection of atoms) - based on index - to get all frames and all the coordinates of that collection of atoms

traj.xyz[:, [1,2,3],:].shape
(1501, 3, 3)

Snip out just one dimension to get all frames and all atoms and just one dimension

traj.xyz[:,:,0].shape
(1501, 18546)

Since a trajectory is just a collection of frames, one after another, you can also snip out frames from a trajectory

traj[0]
<mdtraj.Trajectory with 1 frames, 18546 atoms, 2688 residues, and unitcells at 0x1257af908>

This is still a Trajectory object, just 1 frame. XYZ coordinates are still accessible as earlier

All simulations occur within a unitcell to define the boundaries of the simulation.

traj.unitcell_vectors
array([[[5.11195  , 0.       , 0.       ],
        [0.       , 3.74324  , 0.       ],
        [0.       , 0.       , 8.80772  ]],

       [[5.116633 , 0.       , 0.       ],
        [0.       , 3.7466693, 0.       ],
        [0.       , 0.       , 8.806476 ]],

       [[5.0943184, 0.       , 0.       ],
        [0.       , 3.7303293, 0.       ],
        [0.       , 0.       , 8.894873 ]],

       ...,

       [[5.3887806, 0.       , 0.       ],
        [0.       , 3.9459503, 0.       ],
        [0.       , 0.       , 8.189841 ]],

       [[5.3347273, 0.       , 0.       ],
        [0.       , 3.9063694, 0.       ],
        [0.       , 0.       , 8.352207 ]],

       [[5.3583217, 0.       , 0.       ],
        [0.       , 3.9236465, 0.       ],
        [0.       , 0.       , 8.363432 ]]], dtype=float32)
traj.unitcell_vectors.shape
(1501, 3, 3)

For each frame, there is a 3x3 array to describe the simulation box vectors

I won’t go into how you should analyze a trajectory, but every molecular modeler should be familiar with what analysis routines exist in which packages, and which analysis routines you should design yourself

Comments

There is a whole zoo of trajectory file formats that simulation engines produce - each analysis package can accommodate a subset of those file formats, each analysis package has different built-in analysis routines. Sometimes it’s a mix-and-match game where you need to use package A to read a trajectory, and convert to package B representation because it has some particular analysis routine you need.

You could use an analysis package to read in one file format but write in another file format, or use an analysis package to manipulate coordinates/toplogy. Because these packages are designed intuitively and very similar to other structures in the SciPy ecosystem, there is a lot of room for creativity

Recent developments in the SciPy ecosystem look at out-of-memory or GPU representations of numpy array or pandas DataFrame, and this is a growing issue in our field - sometimes loading an entire Trajectory into memory is just not possible, so chunking is necessary to break the whole Trajectory into memory-manageable data

Summary

There are a variety of analysis packages out there, but they all start out the same way: read a simulation trajectory file and create an in-memory data representation that contains trajectory (coordinates) and topology (atoms, bonds) information.