From Static Protein Analysis to Molecular Dynamics
Modeling proteins using tiny springs
You may think that simulating the movements of proteins with hundreds of amino acids will prove hopeless. After all, predicting the static structure of a protein has occupied biologists for decades! Yet part of what makes structure prediction so challenging is that the search space of potential structures is enormous. In contrast, once we have established the static structure of a protein, it will not deviate greatly from this static structure, and so the space of potential dynamic structures is narrowed down to those that are similar to the static structure.
A protein’s molecular bonds are constantly vibrating, stretching and compressing, much like that of the oscillating massspring system shown in the figure below. Bonded atoms are held at a specific distance apart due to the attraction and repulsion of the negatively charged electrons and positively charged nucleus. If you were to push the atoms closer together or pull them farther apart, then they would “bounce back” to their equilibrium.
A massspring system in which a mass is attached to the end of a spring. The more that we move the mass from its equilibrium, the more that it will be repelled back toward equilibrium. Image courtesy: flippingphysics.com.
In an elastic network model (ENM), we imagine nearby alpha carbons of a protein structure to be connected by springs. Because distant atoms will not influence each other, we will only connect two alpha carbons if they are within some threshold distance of each other. In this lesson, we will describe a Gaussian network model (GNM), an ENM for molecular dynamics.
Representing random movements of alpha carbons
We will introduce GNMs using our old friend human hemoglobin (1A3N.pdb). We first convert hemoglobin into a network of nodes and springs, in which each alpha carbon is given by a node, and two alpha carbons are connected by a string if they are within a threshold distance; the figure below uses a threshold value of 7.3 angstroms.
Conversion of human hemoglobin (left) into a network of nodes and springs (right) in which two nodes are connected by a spring if they are within a threshold distance of 7.3 angstroms.
The alpha carbons in a protein are subject to random fluctuations that cause them to move from their equilibrium positions. These fluctuations are Gaussian, meaning that the alpha carbon deviates randomly from its equilibrium position according to a normal (bellshaped) distribution. In other words, although an alpha carbon’s position is due to random chance, it is more likely to be near the equilibrium than far away.
Although atomic fluctuations are powered by randomness, the movements of protein atoms are heavily correlated. For example, imagine the simple case in which all of a protein’s alpha carbons are connected in a straight line. If we pull the first alpha carbon away from the center of the protein, then the second alpha carbon will be pulled along with it. Our goal is to understand how the movements of every pair of alpha carbons may be related.
Inner products and crosscorrelations
As illustrated in the figure below, the variable vector \(\mathbf{R_{ij}}\) represents the distance between nodes i and j. The equilibrium position of node \textvar{i} is represented by the vector \(\mathbf{R_i^0}\), and its displacement from equilibrium is denoted by the (variable) vector \(\mathbf{\mathbf{\Delta R_i}}\). The distance between node \textvar{i} and node \textvar{j} at equilibrium is denoted by the vector \(\mathbf{R_{ij}^0}\), which is equal to \(\mathbf{R_j^0}  \mathbf{R_i^0}\).
(Left) A small network of nodes connected by springs deriving from a protein structure. The distance between two nodes i and j is denoted by the variable \(\mathbf{R_{ij}}\). (Right) Zooming in on two nodes i and j that are within the threshold distance and therefore connected by a spring. The equilibrium positions of node i and node j are represented by the distance vectors \(\mathbf{R_i^0}\) and \(\mathbf{R_j^0}\), with the distance between them denoted \(\mathbf{R_{ij}^0}\), which is equal to \(\mathbf{R_j^0}  \mathbf{R_i^0}\). The vectors \(\mathbf{\mathbf{\Delta R_i}}\) and \(\mathbf{\mathbf{\Delta R_j}}\) represent the nodes’ respective changes from equilibrium. Image courtesy: Ahmet Bakan.
To determine how the movements of alpha carbons i and j are related, we need to study the fluctuation vectors \(\mathbf{\mathbf{\Delta R_i}}\) and \(\mathbf{\mathbf{\Delta R_j}}\). Do these vectors point in similar or opposing directions?
To answer this question, we compute the inner product, or dot product, of the two vectors, \(\langle \mathbf{\mathbf{\Delta R_i}}, \mathbf{\mathbf{\Delta R_j}} \rangle\). The inner product of two vectors x = (x_{1}, x_{2}, x_{3}) and y = (y_{1}, y_{2}, y_{3}) is given by <x, y> = x_{1} · y_{1} + x_{2} · y_{2} + x_{3} · y_{3}. If x and y are perpendicular, then <x, y> is equal to zero. The more that the two vectors point in the same direction, the larger the value of <x, y>. And the more that the two vectors point in opposite directions, the more negative the value of <x, y>.
STOP: Say that x = (1, 2, 3), y = (2, 3, 5), and z = (1, 3, 4). Compute the inner products <x, y> and <x, z>. Ensure that your answers match the preceding observation about the value of the inner product and the directions of vectors.
The inner product is also useful for representing the meansquare fluctuation of an alpha carbon, or the expected squared distance of node i from equilibrium. If the fluctuation vector \(\mathbf{\Delta R_i}\) is represented by the coordinates (x, y, z), then its square distance from equilibrium is x^{2} + y^{2} + z^{2}, which is the inner product of the fluctuation vector with itself, \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\). Alpha carbons having large values of this meansquare fluctuation may belong to more flexible regions of the protein.
Note: In practice, we will not know the exact values for the \(\mathbf{\Delta R_i}\) and must compute the inner products of fluctuation vectors indirectly using linear algebra, which is beyond the scope of this work. A full treatment of the mathematics of GNMs can be found in the chapter at https://www.csb.pitt.edu/Faculty/bahar/publications/b14.pdf
Long vectors pointing in the same direction will have a larger inner product than short vectors pointing in the same direction. To have a metric for the correlation of two vectors’ movements that is independent of their length, we should therefore normalize the inner product. The crosscorrelation of alpha carbons i and j is given by
\[C_{ij} = \dfrac{\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_j} \rangle}{\sqrt{\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle \cdot \langle \mathbf{\Delta R_j}, \mathbf{\Delta R_j} \rangle}}.\]After normalization, the crosscorrelation ranges from 1 to 1. A crosscorrelation of 1 means that the two alpha carbons’ movements are completely anticorrelated, and a crosscorrelation of 1 means that their movements are completely correlated.
After computing the crosscorrelation of every pair of alpha carbons in a protein structure with n residues, we obtain an n × n crosscorrelation matrix C such that C(i, j) is the crosscorrelation between amino acids i and j. We can visualize this matrix using a heat map, in which we color matrix values along a spectrum from blue (1) to red (1). The heat map for the crosscorrelation matrix of human hemoglobin is shown in the figure below.
The normalized crosscorrelation heat map of human hemoglobin (PDB: 1A3N). Red regions indicate correlated residue pairs which move in the same direction; blue regions indicate anticorrelated residue pairs which move in opposite directions.
The crosscorrelation map of a protein contains complex patterns of correlated and anticorrelated movement of the protein’s atoms. For example, we should not be surprised that the main diagonal of the above crosscorrelation heat map is colored red, since alpha carbons that are near each other in a polypeptide chain will likely have correlated movements. Furthermore, the regions of high correlation near the main diagonal typically provide information regarding the protein’s secondary structures, since amino acids belonging to the same secondary structure will typically move in concert.
High correlation regions that are far away from the matrix’s main diagonal provide information about the tertiary structure of the protein, such as domains that are distant in terms of sequence but work together in the protein structure.
In the crosscorrelation heat map in the figure above, we can see four squares of positive correlation along the main diagonal, representing the four subunits of hemoglobin from bottom left to top right: α_{1}, β_{1}, α_{2}, and β_{2}. Note that the first and third squares exhibit the same patterns; these squares correspond to α_{1} and α_{2}, respectively. Similarly, the second and fourth squares correspond to β_{1} and β_{2}, respectively.
STOP: What other patterns do you notice in the hemoglobin crosscorrelation heat map?
Meansquare fluctuations and Bfactors
Just as we can visualize crosscorrelation, we would also like to visualize the meansquare fluctuations that the GNM model predicts. During Xray crystallography, the displacement of atoms within the protein crystal decreases the intensity of the scattered Xray, creating uncertainty in the positions of atoms. A Bfactor, also known as temperature factor or DebyeWaller factor, is a measure of this uncertainty.
It is beyond the scope of this work, but the theoretical Bfactor of the ith alpha carbon is equal to a constant factor times the estimated meansquare fluctuation,
\[B_i = \frac{8 \pi^2}{3} \langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle.\]Once we have computed theoretical Bfactors, we can compare them against experimental results. The figure below shows a 3D structure of the hemoglobin protein, with amino acids colored along a spectrum according to both theoretical and experimental Bfactors for the sake of comparison. In these structures, red indicates large Bfactors (mobile amino acids), and blue indicates small Bfactors (static amino acids). Note that the mobile amino acids are generally found at the ends of secondary structures in the outer edges of the protein, which is expected because the boundaries between secondary structures typically contain highly fluctuating residues. This figure confirms the general observation that theoretical Bfactors tend to correlate well with experimental Bfactors in practice^{1} and offers one more example of a property of a biological system that we can infer computationally without the need for costly experimentation.
It is beyond the scope of this work, but the theoretical Bfactors are given byIn other words, once we can estimate the meansquare fluctuations \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\), the theoretical Bfactor of the ith alpha carbon is equal to a constant times this inner product. This theoretical Bfactor tends to correlate well with theoretical Bfactors in practice.^{1}
(Top): Human hemoglobin colored according to theoretical Bfactors calculated from GNM (left) and experimental Bfactors (right). Blue indicates low Bfactors, and red indicates high Bfactors. Subunit α_{1} is located at the top left quarter of the protein. (Bottom): A 2D plot comparing the theoretical (blue) and experimental (black) Bfactors of subunit α_{1}. The theoretical and experimental Bfactors are correlated with a coefficient of 0.63.
Normal mode analysis
When listening to your favorite song, you probably do not think of the individual notes that it comprises. Yet a talented musician can dissect the song into the set of notes that each instrument contributes to the whole. Just because the music combines a number of individual sound waves does not mean that we cannot deconvolve the music into its substituent waves.
All objects, from colossal skyscrapers to tiny proteins, vibrate. And, just like in our musical example, these oscillations are the net result of individual waves passing through the object. The paradigm of breaking down a collection of vibrations into the comparatively small number of “modes” that summarize them is called normal mode analysis (NMA) .
The mathematical details are complicated, but by deconvolving a protein’s movement into individual normal modes, we can observe how each mode affects individual amino acids. As we did with Bfactors, for a given mode, we can visualize the results of a mode using a line graph, called the mode shape plot. The xaxis of this plot corresponds to the amino acid sequence of the protein, and the height of the ith position on the xaxis corresponds to the magnitude of the square fluctuation contributed by the mode on the protein’s ith amino acid.
Just as a piece of music can have one instrument that is much louder than another, some of the oscillations contributing to an object’s vibrations may be more significant than others. NMA also determines the degree to which each mode contributes to the overall fluctuations of a protein; the mode contributing the most is called the slowest mode of the protein. The figure below shows a mode shape plot of the slowest mode for each of the four subunits of human hemoglobin and reveals that all four subunits have similar mode shape for the slowest mode.
(Top) Visualization of human hemoglobin colored based on GNM slow mode shape for the slowest mode (left) and the average of the ten slowest modes (right), or the ten modes that contribute the most to the square fluctuation. Regions of high mobility are colored red, corresponding to peaks in the mode shape plot. (Bottom) A mode shape plot of the slowest mode for human hemoglobin, separated over each of the four chains, shows that the four chains have a similar slowest mode.
Similar to crosscorrelation, analyzing a protein’s mode shapes will give insights into the structure of the protein, and comparing mode shapes for two proteins can reveal differences. For example, the mode shape plots in the figure above show that the slowest mode shape for the four subunits of hemoglobin are quite similar.
We should consult more than just a single mode when completing a full analysis of a protein’s molecular dynamics. Below is the slow mode plot averaging the “slowest” ten modes of hemoglobin, meaning the ten modes that have the greatest effect on mean fluctuation. Unlike when we examined only the slowest mode, we can now see a stark difference when comparing α subunits (chains A and C) to β subunits (chains B and D).
The average mode shape of the slowest ten modes for each of the four human hemoglobin subunits using GNM. Note that the plots for α subunits (chains A and C) and β subunits (chains B and D) differ more than when considering only the slowest mode.
We are now ready to apply what we have learned to build a GNM for the SARSCoV and SARSCoV2 spike proteins in a tutorial and analyze the dynamics of these proteins using the plots that we have introduced in this section.
Molecular dynamics analyses of SARSCoV and SARSCoV2 spike proteins using GNM
The figure below shows the crosscorrelation heat maps of SARSCoV and SARSCoV2 spike proteins, indicating that these proteins may have similar dynamics.
The crosscorrelation heat maps of the SARSCoV2 spike protein (topleft), SARSCoV spike protein (topright), single chain of the SARSCoV2 spike protein (bottomleft), and singlechain of the SARSCoV spike protein (bottomright).
The next figure shows the mode shape plot for the slowest mode of the two proteins. The protein region between positions 200 and 500 of the spike protein is the most mobile and overlaps with the RBD region, found between residues 331 to 524.
(Top) A mode shape plot for the slowest mode of the SARSCoV2 spike protein (left) and SARSCoV spike protein (right). (Bottom) A mode shape plot for the slowest mode of a single chain of the SARSCoV2 spike protein (left) and a single chain of the SARSCoV spike protein (right). Note that the plot on the right is inverted compared to the one on the left because of a choice made by the software, but the two plots have the same shape if we consider the absolute value. These plots show that the two viruses have similar dynamics, and that residues 200 – 500 fluctuate the most, a region that overlaps heavily with the RBD.
We can also examine the mode shape plot for the average of the slowest ten modes for the two spike proteins (see figure below). Using this plot, we color flexible parts of the protein red and inflexible parts of the protein blue.
Average mode shape of the slowest ten modes of SARSCoV2 Spike (left) and SARSCoV Spike (right). The first peak corresponds to the NTerminal Domain (NTD) and the second peak corresponds to the RBD. Above the mode shape plots, the viral spike proteins are colored according to the value of mode shape, with high values colored red and indicating greater predicted flexibility; note that the SARSCoV2 NTD is predicted to be more flexible than that of SARSCoV.
The mode shape plots show that the RBD of both spike proteins are highly flexible, which agrees with the biological functions of these regions. When the RBD interacts with the ACE2 enzyme on human cells, the RBD of one of the three chains “opens up”, exposing itself to more easily bind with ACE2.
The figure above highlights another flexible region, the spike protein’s nonterminal domain (NTD), which appears to be more mobile in SARSCoV2. Like the RBD, the NTD also mediates viral infection, but by interacting with DCSIGN and LSIGN receptors rather than ACE2^{2}. These receptors are present on macrophages and dendritic cells, allowing SARSCoV2 to infect different tissues such as the lungs, where ACE2 expression levels are low, which may help explain why SARSCoV2 is more likely than SARSCoV to progress to pneumonia.
The analysis provided by GNM is compelling, but it ultimately relies on the estimation of inner products \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\) and \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_j} \rangle\). Although the \(\mathbf{\Delta R_i}\) are vectors, meaning that they have a direction as well as a magnitude, the inner products have only a value, and so we therefore cannot infer anything about the direction of a protein’s movements. For this reason, GNM is said to be isotropic, meaning that it only considers the magnitude of force exerted on the springs between nearby molecules and ignores any global effect on the directions of these forces. In the next lesson, we will generalize GNMs to take these movements into account.

Yang, L., Song, G., & Jernigan, R. L. 2009. Comparisons of experimental and computed protein anisotropic temperature factors. Proteins, 76(1), 164–175. https://doi.org/10.1002/prot.22328 ↩ ↩^{2}

Soh, W. T., Liu, Y., Nakayama, E. E., Ono, C., Torii, S., Nakagami, H., Matsuura, Y., Shioda, T., Arase, H. The Nterminal domain of spike glycoprotein mediates SARSCoV2 infection by associating with LSIGN and DCSIGN. ↩
Comments