Part 2 Conclusion: From Static Protein Analysis to Molecular Dynamics
Modeling proteins using tiny springs
We now transition from the static study of proteins to the field of molecular dynamics, in which we simulate the movement of proteins’ atoms, along with their interactions as they move.
You may think that simulating the movements of proteins with hundreds of amino acids will be a hopeless task. After all, predicting the static structure of a protein has occupied biologists for decades! Yet part of what makes structure prediction so challenging is that the search space of potential structures is so enormous. Once we have established the static structure of a protein, its dynamic behavior will not allow it to deviate greatly from this static structure, and so the space of potential dynamic structures is narrowed down to those that are similar to the static structure.
A protein’s molecular bonds are constantly vibrating, stretching and compressing, much like that of the oscillating massspring system shown in the figure below. Bonded atoms are held together by sharing electrons and are held at specific bond length due to the attraction and repulsion of the negatively charged electrons and positively charged nucleus. If you push the atoms closer together or pull them farther apart, then they will “bounce back” to their equilibrium.
A massspring system in which a mass is attached to the end of a spring. The more we move the mass from its equilibrium, the greater its resistance and the more it will be repelled back toward equilibrium. Image courtesy: flippingphysics.com.
In an elastic network model (ENM), we imagine nearby alpha carbons of a protein structure to be connected by springs. Because distant atoms will not influence each other, we will only connect two alpha carbons if they are within some threshold distance of each other. In the following section, we will describe a Gaussian network model (GNM), an ENM for molecular dynamics.
An Introduction to Gaussian Network Models
Representing random movements of alpha carbons
We will introduce GNMs using our old friend human hemoglobin protein (1A3N.pdb). We first convert hemoglobin into a network of nodes and springs, in which each alpha carbon is given by a node, and two alpha carbons are connected by a string if they are within a threshold distance; the figure below uses a threshold value of 7.3 angstroms.
Conversion of human hemoglobin (left) to a network of nodes and springs (right) in which two nodes are connected by a spring if they are within a threshold distance of 7.3 angstroms.
The alpha carbons in a protein are subject to random fluctuations that cause them to deviate from their equilibrium positions. These fluctuations are Gaussian, which give the model its name. Gaussian fluctuations mean that the alpha carbon can deviate randomly from its equilibrium position according to a normal (bellshaped) distribution, meaning that it is more likely to be near the equilibrium than far away.
The equilibrium position of node \textvar{i} is represented by the vector \(\mathbf{R_i^0}\), and its position at a given point in time is denoted by the (variable) vector \(\mathbf{\mathbf{\Delta R_i}}\). The distance between node \textvar{i} and node \textvar{j} at equilibrium is denoted by the vector \(\mathbf{R_{ij}^0}\), which is equal to \(\mathbf{R_j^0}  \mathbf{R_i^0}\) (as illustrated in the figure below; at a given point in time, this distance becomes the vector \(\mathbf{R_{ij}}\).
(Left) A small network of nodes connected by springs deriving from a protein structure. The distance between two nodes i and j is denoted by the variable \(\mathbf{R_{ij}}\). (Right) Zooming in on two nodes i and j that are within the threshold distance. The equilibrium positions of node i and node j are represented by the distance vectors \(\mathbf{R_i^0}\) and \(\mathbf{R_j^0}\), with the distance between them denoted \(\mathbf{R_{ij}^0}\), which is equal to \(\mathbf{R_j^0}  \mathbf{R_i^0}\). The vectors \(\mathbf{\mathbf{\Delta R_i}}\) and \(\mathbf{\mathbf{\Delta R_j}}\) represent the nodes’ respective changes from equilibrium. Image courtesy: Ahmet Bakan.
Yet although atomic fluctuations are powered by randomness, the movements of protein atoms are in fact heavily correlated. For example, imagine the simple case in which all of a protein’s alpha carbons are connected in a straight line. If we pull the first alpha carbon away from the second node, then the second alpha carbon will be pulled toward the first alpha carbon, and the movements of these two alpha carbons will be directly correlated. Our goal is to understand how the movements of every pair of alpha carbons, called the crosscorrelation of these atoms, may be related. We will see how using vectors to represent these movements can be helpful.
Inner products and crosscorrelations
To determine how the movements of nodes i and j are related, we need to study the fluctuation vectors \(\mathbf{\mathbf{\Delta R_i}}\) and \(\mathbf{\mathbf{\Delta R_j}}\). Do these vectors point in similar or opposing directions?
To answer this question, we compute the inner product, or dot product, of the two vectors, \(\langle \mathbf{\mathbf{\Delta R_i}}, \mathbf{\mathbf{\Delta R_j}} \rangle\). The inner product of two vectors x = (x_{1}, x_{2}, x_{3}) and y = (y_{1}, y_{2}, y_{3}) is given by <x, y> = x_{1} · y_{1} + x_{2} · y_{2} + x_{3} · y_{3}. If x and y are perpendicular, then <x, y> is equal to zero. The more that the two vectors point in the same direction, the larger the value of <x, y>. And the more that the two vectors point in opposite directions, the smaller the value of <x, y>.
STOP: Say that x = (1, 2, 3), y = (2, 3, 5), and z = (1, 3, 4). Compute the inner products <x, y> and <x, z> and ensure that your answers match the preceding observation.
The inner product is also useful for representing the meansquare fluctuation of an alpha carbon, or the expected squared distance of node i from equilibrium. The fluctuation of node i from equilibrium is just \(\mathbf{\Delta R_i}\). If this vector is represented by the coordinates (x, y, z), then its square distance from equilibrium is x^{2} + y^{2} + z^{2}, which is the inner product of the fluctuation vector with itself, \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\).
Note: In practice, we will not know the exact values for the \(\mathbf{\Delta R_i}\) and must compute the inner products of fluctuation vectors indirectly using linear algebra, which is beyond the scope of this work. A full treatment of the mathematics of GNMs can be found in the chapter at https://www.csb.pitt.edu/Faculty/bahar/publications/b14.pdf
Long vectors pointing in the same direction will have a larger inner product than short vectors pointing in the same direction. As a result, we can normalize the inner product so that we can have a sense of the correlation of these vectors, independent of their length. To be precise, the crosscorrelation of nodes i and j is given by
\[C_{ij} = \dfrac{\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_j} \rangle}{\sqrt{\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle \langle \mathbf{\Delta R_j}, \mathbf{\Delta R_j} \rangle}}.\]After normalization, the crosscorrelation ranges from 1 to 1. A crosscorrelation of 1 means that the two alpha carbons’ movements are completely anticorrelated, and a crosscorrelation of 1 means that their movements are completely correlated.
After computing the crosscorrelation of every pair of alpha carbons in a protein structure with n residues, we obtain an n × n crosscorrelation matrix C such that C(i, j) is the crosscorrelation between amino acids i and j. We can visualize this matrix using a heat map, in which we color matrix values along a spectrum from blue (1) to red (1). The heat map for the crosscorrelation matrix of human hemoglobin is shown in the figure below.
The normalized crosscorrelation heat map of human hemoglobin (PDB: 1A3N). Red regions indicate correlated residue pairs which move in the same direction; blue regions indicate anticorrelated residue pairs which move in opposite directions. Note that the matrix is oriented so that its first element is in the bottom left corner.
The crosscorrelation map of a protein contains complex patterns of correlated and anticorrelated movement of the protein’s atoms. Much like the contact map that we showed in an earlier lesson, it serves as a “fingerprint” of sorts for the protein and can give useful insights into the protein’s structure.
For example, we should not be surprised that the main diagonal of the crosscorrelation map is colored red, since alpha carbons that are near each other in a polypeptide chain will likely have correlated movements.
Furthermore, the regions of high correlation near the main diagonal typically provide information regarding the protein’s secondary structures, since amino acids belonging to the same secondary structure will typically move in concert.
High correlation regions that are far away from the matrix’s main diagonal provide information about the tertiary structure of the protein, such as protein domains and other components of the protein that work together.
Consider again the crosscorrelation map of human hemoglobin shown above. We can see four squares of positive correlation along the main diagonal, representing the four subunits of hemoglobin from bottom left to top right: α_{1}, β_{1}, α_{2}, and β_{2}. Note that the first and third squares have the same patterns; these squares correspond to α_{1} and α_{2}, respectively. Similarly, the second and fourth squares correspond to β_{1} and β_{2}.
STOP: What other patterns do you notice in the hemoglobin crosscorrelation heat map?
Meansquare fluctuations and Bfactors
Just as we can visualize crosscorrelation, we could also plot the meansquare fluctuations of residues using a line graph, where the xaxis represents the order of alpha carbons, and the yaxis represents the meansquare fluctuation \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\) of the ith alpha carbon. Large yvalues in this plot would correspond to mobile alpha carbons, whereas lower values would correspond to static alpha carbons.
We also would like to compare these predicted meansquare fluctuations against what we can infer from experimental results. During Xray crystallography, the displacement of atoms within the protein crystal decreases the intensity of the scattered Xray, creating uncertainty in the positions of atoms. Bfactor, also known as temperature factor or DebyeWaller factor, is a measure of this uncertainty, which includes noise from positional variance of thermal protein motion, model errors, and lattice defects.
We can then compare these experimental Bfactors against a theoretical estimate of their value. It is beyond the scope of this work, but the theoretical Bfactors are given by
\[B_i = \frac{8 \pi^2}{3} \langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle.\]In other words, once we can estimate the meansquare fluctuations \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\), the theoretical Bfactor of the ith alpha carbon is equal to a constant times this inner product. This theoretical Bfactor tends to correlate well with theoretical Bfactors in practice.^{1}
The figure below shows a plot of the Bfactor of the α_{1} subunit of human hemoglobin. We also show a 3D structure of the hemoglobin protein, with amino acids colored along a spectrum according to both theoretical and experimental Bfactors for the sake of comparison. In these structures, red indicates large Bfactors (mobile amino acids), and blue indicates small Bfactors (static amino acids). Note that the mobile amino acids are generally found at the ends of secondary structures in the outer edges of the protein, which is expected because the boundaries between secondary structures typically contain highly fluctuating residues.
(Top): Human hemoglobin colored according to theoretical Bfactors calculated from GNM (left) and experimental Bfactors (right). Subunit α_{1} is located at the top left quarter of the protein figure. (Bottom): A 2D plot comparing the theoretical (blue) and experimental (black) Bfactors of subunit α_{1}. The theoretical and experimental Bfactors are correlated with a coefficient of 0.63.
Normal mode analysis
When listening to your favorite song, you probably do not think of the individual notes that it comprises. Yet a talented musician can dissect the song into the set of notes that each instrument contributes to the whole. Just because the music combines a number of individual sound waves does not mean that we cannot deconvolve the music into its substituent waves.
All objects, from colossal skyscrapers to tiny proteins, vibrate. And, just like in our musical example, these oscillations are the net result of individual waves passing through the object. The paradigm of breaking down a collection of vibrations into the comparatively small number of “modes” that summarize them is called normal mode analysis (NMA) and is at the heart of elastic network models.
The mathematical details are complicated, but by deconvolving a protein’s movement into individual normal modes, we can observe how each mode affects individual amino acids. As we did with Bfactors, for a given mode, we can visualize the results of a mode with a line graph, called the mode shape plot. The xaxis of this plot corresponds to the amino acid sequence of the protein, and the height of the ith position on the xaxis corresponds to the magnitude of the square fluctuation caused by the mode on the protein’s ith residue.
Just as a piece of music can have one instrument that is much louder than another, some of the oscillations contributing to an object’s vibrations may be more significant than others. NMA also determines the degree to which each mode contributes to the overall fluctuations of a protein; the mode contributing the most is called the slowest mode of the protein. The figure below shows a mode shape plot of the slowest mode for each of the four subunits of human hemoglobin.
(Top) Visualization of human hemoglobin colored based on GNM slow mode shape for the slowest mode (left) and the average of the ten slowest modes (right), or the ten modes that contribute the most to the square fluctuation. Regions of high mobility are colored red, corresponding to peaks in the mode shape plot. (Bottom) A mode shape plot of the slowest mode for human hemoglobin, separated over each of the four chains, shows that the four chains have a similar slowest mode.
Similar to crosscorrelation, analyzing a protein’s mode shapes will give insights into the structure of the protein, and comparing mode shapes for two proteins can reveal structural differences. For example, the mode shape plots in the figure above show that the slowest mode shape for the four subunits of hemoglobin are quite similar.
We should consult more than just a single mode when completing a full analysis of a protein’s molecular dynamics. Below is the slow mode plot averaging the “slowest” ten modes of hemoglobin, meaning the modes that have the greatest affect on mean fluctuation. Unlike when we examined only the slowest mode, we can now see a stark difference between the two groups of subunits/chains; the plots for α subunits (chains A and C) are very similar, whereas the plots for β subunits (chains B and D) share a slightly less similar average slow mode shape.
The average mode shape of the slowest ten modes for each of the four human hemoglobin subunits using GNM.
We are now ready to apply what we have learned in this lesson and use ProDy to build a GNM for the SARSCoV and SARSCoV2 spike proteins.
Molecular dynamics analyses of SARSCoV and SARSCoV2 spike proteins using GNM
The figure below shows the crosscorrelation heat maps of SARSCoV and SARSCoV2 spike proteins, indicating that these proteins may have similar dynamics.
The crosscorrelation heat maps of the SARSCoV2 spike protein (topleft), SARSCoV spike protein (topright), single chain of the SARSCoV2 spike protein (bottomleft), and singlechain of the SARSCoV spike protein (bottomright).
The next figure shows the mode shape plot for the slowest mode of the two proteins. The protein region between positions 200 and 500 of the spike protein is the most mobile and overlaps with the RBD region, found between residues 331 to 524.
(Top) A mode shape plot for the slowest mode of the SARSCoV2 spike protein (left) and SARSCoV spike protein (right). (Bottom) A mode shape plot for the slowest mode of a single chain of the SARSCoV2 spike protein (left) and a single chain of the SARSCoV spike protein (right). Note that the plot on the right is inverted compared to the one on the left because of a choice made by the software, but the two plots have the same shape if we consider the absolute value. These plots show that the two viruses have similar dynamics, and that residues 200 – 500 fluctuate the most, a region that overlaps heavily with the RBD.
We can also examine the mode shape plot for the average of the slowest ten modes for the two spike proteins (see figure below). Using this plot, we color flexible parts of the protein red and inflexible parts of the protein blue.
Average mode shape of the slowest ten modes of SARSCoV2 Spike (left) and SARSCoV Spike (right). The first peak corresponds to the NTerminal Domain (NTD) and the second peak corresponds to the Receptor Binding Domain (RBD).
The mode shape plots show that the RBD of both spike proteins are highly flexible, which agrees with the biological functions of these regions. When the RBD interacts with the ACE2 enzyme on human cells, the RBD of one of the three chains “opens up”, exposing itself to more easily bind with ACE2.
The figure above highlights another flexible region, the spike protein’s nonterminal domain (NTD). Similar to the RBD, the NTD of the Spike proteins also mediates viral infection, but by interacting with DCSIGN and LSIGN receptors rather than ACE2^{2}. These receptors are present on macrophages and dendritic cells, allowing SARSCoV2 to infect different tissues such as the lungs, where ACE2 expression levels are low. The ability to infect lung cells attributes to pneumonia, the main symptom of severe COVID19 cases. As with the RBD, the high flexibility in this domain allows the spike protein to more easily interact with these receptors.
We should perhaps not be surprised that these plots point to SARSCoV2 and SARSCoV having similar dynamics, since their structures were similar, and both viral proteins target the ACE2 enzyme. To help provide more evidence for their similarity, we will learn about another aspect of molecular dynamics.
ANM accounts for the direction of protein fluctuations
All of the analysis provided by GNM, from crosscorrelation plots to Bfactors and NMA, depend on the estimation of inner products \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\) or \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_j} \rangle\) as the case may be. Although the \(\mathbf{\Delta R_i}\) are vectors, meaning that they have a direction as well as a magnitude, the inner products have only a value, and we therefore are not able to infer anything about the direction of a protein’s movements. For this reason, GNM is said to be isotropic, meaning that it only considers the magnitude of force exerted on the springs between nearby molecules and ignores any global effect on the directions of these forces.
The generalization of GNM in which we attempt to determine the directions of these forces is called an anisotropic network model (ANM). Although ANM includes directionality, it typically performs worse than GNM when compared with experimental data^{3}. However, this model offers the benefit that it can be used to create animations depicting the range of motions and fluctuations of the protein, as well as to estimate the directions of movement caused by each of a protein’s modes.
We will not delve into the mathematical intricacies of ANM calculations, but we will use ANM to create animations visualizing protein fluctuations. For example, click on the animation below to see a video of estimated hemoglobin fluctuations produced from ANM. We can see that the left and right side of the protein are more flexible than the rest of the protein and twist in opposite directions.
Collective motions of the slowest mode in human hemoglobin from ANM calculations using DynOmics.
After we produce an animation like the one in the figure above, we also should attempt to explain it biologically. Human hemoglobin exists in two states: the tense state (T), in which it is not bound to an oxygen molecule, and the relaxed state (R), in which it is oxygenated. Hemoglobin’s mobility shown in the above animation corresponds to its ability to transition between these two states, in which saltbridges and contacts can shift by up to seven angstroms^{4}. This significant molecular flexibility exemplifies why we need to study protein dynamics as well as structure.
We will now apply ANM to the SARSCoV and SARSCoV2 spike proteins. We will also use NMWiz, which is short for “normal mode wizard”, to perform ANM calculations and create an animation of the SARSCoV2 (chimeric) RBD and the SARSCoV RBD.
ANM analysis of the coronavirus binding domain
To predict spike protein movements based on ANM, we used NMWiz and VMD to create animations of the protein fluctuations over time as calculated via ANM analysis. The following two animations show the complex of each virus’s RBD (purple) bound with ACE2 (green). Important residues from the three sites of conformational differences from the previous lessons are also highlighted.
SARSCoV spike protein RBD (PDB: 2ajf)
SARSCoV RBD  Purple 

Resid 463 to 472 (Loop)  Yellow 
Resid 442 (Hotspot 31)  Orange 
Resid 487 (Hotspot 353)  Red 
ACE2  Green 
Resid 79, 82, 83 (Loop)  Silver 
Resid 31, 35 (Hotspot 31)  Orange 
Resid 38, 353 (Hotspot 353)  Red 
SARSCoV2 spike protein chimeric RBD (PDB: 6vw1)
SARSCoV2 (Chimeric) RBD  Purple 

Resid 476 to 486 (Loop)  Yellow 
Resid 455 (Hotspot 31)  Blue 
Resid 493 (Hotspot 31)  Orange 
Resid 501 (Hotspot 353)  Red 
ACE2  Green 
Resid 79, 82, 83 (Loop)  Silver 
Resid 31, 35 (Hotspot 31)  Orange 
Resid 38, 353 (Hotspot 353)  Red 
Recall from the previous lesson that the greatest contribution of negative energy to the RBD/ACE2 complex in SARSCoV2 was the region called “hotspot 31”. This region is highlighted in blue and orange in the above figures. If you look very closely, as the protein swings in to bind with ACE2, the blue and orange regions appear to line up just a bit more naturally in the SARSCoV2 animation than in the SARSCoV animation. That is, the improved binding that we hypothesized for a static structure appears to be confirmed by dynamics simulations. This animation provides one more piece of evidence that SARSCoV2 is more effective at binding to the ACE2 enzyme.
DynOmics Integrates GNM and ANM Analysis
Although we have separated our discussion of GNM and ANM, the DynOmics 1.0 server is an effort to integrate these approaches on a userfriendly platform. If you are interested, the following tutorial shows how to use DynOmics to replicate some of our above analysis.
Fighting a Virus with Open Science
In this module, we began with a discussion of the fundamental problem of determining a protein’s structure. Because experimental methods for identifying protein structure are costly and time consuming, we transitioned to discuss algorithmic approaches that do a good job of predicting a protein’s structure from its sequence of amino acids.
We then discussed how to compare protein structures, with a lengthy case study comparing the SARSCoV and SARSCoV2 spike proteins. The problem of quantifying how different two structures is can be challenging, and we established both global and local structure comparison metrics. We applied these approaches to isolate three candidate regions of interest of the SARSCoV2 spike protein that differ from the SARSCoV spike protein when complexed with the ACE2 enzyme, and we quantified this binding using a localized energy function.
We concluded with a transition from the study of structure to the structure of molecular dynamics. If we hope to fully understand a protein’s function, then we need to know how it flexes and bends within its environment, sometimes in order to interact with other molecules.
Despite covering a great deal of ground, we have left just as much unstudied. For one example, the surface of viruses and host cells are “fuzzy” because they are covered by structures called glycans, or numerous monosaccharides linked together. SARSCoV and SARSCoV2 have a “glycan shield”, in which glycosylation of surface antigens allows the virus to hide from antibody detection. Researchers have found that the SARSCoV2 spike protein is heavily glycosylated, shielding around 40% of the protein from antibody recognition^{5}; we will save a comparative analysis of the two virus’s glycan shields for another time.
Finally, we would point out that although scientific research has often historically been siloed away from the public, the COVID19 pandemic exemplifies how global citizens can get involved in real research. GISAID published their first publicly available SARSCoV2 genome on December 24, 2019. Within six months, this database had grown to contain over 50,000 entries, and after two years, it would contain nearly 7 million.
At any point in early 2020, you could have grabbed your favorite SARSCoV2 genome sequenced from a patient, excised the sequence of the spike protein, and used one of a variety of different software resources to predict its structure. Or, you could have enlisted your own computer as part of a global race to provide vaccine developers with accurate estimations of the protein’s structure. Despite the COVID19 pandemic causing an international crisis, the progress made in opening scientific research to the public is cause for optimism about the future of biological research.
Thus concludes the third module of this course. In the course’s final module, we will turn our attention to a very different type of problem. To fight a virus like SARSCoV2, your body employs a cavalry of white blood cells. Maintaining healthy levels of these cells is vital to a strong immune system, and blood reports run counts of these cells to ensure they are within normal ranges. Can we teach a computer to run this analysis automatically?
Note: Although we have covered a great deal in this chapter, there is still much more to say about SARSCoV2. What happens after the spike protein binds to ACE2? How does the virus enter the cell and replicate? How does it fight our immune systems, and how should we design a vaccine to fight back? If you are interested in an online course covering some of these questions, then check out the free online course SARS Wars: A New Hope by Christopher Langmead.

Yang, L., Song, G., & Jernigan, R. L. 2009. Comparisons of experimental and computed protein anisotropic temperature factors. Proteins, 76(1), 164–175. https://doi.org/10.1002/prot.22328 ↩

Soh, W. T., Liu, Y., Nakayama, E. E., Ono, C., Torii, S., Nakagami, H., Matsuura, Y., Shioda, T., Arase, H. The Nterminal domain of spike glycoprotein mediates SARSCoV2 infection by associating with LSIGN and DCSIGN. ↩

Yang, L., Song, G., Jernigan, R. 2009. Protein elastic network models and the ranges of cooperativity. PNAS 106(30), 1234712352. https://doi.org/10.1073/pnas.0902159106 ↩

Davis, M., Tobi, D. 2014. Multiple Gaussian network modes alignments reveals dynamically variable regions: The hemoglobin case. Proteins: Structure, Function, and Bioinformatics, 82(9), 20972105. https://doiorg.cmu.idm.oclc.org/10.1002/prot.24565 ↩

Grant, O. C., Montgomery, D., Ito, K., & Woods, R. J. Analysis of the SARSCoV2 spike protein glycan shield: implications for immune recognition. bioRxiv : the preprint server for biology, 2020.04.07.030445. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7217288/ ↩
Comments