Statistical aspects of protein structure prediction

The basics

Amino acids are small molecules of the form


where R is a side chain called the R-group. There are 20 different amino acids found in proteins, each characterized by its R-group.

Peptides and proteins are chains of amino acids. Proteins are long such chains, whereas peptides and polypeptides are shorter ones. The amino acids are linked together by peptide bonds:


In 3d, the chains C -C-N-C each lie in their respective plane:


Put together, they form the protein backbone.

Protein structure

The 3 dimensional conformation of the protein backbone is determined by the angles in the bonds around the R-groups R_i. These angles are denoted by \phi_i and \psi_i.


Proteins tend to stay in relatively stable configurations. Thus every protein has an associated 3 dimensional structure which is the spatial position of each atom in the protein. This structure, including the amino acids specific R-groups, determines the protein’s function.

The structure prediction problem is to predict the 3 dimensional structure of a protein, given the sequence of amino acids that composes it. Of help is the Protein Data Bank (PDB), an online repository containing the experimentally determined 3d structure of thousands of proteins.

Decomposition of the problem

The structure of a protein is decomposed into 4 levels.

  • Primary structure. The (ordered) sequence of amino acids that composes the protein.
  • Secondary structure. The local structure of the of the protein. Many local patterns are classified. These include helices and sheets.
  • Tertiary structure.  The overall structure of the protein. When the protein is made of many protein subunits (which are polypeptides), the tertiary structure describes their individual conformation.
  • Quaternary structure. The global structure of the protein (including how subunits are arranged together).

Figure 1. The structure of a protein, the polymerase basic protein 2, from the PDB. On the left is its ball-and-stick representation; on the right is highlighted its secondary structure in a cartoony style: notice the helices (pink) and sheets (yellow) joined by irregular segments (white).

Figure 2. The protein’s backbone.jsmol_backbone(1).png

Characterizing the phi-psi angles distributions

In a 1963 short article entitled Stereochemistry of Polypeptide Chain Configurations [3] and in related works, Ramanchandran, Ramakrishnan and Sasisekharan conventionalized the use of the \phi and \psi angles to describe the conformation of a polypeptide backbone. They also predicted ranges of « allowed » and « disallowed » regions for the \phi, \psi angles, representing physical constraints, and associated with these regions notable patterns such as the \alpha-helices and \beta-sheets.

Fourty years later, Hovmoller et al [4] compared their prediction with empirical data from the Protein Data Bank. For each of the 20 amino acids, they plotted the distribution of their \phi,\psi angles as they appeared in 1042 protein subunits. These plots are called Ramachandran plots and the empirical results were surprisingly close to Ramanchandran’s predictions.

Figure 3. The alanyne (c) and glycine (d) Ramachandran plots reproduced from [4].

alanine-glycineThe alanyne plot is very similar to most of the other amino acids’ Ramachandran plots. The angles in the upper left cluster of the alanyne plot are in the \beta-sheet region. Angles in the lower cluster tend to be highly concentrated and are in the \alpha-helix region. This association of angular regions with secondary structure pattern is not exact, but was found to correspond up to 95-99% to the DSSP secondary structure classification. (The DSSP is a standard tool for secondary structure assignment from protein 3d structure.)

Predicting phi-psi angles

The \phi, \psi angles around a given amino acid residue in a protein can be predicted using emprical data from similar protein structure in the PDB. The predictions can then be suggested to more sophisticated structure prediction search algorithms.

In [5], it was suggested to predict the “half-angles” (\psi_i, \phi_{i+1}) instead of the “whole-angles” (\phi_i, \psi_i), as these were associated to two amino acid residues and therefore were more specific. However, the datasets of (\psi_i, \phi_{i+1}) angles are correspondingly much smaller. They therefore used a Dirichlet process mixture (with bivariate von Mises distributions) to model the (\psi_i, \phi_{i+1}) angles distribution and predicted angular values from the posterior predictive distribution.

Their statistical model is clear and well suited to the task. Most of the parameters are easily interpretable and allow for the necessary prior information incorporation (some parameters are more difficult to interpret and are there to give flexibility to the model). They clearly show that the use of “half-angles” yields better prediction than the use of “whole-angles”


[1] Watson, Baker, Bell, Gann, Levine and Losick, Molecular Biology of the Gene, sixth edition.

[2] Richardson, J.S., The Anatomy and Taxonomy of Protein Structure,

[3] Ramanchandran et al., Stereochemistry of Polypeptide Chain Configurations, J. Mol. Bio. 1963

[4] Hovmoller, S., Tuping, Z. and Thomas, O. Conformations of amino acids in proteins, Biological Crystallography, 2002.

[5] Lennox, K., et al., Density Estimation for Protein Conformation Angles Using a Bivariate von Mises Distribution and Bayesian Nonparametrics, Journal of the American Statistical Association, 2009.

One thought on “Statistical aspects of protein structure prediction

  1. Pingback: Drawings – Math. Stat. Notes

Leave a comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s