The Science of Protein Folding: Exploring the Basics

MoleculeAI
8 min readJul 15, 2023

Proteins are the workhorses of the cell of living systems. They perform a vast array of essential functions. However, for proteins to carry out their jobs, they must first fold into specific shapes. The shape of a protein is closely related to its function. But how do these fold into the specific shapes needed to carry out their jobs? The human body contains trillions of protein molecules, each with a specific function critical to our health and wellbeing. For example, in the human body, metabolism, muscle movement, storage of calcium in bone, fighting against bacteria and viruses, etc. all are biologically facilitated by different proteins. Understanding how these molecules fold and interact is key to unlocking the secrets of life itself. Have you ever stopped to think about the complexity of a single protein molecule? Though proteins exhibit an astounding diversity of functions, they all share the common structural feature of being linear polymers of amino acids linked by peptide bonds.

What is an Amino Acid?

Amino acids are the basic building blocks of a protein. Each amino acid has a carboxyl group (-COOH), a primary amino group (-NH2), and a distinctive side chain (R group) bonded to the α-carbon atom. In proteins, the carboxyl and the amino groups form peptide linkages and are not available for chemical reactions. Hence, the chemical nature of the sidechain determines the role of the amino acid, and only 20 types of amino acids are commonly found in mammalian proteins [1,2]. The details of these amino acids including names, abbreviations, molecular formulas, and three-dimensional molecular models can be found here.

General structure of an amino acid. [Figure Reference]

Proteins and their structural complexity

The essential information to create a protein molecule with a distinct function is encoded in the ordered arrangement of linked amino acids, which results in a specific three-dimensional configuration. To better comprehend the complexity of protein structure, it is necessary to examine the structure using four hierarchical levels of organization: primary, secondary, tertiary, and quaternary structure [1].

The primary structure is the sequence of amino acids. Historically, the N-terminal sequencing method developed by Edman has been used for the task of sequencing constituent amino acids in an unknown protein, but the method is now replaced by methods based on enzymatic digestion of the protein, followed by mass spectrometric analysis. An example of a primary structure is shown in the following figure.

A linear sequence of amino acids in a polypeptide chain. [Figure Reference]

The linkage of more than 50 amino acids through peptide bonds forms a polypeptide [1]. Now, the polypeptide backbone does not adopt any arbitrary three-dimensional structure. Rather, it typically establishes regular short term structural patterns of amino acids which are found near each other in the linear sequence. These patterns, known as the polypeptide’s secondary structure, include helix, sheet, turn, and coil.

α-helix: It is one of the commonly found helices. It is formed with a tightly packed, coiled amino acid backbone core with its residue side chains extending outwards from the central axis. This structure is stabilized by the intra-chain hydrogen bond between the peptide bond C=O group oxygens and N-H group hydrogens that are four residues apart in the polypeptide. As a result, the amino acids that are separated by a distance of four residues in the primary structure are brought into close proximity (resulting turns) upon folding into the α-helix. An example of α-helix structure is shown in the following figure (reference).

Other than α-helix, there also exist two more helical conformations in proteins namely 3_10 helix, and π-helix. The major difference among these structures comes from the formation of hydrogen bonds with i-th residue oxygen atom to (i+n)-th residue nitrogen atom. The value of n is 3,4,5 for 3_10-helix, α-helix, π-helix, respectively. Consequently, it controls the number of residues per turn, rise of the helix, and the dihedral angles in these helices [5].

β-sheet: It is another form of secondary structure where all of the peptide bond constituents participate in hydrogen bonding making pleated sheets. It is formed by two or more far apart or different polypeptide fragments (intra or inter-chain) aligned side-by-side, either in parallel or antiparallel fashion, and stabilized by hydrogen bonds. The residues of adjacent amino acids are held out in opposite directions, above and below the sheets. Unlike the α-helix, where the hydrogen bonds run parallel to the backbone, the hydrogen bonds in a β-sheet are oriented perpendicular to the backbone. An example of β-sheet structure is shown in the following figure (reference).

Now, in reality the sheets are not just as simple as parallel or antiparallel strands. The sheets are also twisted making peaks and valleys within it [5].

Other than these two regular forms in the secondary structure, there exists a non-repetitive structure named, turn, which is responsible for changing directions. It poses a particular pattern by forming a hydrogen bond between the first and last residues (i and i+n) in the bend. Turns can be classified based on the value of n, as α-turn (n=4), β-turn (n=3), 𝛾-turn (n=2), 𝛿-turn (n=1), π-turn (n=5). Among these β-bend(β-turn) is the most commonly found class, consisting of four amino acids which sharply change the direction of the polypeptide chain. The β-turns can be further classified into nine types based on their torsion angle values of n+1 and n+2 residues.

There also exist some non-repetitive structures with loop or coil conformation and super-secondary structure motifs produced by close packing of side chains from adjacent secondary structural elements, like α-α, β-α-β, β-Barrel, etc.

Domains are the fundamental functional and 3D units of polypeptides. The core of the domain is constructed from various combinations of super-secondary structures, and the folding of the peptide chain within a domain is independent from the folding in other domains. The primary structure of a polypeptide chain determines its tertiary structure at the domain level, as well as at the level of the final arrangements of domains in the polypeptide.

The interactions between the side chains in the sequence of amino acids guide the folding of the polypeptide to form a compact 3D structure. The interactions which stabilize the tertiary structure are disulfide bonds, hydrophobic interactions, hydrogen bonds, and ionic interactions.

The folding of proteins in a cell, involves an ordered pathway starting from peptide folds, — -> secondary structure formation driven by hydrophobic effect, — -> larger structure formation from small structures, — -> stabilize secondary structure — -> initiate tertiary structure — -> formation of final fully-folded protein monomer in its functional form characterized by a low-energy conformational state.

The quaternary structures of proteins refer to the interaction of multiple protein chains or subunits, which assemble into a compact configuration. Each subunit possesses its own primary, secondary, and tertiary structure. The subunits are connected by hydrogen bonds and Van der Waals forces that arise from interactions between nonpolar side chains.

Sequence to Structure

From the above discussion, we can conclude that the three-dimensional structures (tertiary structures or domains) play a pivotal role in understanding the functions of proteins. The 1972 Nobel prize winner Christian Anfinsen’s hypothesis that a protein’s amino acid sequence determines its structure has led to a search for a computational method to predict a protein’s 3D structure from its 1D amino acid sequence.

An example of protein sequence to structure using Somatotropin is shown. [Figure Reference]

Generic Methods

The 3D structure of proteins is generally obtained using methods like X-ray crystallography, and nuclear magnetic resonance (NMR) spectroscopy. These experiments require both time, labor, and expensive specialized equipment. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, before the development of the revolutionary AI-enabled prediction method AlphaFold[3]. The annual rate of structure determination using generic methods is presented in the following figure.

WeRate of Protein Structure Determination by Method and Year [Figure Reference]

Traditional Computational Methods

In addition to experimental methods, computational methods such as homology modeling, ab-initio modeling, and molecular dynamics simulations were also used for protein structure prediction. Homology modeling involves building a protein model based on the known structure of a similar protein. Ab-initio modeling involves predicting the protein structure from scratch, based on the laws of physics and chemistry. Molecular dynamics simulations involve modeling the behavior of a protein over time using complex simulations.

Despite the many computational methods available, predicting protein structures accurately remained a significant challenge, and often the predictions were inaccurate or incomplete. This is because predicting the three-dimensional structure of a protein is a complex problem that involves understanding the interactions between amino acid residues, which are the building blocks of proteins, as well as the folding pathways that proteins take to reach their final structure.

The development of AlphaFold2, a deep learning-based protein structure prediction system, has revolutionized the field by significantly improving the accuracy and speed of protein structure prediction. AlphaFold2 is capable of predicting protein structures with high accuracy and has the potential to accelerate research in a variety of fields, including drug discovery, protein engineering, and structural biology. We will discuss AlphaFold2 in the next blog with more details.

At Molecule AI, we’re creating cutting-edge approaches that harness the power of deep learning in the realm of protein design. To learn more, feel free to contact us at info@moleculeai.com.

References:

[1] Ferrier, D. R. (2017). Lippincott illustrated reviews: biochemistry. Wolters Kluwer.

[2] Bucholtz, K. M. (2009). Medicinal Chemistry: An Introduction, (Gareth Thomas).

[3] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, T., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S., Ballard, A., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Stengger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.

[4] Ophardt, C.E. (2003) Diagnostic Serum Enzymes. http://www.elmhurst.edu/~chm/vchembook/641serumenzymes.html

[5] https://proteopedia.org/wiki/index.php/

--

--

MoleculeAI

This page would let you know about the interesting developments in the field of Drug discovery to cure neurogenerative diseases, using artificial intelligence.