Protein Folding With AlphaFold2: Chapter Two
In the last blog, we briefly discuss the AlphaFold2 [1] architecture on the protein folding problem. We divided the architecture into three different modules (i) Sequence preprocessing, (ii) Evoformer, and (iii) Structure generation. In this write up, we will be discussing the first module along with the input data for alphaFold2.
Where to get the protein sequence information?
For any machine learning model, the starting point is data or the input to the model for which the pattern should be predicted. In case of the protein folding problem, we need the dataset/database where all the existing amino acid sequence and the structure information is available as the labeled information. Experimentally-determined 3D structures of biological macromolecules are archived in the open-access Protein Data Bank (PDB, wwPDB Consortium 2019), and managed by the Worldwide Protein Data Bank (wwPDB) partnership since 2003 (Berman et al., 2003). These structural data are freely available from the PDB FTP Archive (https://files.wwpdb.org/pub/pdb/data/). The mmcif (macromolecular crystallographic information files) have all the details of proteins, their release dates, chain information, all the atoms position types, etc which are to be used by the model at the time of training.
Now there is also another type of input format, namely FASTA, a text-based format for representing amino acid (protein) sequences, in which amino acids are represented using single-letter codes. These files are used at the time of inference and can be downloaded from UniProt.
Is this data sufficient for running the AI model of AlphaFold2?
The answer is definitely “No”. We know from the physics of the protein structure that it tries to lower the free energy of all atoms to get a stable conformation, it tries to form the hydrogen bond and fold into 3D shape from the sequence. Though the application of these geometrical and physical constraints for each atom is able to generale accurate structures, it is computationally intractable for large macromolecules like protein [2]. Hence, the solution comes in two-fold (i)template based and (ii) template-free modeling.
In the first case, the idea is that proteins with similar amino acid sequences (more than ~30% sequence identity) are known to fold into similar structures. So structure prediction of a new protein can be based on the known structure of a homologous protein (provided it has a sufficiently similar sequence). As we have publicly available PDB information, it can be used to generate the template. Natural question is “how”? The similarity score is generated using the profile score (based on frequency of each amino acid in the sequence) of the query and the ranked database entries. The state of the art model in finding this similarity works using a hidden-markov model based score available in HHSearch and returns a pair-wise query-database alignment [3]. Alphafold2 also uses the same principle to guide the structure prediction and passes the homologous structure information as template (using HHSearch in Uniref90 database) to the deep learning modules.
Now, the template-free modeling comes into picture where the significantly similar sequences are not found in the existing database of proteins. Multiple sequence alignments can be used to identify evolutionarily conserved protein families. Access to vast amounts of sequencing and experimentally determined structural data show that in conserved protein families, amino acids that are in close proximity in 3D may be mutated so that their locations are exchanged but the interactions and physical distance between them is preserved. These correlated mutations or covariations can provide distance constraints and guide prediction of intramolecular contacts. AlphaFold2 also uses the multiple sequence alignment to guide the predictions.
What is Multiple Sequence Alignment (MSA)? Is it really important in solving the protein folding problem?
MSA plays an important role in predicting the structures of protein sequences. In bioinformatics, sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of amino acid residues for different species are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Now, this matrix may have conservative regions (where the amino acid sequence exactly matches), and conservative mutations (amino acid gets replaced with another amino acid of the same biochemical property), The conserved regions across genes show similar patterns or sequence motifs across different sequences. To predict a structure of a target protein, it is helpful to have the evolutionary relationships and phylogenetic information from the genetic databases. It boosts the performance of the prediction by assessing the protein domain, and structures from the conserved regions of multiple sequences.
Databases used in AlphaFold2:
(i) Sequence Databases — The following genetic databases are used for constructing MSA using JackHMMER [7] and HHblits techniques [8]. The databases are -
(ii) Structure Databases — The training samples and the template building, both require the structure information. These are sourced from the following databases-
- PDB (training)
- PDB70 clustering (HHSearch techniques [8])
Summary of the Sequence Preprocessing Module of AlphaFold2:
In Alphafold2, profile-based searching is used to identify homologous protein sequences that can be used as templates for homology modeling.
Profile-based searching involves using a profile, which is a matrix that summarizes the frequency of each amino acid at each position in a multiple sequence alignment (MSA) of related protein sequences. This profile is used to search sequence databases to identify related sequences that share similar amino acid profiles.
Alphafold2 uses two methods for profile-based searching: PSI-BLAST and JackHMMER. PSI-BLAST performs a position-specific scoring of sequences against a protein database, while JackHMMER generates MSAs using HMMs and searches sequence databases for related protein sequences.
The profiles generated by these methods are used to extract evolutionary information, such as co-evolutionary couplings, which are then used as input for the evoformer module in Alphafold2 to generate initial 3D structure predictions.
Now, the input embeddings are constructed for both the MSA features and pairwise-wise feature information using a linear network, which is trained along with the end-to-end network of the structure prediction. In the next chapter, we will discuss the next modules in detail.
At Molecule AI, we’re creating cutting-edge approaches that harness the power of deep learning in the realm of protein design. To learn more, feel free to contact us at info@moleculeai.com .
References:
[1] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, T., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S., Ballard, A., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Stengger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.
[2]https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/computed-structure-models
[3] https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
[4] Suzek et al., Bioinformatics (2015) doi:10.1093/bioinformatics/btu739
[5] Steinegger et al., Nature Methods (2019) doi:10.1038/s41592–019–0437–4
[6] Mitchell et al., Nucleic Acids Research (2019) doi:10.1093/nar/gkz1035
[7] Potter et al., Nucleic Acids Research (2018) doi:10.1093/nar/gky448
[8] Steinegger et al., BMC Bioinformatics (2019) doi:10.1186/s12859–019–3019–7