scispace - formally typeset
Open AccessJournal ArticleDOI

Improved protein structure prediction using potentials from deep learning

TLDR
It is shown that a neural network can be trained to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions, and the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures.
Abstract
Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely determines its function2; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures3. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force4 that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modelling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined7.

read more

Content maybe subject to copyright    Report

AlphaFold: Improved protein structure prediction using1
potentials from deep learning2
Andrew W. Senior
1
, Richard Evans
1
, John Jumper
1
, James Kirkpatrick
1
, Laurent Sifre
1
, Tim Green
1
,3
Chongli Qin
1
, Augustin
ˇ
Z
´
ıdek
1
, Alexander W. R. Nelson
1
, Alex Bridgland
1
, Hugo Penedones
1
,4
Stig Petersen
1
, Karen Simonyan
1
, Steve Crossan
1
, Pushmeet Kohli
1
, David T. Jones
2,3
, David Silver
1
,5
Koray Kavukcuoglu
1
, Demis Hassabis
1
6
1
DeepMind, London, UK7
2
The Francis Crick Institute, London, UK8
3
University College London, London, UK9
These authors contributed equally to this work.10
Protein structure prediction aims to determine the three-dimensional shape of a protein from11
its amino acid sequence
1
. This problem is of fundamental importance to biology as the struc-12
ture of a protein largely determines its function
2
but can be hard to determine experimen-13
tally. In recent years, considerable progress has been made by leveraging genetic informa-14
tion: analysing the co-variation of homologous sequences can allow one to infer which amino15
acid residues are in contact, which in turn can aid structure prediction
3
. In this work, we16
show that we can train a neural network to accurately predict the distances between pairs17
of residues in a protein which convey more about structure than contact predictions. With18
this information we construct a potential of mean force
4
that can accurately describe the19
shape of a protein. We find that the resulting potential can be optimised by a simple gradient20
descent algorithm, to realise structures without the need for complex sampling procedures.21
The resulting system, named AlphaFold, has been shown to achieve high accuracy, even for22
sequences with relatively few homologous sequences. In the most recent Critical Assessment23
of Protein Structure Prediction
5
(CASP13), a blind assessment of the state of the field of pro-24
tein structure prediction, AlphaFold created high-accuracy structures (with TM-scores
of25
0.7 or higher) for 24 out of 43 free modelling domains whereas the next best method, using26
sampling and contact information, achieved such accuracy for only 14 out of 43 domains.27
AlphaFold represents a significant advance in protein structure prediction. We expect the in-28
creased accuracy of structure predictions for proteins to enable insights in understanding the29
function and malfunction of these proteins, especially in cases where no homologous proteins30
have been experimentally determined
7
.31
Proteins are at the core of most biological processes. Since the function of a protein is32
dependent on its structure, understanding protein structure has been a grand challenge in biology33
for decades. While several experimental structure determination techniques have been developed34
Template Modelling score
6
, between 0 and 1, measures the degree of match of the overall (backbone) shape of a
proposed structure to a native structure.
1

and improved in accuracy, they remain difficult and time-consuming
2
. As a result, decades of35
theoretical work has attempted to predict protein structure from amino acid sequences.36
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
TM-score Cutoff
0
5
10
15
20
25
30
35
40
45
FM Domain Count
AlphaFold
Other groups
a
T0953s2-D3
T0968s2-D1
T0990-D1
T0990-D2
T0990-D3
T1017s2-D1
Target
0.0
0.2
0.4
0.6
0.8
1.0
TM-score
b
Contact precisions L long L/2 long L/5 long
Set N AF 498 032 AF 498 032 AF 498 032
FM 31 45.5 42.9 39.8 58.0 55.1 51.7 70.1 67.3 61.6
FM/TBM 12 59.1 53.0 48.9 74.2 64.5 64.2 85.3 81.0 79.6
TBM 61 68.3 65.5 61.9 82.4 80.3 76.4 90.6 90.5 87.1
c
Fig. 1 | AlphaFold’s performance in the CASP13 assessment. (a) Number of free modelling
(FM + FM/TBM) domains predicted to a given TM-score threshold for AlphaFold and the other
97 groups. (b) For the six new folds identified by the CASP13 assessors, AlphaFold’s TM-score
compared with the other groups, with native structures. The structure of T1017s2-D1 is unavailable
for publication. (c) Precisions for long-range contact prediction in CASP13 for the most probable
L, L/2 or L/5 contacts, where L is the length of the domain. The distance distributions used by
AlphaFold (AF) in CASP13, thresholded to contact predictions, are compared with submissions
by the two best-ranked contact prediction methods in CASP13: 498 (RaptorX-Contact
8
) and 032
(TripletRes
9
), on “all groups” targets, excluding T0999.
CASP
5
is a biennial blind protein structure prediction assessment run by the structure pre-37
diction community to benchmark progress in accuracy. In 2018, AlphaFold joined 97 groups from38
around the world in entering CASP13. Each group submitted up to 5 structure predictions for39
each of 84 protein sequences whose experimentally-determined structures were sequestered. As-40
sessors divided the proteins into 104 domains for scoring and classified each as being amenable41
to template-based modelling (TBM, where a protein with a similar sequence has a known struc-42
ture, and that homologous structure is modified in accordance with the sequence differences) or43
requiring free modelling (FM, when no homologous structure is available), with an intermediate44
(FM/TBM) category. Figure 1a shows that AlphaFold stands out in performance above the other45
entrants, predicting more FM domains to high accuracy than any other system, particularly in the46
2

0.6–0.7 TM-score range. The assessors ranked the 98 participating groups by the summed, capped47
z-scores of the structures, separated according to category. AlphaFold achieved a summed z-score48
of 52.8 in the FM category (best-of-5) vs 36.6 for the next closest group (322)
. Combining FM49
and TBM/FM categories, AlphaFold scored 68.3 vs 48.2. AlphaFold is able to predict previously50
unknown folds to high accuracy as shown in Figure 1b. Despite using only free modelling tech-51
niques and not using templates, AlphaFold also scored well in the TBM category according to the52
assessors’ formula 0-capped z-score, ranking fourth by the top-1 model or first by the best-of-553
models. Much of the accuracy of AlphaFold is due to the accuracy of the distance predictions,54
which is evident from the high precision of the contact predictions of Table 1c.55
The most successful free modelling approaches so far
10–12
have relied on fragment assembly56
to determine the shape of the protein of interest. In these approaches a structure is created through57
a stochastic sampling process, such as simulated annealing
13
, that minimises a statistical potential58
derived from summary statistics extracted from structures in the Protein Data Bank (PDB
14
). In59
fragment assembly, a structure hypothesis is repeatedly modified, typically by changing the shape60
of a short section, retaining changes which lower the potential, ultimately leading to low potential61
structures. Simulated annealing requires many thousands of such moves and must be repeated62
many times to have good coverage of low-potential structures.63
In recent years, structure prediction accuracy has improved through the use of evolutionary64
covariation data
15
found in sets of related sequences. Sequences similar to the target sequence65
are found by searching large datasets of protein sequences derived from DNA sequencing and66
aligned to the target sequence to make a multiple sequence alignment (MSA). Correlated changes67
in two amino acid residue positions across the sequences of the MSA can be used to infer which68
residues might be in contact. Contacts are typically defined to occur when the β-carbon atoms of69
two residues are within 8
˚
Angstr
¨
om of one another. Several methods have been used to predict70
the probability that a pair of residues is in contact based on features computed from MSAs
16–19
71
including neural networks
20–23
. Contact predictions are incorporated in structure prediction by72
modifying the statistical potential to guide the folding process to structures that satisfy more of the73
predicted contacts
12,24
. Previous work
25,26
has made predictions of the distance between residues,74
particularly for distance geometry approaches
8,27–29
. Neural network distance predictions without75
covariation features were used to make the EPAD potential
26
which was used for ranking struc-76
ture hypotheses and the QUARK pipeline
12
used a template-based distance profile restraint for77
template-based modelling.78
In this work we present a new, deep-learning, approach to protein structure prediction, whose79
stages are illustrated in Figure 2a. We show that it is possible to construct a learned, protein-specific80
potential by training a neural network (Fig. 2b) to make accurate predictions about the structure81
of the protein given its sequence, and to predict the structure itself accurately by minimising the82
Results from http://predictioncenter.org/casp13/zscores_final.cgi?formula=
assessors
3

0
280
600
1200
Sequence
& MSA
features
Distance & torsion
distribution predictions
Gradient descent on
protein-specific potential
Deep neural
network
c
a
LxL 2D Covariation features
Tiled Lx1 1D sequence & profile features
b
220 Residual convolution blocks
64
64
d
j
i
e
64 bins deep
500
Fig. 2 | The folding process illustrated for CASP13 target T0986s2. (Length L = 155) (a)
Steps of structure prediction. (b) The neural network predicts the entire L × L distogram based
on MSA features, accumulating separate predictions for 64 × 64-residue regions. (c) One iteration
of gradient descent (1 200 steps) is shown, with TM-score and RMSD plotted against step number
with ve snapshots of the structure. The secondary structure (from SST
30
) is also shown (helix
in blue, strand in red) along with the the native secondary structure (SS), the network’s secondary
structure prediction probabilities and the uncertainty in torsion angle predictions (as κ
1
of the
von Mises distributions fitted to the predictions for φ and ψ). While each step of gradient descent
greedily lowers the potential, large global conformation changes are effected, resulting in a well-
packed chain. (d) shows the final first submission overlaid on the native structure (in grey). (e)
shows the average (across the test set, n = 377) TM-score of the lowest-potential structure against
the number of repeats of gradient descent (log scale).
4

potential by gradient descent (Fig. 2c). The neural network predictions include backbone torsion83
angles and pairwise distances between residues. Distance predictions provide more specific in-84
formation about the structure than contact predictions and provide a richer training signal for the85
neural network. Predicting distances, rather than contacts as in most prior work, models detailed86
interactions rather than simple binary decisions. By jointly predicting many distances, the network87
can propagate distance information respecting covariation, local structure and residue identities to88
nearby residues. The predicted probability distributions can be combined to form a simple, prin-89
cipled protein-specific potential. We show that with gradient descent, it is simple to find a set of90
torsion angles that minimise this protein-specific potential using only limited sampling. We also91
show that whole chains can be optimised together, avoiding the need for segmenting long proteins92
into hypothesised domains which are modelled independently.93
The central component of AlphaFold is a convolutional neural network which is trained94
on PDB structures to predict the distances d
ij
between the C
β
atoms of pairs, ij, of a protein’s95
residues. Based on a representation of the protein’s amino acid sequence, S, and features derived96
from the sequence’s MSA, the network, similar in structure to those used for image recognition97
tasks
31
, predicts a discrete probability distribution P (d
ij
| S, MSA(S)) for every ij pair in a98
64 × 64 residue region, as shown in Fig. 2b. The full set of distance distribution predictions99
is constructed by averaging predictions for overlapping regions and is termed a distogram (from100
distance histogram). Figure 3 shows an example distogram prediction for one CASP protein,101
T0955. The modes of the distribution (Fig. 3c) can be seen to closely match the true distances102
(Fig. 3b). Example distributions for all distances to one residue (29) are shown in Fig. 3c. Further103
analysis of how the network predicts the distances is shown in Methods Figure 14.104
In order to realise structures that conform to the distance predictions, we construct a smooth105
potential V
distance
by fitting a spline to the negative log probabilities, and summing across all the106
residue pairs. We parameterise protein structures by the backbone torsion angles (φ, ψ) of all107
residues and build a differentiable model of protein geometry x = G(φ, ψ) to compute the C
β
108
coordinates, x, and thus the inter-residue distances, d
ij
= kx
i
x
j
k, for each structure, and109
express V
distance
as a function of φ and ψ. For a protein with L residues, this potential accumulates110
L
2
terms from marginal distribution predictions. To correct for the over-representation of the111
prior we subtract a reference distribution
32
from the distance potential in the log domain. The112
reference distribution models the distance distributions P (d
ij
| length) independent of the protein113
sequence and is computed by training a small version of the distance prediction neural network on114
the same structures, without sequence or MSA input features. A separate output head of the contact115
prediction network is trained to predict discrete probability distributions of backbone torsion angles116
P (φ
i
, ψ
i
| S, MSA(S)). After fitting a von Mises distribution, this is used to add a smooth torsion117
modelling term V
torsion
=
P
log p
vonMises
(φ
i
, ψ
i
| S, MSA(S)) to the potential. Finally, to118
prevent steric clashes, we add Rosetta’s V
score2 smooth
10
to the potential, as this incorporates a van119
der Waals term. We used multiplicative weights for each of the three terms in the potential, but no120
weighting noticeably outperformed equal weighting.121
5

Figures
Citations
More filters
Journal ArticleDOI

A Comprehensive Review of the COVID-19 Pandemic and the Role of IoT, Drones, AI, Blockchain, and 5G in Managing its Impact

TL;DR: The use of technologies such as the Internet of Things (IoT), Unmanned Aerial Vehicles (UAVs), blockchain, Artificial Intelligence (AI), and 5G, among others, are explored to help mitigate the impact of COVID-19 outbreak.
Posted ContentDOI

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences

TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Journal ArticleDOI

Optimization by Simulated Annealing

TL;DR: There is a deep and useful connection between statistical mechanics and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters), and a detailed analogy with annealing in solids provides a framework for optimization of very large and complex systems.
Journal ArticleDOI

The Protein Data Bank

TL;DR: The goals of the PDB are described, the systems in place for data deposition and access, how to obtain further information and plans for the future development of the resource are described.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Alphafold: improved protein structure prediction using" ?

In this work, the authors 16 show that they can train a neural network to accurately predict the distances between pairs 17 of residues in a protein which convey more about structure than contact predictions. With 18 this information the authors construct a potential of mean force4 that can accurately describe the 19 shape of a protein. The authors find that the resulting potential can be optimised by a simple gradient 20 descent algorithm, to realise structures without the need for complex sampling procedures.