What contributions have the authors mentioned in the paper "Alphafold: improved protein structure prediction using" ?

In this work, the authors 16 show that they can train a neural network to accurately predict the distances between pairs 17 of residues in a protein which convey more about structure than contact predictions. With 18 this information the authors construct a potential of mean force4 that can accurately describe the 19 shape of a protein. The authors find that the resulting potential can be optimised by a simple gradient 20 descent algorithm, to realise structures without the need for complex sampling procedures.

(Open Access) Improved protein structure prediction using potentials from deep learning (2020) | Andrew W. Senior

AlphaFold: Improved protein structure prediction using1

potentials from deep learning2

Andrew W. Senior

1∗

, Richard Evans

1∗

, John Jumper

1∗

, James Kirkpatrick

1∗

, Laurent Sifre

1∗

, Tim Green

Chongli Qin

, Augustin

ıdek

, Alexander W. R. Nelson

, Alex Bridgland

, Hugo Penedones

Stig Petersen

, Karen Simonyan

, Steve Crossan

, Pushmeet Kohli

, David T. Jones

2,3

, David Silver

Koray Kavukcuoglu

, Demis Hassabis

DeepMind, London, UK7

The Francis Crick Institute, London, UK8

University College London, London, UK9

∗

These authors contributed equally to this work.10

Protein structure prediction aims to determine the three-dimensional shape of a protein from11

its amino acid sequence

. This problem is of fundamental importance to biology as the struc-12

ture of a protein largely determines its function

but can be hard to determine experimen-13

tally. In recent years, considerable progress has been made by leveraging genetic informa-14

tion: analysing the co-variation of homologous sequences can allow one to infer which amino15

acid residues are in contact, which in turn can aid structure prediction

. In this work, we16

show that we can train a neural network to accurately predict the distances between pairs17

of residues in a protein which convey more about structure than contact predictions. With18

this information we construct a potential of mean force

that can accurately describe the19

shape of a protein. We ﬁnd that the resulting potential can be optimised by a simple gradient20

descent algorithm, to realise structures without the need for complex sampling procedures.21

The resulting system, named AlphaFold, has been shown to achieve high accuracy, even for22

sequences with relatively few homologous sequences. In the most recent Critical Assessment23

of Protein Structure Prediction

(CASP13), a blind assessment of the state of the ﬁeld of pro-24

tein structure prediction, AlphaFold created high-accuracy structures (with TM-scores

†

of25

0.7 or higher) for 24 out of 43 free modelling domains whereas the next best method, using26

sampling and contact information, achieved such accuracy for only 14 out of 43 domains.27

AlphaFold represents a signiﬁcant advance in protein structure prediction. We expect the in-28

creased accuracy of structure predictions for proteins to enable insights in understanding the29

function and malfunction of these proteins, especially in cases where no homologous proteins30

have been experimentally determined

.31

Proteins are at the core of most biological processes. Since the function of a protein is32

dependent on its structure, understanding protein structure has been a grand challenge in biology33

for decades. While several experimental structure determination techniques have been developed34

†

Template Modelling score

, between 0 and 1, measures the degree of match of the overall (backbone) shape of a

proposed structure to a native structure.

and improved in accuracy, they remain difﬁcult and time-consuming

. As a result, decades of35

theoretical work has attempted to predict protein structure from amino acid sequences.36

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

TM-score Cutoff

FM Domain Count

AlphaFold

Other groups

T0953s2-D3

T0968s2-D1

T0990-D1

T0990-D2

T0990-D3

T1017s2-D1

Target

0.0

0.2

0.4

0.6

0.8

1.0

TM-score

Contact precisions L long L/2 long L/5 long

Set N AF 498 032 AF 498 032 AF 498 032

FM 31 45.5 42.9 39.8 58.0 55.1 51.7 70.1 67.3 61.6

FM/TBM 12 59.1 53.0 48.9 74.2 64.5 64.2 85.3 81.0 79.6

TBM 61 68.3 65.5 61.9 82.4 80.3 76.4 90.6 90.5 87.1

Fig. 1 | AlphaFold’s performance in the CASP13 assessment. (a) Number of free modelling

(FM + FM/TBM) domains predicted to a given TM-score threshold for AlphaFold and the other

97 groups. (b) For the six new folds identiﬁed by the CASP13 assessors, AlphaFold’s TM-score

compared with the other groups, with native structures. The structure of T1017s2-D1 is unavailable

for publication. (c) Precisions for long-range contact prediction in CASP13 for the most probable

L, L/2 or L/5 contacts, where L is the length of the domain. The distance distributions used by

AlphaFold (AF) in CASP13, thresholded to contact predictions, are compared with submissions

by the two best-ranked contact prediction methods in CASP13: 498 (RaptorX-Contact

) and 032

(TripletRes

), on “all groups” targets, excluding T0999.

CASP

is a biennial blind protein structure prediction assessment run by the structure pre-37

diction community to benchmark progress in accuracy. In 2018, AlphaFold joined 97 groups from38

around the world in entering CASP13. Each group submitted up to 5 structure predictions for39

each of 84 protein sequences whose experimentally-determined structures were sequestered. As-40

sessors divided the proteins into 104 domains for scoring and classiﬁed each as being amenable41

to template-based modelling (TBM, where a protein with a similar sequence has a known struc-42

ture, and that homologous structure is modiﬁed in accordance with the sequence differences) or43

requiring free modelling (FM, when no homologous structure is available), with an intermediate44

(FM/TBM) category. Figure 1a shows that AlphaFold stands out in performance above the other45

entrants, predicting more FM domains to high accuracy than any other system, particularly in the46

0.6–0.7 TM-score range. The assessors ranked the 98 participating groups by the summed, capped47

z-scores of the structures, separated according to category. AlphaFold achieved a summed z-score48

of 52.8 in the FM category (best-of-5) vs 36.6 for the next closest group (322)

‡

. Combining FM49

and TBM/FM categories, AlphaFold scored 68.3 vs 48.2. AlphaFold is able to predict previously50

unknown folds to high accuracy as shown in Figure 1b. Despite using only free modelling tech-51

niques and not using templates, AlphaFold also scored well in the TBM category according to the52

assessors’ formula 0-capped z-score, ranking fourth by the top-1 model or ﬁrst by the best-of-553

models. Much of the accuracy of AlphaFold is due to the accuracy of the distance predictions,54

which is evident from the high precision of the contact predictions of Table 1c.55

The most successful free modelling approaches so far

10–12

have relied on fragment assembly56

to determine the shape of the protein of interest. In these approaches a structure is created through57

a stochastic sampling process, such as simulated annealing

, that minimises a statistical potential58

derived from summary statistics extracted from structures in the Protein Data Bank (PDB

). In59

fragment assembly, a structure hypothesis is repeatedly modiﬁed, typically by changing the shape60

of a short section, retaining changes which lower the potential, ultimately leading to low potential61

structures. Simulated annealing requires many thousands of such moves and must be repeated62

many times to have good coverage of low-potential structures.63

In recent years, structure prediction accuracy has improved through the use of evolutionary64

covariation data

found in sets of related sequences. Sequences similar to the target sequence65

are found by searching large datasets of protein sequences derived from DNA sequencing and66

aligned to the target sequence to make a multiple sequence alignment (MSA). Correlated changes67

in two amino acid residue positions across the sequences of the MSA can be used to infer which68

residues might be in contact. Contacts are typically deﬁned to occur when the β-carbon atoms of69

two residues are within 8

Angstr

om of one another. Several methods have been used to predict70

the probability that a pair of residues is in contact based on features computed from MSAs

16–19

including neural networks

20–23

. Contact predictions are incorporated in structure prediction by72

modifying the statistical potential to guide the folding process to structures that satisfy more of the73

predicted contacts

12,24

. Previous work

25,26

has made predictions of the distance between residues,74

particularly for distance geometry approaches

8,27–29

. Neural network distance predictions without75

covariation features were used to make the EPAD potential

which was used for ranking struc-76

ture hypotheses and the QUARK pipeline

used a template-based distance proﬁle restraint for77

template-based modelling.78

In this work we present a new, deep-learning, approach to protein structure prediction, whose79

stages are illustrated in Figure 2a. We show that it is possible to construct a learned, protein-speciﬁc80

potential by training a neural network (Fig. 2b) to make accurate predictions about the structure81

of the protein given its sequence, and to predict the structure itself accurately by minimising the82

‡

Results from http://predictioncenter.org/casp13/zscores_final.cgi?formula=

assessors

280

600

1200

Sequence

& MSA

features

Distance & torsion

distribution predictions

Gradient descent on

protein-specific potential

Deep neural

network

LxL 2D Covariation features

Tiled Lx1 1D sequence & profile features

220 Residual convolution blocks

64 bins deep

500

Fig. 2 | The folding process illustrated for CASP13 target T0986s2. (Length L = 155) (a)

Steps of structure prediction. (b) The neural network predicts the entire L × L distogram based

on MSA features, accumulating separate predictions for 64 × 64-residue regions. (c) One iteration

of gradient descent (1 200 steps) is shown, with TM-score and RMSD plotted against step number

with ﬁve snapshots of the structure. The secondary structure (from SST

) is also shown (helix

in blue, strand in red) along with the the native secondary structure (SS), the network’s secondary

structure prediction probabilities and the uncertainty in torsion angle predictions (as κ

−1

of the

von Mises distributions ﬁtted to the predictions for φ and ψ). While each step of gradient descent

greedily lowers the potential, large global conformation changes are effected, resulting in a well-

packed chain. (d) shows the ﬁnal ﬁrst submission overlaid on the native structure (in grey). (e)

shows the average (across the test set, n = 377) TM-score of the lowest-potential structure against

the number of repeats of gradient descent (log scale).

potential by gradient descent (Fig. 2c). The neural network predictions include backbone torsion83

angles and pairwise distances between residues. Distance predictions provide more speciﬁc in-84

formation about the structure than contact predictions and provide a richer training signal for the85

neural network. Predicting distances, rather than contacts as in most prior work, models detailed86

interactions rather than simple binary decisions. By jointly predicting many distances, the network87

can propagate distance information respecting covariation, local structure and residue identities to88

nearby residues. The predicted probability distributions can be combined to form a simple, prin-89

cipled protein-speciﬁc potential. We show that with gradient descent, it is simple to ﬁnd a set of90

torsion angles that minimise this protein-speciﬁc potential using only limited sampling. We also91

show that whole chains can be optimised together, avoiding the need for segmenting long proteins92

into hypothesised domains which are modelled independently.93

The central component of AlphaFold is a convolutional neural network which is trained94

on PDB structures to predict the distances d

between the C

atoms of pairs, ij, of a protein’s95

residues. Based on a representation of the protein’s amino acid sequence, S, and features derived96

from the sequence’s MSA, the network, similar in structure to those used for image recognition97

tasks

, predicts a discrete probability distribution P (d

| S, MSA(S)) for every ij pair in a98

64 × 64 residue region, as shown in Fig. 2b. The full set of distance distribution predictions99

is constructed by averaging predictions for overlapping regions and is termed a distogram (from100

distance histogram). Figure 3 shows an example distogram prediction for one CASP protein,101

T0955. The modes of the distribution (Fig. 3c) can be seen to closely match the true distances102

(Fig. 3b). Example distributions for all distances to one residue (29) are shown in Fig. 3c. Further103

analysis of how the network predicts the distances is shown in Methods Figure 14.104

In order to realise structures that conform to the distance predictions, we construct a smooth105

potential V

distance

by ﬁtting a spline to the negative log probabilities, and summing across all the106

residue pairs. We parameterise protein structures by the backbone torsion angles (φ, ψ) of all107

residues and build a differentiable model of protein geometry x = G(φ, ψ) to compute the C

108

coordinates, x, and thus the inter-residue distances, d

= kx

− x

k, for each structure, and109

express V

distance

as a function of φ and ψ. For a protein with L residues, this potential accumulates110

terms from marginal distribution predictions. To correct for the over-representation of the111

prior we subtract a reference distribution

from the distance potential in the log domain. The112

reference distribution models the distance distributions P (d

| length) independent of the protein113

sequence and is computed by training a small version of the distance prediction neural network on114

the same structures, without sequence or MSA input features. A separate output head of the contact115

prediction network is trained to predict discrete probability distributions of backbone torsion angles116

P (φ

, ψ

| S, MSA(S)). After ﬁtting a von Mises distribution, this is used to add a smooth torsion117

modelling term V

torsion

= −

log p

vonMises

(φ

, ψ

| S, MSA(S)) to the potential. Finally, to118

prevent steric clashes, we add Rosetta’s V

score2 smooth

to the potential, as this incorporates a van119

der Waals term. We used multiplicative weights for each of the three terms in the potential, but no120

weighting noticeably outperformed equal weighting.121

Improved protein structure prediction using potentials from deep learning

Figures

Citations

Highly accurate protein structure prediction with AlphaFold

Accurate prediction of protein structures and interactions using a three-track neural network

Highly accurate protein structure prediction for the human proteome

A Comprehensive Review of the COVID-19 Pandemic and the Role of IoT, Drones, AI, Blockchain, and 5G in Managing its Impact

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences

References

Deep Residual Learning for Image Recognition

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Optimization by Simulated Annealing

The Protein Data Bank

Dropout: a simple way to prevent neural networks from overfitting

Related Papers (5)

Deep learning

Deep Residual Learning for Image Recognition

The Protein Data Bank

Highly accurate protein structure prediction with AlphaFold

HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Alphafold: improved protein structure prediction using" ?