scispace - formally typeset
Open AccessJournal ArticleDOI

Accurate prediction of protein structures and interactions using a three-track neural network

TLDR
In this article, a three-track network is proposed to combine information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level.
Abstract
DeepMind presented notably accurate predictions at the recent 14th Critical Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo-electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.

read more

Content maybe subject to copyright    Report

1
Accurate prediction of protein structures and interactions using a 3-track
neural network
Minkyung Baek
1,2
, Frank DiMaio
1,2
, Ivan Anishchenko
1,2
, Justas Dauparas
1,2
, Sergey
Ovchinnikov
3,4
, Gyu Rie Lee
1,2
, Jue Wang
1,2
, Qian Cong
5,6
, Lisa N. Kinch
8
, R. Dustin Schaeffer
6
,
Claudia Millán
9
, Hahnbeom Park
1,2
, Carson Adams
1,2
, Caleb R. Glassman
10,11
, Andy
DeGiovanni
12
, Jose H. Pereira
12
, Andria V. Rodrigues
12
, Alberdina A. van Dijk
13
, Ana C.
Ebrecht
13
, Diederik J. Opperman
14
, Theo Sagmeister
15
, Christoph Buhlheller
15,16
, Tea Pavkov-
Keller
15,17
, Manoj K Rathinaswamy
18
, Udit Dalwadi
19
, Calvin K Yip
19
, John E Burke
18
, K.
Christopher Garcia
20
, Nick V. Grishin
6,7,8
, Paul D. Adams
12,21
, Randy J. Read
9
, David Baker
1,2,22*
Affiliations:
1
Department of Biochemistry, University of Washington; Seattle, WA98195, USA
2
Institute for Protein Design, University of Washington; Seattle, WA98195, USA
3
Faculty of Arts and Sciences, Division of Science, Harvard University; Cambridge,
MA02138, USA
4
John Harvard Distinguished Science Fellowship Program, Harvard University; Cambridge,
MA 02138, USA
5
Eugene McDermott Center for Human Growth and Development, University of Texas
Southwestern Medical Center; Dallas, TX, USA
6
Department of Biophysics, University of Texas Southwestern Medical Center; Dallas, TX,
USA
7
Department of Biochemistry, University of Texas Southwestern Medical Center; Dallas,
TX, USA
8
Howard Hughes Medical Institute, University of Texas Southwestern Medical Center;
Dallas, TX, USA
9
Department of Haematology, Cambridge Institute for Medical Research, University of
Cambridge; Cambridge, U.K.
10
Program in Immunology, Stanford University School of Medicine, Stanford, CA 94305,
USA
11
Departments of Molecular and Cellular Physiology and Structural Biology, Stanford
University School of Medicine, Stanford, CA 94305, USA
12
Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National
Laboratory, Berkeley, CA, USA
13
Department of Biochemistry, Focus Area Human Metabolomics, North-West University;
2531 Potchefstroom, South Africa
14
Department of Biotechnology, University of the Free State; 205 Nelson Mandela Drive,
Bloemfontein, 9300, South Africa
15
Institute of Molecular Biosciences, University of Graz; Humboldtstrasse 50, 8010, Graz,
Austria

2
16
Medical University of Graz; Graz, Austria
17
BioTechMed-Graz; Graz, Austria
18
Department of Biochemistry and Microbiology, University of Victoria; Victoria, British
Columbia, Canada
19
Life Sciences Institute, Department of Biochemistry and Molecular Biology, The
University of British Columbia; Vancouver, British Columbia, Canada
20
Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA
94305, USA
21
Department of Bioengineering, University of California Berkeley, Berkeley, CA 94720,
USA
22
Howard Hughes Medical Institute, University of Washington; Seattle, WA98195, USA
*Corresponding author. Email: dabaker@uw.edu
Abstract: DeepMind presented remarkably accurate predictions at the recent CASP14 protein
structure prediction assessment conference. We explored network architectures incorporating
related ideas and obtained the best performance with a 3-track network in which information at
the 1D sequence level, the 2D distance map level, and the 3D coordinate level is successively
transformed and integrated. The 3-track network produces structure predictions with accuracies
approaching those of DeepMind in CASP14, enables the rapid solution of challenging X-ray
crystallography and cryo-EM structure modeling problems, and provides insights into the
functions of proteins of currently unknown structure. The network also enables rapid generation
of accurate protein-protein complex models from sequence information alone, short circuiting
traditional approaches which require modeling of individual subunits followed by docking. We
make the method available to the scientific community to speed biological research.
One-Sentence Summary: Accurate protein structure modeling enables the rapid solution of
protein structures and provides insights into function.

3
The prediction of protein structure from amino acid sequence information alone has been a
longstanding challenge. The bi-annual Critical Assessment of Structure (CASP) meetings have
demonstrated that deep learning methods such as AlphaFold (1, 2) and trRosetta (3), that extract
information from the large database of known protein structures in the PDB, outperform more
traditional approaches that explicitly model the folding process. The outstanding performance of
DeepMind’s AlphaFold2 in the recent CASP14 meeting
(https://predictioncenter.org/casp14/zscores_final.cgi) left the scientific community eager to
learn details beyond the overall framework presented and raised the question of whether such
accuracy could be achieved outside of a world-leading deep learning company. As described at
the CASP14 conference, the AlphaFold2 methodological advances included 1) starting from
multiple sequence alignments (MSAs) rather than from more processed features such as inverse
covariance matrices derived from MSAs, 2) replacement of 2D convolution with an attention
mechanism that better represents interactions between residues distant along the sequence, 3) use
of a two-track network architecture in which information at the 1D sequence level and the 2D
distance map level is iteratively transformed and passed back and forth, 4) use of an SE(3)-
equivariant Transformer network to directly refine atomic coordinates (rather than 2D distance
maps as in previous approaches) generated from the two-track network, and 5) end-to-end
learning in which all network parameters are optimized by backpropagation from the final
generated 3D coordinates through all network layers back to the input sequence.
Network architecture development
Intrigued by the DeepMind results, and with the goal of increasing protein structure
prediction accuracy for structural biology research and advancing protein design (4), we
explored network architectures incorporating different combinations of these five properties. In
the absence of a published method, we experimented with a wide variety of approaches for
passing information between different parts of the networks, as summarized in the Methods and
table S1. We succeeded in producing a “two-track” network with information flowing in parallel
along a 1D sequence alignment track and a 2D distance matrix track with considerably better
performance than trRosetta (BAKER-ROSETTASERVER and BAKER in Fig. 1B), the next
best method after AlphaFold2 in CASP14 (https://predictioncenter.org/casp14/zscores_final.cgi).
We reasoned that better performance could be achieved by extending to a third track
operating in 3D coordinate space to provide a tighter connection between sequence, residue-
residue distances and orientations, and atomic coordinates. We constructed architectures with the
two levels of the two-track model augmented with a third parallel structure track operating on 3D
backbone coordinates as depicted in Fig. 1A (see Methods and fig. S1 for details). In this
architecture, information flows back and forth between the 1D amino acid sequence information,
the 2D distance map, and the 3D coordinates, allowing the network to collectively reason about
relationships within and between sequences, distances, and coordinates. In contrast, reasoning
about 3D atomic coordinates in the two-track AlphaFold2 architecture happens after processing
of the 1D and 2D information is complete (although end-to-end training does link parameters to
some extent). Because of computer hardware memory limitations, we could not train models on
large proteins directly as the 3-track models have many millions of parameters; instead, we
presented to the network many discontinuous crops of the input sequence consisting of two
discontinuous sequence segments spanning a total of 260 residues. To generate final models, we
combined and averaged the 1D features and 2D distance and orientation predictions produced for
each of the crops and then used two approaches to generate final 3D structures. In the first, the

4
predicted residue-residue distance and orientation distributions are fed into pyRosetta (5) to
generate all-atom models. In the second, the averaged 1D and 2D features are fed into a final
SE(3)-equivariant layer (6), and following end-to-end training from amino acid sequence to 3D
coordinates, backbone coordinates are generated directly by the network (see Methods). We refer
to these networks, which also generate per residue accuracy predictions, as RoseTTAFold. The
first has the advantage of requiring lower memory (for proteins over 400 residues, 8GB rather
than 24GB) GPUs at inference time and producing full side chain models, but requires CPU time
for the pyRosetta structure modeling step.
The 3-track models with attention operating at the 1D, 2D, and 3D levels and information
flowing between the three levels were the best models we tested (Fig. 1B), clearly outperforming
the top 2 server groups (Zhang-server and BAKER-ROSETTASERVER), BAKER human group
(ranked second among all groups), and our 2-track attention models on CASP14 targets. As in
the case of AlphaFold2, the correlation between multiple sequence alignment depth and model
accuracy is lower for RoseTTAFold than for trRosetta and other methods tested at CASP14 (fig.
S2). The performance of the 3-track model on the CASP14 targets was still not as good as
AlphaFold2 (Fig. 1B). This could reflect hardware limitations that limited the size of the models
we could explore, alternative architectures or loss formulations, or more intensive use of the
network for inference. DeepMind reported using several GPUs for days to make individual
predictions, whereas our predictions are made in a single pass through the network in the same
manner that would be used for a server; following sequence and template search (~1.5 hours), the
end-to-end version of RoseTTAFold requires ~10 minutes on an RTX2080 GPU to generate
backbone coordinates for proteins with less than 400 residues, and the pyRosetta version requires
5 minutes for network calculations on a single RTX2080 GPU and an hour for all-atom structure
generation with 15 CPU cores. Incomplete optimization due to computer memory limitations and
neglect of side chain information likely explain the poorer performance of the end-to-end version
compared to the pyRosetta version (Fig. 1B; the latter incorporates side chain information at the
all-atom relaxation stage); since SE(3)-equivariant layers are used in the main body of the 3-
track model, the added gain from the final SE(3) layer is likely less than in the AlphaFold2 case.
We expect the end-to-end approach to ultimately be at least as accurate once the computer
hardware limitations are overcome, and side chains are incorporated.
The improved performance of the 3-track models over the 2-track model with identical
training sets, similar attention-based architectures for the 1D and 2D tracks, and similar
operations in inference (prediction) mode suggests that simultaneously reasoning at the multiple
sequence alignment, distance map, and three-dimensional coordinate representations can more
effectively extract sequence-structure relationships than reasoning over only MSA and distance
map information. The relatively low compute cost makes it straightforward to incorporate the
methods in a public server and predict structures for large sets of proteins, for example, all
human GPCRs, as described below.
Blind structure prediction tests are needed to assess any new protein structure prediction
method, but CASP is held only once every two years. Fortunately, the Continuous Automated
Model Evaluation (CAMEO) experiment (7) tests structure prediction servers blindly on protein
structures as they are submitted to the PDB. RoseTTAFold has been evaluated since May 15th,
2021 on CAMEO; over the 69 medium and hard targets released during this time (May 15th,
2021 ~ June 19th, 2021), it outperformed all other servers evaluated in the experiment including
Robetta (3), IntFold6-TS (8), BestSingleTemplate (9), and SWISS-MODEL (10) (Fig. 1C).

5
We experimented with approaches for further improving accuracy by more intensive use
of the network during sampling. Since the network can take as input templates of known
structures, we experimented with a further coupling of 3D structural information and 1D
sequence information by iteratively feeding the predicted structures back into the network as
templates and random subsampling from the multiple sequence alignments to sample a broader
range of models. These approaches generated ensembles containing higher accuracy models, but
the accuracy predictor was not able to consistently identify models better than those generated by
the rapid single pass method (fig. S3). Nevertheless, we suspect that these approaches can
improve model performance and are carrying out further investigations along these lines.
In developing RoseTTAFold, we found that combining predictions from multiple
discontinuous crops generated more accurate structures than predicting the entire structure at
once (fig. S4A). We hypothesized that this arises from selecting the most relevant sequences for
each region from the very large number of aligned sequences often available (fig. S4B). To
enable the network to focus on the most relevant sequence information for each region while
keeping access to the full multiple sequence alignment in a more memory efficient way, we
experimented with the Perceiver architecture (11), updating smaller seed MSAs (up to 100
sequences) with extra sequences (thousands of sequences) through cross-attention (fig. S4C).
Current RoseTTAFold only uses the top 1000 sequences due to memory limitations; with this
addition, all available sequence information can be used (often over 10,000 sequences). Initial
results are promising (fig. S4D), but more training will be required for rigorous comparison.
Enabling experimental protein structure determination
With the recent considerable progress in protein structure prediction, a key question is
what accurate protein structure models can be used for. We investigated the utility of the
RoseTTAFold to facilitate experimental structure determination by X-ray crystallography and
cryo-electron microscopy and to build models providing biological insights for key proteins of
currently unknown structures.
Solution of X-ray structures by molecular replacement (MR) often requires quite accurate
models. The much higher accuracy of the RoseTTAFold method than currently available
methods prompted us to test whether it could help solve previously unsolved challenging MR
problems and improve the solution of borderline cases. Four recent crystallographic datasets
(summarized, including resolution limits, in table S2), which had eluded solution by MR using
models available in the PDB, were reanalyzed using RoseTTAFold models: glycine N-
acyltransferase (GLYAT) from Bos taurus (fig. S5A), a bacterial oxidoreductase (fig. S5B), a
bacterial surface layer protein (SLP) (Fig. 2A) and the secreted protein Lrbp from the fungus
Phanerochaete chrysosporium (Fig. 2B and fig. S5C). In all four cases, the predicted models had
sufficient structural similarity to the true structures that led to successful MR solutions (see
Methods for details; the per-residue error estimates by DeepAccNet (12) allowed the more
accurate parts to be weighted more heavily). The increased prediction accuracy was critical for
success in all cases, as models made with trRosetta did not yield MR solutions.
To determine why the RoseTTAFold models were successful, where PDB structures had
previously failed, we compared the models to the crystal structures we obtained. The images in
Fig. 2A and fig. S5 show that in each case, the closest homolog of the known structure was a
much poorer model than the RoseTTAFold model; in the case of SLP, only a distant model
covering part of the N-terminal domain (38% of the sequence) was available in the PDB, while

Citations
More filters
Journal ArticleDOI

ColabFold: making protein folding accessible to all

TL;DR: ColabFold as discussed by the authors combines the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold for protein folding and achieves 40-60fold faster search and optimized model utilization.
Posted ContentDOI

Protein complex prediction with AlphaFold-Multimer

TL;DR: In this article, an AlphaFold model trained specifically for multimeric inputs of known stoichiometry was proposed, which significantly increases the accuracy of predicted multimimeric interfaces over input-adapted single-chain AlphaFolds.
Journal ArticleDOI

PROTAC targeted protein degraders: the past is prologue

TL;DR: Targeted protein degradation with proteolysis-targeting chimeras (PROTACs) has the potential to tackle disease-causing proteins that have historically been highly challenging to target with conventional small molecules as mentioned in this paper .
Journal ArticleDOI

Harnessing protein folding neural networks for peptide–protein docking

TL;DR: For example, AlphaFold2 as discussed by the authors generates peptide-protein complex models without requiring multiple sequence alignment information for the peptide partner, and can handle binding-induced conformational changes of the receptor.
References
More filters
Journal ArticleDOI

MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

TL;DR: This version of MAFFT has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update.
Journal ArticleDOI

Clustal W and Clustal X version 2.0

TL;DR: The Clustal W and ClUSTal X multiple sequence alignment programs have been completely rewritten in C++ to facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems.
Journal ArticleDOI

Features and development of Coot.

TL;DR: Coot is a molecular-graphics program designed to assist in the building of protein and other macromolecular models and the current state of development and available features are presented.
Journal ArticleDOI

Phaser crystallographic software

TL;DR: A description is given of Phaser-2.1: software for phasing macromolecular crystal structures by molecular replacement and single-wavelength anomalous dispersion phasing.
Journal ArticleDOI

UniProt: the Universal Protein knowledgebase

TL;DR: The Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt), which is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces.
Related Papers (5)

UniProt: the universal protein knowledgebase in 2021

Alex Bateman, +132 more
Frequently Asked Questions (17)
Q1. What is the role of TANGO2 in sphingolipid metabolism?

Transmembrane spanning Ceramide synthase (CERS1) is a key enzyme in sphingolipid metabolism which uses acyl-CoA to generate ceramides with various acyl chain lengths that regulate differentiation, proliferation, and apoptosis (27). 

The authors explored network architectures incorporating related ideas and obtained the best performance with a 3-track network in which information at the 1D sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The 3-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging X-ray crystallography and cryo-EM structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short circuiting traditional approaches which require modeling of individual subunits followed by docking. One-Sentence Summary: Accurate protein structure modeling enables the rapid solution of protein structures and provides insights into function. 

Since the network can take as input templates of known structures, the authors experimented with a further coupling of 3D structural information and 1D sequence information by iteratively feeding the predicted structures back into the network as templates and random subsampling from the multiple sequence alignments to sample a broader range of models. 

The final layer of the end-to-end version of their 3-track network generates 3D structure models by combining features from discontinuous crops of the protein sequence (two segments of the protein with a chain break between them). 

Deficiencies in TANGO2 (transport and Golgi organization protein 2) lead to metabolic disorders, and the protein plays an unknown role in Golgi membrane redistribution into the ER (16, 17). 

Because of computer hardware memory limitations, the authors could not train models on large proteins directly as the 3-track models have many millions of parameters; instead, the authors presented to the network many discontinuous crops of the input sequence consisting of two discontinuous sequence segments spanning a total of 260 residues. 

In the first, thepredicted residue-residue distance and orientation distributions are fed into pyRosetta (5) to generate all-atom models. 

Ntn superfamily members with structures similar to the RoseTTAFold model suggest that TANGO2 functions as an enzyme that might hydrolyze a carbon-nitrogen bond in a membrane component (18). 

Building atomic models of protein assemblies from cryo-EM maps can be challenging in the absence of homologs with known structures. 

DeepMind reported using several GPUs for days to make individual predictions, whereas their predictions are made in a single pass through the network in the same manner that would be used for a server; following sequence and template search (~1.5 hours), the end-to-end version of RoseTTAFold requires ~10 minutes on an RTX2080 GPU to generate backbone coordinates for proteins with less than 400 residues, and the pyRosetta version requires 5 minutes for network calculations on a single RTX2080 GPU and an hour for all-atom structure generation with 15 CPU cores. 

The increased prediction accuracy was critical for success in all cases, as models made with trRosetta did not yield MR solutions. 

To generate final models, the authors combined and averaged the 1D features and 2D distance and orientation predictions produced for each of the crops and then used two approaches to generate final 3D structures. 

In the second, the averaged 1D and 2D features are fed into a final SE(3)-equivariant layer (6), and following end-to-end training from amino acid sequence to 3D coordinates, backbone coordinates are generated directly by the network (see Methods). 

Over one-third of these models have a predicted lDDT > 0.8, which corresponded to an average Cɑ-RMSD of 2.6 Å on CASP14 targets (fig. S8). 

the network enables the direct building of structure models for protein-protein complexes from sequence information, short circuiting the standard procedure of building models for individual subunits and then carrying out rigid-body docking. 

Incomplete optimization due to computer memory limitations and neglect of side chain information likely explain the poorer performance of the end-to-end version compared to the pyRosetta version (Fig. 1B; the latter incorporates side chain information at the all-atom relaxation stage); since SE(3)-equivariant layers are used in the main body of the 3- track model, the added gain from the final SE(3) layer is likely less than in the AlphaFold2 case. 

The RoseTTAFold model of TANGO2 adopts an N-terminal nucleophile aminohydrolase (Ntn) fold (Fig. 3A) with well-aligned active site residues that are conserved in TANGO2 orthologs (Fig. 3B).