scispace - formally typeset
Open AccessPosted ContentDOI

Structures of core eukaryotic protein complexes

Reads0
Chats0
TLDR
In this article, a combination of RoseTTAFold and AlphaFold is used to screen through paired multiple sequence alignments for 8.3 million pairs of S. cerevisiae proteins and build models for strongly predicted protein assemblies.
Abstract
Protein-protein interactions play critical roles in biology, but despite decades of effort, the structures of many eukaryotic protein complexes are unknown, and there are likely many interactions that have not yet been identified. Here, we take advantage of recent advances in proteome-wide amino acid coevolution analysis and deep-learning-based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes, as represented within the Saccharomyces cerevisiae proteome. We use a combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence alignments for 8.3 million pairs of S. cerevisiae proteins and build models for strongly predicted protein assemblies with two to five components. Comparison to existing interaction and structural data suggests that these predictions are likely to be quite accurate. We provide structure models spanning almost all key processes in Eukaryotic cells for 104 protein assemblies which have not been previously identified, and 608 which have not been structurally characterized. One-sentence summary We take advantage of recent advances in proteome-wide amino acid coevolution analysis and deep-learning-based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes.

read more

Content maybe subject to copyright    Report

Structures of core eukaryotic protein complexes
Ian R. Humphreys
1,2,†
, Jimin Pei
3,4,†
, Minkyung Baek
1,2,†
, Aditya Krishnakumar
1,2,†
,
Ivan Anishchenko
1,2
, Sergey Ovchinnikov
5,6
, Jing Zhang
3,4
, Travis J. Ness
7,‡
, Sudeep Banjade
8
,
Saket Bagde
8
, Viktoriya G. Stancheva
9
, Xiao-Han Li
9
, Kaixian Liu
10
, Zhi Zheng
10,11
,
Daniel J. Barrero
12
, Upasana Roy
13
, Israel S. Fernández
14
, Barnabas Szakal
15
,
Dana Branzei
15,16
, Eric C. Greene
13
, Sue Biggins
12
, Scott Keeney
10,11,17
, Elizabeth A. Miller
9
,
J. Christopher Fromme
8
, Tamara L. Hendrickson
5
, Qian Cong
3,4,*
, David Baker
1,2,18,*
1
Department of Biochemistry, University of Washington, Seattle, WA, USA.
2
Institute for Protein Design, University of Washington, Seattle, WA, USA.
3
Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern
Medical Center, Dallas, TX, USA.
4
Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA.
5
Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138, USA.
6
John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA.
7
Department of Chemistry, Wayne State University, Detroit, MI, USA.
8
Department of Molecular Biology & Genetics, Weill Institute for Cell and Molecular Biology, Cornell
University, Ithaca, NY, USA.
9
MRC Laboratory of Molecular Biology, Cambridge, CB2 0QH, UK.
10
Molecular Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY.
11
Gerstner Sloan Kettering Graduate School of Biomedical Sciences, New York, NY.
12
Howard Hughes Medical Institute, Division of Basic Sciences, Fred Hutchinson Cancer Research
Center, 1100 Fairview Avenue N, Seattle, WA, USA.
13
Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.
14
Department of Structural Biology, St Jude Children's Research Hospital, Memphis, TN, USA.
15
IFOM, the FIRC Institute of Molecular Oncology, Via Adamello 16, 20139, Milan, Italy.
16
Istituto di Genetica Molecolare, Consiglio Nazionale delle Ricerche (IGM-CNR), Via Abbiategrasso 207,
27100, Pavia, Italy.
17
Howard Hughes Medical Institute, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
18
Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
Contributed equally
* Contributed equally and correspondence: qian.cong@utsouthwestern.edu, dabaker@uw.edu
Current address: Sanofi, Cambridge, MA, USA
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

Abstract: Protein-protein interactions play critical roles in biology, but despite decades of
effort, the structures of many eukaryotic protein complexes are unknown, and there are likely
many interactions that have not yet been identified. Here, we take advantage of recent
advances in proteome-wide amino acid coevolution analysis and deep-learning-based structure
modeling to systematically identify and build accurate models of core eukaryotic protein
complexes, as represented within the Saccharomyces cerevisiae proteome. We use a
combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence
alignments for 8.3 million pairs of S. cerevisiae proteins and build models for strongly predicted
protein assemblies with two to five components. Comparison to existing interaction and
structural data suggests that these predictions are likely to be quite accurate. We provide
structure models spanning almost all key processes in Eukaryotic cells for 104 protein
assemblies which have not been previously identified, and 608 which have not been structurally
characterized.
One-sentence summary: We take advantage of recent advances in proteome-wide amino acid
coevolution analysis and deep-learning-based structure modeling to systematically identify and
build accurate models of core eukaryotic protein complexes.
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

Yeast two hybrid (Y2H), affinity-purification mass spectrometry (APMS), and other
high-throughput experimental approaches have identified many pairs of interacting proteins in
yeast and other organisms (1)(2)(3)(4)(5), but there are often extensive discrepancies between
sets generated using the different methods and considerable false positive and false negative
rates (6). Since residues at protein-protein interfaces are expected to coevolve, given two
proteins, the likelihood that they interact can be assessed by identifying and aligning the
sequences of orthologs of the two proteins in many different species, joining them to create
paired multiple sequence alignments (pMSA), and then determining the extent to which changes
in the sequences of orthologs for the first partner covary with ortholog sequence changes for the
second partner (7)(8). Such amino acid coevolution has been used to guide modeling of
complexes for cases in which the structures of the partners are known (9)(10), and to
systematically identify pairs of interacting proteins in Prokaryotes with accuracy higher than
experimental screens (7). Recent deep-learning-based advances in protein structure prediction
have the potential to increase the power of such approaches as they (11)(12) now enable
accurate modeling not only of protein monomer structures but also protein complexes (11).
We set out to combine proteome wide coevolution-guided protein interaction identification with
deep learning based protein structure modeling to systematically identify and determine the
structures of eukaryotic protein assemblies. We faced several challenges in directly applying to
eukaryotes the statistical methods effective in identifying coevolving pairs in prokaryotes. First,
far more genome sequences are available for prokaryotes than eukaryotes, and the average
number of homologous amino acid sequences (excluding nearly identical copies with > 95%
sequence identity) is on the order of 10,000 for bacterial proteins, but 1,000 for eukaryotic
proteins (fig. S1). Thus, multiple sequence alignments for pairs of eukaryotic proteins contain far
fewer diverse sequences, making it more difficult for statistical methods to distinguish true
coevolutionary signal from the noise. Second, eukaryotes in general have a larger number of
genes, making comprehensive pairwise analysis more computationally intensive, and increasing
the background noise resulting from calculation errors. Third, mRNA splicing in eukaryotes
further increases the number of protein species, resulting in errors in gene predictions and
complicating sequence alignments. Fourth, eukaryotes underwent several rounds of genome
duplications in multiple lineages (13), and it can be difficult to distinguish orthologs from
paralogs, which is important for detecting coevolutionary signal because the protein interactions
of interest are likely to be conserved in orthologs in other species but less so in paralogs.
We sought to overcome these challenges as follows. To help with the first three challenges, we
chose the yeast S. cerevisiae as the starting point because there are a large number (~1,700) of
fungal genomes (14), the genome is relatively small (6,000 genes total), and there is relatively
little mRNA splicing (15). Furthermore, because the interactome of yeast has been extensively
studied, there is a “gold standard” set of known interactions to evaluate the reliability of
predicted interactions and structures.
To distinguish orthologs from paralogs, we started from OrthoDB (16), a hierarchical catalog of
orthologs over 1,271 Eukaryote genomes, and supplemented each orthologous group with
sequences from 4,325 Eukaryote proteomes we assembled from NCBI
(https://www.ncbi.nlm.nih.gov/genome) and JGI (17). We compared the protein sequences for
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

each of the additional 4,325 proteomes against those of the most closely related species in the
OrthoDB database, and used the reciprocal best hit criterion (18) to identify orthologs; these
were then added to the corresponding orthologous group. A complication is that each species
frequently contains multiple proteins belonging to the same orthologous group, leading to
ambiguity in determining which protein should be included in pMSAs crucial for coevolutionary
analysis. These multiple copies may represent alternatively spliced forms of the same gene,
parts of the same gene that were split into multiple pieces due to errors in gene prediction, or
recent gene expansions specific to certain lineages. We dealt with these possibilities by keeping
only the longest isoform of each gene, merging pieces of the same gene, and selecting the copy
with the highest sequence identity to single-copy orthologs in other species. For 4,090 out of
~6,000 yeast proteins, we were able to identify clear orthologs across large numbers of species,
and we generated pMSAs for all 4,090 * 4,089 / 2 = 8,362,005 pairwise combinations of these
proteins. We focused on 4,286,433 pairs with alignments containing over 200 sequences to
increase prediction accuracy and less than 1,300 amino acids to allow fast computation.
In a first set of calculations, we found that even with the advantages of S. cerevisiae and
improved ortholog identification, the statistical method (Direct Coupling Analysis, DCA) we had
used in our previous coevolution-guided PPI screen in Prokaryotes (7) (the more accurate
GREMLIN (9) method is too slow for this) could not effectively distinguish a “gold standard” set
of 768 yeast protein pairs known to interact (5)
(http://interactome.dfci.harvard.edu/S_cerevisiae/) from the much larger set (768,000 pairs) of
primarily non-interacting pairs (Fig. 1A, grey curve, area under the curve: 0.016). Progress
clearly required a more accurate and sensitive, but still rapidly computable, method for
evaluating protein interactions based on pMSAs.
We explored the application of the recently developed deep learning based structure prediction
methods, RoseTTAFold (RF) and AlphaFold (AF), to this problem. Even though RF was
originally trained on monomeric protein sequences and structures, it can accurately predict the
structures of protein complexes given pMSAs with a sufficient number of sequences (11). We
evaluated the compute time required and the PPI prediction accuracy for a variety of model
architectures, and found that a lighter-weight (10.7 million parameters) RF two-track model
provided an optimal tradeoff: the model requires 11 seconds (about 100 times faster than AF) to
process a pMSA of 1,000 amino acids on a NVIDIA TITAN RTX graphic processing unit, and it
can effectively distinguish gold standard PPIs amongst much larger sets of randomly paired
proteins. The very short time required to analyze an individual pMSA made it possible to
process all 4.3 million pMSAs. This method considerably outperformed DCA in distinguishing
gold standard interactions from random pairs (Fig. 1A, blue curve, area under the curve: 0.219),
using the highest predicted contact probability over all pairs of residues in the two proteins as a
measure of the propensity for two proteins to interact. Performance was further improved (Fig.
1A, green curve, area under the curve: 0.248) by correcting overestimations of predicted contact
probabilities between the C-terminal residues of the first protein and the N-terminal residues of
the second protein, and of predicted interactions for a subset of proteins showing hub-like
interactions with many other proteins. The much better performance of RF than DCA likely
stems from the extensive information on protein sequence-structure relationships embedded in
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

the RF deep neural network; DCA by contrast operates solely on protein sequences with no
underlying protein structure model.
We next explored whether AF residue-residue contact predictions could further distinguish
interacting from non-interacting protein pairs. Like RF, AF was trained on monomeric protein
structures, but given the good results with 2-track RF on protein complexes, and the higher
accuracy of AF (also a 2-track network with a final 3D structure module) on monomers, we
reasoned that it might similarly have higher accuracy on complexes. To enable modeling of
protein complexes using AF, we modified the positional encoding. AF was too slow to be applied
to the entire set of 4.3 million pMSAs (this would require 0.1-1 million GPU hours); instead we
applied AF to the 5,495 protein pairs with the highest RF support (corresponding to ~25%
precision and ~29% recall based on our benchmark, indicated by the black vertical line in Fig.
1A). Using the highest AF contact probability over all residue pairs as a measure of interaction
strength, we found that the combination of RF followed by AF provided excellent performance
(Fig. 1B). Almost all the gold-standard pairs were ranked higher than the negative controls by
AF contact probability, allowing selection of a set of 717 candidate PPIs with an expected
precision of 95% at an AF contact probability cutoff of 0.67 (black line in Fig. 1B); we refer to this
RF plus AF procedure as the de novo PPI screen, and the resulting set of predicted interactions,
the de novo PPI set, below.
Due to the tradeoff between compute time and accuracy, and the necessity of setting a stringent
threshold to avoid large numbers of false positives given the very large number of total pairs, we
were concerned that some interacting proteins might not coevolve sufficiently to be identified
robustly in our all-vs-all RF screen. Given the excellent performance of AF in distinguishing gold
standard interactions amongst the RF filtered pMSAs, we also applied AF to pMSAs for PPIs
reported in literature, including those identified in experimental high throughput screens.
Similarly to our de novo PPI screen procedure, we considered protein pairs with AF contact
probability larger than 0.67 to be confident interacting partners. We found that 51% of the gold
standard PPI was supported by high AF contact probability (Fig. 1C), with lower ratios for
candidate PPIs manually curated from multiple literature
(http://interactome.dfci.harvard.edu/S_cerevisiae/download/LC_multiple.txt) (3) (34%) or
supported by low-throughput experiments according to BIOGRID (19) (27%). The ratio of
AF-supported PPIs is even lower for protein pairs identified by Y2H (19%) or APMS (16%)
screens, consistent with the known larger fraction of false positives in large-scale experimental
screens (20). The fast RF 2-track model used in the de novo screen has comparable or better
accuracy than the large-scale experimental screens when assessed in this way: with a high
stringency RF cutoff, the fraction of AF-supported pairs among PPIs identified by RF is 32%,
similar to the accuracy of low-throughput experiments; with a lower stringency cutoff, this
fraction becomes closer to that of the large scale experimental screens but somewhat fewer true
PPIs are missed (Fig 1C).
In total, we identified 717 likely interacting pairs from the de novo RF AF” screen, and 1,223
from the “pooled experimental sets AF” screen, of which 434 overlap, resulting in a total of
1,506 PPIs. Out of these, 718 have been structurally characterized, 684 have some supporting
experimental data from literature and databases, and 104 are not to our knowledge previously
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

Figures
Citations
More filters
Posted ContentDOI

Protein complex prediction with AlphaFold-Multimer

TL;DR: In this article, an AlphaFold model trained specifically for multimeric inputs of known stoichiometry was proposed, which significantly increases the accuracy of predicted multimimeric interfaces over input-adapted single-chain AlphaFolds.
Posted ContentDOI

AlphaDesign: A de novo protein design framework based on AlphaFold

TL;DR: AlphaDesign as mentioned in this paper is a computational framework for de novo protein design that embeds AlphaFold as an oracle within an optimisable design process, enabling rapid prediction of completely novel protein monomers starting from random sequences.
Journal ArticleDOI

The structural context of posttranslational modifications at a proteome-wide scale

TL;DR: This analysis uncovers global patterns of PTM occurrence across folded and intrinsically disordered regions and determines the structural context of these PTMs, which can help to distinguish regulatory PTMs from those marking improperly folded proteins.
Posted ContentDOI

Artificial intelligence reveals nuclear pore complexity

TL;DR: In this paper, the authors combined AI-based structure prediction with in situ and in cellulo cryo-electron tomography and integrative modeling, and showed that linker Nups spatially organized the scaffold within and across subcomplexes to establish the higher-order structure.
References
More filters
Journal ArticleDOI

A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae

TL;DR: Examination of large-scale yeast two-hybrid screens reveals interactions that place functionally unclassified proteins in a biological context, interactions between proteins involved in the same biological function, and interactions that link biological functions together into larger cellular processes.
Journal ArticleDOI

A comprehensive two-hybrid analysis to explore the yeast protein interactome

TL;DR: The comprehensive analysis using a system to examine two-hybrid interactions in all possible combinations between the budding yeast Saccharomyces cerevisiae is completed and would significantly expand and improve the protein interaction map for the exploration of genome functions that eventually leads to thorough understanding of the cell as a molecular system.
Journal ArticleDOI

Meiosis-specific DNA double-strand breaks are catalyzed by Spo11, a member of a widely conserved protein family.

TL;DR: These findings strongly implicate Spo11 as the catalytic subunit of the meiotic DNA cleavage activity and provide direct evidence that the mechanism of meiotic recombination initiation is evolutionarily conserved.
Related Papers (5)