Structures of core eukaryotic protein complexes

doi:10.1101/2021.09.30.462231

Ian R. Humphreys

1,2,†

, Jimin Pei

3,4,†

, Minkyung Baek

1,2,†

, Aditya Krishnakumar

1,2,†

,

Ivan Anishchenko

1,2

, Sergey Ovchinnikov

5,6

, Jing Zhang

3,4

, Travis J. Ness

7,‡

, Sudeep Banjade

8

,

Saket Bagde

8

, Viktoriya G. Stancheva

9

, Xiao-Han Li

9

, Kaixian Liu

10

, Zhi Zheng

10,11

,

Daniel J. Barrero

12

, Upasana Roy

13

, Israel S. Fernández

14

, Barnabas Szakal

15

,

Dana Branzei

15,16

, Eric C. Greene

13

, Sue Biggins

12

, Scott Keeney

10,11,17

, Elizabeth A. Miller

9

,

J. Christopher Fromme

8

, Tamara L. Hendrickson

5

, Qian Cong

3,4,*

, David Baker

1,2,18,*

1

Department of Biochemistry, University of Washington, Seattle, WA, USA.

2

Institute for Protein Design, University of Washington, Seattle, WA, USA.

3

Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern

Medical Center, Dallas, TX, USA.

4

Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA.

5

Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138, USA.

6

John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA.

7

Department of Chemistry, Wayne State University, Detroit, MI, USA.

8

Department of Molecular Biology & Genetics, Weill Institute for Cell and Molecular Biology, Cornell

University, Ithaca, NY, USA.

9

MRC Laboratory of Molecular Biology, Cambridge, CB2 0QH, UK.

10

Molecular Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY.

11

Gerstner Sloan Kettering Graduate School of Biomedical Sciences, New York, NY.

12

Howard Hughes Medical Institute, Division of Basic Sciences, Fred Hutchinson Cancer Research

Center, 1100 Fairview Avenue N, Seattle, WA, USA.

13

Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.

14

Department of Structural Biology, St Jude Children's Research Hospital, Memphis, TN, USA.

15

IFOM, the FIRC Institute of Molecular Oncology, Via Adamello 16, 20139, Milan, Italy.

16

Istituto di Genetica Molecolare, Consiglio Nazionale delle Ricerche (IGM-CNR), Via Abbiategrasso 207,

27100, Pavia, Italy.

17

Howard Hughes Medical Institute, Memorial Sloan Kettering Cancer Center, New York, NY, USA.

18

Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.

†

Contributed equally

* Contributed equally and correspondence: qian.cong@utsouthwestern.edu, dabaker@uw.edu

‡

Current address: Sanofi, Cambridge, MA, USA

The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

Abstract: Protein-protein interactions play critical roles in biology, but despite decades of

effort, the structures of many eukaryotic protein complexes are unknown, and there are likely

many interactions that have not yet been identified. Here, we take advantage of recent

advances in proteome-wide amino acid coevolution analysis and deep-learning-based structure

modeling to systematically identify and build accurate models of core eukaryotic protein

complexes, as represented within the Saccharomyces cerevisiae proteome. We use a

combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence

alignments for 8.3 million pairs of S. cerevisiae proteins and build models for strongly predicted

protein assemblies with two to five components. Comparison to existing interaction and

structural data suggests that these predictions are likely to be quite accurate. We provide

structure models spanning almost all key processes in Eukaryotic cells for 104 protein

assemblies which have not been previously identified, and 608 which have not been structurally

characterized.

One-sentence summary: We take advantage of recent advances in proteome-wide amino acid

coevolution analysis and deep-learning-based structure modeling to systematically identify and

build accurate models of core eukaryotic protein complexes.

The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

Yeast two hybrid (Y2H), affinity-purification mass spectrometry (APMS), and other

high-throughput experimental approaches have identified many pairs of interacting proteins in

yeast and other organisms (1)(2)(3)(4)(5), but there are often extensive discrepancies between

sets generated using the different methods and considerable false positive and false negative

rates (6). Since residues at protein-protein interfaces are expected to coevolve, given two

proteins, the likelihood that they interact can be assessed by identifying and aligning the

sequences of orthologs of the two proteins in many different species, joining them to create

paired multiple sequence alignments (pMSA), and then determining the extent to which changes

in the sequences of orthologs for the first partner covary with ortholog sequence changes for the

second partner (7)(8). Such amino acid coevolution has been used to guide modeling of

complexes for cases in which the structures of the partners are known (9)(10), and to

systematically identify pairs of interacting proteins in Prokaryotes with accuracy higher than

experimental screens (7). Recent deep-learning-based advances in protein structure prediction

have the potential to increase the power of such approaches as they (11)(12) now enable

accurate modeling not only of protein monomer structures but also protein complexes (11).

We set out to combine proteome wide coevolution-guided protein interaction identification with

deep learning based protein structure modeling to systematically identify and determine the

structures of eukaryotic protein assemblies. We faced several challenges in directly applying to

eukaryotes the statistical methods effective in identifying coevolving pairs in prokaryotes. First,

far more genome sequences are available for prokaryotes than eukaryotes, and the average

number of homologous amino acid sequences (excluding nearly identical copies with > 95%

sequence identity) is on the order of 10,000 for bacterial proteins, but 1,000 for eukaryotic

proteins (fig. S1). Thus, multiple sequence alignments for pairs of eukaryotic proteins contain far

fewer diverse sequences, making it more difficult for statistical methods to distinguish true

coevolutionary signal from the noise. Second, eukaryotes in general have a larger number of

genes, making comprehensive pairwise analysis more computationally intensive, and increasing

the background noise resulting from calculation errors. Third, mRNA splicing in eukaryotes

further increases the number of protein species, resulting in errors in gene predictions and

complicating sequence alignments. Fourth, eukaryotes underwent several rounds of genome

duplications in multiple lineages (13), and it can be difficult to distinguish orthologs from

paralogs, which is important for detecting coevolutionary signal because the protein interactions

of interest are likely to be conserved in orthologs in other species but less so in paralogs.

We sought to overcome these challenges as follows. To help with the first three challenges, we

chose the yeast S. cerevisiae as the starting point because there are a large number (~1,700) of

fungal genomes (14), the genome is relatively small (6,000 genes total), and there is relatively

little mRNA splicing (15). Furthermore, because the interactome of yeast has been extensively

studied, there is a “gold standard” set of known interactions to evaluate the reliability of

predicted interactions and structures.

To distinguish orthologs from paralogs, we started from OrthoDB (16), a hierarchical catalog of

orthologs over 1,271 Eukaryote genomes, and supplemented each orthologous group with

sequences from 4,325 Eukaryote proteomes we assembled from NCBI

(https://www.ncbi.nlm.nih.gov/genome) and JGI (17). We compared the protein sequences for

The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

each of the additional 4,325 proteomes against those of the most closely related species in the

OrthoDB database, and used the reciprocal best hit criterion (18) to identify orthologs; these

were then added to the corresponding orthologous group. A complication is that each species

frequently contains multiple proteins belonging to the same orthologous group, leading to

ambiguity in determining which protein should be included in pMSAs crucial for coevolutionary

analysis. These multiple copies may represent alternatively spliced forms of the same gene,

parts of the same gene that were split into multiple pieces due to errors in gene prediction, or

recent gene expansions specific to certain lineages. We dealt with these possibilities by keeping

only the longest isoform of each gene, merging pieces of the same gene, and selecting the copy

with the highest sequence identity to single-copy orthologs in other species. For 4,090 out of

~6,000 yeast proteins, we were able to identify clear orthologs across large numbers of species,

and we generated pMSAs for all 4,090 * 4,089 / 2 = 8,362,005 pairwise combinations of these

proteins. We focused on 4,286,433 pairs with alignments containing over 200 sequences to

increase prediction accuracy and less than 1,300 amino acids to allow fast computation.

In a first set of calculations, we found that even with the advantages of S. cerevisiae and

improved ortholog identification, the statistical method (Direct Coupling Analysis, DCA) we had

used in our previous coevolution-guided PPI screen in Prokaryotes (7) (the more accurate

GREMLIN (9) method is too slow for this) could not effectively distinguish a “gold standard” set

of 768 yeast protein pairs known to interact (5)

(http://interactome.dfci.harvard.edu/S_cerevisiae/) from the much larger set (768,000 pairs) of

primarily non-interacting pairs (Fig. 1A, grey curve, area under the curve: 0.016). Progress

clearly required a more accurate and sensitive, but still rapidly computable, method for

evaluating protein interactions based on pMSAs.

We explored the application of the recently developed deep learning based structure prediction

methods, RoseTTAFold (RF) and AlphaFold (AF), to this problem. Even though RF was

originally trained on monomeric protein sequences and structures, it can accurately predict the

structures of protein complexes given pMSAs with a sufficient number of sequences (11). We

evaluated the compute time required and the PPI prediction accuracy for a variety of model

architectures, and found that a lighter-weight (10.7 million parameters) RF two-track model

provided an optimal tradeoff: the model requires 11 seconds (about 100 times faster than AF) to

process a pMSA of 1,000 amino acids on a NVIDIA TITAN RTX graphic processing unit, and it

can effectively distinguish gold standard PPIs amongst much larger sets of randomly paired

proteins. The very short time required to analyze an individual pMSA made it possible to

process all 4.3 million pMSAs. This method considerably outperformed DCA in distinguishing

gold standard interactions from random pairs (Fig. 1A, blue curve, area under the curve: 0.219),

using the highest predicted contact probability over all pairs of residues in the two proteins as a

measure of the propensity for two proteins to interact. Performance was further improved (Fig.

1A, green curve, area under the curve: 0.248) by correcting overestimations of predicted contact

probabilities between the C-terminal residues of the first protein and the N-terminal residues of

the second protein, and of predicted interactions for a subset of proteins showing hub-like

interactions with many other proteins. The much better performance of RF than DCA likely

stems from the extensive information on protein sequence-structure relationships embedded in

The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

the RF deep neural network; DCA by contrast operates solely on protein sequences with no

underlying protein structure model.

We next explored whether AF residue-residue contact predictions could further distinguish

interacting from non-interacting protein pairs. Like RF, AF was trained on monomeric protein

structures, but given the good results with 2-track RF on protein complexes, and the higher

accuracy of AF (also a 2-track network with a final 3D structure module) on monomers, we

reasoned that it might similarly have higher accuracy on complexes. To enable modeling of

protein complexes using AF, we modified the positional encoding. AF was too slow to be applied

to the entire set of 4.3 million pMSAs (this would require 0.1-1 million GPU hours); instead we

applied AF to the 5,495 protein pairs with the highest RF support (corresponding to ~25%

precision and ~29% recall based on our benchmark, indicated by the black vertical line in Fig.

1A). Using the highest AF contact probability over all residue pairs as a measure of interaction

strength, we found that the combination of RF followed by AF provided excellent performance

(Fig. 1B). Almost all the gold-standard pairs were ranked higher than the negative controls by

AF contact probability, allowing selection of a set of 717 candidate PPIs with an expected

precision of 95% at an AF contact probability cutoff of 0.67 (black line in Fig. 1B); we refer to this

RF plus AF procedure as the de novo PPI screen, and the resulting set of predicted interactions,

the de novo PPI set, below.

Due to the tradeoff between compute time and accuracy, and the necessity of setting a stringent

threshold to avoid large numbers of false positives given the very large number of total pairs, we

were concerned that some interacting proteins might not coevolve sufficiently to be identified

robustly in our all-vs-all RF screen. Given the excellent performance of AF in distinguishing gold

standard interactions amongst the RF filtered pMSAs, we also applied AF to pMSAs for PPIs

reported in literature, including those identified in experimental high throughput screens.

Similarly to our de novo PPI screen procedure, we considered protein pairs with AF contact

probability larger than 0.67 to be confident interacting partners. We found that 51% of the gold

standard PPI was supported by high AF contact probability (Fig. 1C), with lower ratios for

candidate PPIs manually curated from multiple literature

(http://interactome.dfci.harvard.edu/S_cerevisiae/download/LC_multiple.txt) (3) (34%) or

supported by low-throughput experiments according to BIOGRID (19) (27%). The ratio of

AF-supported PPIs is even lower for protein pairs identified by Y2H (19%) or APMS (16%)

screens, consistent with the known larger fraction of false positives in large-scale experimental

screens (20). The fast RF 2-track model used in the de novo screen has comparable or better

accuracy than the large-scale experimental screens when assessed in this way: with a high

stringency RF cutoff, the fraction of AF-supported pairs among PPIs identified by RF is 32%,

similar to the accuracy of low-throughput experiments; with a lower stringency cutoff, this

fraction becomes closer to that of the large scale experimental screens but somewhat fewer true

PPIs are missed (Fig 1C).

In total, we identified 717 likely interacting pairs from the “de novo RF → AF” screen, and 1,223

from the “pooled experimental sets → AF” screen, of which 434 overlap, resulting in a total of

1,506 PPIs. Out of these, 718 have been structurally characterized, 684 have some supporting

experimental data from literature and databases, and 104 are not to our knowledge previously

The copyright holder for this preprintthis version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.30.462231doi: bioRxiv preprint

Structures of core eukaryotic protein complexes

Figures

Citations

Protein complex prediction with AlphaFold-Multimer

AlphaDesign: A de novo protein design framework based on AlphaFold

The structural context of posttranslational modifications at a proteome-wide scale

Artificial intelligence reveals nuclear pore complexity

Towards a structurally resolved human protein interaction network

References

Highly accurate protein structure prediction with AlphaFold

A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae

A comprehensive two-hybrid analysis to explore the yeast protein interactome

Accurate prediction of protein structures and interactions using a three-track neural network

Meiosis-specific DNA double-strand breaks are catalyzed by Spo11, a member of a widely conserved protein family.

Related Papers (5)

Highly accurate protein structure prediction with AlphaFold

Accurate prediction of protein structures and interactions using a three-track neural network

A Complex-based Reconstruction of the Saccharomyces cerevisiae Interactome

Computational Analyses of High-Throughput Protein-Protein Interaction Data

Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners.