scispace - formally typeset
Open accessJournal ArticleDOI: 10.1038/S41467-021-21636-Z

Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences.

02 Mar 2021-Nature Communications (Springer Science and Business Media LLC)-Vol. 12, Iss: 1, pp 1396-1396
Abstract: Increasing numbers of protein interactions have been identified in high-throughput experiments, but only a small proportion have solved structures. Recently, sequence coevolution-based approaches have led to a breakthrough in predicting monomer protein structures and protein interaction interfaces. Here, we address the challenges of large-scale interaction prediction at residue resolution with a fast alignment concatenation method and a probabilistic score for the interaction of residues. Importantly, this method (EVcomplex2) is able to assess the likelihood of a protein interaction, as we show here applied to large-scale experimental datasets where the pairwise interactions are unknown. We predict 504 interactions de novo in the E. coli membrane proteome, including 243 that are newly discovered. While EVcomplex2 does not require available structures, coevolving residue pairs can be used to produce structural models of protein interactions, as done here for membrane complexes including the Flagellar Hook-Filament Junction and the Tol/Pal complex. Our understanding of the residue-level details of protein interactions remains incomplete. Here, the authors show sequence coevolution can be used to infer interacting proteins with residue-level details, including predicting 467 interactions de novo in the Escherichia coli cell envelope proteome.

... read more


20 results found

Open accessJournal ArticleDOI: 10.1038/S41467-021-22732-W
Abstract: The ability to design functional sequences and predict effects of variation is central to protein engineering and biotherapeutics. State-of-art computational methods rely on models that leverage evolutionary information but are inadequate for important applications where multiple sequence alignments are not robust. Such applications include the prediction of variant effects of indels, disordered proteins, and the design of proteins such as antibodies due to the highly variable complementarity determining regions. We introduce a deep generative model adapted from natural language processing for prediction and design of diverse functional sequences without the need for alignments. The model performs state-of-art prediction of missense and indel effects and we successfully design and test a diverse 105-nanobody library that shows better expression than a 1000-fold larger synthetic library. Our results demonstrate the power of the alignment-free autoregressive model in generalizing to regions of sequence space traditionally considered beyond the reach of prediction and design.

... read more

Topics: Generative model (54%)

27 Citations

Open accessPosted ContentDOI: 10.1101/2021.09.15.460468
15 Sep 2021-bioRxiv
Abstract: Predicting the structure of single-chain proteins is now close to being a solved problem due to the recent achievement of AlphaFold2 (AF2). However, predicting the structure of interacting protein chains is still a challenge. Here, we utilise AF2 to optimise a protocol for predicting the structure of heterodimeric protein complexes using only sequence information. We find that using the default AF2 protocol, 32% of the models in the Dockground test set can be modelled accurately. By tuning the input alignment and identifying the best model, we adjusted the performance to 43%. Our protocol uses MSAs generated by AF2 and MSAs paired on the organism level generated with HHblits. In a more extensive, more realistic, independent test set, the accuracy is 59%. In comparison, the alternative fold-and-dock method RoseTTAFold is only successful in 10% of the cases on this set and traditional docking methods 22%. However, for the traditional method, the performance would be lower if the bound form of both monomers was not known. The success is higher for bacterial protein pairs, pairs with large interaction areas consisting of helices or sheets, and many homologous sequences. We can distinguish acceptable (DockQ>0.23) from incorrect models with an AUC of 0.84 on the test set by analysing the predicted interfaces. At an error rate of 1%, 13% are acceptable (at a 10% error rate, 40% of the models are acceptable). All scripts and tools to run our protocol are freely available at:

... read more

Topics: Test set (51%)

10 Citations

Open accessPosted ContentDOI: 10.1101/2021.09.30.462231
Ian R. Humphreys1, Jimin Pei2, Minkyung Baek1, Krishnakumar A1  +26 moreInstitutions (12)
30 Sep 2021-bioRxiv
Abstract: Protein-protein interactions play critical roles in biology, but despite decades of effort, the structures of many eukaryotic protein complexes are unknown, and there are likely many interactions that have not yet been identified. Here, we take advantage of recent advances in proteome-wide amino acid coevolution analysis and deep-learning-based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes, as represented within the Saccharomyces cerevisiae proteome. We use a combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence alignments for 8.3 million pairs of S. cerevisiae proteins and build models for strongly predicted protein assemblies with two to five components. Comparison to existing interaction and structural data suggests that these predictions are likely to be quite accurate. We provide structure models spanning almost all key processes in Eukaryotic cells for 104 protein assemblies which have not been previously identified, and 608 which have not been structurally characterized. One-sentence summary We take advantage of recent advances in proteome-wide amino acid coevolution analysis and deep-learning-based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes.

... read more

7 Citations

Open accessPosted ContentDOI: 10.1101/2021.06.04.446442
Gabriele Pozzati1, Wensi Zhu1, Claudio Bassot1, John Lamb1  +3 moreInstitutions (2)
07 Jun 2021-bioRxiv
Abstract: In the last decade, de novo protein structure prediction accuracy for individual proteins has improved significantly by utilizing deep learning (DL) methods for harvesting the co-evolution information from large multiple sequence alignments (MSA). In CASP14, the best method could predict the structure of most proteins with impressive accuracy. The same approach can, in principle, also be used to extract information about evolutionary-based contacts across protein-protein interfaces. However, most of the earlier studies have not used the latest DL methods for inter-chain contact distance predictions. In this paper, we showed for the first time that using one of the best DL-based residue-residue contact prediction methods (trRosetta), it is possible to simultaneously predict both the tertiary and quaternary structures of some protein pairs, even when the structures of the monomers are not known. Straightforward application of this method to a standard dataset for protein-protein docking yielded limited success, however, using alternative methods for MSA generating allowed us to dock accurately significantly more proteins. We also introduced a novel scoring function, PconsDock, that accurately separates 98% of correctly and incorrectly folded and docked proteins and thus this function can be used to evaluate the quality of the resulting docking models. The average performance of the method is comparable to the use of traditional, template-based or ab initio shape-complementarity-only docking methods, however, no a priori structural information for the individual proteins is needed. Moreover, the results of traditional and fold-and-dock approaches are complementary and thus a combined docking pipeline should increase overall docking success significantly. The dock-and-fold pipeline helped us to generate the best model for one of the CASP14 oligomeric targets, H1065.

... read more

6 Citations

Open accessJournal ArticleDOI: 10.1002/PROT.26235
22 Sep 2021-Proteins
Abstract: The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.

... read more

6 Citations


56 results found

Open accessJournal Article
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from

... read more

33,540 Citations

Journal ArticleDOI: 10.1002/BIP.360221211
Wolfgang Kabsch1, Chris Sander1Institutions (1)
01 Dec 1983-Biopolymers
Abstract: For a successful analysis of the relation between amino acid sequence and protein structure, an unambiguous and physically meaningful definition of secondary structure is essential. We have developed a set of simple and physically motivated criteria for secondary structure, programmed as a pattern-recognition process of hydrogen-bonded and geometrical features extracted from x-ray coordinates. Cooperative secondary structure is recognized as repeats of the elementary hydrogen-bonding patterns “turn” and “bridge.” Repeating turns are “helices,” repeating bridges are “ladders,” connected ladders are “sheets.” Geometric structure is defined in terms of the concepts torsion and curvature of differential geometry. Local chain “chirality” is the torsional handedness of four consecutive Cα positions and is positive for right-handed helices and negative for ideal twisted β-sheets. Curved pieces are defined as “bends.” Solvent “exposure” is given as the number of water molecules in possible contact with a residue. The end result is a compilation of the primary structure, including SS bonds, secondary structure, and solvent exposure of 62 different globular proteins. The presentation is in linear form: strip graphs for an overall view and strip tables for the details of each of 10.925 residues. The dictionary is also available in computer-readable form for protein structure prediction work.

... read more

12,998 Citations

Journal ArticleDOI: 10.1016/J.JMB.2007.05.022
E. Krissinel1, Kim Henrick1Institutions (1)
Abstract: We discuss basic physical-chemical principles underlying the formation of stable macromolecular complexes, which in many cases are likely to be the biological units performing a certain physiological function We also consider available theoretical approaches to the calculation of macromolecular affinity and entropy of complexation The latter is shown to play an important role and make a major effect on complex size and symmetry We develop a new method, based on chemical thermodynamics, for automatic detection of macromolecular assemblies in the Protein Data Bank (PDB) entries that are the results of X-ray diffraction experiments As found, biological units may be recovered at 80-90% success rate, which makes X-ray crystallography an important source of experimental data on macromolecular complexes and protein-protein interactions The method is implemented as a public WWW service

... read more

7,202 Citations

Open accessJournal ArticleDOI: 10.1093/NAR/GKU1003
Abstract: The many functional partnerships and interactions that occur between proteins are at the core of cellular processing and their systematic characterization helps to provide context in molecular systems biology. However, known and predicted interactions are scattered over multiple resources, and the available data exhibit notable differences in terms of quality and completeness. The STRING database ( aims to provide a critical assessment and integration of protein-protein interactions, including direct (physical) as well as indirect (functional) associations. The new version 10.0 of STRING covers more than 2000 organisms, which has necessitated novel, scalable algorithms for transferring interaction information between organisms. For this purpose, we have introduced hierarchical and self-consistent orthology annotations for all interacting proteins, grouping the proteins into families at various levels of phylogenetic resolution. Further improvements in version 10.0 include a completely redesigned prediction pipeline for inferring protein-protein associations from co-expression data, an API interface for the R computing environment and improved statistical analysis for enrichment tests in user-provided networks.

... read more

6,834 Citations

Open accessJournal ArticleDOI: 10.1093/NAR/GKH131
Abstract: To provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information, the Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt) consortium. Our mission is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces. The central database will have two sections, corresponding to the familiar Swiss-Prot (fully manually curated entries) and TrEMBL (enriched with automated classification, annotation and extensive cross-references). For convenient sequence searches, UniProt also provides several non-redundant sequence databases. The UniProt NREF (UniRef) databases provide representative subsets of the knowledgebase suitable for efficient searching. The comprehensive UniProt Archive (UniParc) is updated daily from many public source databases. The UniProt databases can be accessed online ( or downloaded in several formats ( The scientific community is encouraged to submit data for inclusion in UniProt.

... read more

Topics: UniProt (68%)

6,522 Citations

No. of citations received by the Paper in previous years
Network Information
Related Papers (5)
Sequence co-evolution gives 3D contacts and structures of protein complexes25 Sep 2014, eLife

Thomas A. Hopf, Charlotta P I Schärfe +7 more

99% related
Protein interaction networks revealed by proteome coevolution.12 Jul 2019, Science

Qian Cong, Ivan Anishchenko +2 more

99% related
Protein 3D structure computed from evolutionary sequence variation.07 Dec 2011, PLOS ONE

Debora S. Marks, Lucy J. Colwell +5 more

94% related
Highly accurate protein structure prediction with AlphaFold15 Jul 2021, Nature

John M. Jumper, Richard O. Evans +32 more

88% related