scispace - formally typeset
Search or ask a question
Posted Content•DOI•

Limits and potential of combined folding and docking using PconsDock

07 Jun 2021-bioRxiv (Cold Spring Harbor Laboratory)-
TL;DR: A fold-and-dock method, PconsDock, based on predicted residue-residue distances with trRosetta, that can simultaneously predict the tertiary and quaternary structure of a protein pair, even when the structures of the monomers are not known.
Abstract: In the last decade, de novo protein structure prediction accuracy for individual proteins has improved significantly by utilizing deep learning (DL) methods for harvesting the co-evolution information from large multiple sequence alignments (MSA). In CASP14, the best method could predict the structure of most proteins with impressive accuracy. The same approach can, in principle, also be used to extract information about evolutionary-based contacts across protein-protein interfaces. However, most of the earlier studies have not used the latest DL methods for inter-chain contact distance predictions. In this paper, we showed for the first time that using one of the best DL-based residue-residue contact prediction methods (trRosetta), it is possible to simultaneously predict both the tertiary and quaternary structures of some protein pairs, even when the structures of the monomers are not known. Straightforward application of this method to a standard dataset for protein-protein docking yielded limited success, however, using alternative methods for MSA generating allowed us to dock accurately significantly more proteins. We also introduced a novel scoring function, PconsDock, that accurately separates 98% of correctly and incorrectly folded and docked proteins and thus this function can be used to evaluate the quality of the resulting docking models. The average performance of the method is comparable to the use of traditional, template-based or ab initio shape-complementarity-only docking methods, however, no a priori structural information for the individual proteins is needed. Moreover, the results of traditional and fold-and-dock approaches are complementary and thus a combined docking pipeline should increase overall docking success significantly. The dock-and-fold pipeline helped us to generate the best model for one of the CASP14 oligomeric targets, H1065.

Summary (1 min read)

Jump to:  and [Summary]

Summary

  • Protein structure is crucial for their understanding of biological function.
  • At a depth of 100 sequences, the average TM-score is over 0.6, indicating that about 100 effective sequences are in most cases sufficient to obtain the fold of a protein.
  • The default (N3) performance is compared withpyconsFold (uses the pyconsFold program instead of Rosetta), RaptorX (uses inter-chain contacts predicted by RaptorX instead of distances from trRosetta), RaptorX and N3-pdb use the intra-chain distances from the native structures, and N3-merged uses intra-chain distances predicted by the full alignments for each chain independently.
  • First, it can be seen that the successful dockings tend to have a multiple sequence alignment of one hundred or more residues, see Figure 5A.
  • There are a few targets whose performance increases significantly.
  • First, the authors compared it to one shape complementarity method, Gramm, and one template-based docking method, TMdock (see Figure 9.
  • In some cases, only specific alignment gives correct folding and docking based on the intrinsic evolutionary characteristic of the proteins and their interaction.
  • Here, it should be noted that a dockQ score over 0.23 roughly corresponds to an “acceptable” model in CAPRI [45], and the authors will therefore call all models with dockQ >0.23 as correct and all others as incorrect.
  • The distances were then used in Rosetta as described in the original trRosetta protocol.
  • Morcos F, Pagnini A, Lunt B, Bertolino A, Marks D, Sander C, et al. Estimation of Residue-Residue Coevolution using Direct Coupling Analysis Identifies Many Native Contacts Across a Large Number of Domain Families.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Limits and potential of combined folding
and docking using PconsDock.
Gabriele Pozzati
1*
, Wensi Zhu
1*
, Claudio Bassot
1*
, John Lamb
1
, Petras
Kundrotas
1,2
, Arne Elofsson
1
1
Science for Life Laboratory and Dep of Biochemistry and Biophysics,
Stockholm University, Box 1031, 171 21 Solna, Sweden
2
Center for Computational Biology, The University of Kansas,
Lawrence, KS 66047, USA
*
=contributed equally.
Abstract
In the last decade, de novo protein structure prediction accuracy for individual proteins has
improved significantly by utilising deep learning (DL) methods for harvesting the co-evolution
information from large multiple sequence alignments (MSA). In CASP14, the best groups
predicted the structure of most proteins with impressive accuracy. The same approach can, in
principle, also be used to extract information about evolutionary-based contacts across
protein-protein interfaces. However, most of the earlier studies have not used the latest DL
methods for inter-chain contact distance prediction. This paper introduces a fold-and-dock
method, PconsDock, based on predicted residue-residue distances with trRosetta. PconsDock
can simultaneously predict the tertiary and quaternary structure of a protein pair, even when the
structures of the monomers are not known. The straightforward application of this method to a
standard dataset for protein-protein docking yielded limited success. However, using alternative
methods for MSA generating allowed us to dock accurately significantly more proteins. We also
introduced a novel scoring function, PconsDock, that accurately separates 98% of correctly and
incorrectly folded and docked proteins. The average performance of the method is comparable
to the use of traditional, template-based or ab initio shape-complementarity-only docking
methods. However, no a priori structural information for the individual proteins is needed.
Moreover, the results of conventional and fold-and-dock approaches are complementary, and
thus a combined docking pipeline could increase overall docking success significantly.
PconsDocck contributed to the best model for one of the CASP14 oligomeric targets, H1065.
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

Introduction
Protein structure is crucial for our understanding of biological function. However, experimentally
determining the structure of a protein is still time-consuming and expensive. Therefore,
computational methods will be the only method to determine the structure of most proteins in
the foreseeable future. Until recently, the only method to reliably predict the structure of a
protein was to model it using a homologous template. However, reliable templates are not
available for close to half the residues in the human proteome [1].
For several decades the prediction of protein structure directly from sequence information has
been an unachievable dream. However, that changed about a decade ago when improved
methods using co-evolution achieved sufficient residue contact information to predict the
structure of many proteins [2,3]. Later, deep learning [4,5] and prediction of residue-residue
distances provided further improvements [6,7]. Today this means that for many, if not most,
individual proteins, it is possible to accurately predict the structure of its folded domains [8].
Recently, Deepmind demonstrated at CASP14 that using an end-to-end learnable approach,
high-quality prediction of almost all protein domains is already feasible today (although not
generally available).
In principle, the same type of methods used for predicting the structure of a single protein can
predict the interaction between two proteins [9,10]. However, there is one fundamental
difference: it is necessary to create paired alignments to identify the interaction between two
proteins, i.e. identifying what pairs of proteins interact in the same manner. The identification of
interacting pairs is assumed to be relatively easy for pairs of proteins that both only contain a
single homolog in a set of genomes, but when multiple paralogs exist - the exact pairing is
difficult [11].
Proteins do, however, not act alone. They function by interacting with other proteins and other
molecules. Protein interaction can vary in nature from stable interaction present in small and
large protein complexes to transient interactions often used for regulation. Experimentally the
study of stable protein interactions can be done using various techniques. Structural
determination methods, including crystallography and Cryo-EM electron microscopy, can solve
the structure of protein complexes, while other methods can be used to identify that two proteins
interact without obtaining detailed structural information.
Prediction of protein interactions has been an even more significant challenge than predicting
the structure of individual proteins. Many different techniques have been developed, but in short,
they can be divided into four categories: (i) docking primarily based on shape complementarity
[12], (ii) template-based modelling [13], and (iii) flexible docking [14,15]. Various energy
functions have also been used to improve the identification of correct docking poses [16]. In
addition, co-evolution-based methods have also been used to predict the structure of complexes
[9,17].
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

Benchmarks have been developed to elucidate the advantages and disadvantages of different
docking methods [18]. Shape complementarity works excellently on native complexes, but the
accuracy drops fast when using the structures of unbound complexes and even further if models
of the proteins are used [19,20]. Template-based modelling works excellently if a complex with
significant sequence identity exists in PDB but does not work for novel complexes[21,22].
Successful DCA based methods to predict protein-protein interactions preceded the large-scale
prediction of single proteins by predicting the bacterial two-component signalling in 2009 [17].
These methods were then extended to a handful of other complexes by several groups [9,10].
However, it is still unclear how generally applicable these methods are, but the potential to
vastly increase the space of known protein-protein interactions should lie in using some type of
co-evolution based methods. The computational cost limits flexible docking, but a fold-and-dock
protocol [23] based on coevolution does not require an exact structure of the two individual
proteins.
In addition to determining the structure of a protein complex, it is also crucial to determine which
proteins interact. However, protein-protein interaction is not an easily defined entity. It might
include anything from proteins regulating the expression of genes to proteins strongly bound to
each other in a large molecular machine. Several interaction databases exist [24,25], and
methods, including co-evolution based methods [26], to predict interactions have been
developed.
Here, we examine if it is possible to simultaneously fold and dock [23] two proteins by using
coevolutionary information and not only dock them. In addition, we use one of the best methods
(trRosetta) instead of DCA[2] based methods to predict intra- and inter-chain distances. One
advantage of a fold-and-dock methodology is that it is not dependent on the availability of
individual structures and should therefore be less sensitive to structural rearrangements upon
binding. The disadvantage is that obviously, there are many more degrees of freedom in the
system. We find that for several cases, it is possible to fold and dock the dimer simultaneously
accurately. Although the success rate is low (<10%), this is comparable to the accuracy of other
docking methods, which utilises the structure of both individual proteins. In addition, the
methods are complementary.
Results
The protocol used here starts from two multiple sequence alignments, created by searching with
jackhmmer [27] against all complete proteomes from UniProt [28]. After that, a combined
multiple sequence alignment is created by including the top paired hit from each proteome. It
should be noted that the depth of the combined multiple sequence alignment is often
significantly smaller than for the individual proteins. In addition, a few alternative methods both
for generating the alignments and selecting the sequences were tried. These are discussed
below. Next, twenty Glycine residues were added to separate the two sequences in the
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

combined multiple sequence alignment. The combined alignment can be created in two different
orientations, A-B vs B-A, and we have tried to use both combinations.
Next, the combined multiple sequence alignment has been adopted to predict distances and
angles with trRosetta [29]. These are then used as input to Rosetta or CNS [30] to fold and dock
the two proteins.
Below, we will discuss when this methodology works, when it fails, compare the performance of
different alignments, and compare the performance with other docking techniques, and finally
introduce a score, PconsDock, which accurately can be used to distinguish successful and
unsuccessful docking attempts.
Example of successful fold and dock.
Figure 1: A) Predicted (lower triangle) and actual (upper triangle) distance map of the
protein 4gmj. The two blue stripes represent the poly-G linker between the two chains.
The title shows that 287 interchain contacts are predicted and that 48.4% of these are
correct. B) Real (dark colours) and modelled (light colours) structure of the protein
1vrs. The accuracy of the models is good, dockQ score 0.42, and the TM-scores for
the two chains are 0.82 and 0.85, respectively.
First, we demonstrate that the algorithm can accurately fold and dock a pair of proteins in at
least one case. Figure 1 presents one successful example of the fold-and-dock protocol for the
human protein complex between NOT1 MIF4G and CAF1 (PDB: 4gmj)[31]. The prediction is
built on an alignment containing 1189 sequences (Meff=523) created by three iterations of
jackhmmer[27] and an E-value cutoff of 10
-3
against all reference proteomes in UniProt[28].
Visually, it can be seen that the intra-chain distance maps are similar and most intra-chains
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

contacts are predicted accurately (PPV>0.90 for both chains), resulting in well-folded models of
both chains (TM-score >0.8 for both). In total, 139 out of 287 inter-chain contacts are accurately
predicted (287 contacts predicted with a PPV of 49%). The final docked model is also accurate
(dockQ score 0.42). However, as we will show below, unfortunately, many models are not as
easy to model as 4gmj. To test the performance of the algorithm, we have, therefore, used 222
heterodimeric protein pairs from dockground 4.3 [18,28].
Modelling accuracy depends on the size of the MSA; docking
performance does not.
Figure 2: Performance of the fold-and-dock methodology versus the size of the joint
alignments. Average TM-score of the two chains (A) and dockQ scores (B) plotted
against the size of the multiple sequence alignment used to predict the contacts.
The Dockground heterodimeric dataset was used to test the performance of the fold-and-dock
methodology. First, we examined the dependence of the size of the multiple sequence
alignment on the performance. It can be seen that the average TM-score for both chains is
increasing with the size of the combined alignment, Figure 1. At a depth of 100 sequences, the
average TM-score is over 0.6, indicating that about 100 effective sequences are in most cases
sufficient to obtain the fold of a protein.
Next, we examined the quality of the predicted dimers, Figure 1B. A few models are docked
correctly (dockQ score >0.23). However, most protein pairs are not accurately docked (dockQ
.CC-BY 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

Citations
More filters
Journal Article•DOI•
TL;DR: For example, AlphaFold2 as discussed by the authors generates peptide-protein complex models without requiring multiple sequence alignment information for the peptide partner, and can handle binding-induced conformational changes of the receptor.
Abstract: Highly accurate protein structure predictions by deep neural networks such as AlphaFold2 and RoseTTAFold have tremendous impact on structural biology and beyond. Here, we show that, although these deep learning approaches have originally been developed for the in silico folding of protein monomers, AlphaFold2 also enables quick and accurate modeling of peptide-protein interactions. Our simple implementation of AlphaFold2 generates peptide-protein complex models without requiring multiple sequence alignment information for the peptide partner, and can handle binding-induced conformational changes of the receptor. We explore what AlphaFold2 has memorized and learned, and describe specific examples that highlight differences compared to state-of-the-art peptide docking protocol PIPER-FlexPepDock. These results show that AlphaFold2 holds great promise for providing structural insight into a wide range of peptide-protein complexes, serving as a starting point for the detailed characterization and manipulation of these interactions.

390 citations

Posted Content•DOI•
13 Aug 2021-bioRxiv
TL;DR: A simple implementation of AlphaFold2 is presented to model the structure of peptide-protein interactions, enabled by linking the peptide sequence to the protein c-terminus via a poly glycine linker.
Abstract: Highly accurate protein structure predictions by the recently published deep neural networks such as AlphaFold2 and RoseTTAFold are truly impressive achievements, and will have a tremendous impact far beyond structural biology. If peptide-protein binding can be seen as a final complementing step in the folding of a protein monomer, we reasoned that these approaches might be applicable to the modeling of such interactions. We present a simple implementation of AlphaFold2 to model the structure of peptide-protein interactions, enabled by linking the peptide sequence to the protein c-terminus via a poly glycine linker. We show on a large non-redundant set of 162 peptide-protein complexes that peptide-protein interactions can indeed be modeled accurately. Importantly, prediction is fast and works without multiple sequence alignment information for the peptide partner. We compare performance on a smaller, representative set to the state-of-the-art peptide docking protocol PIPER-FlexPepDock, and describe in detail specific examples that highlight advantages of the two approaches, pointing to possible further improvements and insights in the modeling of peptide-protein interactions. Peptide-mediated interactions play important regulatory roles in functional cells. Thus the present advance holds much promise for significant impact, by bringing into reach a wide range of peptide-protein complexes, and providing important starting points for detailed study and manipulation of many specific interactions.

82 citations

Posted Content•DOI•
26 Sep 2021-bioRxiv
TL;DR: In this article, the AlphaFold 2 (AF2) model was used to predict protein disorder and protein complexes, which can be used across diverse applications equally well compared to experimentally determined structures when the confidence metrics are critically considered.
Abstract: Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods have led to protein structure predictions that have reached the accuracy of experimentally determined models. While this has been independently verified, the implementation of these methods across structural biology applications remains to be tested. Here, we evaluate the use of AlphaFold 2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modelling of interactions; and modelling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modelled when compared to homology modelling, identifying structural features rarely seen in the PDB. AF2-based predictions of protein disorder and protein complexes surpass state-of-the-art tools and AF2 models can be used across diverse applications equally well compared to experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life science research.

78 citations

Posted Content•DOI•
15 Sep 2021-bioRxiv
TL;DR: Elofsson et al. as mentioned in this paper used AlphaFold2 to optimise a protocol for predicting the structure of heterodimeric protein complexes using only sequence information and found that using the default AF2 protocol, 32% of the models in the Dockground test set can be modelled accurately.
Abstract: Predicting the structure of single-chain proteins is now close to being a solved problem due to the recent achievement of AlphaFold2 (AF2). However, predicting the structure of interacting protein chains is still a challenge. Here, we utilise AF2 to optimise a protocol for predicting the structure of heterodimeric protein complexes using only sequence information. We find that using the default AF2 protocol, 32% of the models in the Dockground test set can be modelled accurately. By tuning the input alignment and identifying the best model, we adjusted the performance to 43%. Our protocol uses MSAs generated by AF2 and MSAs paired on the organism level generated with HHblits. In a more extensive, more realistic, independent test set, the accuracy is 59%. In comparison, the alternative fold-and-dock method RoseTTAFold is only successful in 10% of the cases on this set and traditional docking methods 22%. However, for the traditional method, the performance would be lower if the bound form of both monomers was not known. The success is higher for bacterial protein pairs, pairs with large interaction areas consisting of helices or sheets, and many homologous sequences. We can distinguish acceptable (DockQ>0.23) from incorrect models with an AUC of 0.84 on the test set by analysing the predicted interfaces. At an error rate of 1%, 13% are acceptable (at a 10% error rate, 40% of the models are acceptable). All scripts and tools to run our protocol are freely available at: https://gitlab.com/ElofssonLab/FoldDock.

26 citations

Posted Content•DOI•
09 Nov 2021-bioRxiv
TL;DR: This article used AlphaFold2 to predict structures for 65,484 human interactions and showed that higher confidence models are enriched in interactions supported by affinity or structure based methods and can be orthogonally confirmed by spatial constraints defined by cross-link data.
Abstract: All cellular functions are governed by complex molecular machines that assemble through protein-protein interactions. Their atomic details are critical to the study of their molecular mechanisms but fewer than 5% of hundreds of thousands of human interactions have been structurally characterized. Here, we test the potential and limitations of recent progress in deep-learning methods using AlphaFold2 to predict structures for 65,484 human interactions. We show that higher confidence models are enriched in interactions supported by affinity or structure based methods and can be orthogonally confirmed by spatial constraints defined by cross-link data. We identify 3,137 high confidence models, of which 1,371 have no homology to a known structure, from which we identify interface residues harbouring disease mutations, suggesting potential mechanisms for pathogenic variants. We find groups of interface phosphorylation sites that show patterns of co-regulation across conditions, suggestive of coordinated tuning of multiple interactions as signalling responses. Finally, we provide examples of how the predicted binary complexes can be used to build larger assemblies. Accurate prediction of protein complexes promises to greatly expand our understanding of the atomic details of human cell biology in health and disease.

20 citations

References
More filters
Journal Article•DOI•
TL;DR: The latest version of STRING more than doubles the number of organisms it covers, and offers an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input.
Abstract: Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.

10,584 citations


"Limits and potential of combined fo..." refers methods in this paper

  • ...Several interaction databases exist [24,25], and methods, including co-evolution based methods [26], to predict interactions have been developed....

    [...]

Journal Article•DOI•

5,284 citations

Book•
01 May 2015
TL;DR: An acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm, which computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment.
Abstract: Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the "multiple segment Viterbi" (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call "sparse rescaling". These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches.

4,492 citations


"Limits and potential of combined fo..." refers methods in this paper

  • ...The protocol used here starts from two multiple sequence alignments, created by searching with jackhmmer [27] against all complete proteomes from UniProt [28]....

    [...]

  • ...Starting from two proteins, which are assumed to interact, we search both sequences against a proteomic database using jackhmmer [27,38]....

    [...]

  • ...The prediction is built on an alignment containing 1189 sequences (Meff=523) created by three iterations of jackhmmer[27] and an E-value cutoff of 10-3 against all reference proteomes in UniProt[28]....

    [...]

Journal Article•DOI•
TL;DR: During 2004, tens of thousands of Knowledgebase records got manually annotated or updated; the UniProt keyword list got augmented by additional keywords; the documentation of the keywords and are continuously overhauling and standardizing the annotation of post-translational modifications.
Abstract: The Universal Protein Resource (UniProt) provides the scientific community with a single, centralized, authoritative resource for protein sequences and functional information. Formed by uniting the Swiss-Prot, TrEMBL and PIR protein database activities, the UniProt consortium produces three layers of protein sequence databases: the UniProt Archive (UniParc), the UniProt Knowledgebase (UniProt) and the UniProt Reference (UniRef) databases. The UniProt Knowledgebase is a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase with extensive cross-references. This centrepiece consists of two sections: UniProt/Swiss-Prot, with fully, manually curated entries; and UniProt/TrEMBL, enriched with automated classification and annotation. During 2004, tens of thousands of Knowledgebase records got manually annotated or updated; we introduced a new comment line topic: TOXIC DOSE to store information on the acute toxicity of a toxin; the UniProt keyword list got augmented by additional keywords; we improved the documentation of the keywords and are continuously overhauling and standardizing the annotation of post-translational modifications. Furthermore, we introduced a new documentation file of the strains and their synonyms. Many new database cross-references were introduced and we started to make use of Digital Object Identifiers. We also achieved in collaboration with the Macromolecular Structure Database group at EBI an improved integration with structural databases by residue level mapping of sequences from the Protein Data Bank entries onto corresponding UniProt entries. For convenient sequence searches we provide the UniRef non-redundant sequence databases. The comprehensive UniParc database stores the complete body of publicly available protein sequence data. The UniProt databases can be accessed online (http://www.uniprot.org) or downloaded in several formats (ftp://ftp.uniprot.org/pub). New releases are published every two weeks.

4,074 citations

Journal Article•DOI•
TL;DR: There exists a significant correlation between the correctness of the predicted structure and the structural similarity of the model to the other proteins in the PDB, which could be used to assist in model selection in blind protein structure predictions.
Abstract: We have developed TM-align, a new algorithm to identify the best structural alignment between protein pairs that combines the TM-score rotation matrix and Dynamic Programming (DP). The algorithm is approximately 4 times faster than CE and 20 times faster than DALI and SAL. On average, the resulting structure alignments have higher accuracy and coverage than those provided by these most often-used methods. TM-align is applied to an all-against-all structure comparison of 10 515 representative protein chains from the Protein Data Bank (PDB) with a sequence identity cutoff <95%: 1996 distinct folds are found when a TM-score threshold of 0.5 is used. We also use TM-align to match the models predicted by TASSER for solved non-homologous proteins in PDB. For both folded and misfolded models, TM-align can almost always find close structural analogs, with an average root mean square deviation, RMSD, of 3 A and 87% alignment coverage. Nevertheless, there exists a significant correlation between the correctness of the predicted structure and the structural similarity of the model to the other proteins in the PDB. This correlation could be used to assist in model selection in blind protein structure predictions. The TM-align program is freely downloadable at http://bioinformatics.buffalo.edu/TM-align.

2,582 citations


"Limits and potential of combined fo..." refers methods in this paper

  • ...To evaluate the quality of the individual models, we have used TM-score [45,46]....

    [...]

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Limits and potential of combined folding and docking using pconsdock" ?

This paper introduces a fold-and-dock method, PconsDock, based on predicted residue-residue distances with trRosetta. The authors also introduced a novel scoring function, PconsDock, that accurately separates 98 % of correctly and incorrectly folded and docked proteins. CC-BY 4. 0 International license available under a was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.Â