Posted Content•DOI•

Limits and potential of combined folding and docking using PconsDock

Gabriele Pozzati¹, Wensi Zhu¹, Claudio Bassot¹, John Lamb¹, Petras J. Kundrotas², Petras J. Kundrotas¹, Arne Elofsson¹ - Show less +3 more•Institutions (2)

Science for Life Laboratory¹, University of Kansas²

07 Jun 2021-bioRxiv (Cold Spring Harbor Laboratory)-

TL;DR: A fold-and-dock method, PconsDock, based on predicted residue-residue distances with trRosetta, that can simultaneously predict the tertiary and quaternary structure of a protein pair, even when the structures of the monomers are not known.

read less

Abstract: In the last decade, de novo protein structure prediction accuracy for individual proteins has improved significantly by utilizing deep learning (DL) methods for harvesting the co-evolution information from large multiple sequence alignments (MSA). In CASP14, the best method could predict the structure of most proteins with impressive accuracy. The same approach can, in principle, also be used to extract information about evolutionary-based contacts across protein-protein interfaces. However, most of the earlier studies have not used the latest DL methods for inter-chain contact distance predictions. In this paper, we showed for the first time that using one of the best DL-based residue-residue contact prediction methods (trRosetta), it is possible to simultaneously predict both the tertiary and quaternary structures of some protein pairs, even when the structures of the monomers are not known. Straightforward application of this method to a standard dataset for protein-protein docking yielded limited success, however, using alternative methods for MSA generating allowed us to dock accurately significantly more proteins. We also introduced a novel scoring function, PconsDock, that accurately separates 98% of correctly and incorrectly folded and docked proteins and thus this function can be used to evaluate the quality of the resulting docking models. The average performance of the method is comparable to the use of traditional, template-based or ab initio shape-complementarity-only docking methods, however, no a priori structural information for the individual proteins is needed. Moreover, the results of traditional and fold-and-dock approaches are complementary and thus a combined docking pipeline should increase overall docking success significantly. The dock-and-fold pipeline helped us to generate the best model for one of the CASP14 oligomeric targets, H1065.

...read moreread less

Summary (1 min read)

Jump to: and [Summary]

Summary

Protein structure is crucial for their understanding of biological function.
At a depth of 100 sequences, the average TM-score is over 0.6, indicating that about 100 effective sequences are in most cases sufficient to obtain the fold of a protein.
The default (N3) performance is compared withpyconsFold (uses the pyconsFold program instead of Rosetta), RaptorX (uses inter-chain contacts predicted by RaptorX instead of distances from trRosetta), RaptorX and N3-pdb use the intra-chain distances from the native structures, and N3-merged uses intra-chain distances predicted by the full alignments for each chain independently.
First, it can be seen that the successful dockings tend to have a multiple sequence alignment of one hundred or more residues, see Figure 5A.
There are a few targets whose performance increases significantly.
First, the authors compared it to one shape complementarity method, Gramm, and one template-based docking method, TMdock (see Figure 9.
In some cases, only specific alignment gives correct folding and docking based on the intrinsic evolutionary characteristic of the proteins and their interaction.
Here, it should be noted that a dockQ score over 0.23 roughly corresponds to an “acceptable” model in CAPRI [45], and the authors will therefore call all models with dockQ >0.23 as correct and all others as incorrect.
The distances were then used in Rosetta as described in the original trRosetta protocol.
Morcos F, Pagnini A, Lunt B, Bertolino A, Marks D, Sander C, et al. Estimation of Residue-Residue Coevolution using Direct Coupling Analysis Identifies Many Native Contacts Across a Large Number of Domain Families.

Did you find this useful? Give us your feedback

Figures (5)

Figure 3: Results of PconsDock using different alignments. A) dockQ scores for all models using six different alignments (see Table 1). B) The dockQ scores for the 15 proteins where at

Figure 8. A) Prediction qualities for all models (using the methods to produce an alignment marked with a star in Table 1). B) Predictions qualities for all models using different alignments and sequence databases (see Table 1 for details). C) Comparison of dockQ scores for individual models compared to the N3 models.

Table 1: Overview of alignment methods used in this study (top rows) as input to the fold and dock protocol or alternative docking methods and their performance (Number of correct docked proteins, average dockQ score, and average TM-score.

Table 2: Overview of consensus methods used to rank models.

Figure 6. Distance maps for predicted (lower left) and native (upper right) distances for three proteins with artefacts 3qlu (A,B), 3pv6 (C,D) and 4yoc(E,F). Left columns (A,C,E) with N3 alignments, right (B,D,F) with reciprocal best hits.

Content maybe subject to copyright Report

Limits and potential of combined folding

and docking using PconsDock.

Gabriele Pozzati

, Wensi Zhu

, Claudio Bassot

, John Lamb

, Petras

Kundrotas

1,2

, Arne Elofsson

Science for Life Laboratory and Dep of Biochemistry and Biophysics,

Stockholm University, Box 1031, 171 21 Solna, Sweden

Center for Computational Biology, The University of Kansas,

Lawrence, KS 66047, USA

=contributed equally.

Abstract

In the last decade, de novo protein structure prediction accuracy for individual proteins has

improved significantly by utilising deep learning (DL) methods for harvesting the co-evolution

information from large multiple sequence alignments (MSA). In CASP14, the best groups

predicted the structure of most proteins with impressive accuracy. The same approach can, in

principle, also be used to extract information about evolutionary-based contacts across

protein-protein interfaces. However, most of the earlier studies have not used the latest DL

methods for inter-chain contact distance prediction. This paper introduces a fold-and-dock

method, PconsDock, based on predicted residue-residue distances with trRosetta. PconsDock

can simultaneously predict the tertiary and quaternary structure of a protein pair, even when the

structures of the monomers are not known. The straightforward application of this method to a

standard dataset for protein-protein docking yielded limited success. However, using alternative

methods for MSA generating allowed us to dock accurately significantly more proteins. We also

introduced a novel scoring function, PconsDock, that accurately separates 98% of correctly and

incorrectly folded and docked proteins. The average performance of the method is comparable

to the use of traditional, template-based or ab initio shape-complementarity-only docking

methods. However, no a priori structural information for the individual proteins is needed.

Moreover, the results of conventional and fold-and-dock approaches are complementary, and

thus a combined docking pipeline could increase overall docking success significantly.

PconsDocck contributed to the best model for one of the CASP14 oligomeric targets, H1065.

.CC-BY 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

Introduction

Protein structure is crucial for our understanding of biological function. However, experimentally

determining the structure of a protein is still time-consuming and expensive. Therefore,

computational methods will be the only method to determine the structure of most proteins in

the foreseeable future. Until recently, the only method to reliably predict the structure of a

protein was to model it using a homologous template. However, reliable templates are not

available for close to half the residues in the human proteome [1].

For several decades the prediction of protein structure directly from sequence information has

been an unachievable dream. However, that changed about a decade ago when improved

methods using co-evolution achieved sufficient residue contact information to predict the

structure of many proteins [2,3]. Later, deep learning [4,5] and prediction of residue-residue

distances provided further improvements [6,7]. Today this means that for many, if not most,

individual proteins, it is possible to accurately predict the structure of its folded domains [8].

Recently, Deepmind demonstrated at CASP14 that using an end-to-end learnable approach,

high-quality prediction of almost all protein domains is already feasible today (although not

generally available).

In principle, the same type of methods used for predicting the structure of a single protein can

predict the interaction between two proteins [9,10]. However, there is one fundamental

difference: it is necessary to create paired alignments to identify the interaction between two

proteins, i.e. identifying what pairs of proteins interact in the same manner. The identification of

interacting pairs is assumed to be relatively easy for pairs of proteins that both only contain a

single homolog in a set of genomes, but when multiple paralogs exist - the exact pairing is

difficult [11].

Proteins do, however, not act alone. They function by interacting with other proteins and other

molecules. Protein interaction can vary in nature from stable interaction present in small and

large protein complexes to transient interactions often used for regulation. Experimentally the

study of stable protein interactions can be done using various techniques. Structural

determination methods, including crystallography and Cryo-EM electron microscopy, can solve

the structure of protein complexes, while other methods can be used to identify that two proteins

interact without obtaining detailed structural information.

Prediction of protein interactions has been an even more significant challenge than predicting

the structure of individual proteins. Many different techniques have been developed, but in short,

they can be divided into four categories: (i) docking primarily based on shape complementarity

[12], (ii) template-based modelling [13], and (iii) flexible docking [14,15]. Various energy

functions have also been used to improve the identification of correct docking poses [16]. In

addition, co-evolution-based methods have also been used to predict the structure of complexes

[9,17].

.CC-BY 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

Benchmarks have been developed to elucidate the advantages and disadvantages of different

docking methods [18]. Shape complementarity works excellently on native complexes, but the

accuracy drops fast when using the structures of unbound complexes and even further if models

of the proteins are used [19,20]. Template-based modelling works excellently if a complex with

significant sequence identity exists in PDB but does not work for novel complexes[21,22].

Successful DCA based methods to predict protein-protein interactions preceded the large-scale

prediction of single proteins by predicting the bacterial two-component signalling in 2009 [17].

These methods were then extended to a handful of other complexes by several groups [9,10].

However, it is still unclear how generally applicable these methods are, but the potential to

vastly increase the space of known protein-protein interactions should lie in using some type of

co-evolution based methods. The computational cost limits flexible docking, but a fold-and-dock

protocol [23] based on coevolution does not require an exact structure of the two individual

proteins.

In addition to determining the structure of a protein complex, it is also crucial to determine which

proteins interact. However, protein-protein interaction is not an easily defined entity. It might

include anything from proteins regulating the expression of genes to proteins strongly bound to

each other in a large molecular machine. Several interaction databases exist [24,25], and

methods, including co-evolution based methods [26], to predict interactions have been

developed.

Here, we examine if it is possible to simultaneously fold and dock [23] two proteins by using

coevolutionary information and not only dock them. In addition, we use one of the best methods

(trRosetta) instead of DCA[2] based methods to predict intra- and inter-chain distances. One

advantage of a fold-and-dock methodology is that it is not dependent on the availability of

individual structures and should therefore be less sensitive to structural rearrangements upon

binding. The disadvantage is that obviously, there are many more degrees of freedom in the

system. We find that for several cases, it is possible to fold and dock the dimer simultaneously

accurately. Although the success rate is low (<10%), this is comparable to the accuracy of other

docking methods, which utilises the structure of both individual proteins. In addition, the

methods are complementary.

Results

The protocol used here starts from two multiple sequence alignments, created by searching with

jackhmmer [27] against all complete proteomes from UniProt [28]. After that, a combined

multiple sequence alignment is created by including the top paired hit from each proteome. It

should be noted that the depth of the combined multiple sequence alignment is often

significantly smaller than for the individual proteins. In addition, a few alternative methods both

for generating the alignments and selecting the sequences were tried. These are discussed

below. Next, twenty Glycine residues were added to separate the two sequences in the

.CC-BY 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

combined multiple sequence alignment. The combined alignment can be created in two different

orientations, A-B vs B-A, and we have tried to use both combinations.

Next, the combined multiple sequence alignment has been adopted to predict distances and

angles with trRosetta [29]. These are then used as input to Rosetta or CNS [30] to fold and dock

the two proteins.

Below, we will discuss when this methodology works, when it fails, compare the performance of

different alignments, and compare the performance with other docking techniques, and finally

introduce a score, PconsDock, which accurately can be used to distinguish successful and

unsuccessful docking attempts.

Example of successful fold and dock.

Figure 1: A) Predicted (lower triangle) and actual (upper triangle) distance map of the

protein 4gmj. The two blue stripes represent the poly-G linker between the two chains.

The title shows that 287 interchain contacts are predicted and that 48.4% of these are

correct. B) Real (dark colours) and modelled (light colours) structure of the protein

1vrs. The accuracy of the models is good, dockQ score 0.42, and the TM-scores for

the two chains are 0.82 and 0.85, respectively.

First, we demonstrate that the algorithm can accurately fold and dock a pair of proteins in at

least one case. Figure 1 presents one successful example of the fold-and-dock protocol for the

human protein complex between NOT1 MIF4G and CAF1 (PDB: 4gmj)[31]. The prediction is

built on an alignment containing 1189 sequences (Meff=523) created by three iterations of

jackhmmer[27] and an E-value cutoff of 10

-3

against all reference proteomes in UniProt[28].

Visually, it can be seen that the intra-chain distance maps are similar and most intra-chains

.CC-BY 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

contacts are predicted accurately (PPV>0.90 for both chains), resulting in well-folded models of

both chains (TM-score >0.8 for both). In total, 139 out of 287 inter-chain contacts are accurately

predicted (287 contacts predicted with a PPV of 49%). The final docked model is also accurate

(dockQ score 0.42). However, as we will show below, unfortunately, many models are not as

easy to model as 4gmj. To test the performance of the algorithm, we have, therefore, used 222

heterodimeric protein pairs from dockground 4.3 [18,28].

Modelling accuracy depends on the size of the MSA; docking

performance does not.

Figure 2: Performance of the fold-and-dock methodology versus the size of the joint

alignments. Average TM-score of the two chains (A) and dockQ scores (B) plotted

against the size of the multiple sequence alignment used to predict the contacts.

The Dockground heterodimeric dataset was used to test the performance of the fold-and-dock

methodology. First, we examined the dependence of the size of the multiple sequence

alignment on the performance. It can be seen that the average TM-score for both chains is

increasing with the size of the combined alignment, Figure 1. At a depth of 100 sequences, the

average TM-score is over 0.6, indicating that about 100 effective sequences are in most cases

sufficient to obtain the fold of a protein.

Next, we examined the quality of the predicted dimers, Figure 1B. A few models are docked

correctly (dockQ score >0.23). However, most protein pairs are not accurately docked (dockQ

.CC-BY 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted June 7, 2021. ; https://doi.org/10.1101/2021.06.04.446442doi: bioRxiv preprint

HTML Viewer

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Limits and potential of combined folding and docking using pconsdock" ?

This paper introduces a fold-and-dock method, PconsDock, based on predicted residue-residue distances with trRosetta. The authors also introduced a novel scoring function, PconsDock, that accurately separates 98 % of correctly and incorrectly folded and docked proteins. CC-BY 4. 0 International license available under a was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Limits and potential of combined folding and docking using PconsDock

Summary (1 min read)

Summary

Figures (5)

Citations

References

"Limits and potential of combined fo..." refers methods in this paper

"Limits and potential of combined fo..." refers methods in this paper

"Limits and potential of combined fo..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Limits and potential of combined folding and docking using pconsdock" ?