scispace - formally typeset
Search or ask a question

Showing papers in "Proteins in 2020"


Journal ArticleDOI
01 Mar 2020-Proteins
TL;DR: The machine learning techniques used in the literature are reviewed, following their evolution from simple algorithms such as logistic regression to more advanced methods like support vector machines and modern deep neural networks.
Abstract: Proteins play important roles in living organisms, and their function is directly linked with their structure. Due to the growing gap between the number of proteins being discovered and their functional characterization (in particular as a result of experimental limitations), reliable prediction of protein function through computational means has become crucial. This paper reviews the machine learning techniques used in the literature, following their evolution from simple algorithms such as logistic regression to more advanced methods like support vector machines and modern deep neural networks. Hyperparameter optimization methods adopted to boost prediction performance are presented. In parallel, the metamorphosis in the features used by these algorithms from classical physicochemical properties and amino acid composition, up to text-derived features from biomedical literature and learned feature representations using autoencoders, together with feature selection and dimensionality reduction techniques, are also reviewed. The success stories in the application of these techniques to both general and specific protein function prediction are discussed.

86 citations


Journal ArticleDOI
01 Aug 2020-Proteins
TL;DR: Analysis indicates that progress in predicting increasingly challenging and diverse types of targets is due to closer integration of template‐based modeling techniques with docking, scoring, and model refinement procedures, and to significant incremental improvements in the underlying methodologies.
Abstract: We present the seventh report on the performance of methods for predicting the atomic resolution structures of protein complexes offered as targets to the community-wide initiative on the Critical Assessment of Predicted Interactions. Performance was evaluated on the basis of 36 114 models of protein complexes submitted by 57 groups-including 13 automatic servers-in prediction rounds held during the years 2016 to 2019 for eight protein-protein, three protein-peptide, and five protein-oligosaccharide targets with different length ligands. Six of the protein-protein targets represented challenging hetero-complexes, due to factors such as availability of distantly related templates for the individual subunits, or for the full complex, inter-domain flexibility, conformational adjustments at the binding region, or the multi-component nature of the complex. The main challenge for the protein-peptide and protein-oligosaccharide complexes was to accurately model the ligand conformation and its interactions at the interface. Encouragingly, models of acceptable quality, or better, were obtained for a total of six protein-protein complexes, which included four of the challenging hetero-complexes and a homo-decamer. But fewer of these targets were predicted with medium or higher accuracy. High accuracy models were obtained for two of the three protein-peptide targets, and for one of the protein-oligosaccharide targets. The remaining protein-sugar targets were predicted with medium accuracy. Our analysis indicates that progress in predicting increasingly challenging and diverse types of targets is due to closer integration of template-based modeling techniques with docking, scoring, and model refinement procedures, and to significant incremental improvements in the underlying methodologies.

76 citations


Journal ArticleDOI
06 Jan 2020-Proteins
TL;DR: A nine‐layer 3D deep convolutional neural network (CNN) that takes as input a gridded box with the atomic coordinates and types around a residue that achieved state‐of‐the‐art performance when tested on large numbers of test proteins and benchmark datasets.
Abstract: Designing protein sequences that fold to a given three-dimensional (3D) structure has long been a challenging problem in computational structural biology with significant theoretical and practical implications. In this study, we first formulated this problem as predicting the residue type given the 3D structural environment around the C α atom of a residue, which is repeated for each residue of a protein. We designed a nine-layer 3D deep convolutional neural network (CNN) that takes as input a gridded box with the atomic coordinates and types around a residue. Several CNN layers were designed to capture structure information at different scales, such as bond lengths, bond angles, torsion angles, and secondary structures. Trained on a very large number of protein structures, the method, called ProDCoNN (protein design with CNN), achieved state-of-the-art performance when tested on large numbers of test proteins and benchmark datasets.

55 citations


Journal ArticleDOI
01 May 2020-Proteins
TL;DR: It is shown that combining machine‐learning based models from AlphaFold with state‐of‐the‐art physics‐based refinement via molecular dynamics simulations further improves predictions to outperform any other prediction method tested during the latest round of CASP.
Abstract: Protein structure prediction has long been available as an alternative to experimental structure determination, especially via homology modeling based on templates from related sequences. Recently, models based on distance restraints from coevolutionary analysis via machine learning to have significantly expanded the ability to predict structures for sequences without templates. One such method, AlphaFold, also performs well on sequences where templates are available but without using such information directly. Here we show that combining machine-learning based models from AlphaFold with state-of-the-art physics-based refinement via molecular dynamics simulations further improves predictions to outperform any other prediction method tested during the latest round of CASP. The resulting models have highly accurate global and local structures, including high accuracy at functionally important interface residues, and they are highly suitable as initial models for crystal structure determination via molecular replacement.

40 citations


Journal ArticleDOI
01 Jan 2020-Proteins
TL;DR: The CoupledMoves method, which combines backbone flexibility and sequence exploration into a single acceptance step during the sampling trajectory, better recapitulates observed sequence profiles than the BackrubEnsemble and FastDesign methods.
Abstract: Computational design of binding sites in proteins remains difficult, in part due to limitations in our current ability to sample backbone conformations that enable precise and accurate geometric positioning of side chains during sequence design. Here we present a benchmark framework for comparison between flexible-backbone design methods applied to binding interactions. We quantify the ability of different flexible backbone design methods in the widely used protein design software Rosetta to recapitulate observed protein sequence profiles assumed to represent functional protein/protein and protein/small molecule binding interactions. The CoupledMoves method, which combines backbone flexibility and sequence exploration into a single acceptance step during the sampling trajectory, better recapitulates observed sequence profiles than the BackrubEnsemble and FastDesign methods, which separate backbone flexibility and sequence design into separate acceptance steps during the sampling trajectory. Flexible-backbone design with the CoupledMoves method is a powerful strategy for reducing sequence space to generate targeted libraries for experimental screening and selection.

31 citations


Journal ArticleDOI
01 Feb 2020-Proteins
TL;DR: ClustENM‐HADDOCK performs better than two‐body docking in protein‐protein cases but worse than a flexible multidomain docking approach, however, it does show a better or similar performance compared to previous protein‐DNA docking approaches, which makes it a suitable alternative.
Abstract: Incorporating the dynamic nature of biomolecules in the modeling of their complexes is a challenge, especially when the extent and direction of the conformational changes taking place upon binding is unknown. Estimating whether the binding of a biomolecule to its partner(s) occurs in a conformational state accessible to its unbound form ("conformational selection") and/or the binding process induces conformational changes ("induced-fit") is another challenge. We propose here a method combining conformational sampling using ClustENM-an elastic network-based modeling procedure-with docking using HADDOCK, in a framework that incorporates conformational selection and induced-fit effects upon binding. The extent of the applied deformation is estimated from its energetical costs, inspired from mechanical tensile testing on materials. We applied our pre- and post-docking sampling of conformational changes to the flexible multidomain protein-protein docking benchmark and a subset of the protein-DNA docking benchmark. Our ClustENM-HADDOCK approach produced acceptable to medium quality models in 7/11 and 5/6 cases for the protein-protein and protein-DNA complexes, respectively. The conformational selection (sampling prior to docking) has the highest impact on the quality of the docked models for the protein-protein complexes. The induced-fit stage of the pipeline (post-sampling), however, improved the quality of the final models for the protein-DNA complexes. Compared to previously described strategies to handle conformational changes, ClustENM-HADDOCK performs better than two-body docking in protein-protein cases but worse than a flexible multidomain docking approach. However, it does show a better or similar performance compared to previous protein-DNA docking approaches, which makes it a suitable alternative.

30 citations


Journal ArticleDOI
04 Jul 2020-Proteins
TL;DR: The results show that among the genomes analyzed, two sequence regions in the N‐terminal domain “MESEFR” and “SYLTPG” are specific to human SARS CoV‐2, and a disulfide bridge connecting 480C and 488C in the extended loop are structural determinants for the recognition of human ACE‐2 receptor.
Abstract: Coronavirus disease 2019 (COVID-19) is a pandemic infectious disease caused by novel severe acute respiratory syndrome coronavirus-2 (SARS CoV-2). The SARS CoV-2 is transmitted more rapidly and readily than SARS CoV. Both, SARS CoV and SARS CoV-2 via their glycosylated spike proteins recognize the human angiotensin converting enzyme-2 (ACE-2) receptor. We generated multiple sequence alignments and phylogenetic trees for representative spike proteins of SARS CoV and SARS CoV-2 from various host sources in order to analyze the specificity in SARS CoV-2 spike proteins required for causing infection in humans. Our results show that among the genomes analyzed, two sequence regions in the N-terminal domain "MESEFR" and "SYLTPG" are specific to human SARS CoV-2. In the receptor-binding domain, two sequence regions "VGGNY" and "EIYQAGSTPCNGV" and a disulfide bridge connecting 480C and 488C in the extended loop are structural determinants for the recognition of human ACE-2 receptor. The complete genome analysis of representative SARS CoVs from bat, civet, human host sources, and human SARS CoV-2 identified the bat genome (GenBank code: MN996532.1) as closest to the recent novel human SARS CoV-2 genomes. The bat SARS CoV genomes (GenBank codes: MG772933 and MG772934) are evolutionary intermediates in the mutagenesis progression toward becoming human SARS CoV-2.

27 citations


Journal ArticleDOI
01 Jun 2020-Proteins
TL;DR: A comprehensive analysis of 4741 high‐resolution, non‐redundant X‐ray crystallographic structures collected from 11 hyperthermophilic, 32 thermophilic and 53 mesophilic prokaryotes unravels at least five “nearly universal” signatures of thermal adaptation, irrespective of the enormous sequence, structure, and functional diversity of the proteins compared.
Abstract: Are there any generalized molecular principles of thermal adaptation? Here, integrating the concepts of structural bioinformatics, sequence analysis, and classical knot theory, we develop a robust computational framework that seeks for mechanisms of thermal adaptation by comparing orthologous mesophilic-thermophilic and mesophilic-hyperthermophilic proteins of remarkable structural and topological similarities, and still leads us to context-independent results. A comprehensive analysis of 4741 high-resolution, non-redundant X-ray crystallographic structures collected from 11 hyperthermophilic, 32 thermophilic and 53 mesophilic prokaryotes unravels at least five "nearly universal" signatures of thermal adaptation, irrespective of the enormous sequence, structure, and functional diversity of the proteins compared. A careful investigation further extracts a set of amino acid changes that can potentially enhance protein thermal stability, and remarkably, these mutations are overrepresented in protein crystallization experiments, in disorder-to-order transitions and in engineered thermostable variants of existing mesophilic proteins. These results could be helpful to find a precise, global picture of thermal adaptation.

27 citations


Journal ArticleDOI
12 Jun 2020-Proteins
TL;DR: Protein sequence networks showed relationships between two‐ and three‐domain MCOs, allowing for family‐specific annotation and inference of evolutionary relationships, and compared to previously reported results from mutagenesis studies.
Abstract: Multicopper oxidases (MCOs) use copper ions as cofactors to oxidize a variety of substrates while reducing oxygen to water. MCOs have been identified in various taxa, with notable occurrences in fungi. The role of these fungal MCOs in lignin degradation sparked an interest due to their potential for application in biofuel production and various other industries. MCOs consist of different protein domains, which led to their classification into two-, three-, and six-domain MCOs. The previously established Laccase and Multicopper Oxidase Engineering Database (https://lcced.biocatnet.de) was updated and now includes 51 058 sequences and 229 structures of MCOs. Sequences and structures of all MCOs were systematically compared. All MCOs consist of cupredoxin-like domains. Two-domain MCOs are formed by the N- and C-terminal domain (domain N and C), while three-domain MCOs have an additional domain (M) in between, homologous to domain C. The six-domain MCOs consist of alternating domains N and C, each three times. Two standard numbering schemes were developed for the copper-binding domains N and C, which facilitated the identification of conserved positions and a comparison to previously reported results from mutagenesis studies. Two sequence motifs for the copper binding sites were identified per domain. Their modularity, depending on the placement of the T1-copper binding site, was demonstrated. Protein sequence networks showed relationships between two- and three-domain MCOs, allowing for family-specific annotation and inference of evolutionary relationships.

23 citations


Journal ArticleDOI
06 Jan 2020-Proteins
TL;DR: The results clearly show the necessity of dynamics to understand and characterize the favorable orientations of the VH and VL domains implying a considerable binding interface flexibility and reveal in all antibody fragments very similar VH‐VL interdomain variations comparable to the distributions observed for known X‐ray structures of antibodies.
Abstract: The relative orientation of the two variable domains, VH and VL , influences the shape of the antigen binding site, that is, the paratope, and is essential to understand antigen specificity. ABangle characterizes the VH -VL orientation by using five angles and a distance and compares it to other known structures. Molecular dynamics simulations of antibody variable domains (Fvs) reveal fluctuations in the relative domain orientations. The observed dynamics between these domains are confirmed by NMR experiments on a single-chain variable fragment antibody (scFv) in complex with IL-1β and an antigen-binding fragment (Fab). The variability of these relative domain orientations can be interpreted as a structural feature of antibodies, which increases the antibody repertoire significantly and can enlarge the number of possible binding partners substantially. The movements of the VH and VL domains are well sampled with molecular dynamics simulations and are in agreement with the NMR ensemble. Fast Fourier transformation of the ABangle metrics allows to assign timescales of 0.1-10 GHz to the fastest collective interdomain movements. The results clearly show the necessity of dynamics to understand and characterize the favorable orientations of the VH and VL domains implying a considerable binding interface flexibility and reveal in all antibody fragments (Fab, scFv, and Fv) very similar VH -VL interdomain variations comparable to the distributions observed for known X-ray structures of antibodies. SIGNIFICANCE STATEMENT: Antibodies have become key players as therapeutic agents. The binding ability of antibodies is determined by the antigen-binding fragment (Fab), in particular the variable fragment region (Fv). Antigen-binding is mediated by the complementarity-determining regions consisting of six loops, each three of the heavy and light chain variable domain VH and VL . The relative orientation of the VH and VL domains influences the shape of the antigen-binding site and is a major objective in antibody design. In agreement with NMR experiments and molecular dynamics simulations, we show a considerable binding site flexibility in the low nanosecond timescale. Thus we suggest that this flexibility and its implications for binding and specificity should be considered when designing and optimizing therapeutic antibodies.

22 citations


Journal ArticleDOI
01 Jan 2020-Proteins
TL;DR: It is found that summary features work well for single‐genome (human‐only) data but are outperformed by pPSSM for diverse PDB‐derived data sets, suggesting greater summary‐level redundancy in the former, and CNN models comprehensively outperform their corresponding MLP models.
Abstract: Sequence based DNA-binding protein (DBP) prediction is a widely studied biological problem. Sliding windows on position specific substitution matrices (PSSMs) rows predict DNA-binding residues well on known DBPs but the same models cannot be applied to unequally sized protein sequences. PSSM summaries representing column averages and their amino-acid wise versions have been effectively used for the task, but it remains unclear if these features carry all the PSSM's predictive power, traditionally harnessed for binding site predictions. Here we evaluate if PSSMs scaled up to a fixed size by zero-vector padding (pPSSM) could perform better than the summary based features on similar models. Using multilayer perceptron (MLP) and deep convolutional neural network (CNN), we found that (a) Summary features work well for single-genome (human-only) data but are outperformed by pPSSM for diverse PDB-derived data sets, suggesting greater summary-level redundancy in the former, (b) even when summary features work comparably well with pPSSM, a consensus on the two outperforms both of them (c) CNN models comprehensively outperform their corresponding MLP models and (d) actual predicted scores from different models depend on the choice of input feature sets used whereas overall performance levels are model-dependent in which CNN leads the accuracy.

Journal ArticleDOI
01 Apr 2020-Proteins
TL;DR: Using all‐atom molecular dynamics simulations, results reveal how the kinesin dimer retains 1HB state before ATP binding and how the dimer transits from 1HB to 2HB state after ATP binding.
Abstract: Kinesin dimer walks processively along a microtubule (MT) protofilament in a hand-over-hand manner, transiting alternately between one-head-bound (1HB) and two-heads-bound (2HB) states. In 1HB state, one head bound by adenosine diphosphate (ADP) is detached from MT and the other head is bound to MT. Here, using all-atom molecular dynamics simulations we determined the position and orientation of the detached ADP-head relative to the MT-bound head in 1HB state. We showed that in 1HB state when the MT-bound head is in ADP or nucleotide-free state, with its neck linker being undocked, the detached ADP-head and the MT-bound head have the high binding energy, and after adenosine triphosphate (ATP) binds to the MT-bound head, with its neck linker being docked, the binding energy between the two heads is reduced greatly. These results reveal how the kinesin dimer retains 1HB state before ATP binding and how the dimer transits from 1HB to 2HB state after ATP binding. Key residues involved in the head-head interaction in 1HB state were identified.

Journal ArticleDOI
01 Mar 2020-Proteins
TL;DR: A comprehensive system for quantifying the geometries of how TCRs bind peptide/MHC complexes is developed and it is shown that the system can discern differences not clearly revealed by more common methods.
Abstract: Recognition of antigenic peptides bound to major histocompatibility complex (MHC) proteins by αβ T cell receptors (TCRs) is a hallmark of T cell mediated immunity. Recent data suggest that variations in TCR binding geometry may influence T cell signaling, which could help explain outliers in relationships between physical parameters such as TCR-pMHC binding affinity and T cell function. Traditionally, TCR binding geometry has been described with simple descriptors such as the crossing angle, which quantifies what has become known as the TCR's diagonal binding mode. However, these descriptors often fail to reveal distinctions in binding geometry that are apparent through visual inspection. To provide a better framework for relating TCR structure to T cell function, we developed a comprehensive system for quantifying the geometries of how TCRs bind peptide/MHC complexes. We show that our system can discern differences not clearly revealed by more common methods. As an example of its potential to impact biology, we used it to reveal differences in how TCRs bind class I and class II peptide/MHC complexes, which we show allow the TCR to maximize access to and "read out" the peptide antigen. We anticipate our system will be of use in not only exploring these and other details of TCR-peptide/MHC binding interactions, but also addressing questions about how TCR binding geometry relates to T cell function, as well as modeling structural properties of class I and class II TCR-peptide/MHC complexes from sequence information. The system is available at https://tcr3d.ibbr.umd.edu/tcr_com or for download as a script.

Journal ArticleDOI
13 Mar 2020-Proteins
TL;DR: New comprehensive benchmark sets of protein models for the development and validation of protein docking, as well as a systematic assessment of free and template-based docking techniques on these sets are presented.
Abstract: Protein docking is essential for structural characterization of protein interactions. Besides providing the structure of protein complexes, modeling of proteins and their complexes is important for understanding the fundamental principles and specific aspects of protein interactions. The accuracy of protein modeling, in general, is still less than that of the experimental approaches. Thus, it is important to investigate the applicability of docking techniques to modeled proteins. We present new comprehensive benchmark sets of protein models for the development and validation of protein docking, as well as a systematic assessment of free and template-based docking techniques on these sets. As opposed to previous studies, the benchmark sets reflect the real case modeling/docking scenario where the accuracy of the models is assessed by the modeling procedure, without reference to the native structure (which would be unknown in practical applications). We also expanded the analysis to include docking of protein pairs where proteins have different structural accuracy. The results show that, in general, the template-based docking is less sensitive to the structural inaccuracies of the models than the free docking. The near-native docking poses generated by the template-based approach, typically, also have higher ranks than those produces by the free docking (although the free docking is indispensable in modeling the multiplicity of protein interactions in a crowded cellular environment). The results show that docking techniques are applicable to protein models in a broad range of modeling accuracy. The study provides clear guidelines for practical applications of docking to protein models.

Journal ArticleDOI
Yue Cao1, Yang Shen1
06 Mar 2020-Proteins
TL;DR: Directly learning from 3D structure data in graph representation, EGCN represents the first successful development of graph convolutional networks for protein docking and significantly improves ranking for a critical assessment of predicted interactions.
Abstract: Structural information about protein-protein interactions, often missing at the interactome scale, is important for mechanistic understanding of cells and rational discovery of therapeutics. Protein docking provides a computational alternative for such information. However, ranking near-native docked models high among a large number of candidates, often known as the scoring problem, remains a critical challenge. Moreover, estimating model quality, also known as the quality assessment problem, is rarely addressed in protein docking. In this study, the two challenging problems in protein docking are regarded as relative and absolute scoring, respectively, and addressed in one physics-inspired deep learning framework. We represent protein and complex structures as intra- and inter-molecular residue contact graphs with atom-resolution node and edge features. And we propose a novel graph convolutional kernel that aggregates interacting nodes' features through edges so that generalized interaction energies can be learned directly from 3D data. The resulting energy-based graph convolutional networks (EGCN) with multihead attention are trained to predict intra- and inter-molecular energies, binding affinities, and quality measures (interface RMSD) for encounter complexes. Compared to a state-of-the-art scoring function for model ranking, EGCN significantly improves ranking for a critical assessment of predicted interactions (CAPRI) test set involving homology docking; and is comparable or slightly better for Score_set, a CAPRI benchmark set generated by diverse community-wide docking protocols not known to training data. For Score_set quality assessment, EGCN shows about 27% improvement to our previous efforts. Directly learning from 3D structure data in graph representation, EGCN represents the first successful development of graph convolutional networks for protein docking.

Journal ArticleDOI
21 Jun 2020-Proteins
TL;DR: Examination of Streptococcus pyogenes sortase A's flexible substrate specificity is examined by investigating the role of the β7/β8 loop in determining substrate specificity and finding the mutant had an improved activity toward LPETG, the preferred substrate of SaSrtA WT.
Abstract: Sortases are a group of enzymes displayed on the cell-wall of Gram-positive bacteria. They are responsible for the attachment of virulence factors onto the peptidoglycan in a transpeptidation reaction through recognition of a pentapeptide substrate. Most housekeeping sortases recognize one specific pentapeptide motif; however, Streptococcus pyogenes sortase A (SpSrtA WT) recognizes LPETG, LPETA and LPKLG motifs. Here, we examined SpSrtA's flexible substrate specificity by investigating the role of the β7/β8 loop in determining substrate specificity. We exchanged the β7/β8 loop in SpSrtA with corresponding β7/β8 loops from Staphylococcus aureus (SaSrtA WT) and Bacillus anthracis (BaSrtA WT). While the BaSrtA-derived variant showed no enzymatic activity toward either LPETG or LPETA substrates, the activity of the SaSrtA-derived mutant toward the LPETA substrate was completely abolished. Instead, the mutant had an improved activity toward LPETG, the preferred substrate of SaSrtA WT.

Journal ArticleDOI
01 Jan 2020-Proteins
TL;DR: Using molecular dynamics simulations of the human CFTR NBD dimer, it is shown that F508del increases, in the prehydrolysis state, the inter‐motif distance in both ATP binding sites (ABP) when ATP is bound and a decrease in the number of catalytically competent conformations was observed in the presence of F508Del.
Abstract: The cystic fibrosis transmembrane conductance regulator (CFTR) channel is an ion channel responsible for chloride transport in epithelia and it belongs to the class of ABC transporters. The deletion of phenylalanine 508 (F508del) in CFTR is the most common mutation responsible for cystic fibrosis. Little is known about the effect of the mutation in the isolated nucleotide binding domains (NBDs), on dimer dynamics, ATP hydrolysis and even on nucleotide binding. Using molecular dynamics simulations of the human CFTR NBD dimer, we showed that F508del increases, in the prehydrolysis state, the inter-motif distance in both ATP binding sites (ABP) when ATP is bound. Additionally, a decrease in the number of catalytically competent conformations was observed in the presence of F508del. We used the subtraction technique to study the first 300 ps after ATP hydrolysis in the catalytic competent site and found that the F508del dimer evidences lower conformational changes than the wild type. Using longer simulation times, the magnitude of the conformational changes in both forms increases. Nonetheless, the F508del dimer shows lower C-α RMS values in comparison to the wild-type, on the F508del loop, on the residues surrounding the catalytic site and the portion of NBD2 adjacent to ABP1. These results provide evidence that F508del interferes with the NBD dynamics before and after ATP hydrolysis. These findings shed a new light on the effect of F508del on NBD dynamics and reveal a novel mechanism for the influence of F508del on CFTR.

Journal ArticleDOI
22 Apr 2020-Proteins
TL;DR: A new approach to compute the Energetic CONTributions of Amino acid residues and its possible Cross-Talk (ECONTACT) to study ligand binding using per-residue energy decomposition, molecular dynamics simulations and rescoring method without the need for experimental binding affinity is presented.
Abstract: Receptor-based QSAR approaches can enumerate the energetic contributions of amino acid residues toward ligand binding only when experimental binding affinity is associated. The structural data of protein-ligand complexes are witnessing a tremendous growth in the Protein Data Bank deposited with a few entries on binding affinity. We present here a new approach to compute the Energetic CONTributions of Amino acid residues and its possible Cross-Talk (ECONTACT) to study ligand binding using per-residue energy decomposition, molecular dynamics simulations and rescoring method without the need for experimental binding affinity. This approach recognizes potential cross-talks among amino acid residues imparting a nonadditive effect to the binding affinity with evidence of correlative motions in the dynamics simulations. The protein-ligand interaction energies deduced from multiple structures are decomposed into per-residue energy terms, which are employed as variables to principal component analysis and generated cross-terms. Out of 16 cross-talks derived from eight datasets of protein-ligand systems, the ECONTACT approach is able to associate 10 potential cross-talks with site-directed mutagenesis, free energy, and dynamics simulations data strongly. We modeled these key determinants of ligand binding using joint probability density function (jPDF) to identify cross-talks in protein structures. The top two cross-talks identified by ECONTACT approach corroborated with the experimental findings. Furthermore, virtual screening exercise using ECONTACT models better discriminated known inhibitors from decoy molecules. This approach proposes the jPDF metric to estimate the probability of observing cross-talks in any protein-ligand complex. The source code and related resources to perform ECONTACT modeling is available freely at https://www.gujaratuniversity.ac.in/econtact/.

Journal ArticleDOI
13 May 2020-Proteins
TL;DR: This work has relaxed the assumption that the probabilities of observing the docking scores to different structures to be independent when using several other machine learning methods—k nearest neighbor, logistic regression, support vector machine, and random forest—to improve ensemble docking.
Abstract: Ensemble docking has provided an inexpensive method to account for receptor flexibility in molecular docking for virtual screening. Unfortunately, as there is no rigorous theory to connect the docking scores from multiple structures to measured activity, researchers have not yet come up with effective ways to use these scores to classify compounds into actives and inactives. This shortcoming has led to the decrease, rather than an increase in the performance of classifying compounds when more structures are added to the ensemble. Previously, we suggested machine learning, implemented in the form of a naive Bayesian model could alleviate this problem. However, the naive Bayesian model assumed that the probabilities of observing the docking scores to different structures to be independent. This approximation might prevent it from achieving even higher performance. In the work presented in this paper, we have relaxed this approximation when using several other machine learning methods-k nearest neighbor, logistic regression, support vector machine, and random forest-to improve ensemble docking. We found significant improvement.

Journal ArticleDOI
01 Feb 2020-Proteins
TL;DR: A second reaction mechanism is explored where the catalytic water is in the second shell of the Mg2+ and it is assumed that the cryo‐EM structure by itself is a suitable representation of a catalytic‐ready structure.
Abstract: Understanding the reaction mechanism of CRISPR-associated protein 9 (Cas9) is crucial for the application of programmable gene editing. Despite the availability of the structures of Cas9 in apo- and substrate-bound forms, the catalytically active structure is still unclear. Our first attempt to explore the catalytic mechanism of Cas9 HNH domain has been based on the reasonable assumption that we are dealing with the same mechanism as endonuclease VII, including the assumption that the catalytic water is in the first shell of the Mg2+ . Trying this mechanism with the cryo-EM structure forced us to induce significant structural change driven by the movement of K848 (or other positively charged residue) close to the active site to facilitate the proton transfer step. In the present study, we explore a second reaction mechanism where the catalytic water is in the second shell of the Mg2+ and assume that the cryo-EM structure by itself is a suitable representation of a catalytic-ready structure. The alternative mechanism indicates that if the active water is from the second shell, then the calculated reaction barrier is lower compared with the corresponding barrier when the water comes from the first shell.

Journal ArticleDOI
01 Oct 2020-Proteins
TL;DR: The results show that the zinc and copper coordination results in a significant decrease of the solvation free energy in the C‐terminal region (Met35‐Val40), which in turn leads to a higher structural disorder.
Abstract: The aggregation of Aβ42 peptides is considered as one of the main causes for the development of Alzheimer's disease. In this context, Zn2+ and Cu2+ play a significant role in regulating the aggregation mechanism, due to changes in the structural and the solvation free energy of Aβ42. In practice, experimental studies are not able to determine the latter properties, since the Aβ42-Zn2+ and Aβ42-Cu2+ peptide complexes are intrinsically disordered, exhibiting rapid conformational changes in the aqueous environment. Here, we investigate atomic structural variations and the solvation thermodynamics of Aβ42, Aβ42-Cu2+ , and Aβ42-Zn2+ systems in explicit solvent (water) by using quantum chemical structures as templates for a metal binding site and combining extensive all-atom molecular dynamics (MD) simulations with a thorough solvation thermodynamic analysis. Our results show that the zinc and copper coordination results in a significant decrease of the solvation free energy in the C-terminal region (Met35-Val40), which in turn leads to a higher structural disorder. In contrast, the β-sheet formation at the same C-terminal region indicates a higher solvation free energy in the case of Aβ42. The solvation free energy of Aβ42 increases upon Zn2+ binding, due to the higher tendency of forming the β-sheet structure at the Leu17-Ala42 residues, in contrast to the case of binding with Cu2+ . Finally, we find the hydrophobicity of Aβ42-Zn2+ in water is greater than in the case of Aβ42-Cu2+ .

Journal ArticleDOI
01 May 2020-Proteins
TL;DR: Conversion of the free energy of NTP hydrolysis efficiently into mechanical work and/or information by transducing enzymes sustains living systems far from equilibrium, and so has been of interest for many decades.
Abstract: Conversion of the free energy of NTP hydrolysis efficiently into mechanical work and/or information by transducing enzymes sustains living systems far from equilibrium, and so has been of interest for many decades. Detailed molecular mechanisms, however, remain puzzling and incomplete. We previously reported that catalysis of tryptophan activation by tryptophanyl-tRNA synthetase, TrpRS, requires relative domain motion to re-position the catalytic Mg2+ ion, noting the analogy between that conditional hydrolysis of ATP and the escapement mechanism of a mechanical clock. The escapement allows the time-keeping mechanism to advance discretely, one gear at a time, if and only if the pendulum swings, thereby converting energy from the weight driving the pendulum into rotation of the hands. Coupling of catalysis to domain motion, however, mimics only half of the escapement mechanism, suggesting that domain motion may also be reciprocally coupled to catalysis, completing the escapement metaphor. Computational studies of the free energy surface restraining the domain motion later confirmed that reciprocal coupling: the catalytic domain motion is thermodynamically unfavorable unless the PPi product is released from the active site. These two conditional phenomena-demonstrated together only for the TrpRS mechanism-function as reciprocally-coupled gates. As we and others have noted, such an escapement mechanism is essential to the efficient transduction of NTP hydrolysis free energy into other useful forms of mechanical or chemical work and/or information. Some implementation of both gating mechanisms-catalysis by domain motion and domain motion by catalysis-will thus likely be found in many other systems.

Journal ArticleDOI
18 Jul 2020-Proteins
TL;DR: It is shown here that at acidic pH, the aggregation of insulin is likely initiated by a partially folded monomeric intermediate, and knowledge of this transition may aid in the engineering of insulin variants that retain the favorable pharamacokinetic properties of monomersic insulin but are more resistant to aggregation.
Abstract: Insulin has long been served as a model for protein aggregation, both due to the importance of aggregation in the manufacture of insulin and because the structural biology of insulin has been extensively characterized. Despite intensive study, details about the initial triggers for aggregation have remained elusive at the molecular level. We show here that at acidic pH, the aggregation of insulin is likely initiated by a partially folded monomeric intermediate. High-resolution structures of the partially folded intermediate show that it is coarsely similar to the initial monomeric structure but differs in subtle details-the A chain helices on the receptor interface are more disordered and the B chain helix is displaced from the C-terminal A chain helix when compared to the stable monomer. The result of these movements is the creation of a hydrophobic cavity in the center of the protein that may serve as nucleation site for oligomer formation. Knowledge of this transition may aid in the engineering of insulin variants that retain the favorable pharamacokinetic properties of monomeric insulin but are more resistant to aggregation.

Journal ArticleDOI
01 Mar 2020-Proteins
TL;DR: Computer‐aided molecular design techniques can effectively guide the development of small‐molecule BRD4 BrD1 inhibitors, explain their selectivity origin, and further open doors to the design of new therapeutically improved derivatives.
Abstract: Bromodomains (BrDs), a conserved structural module in chromatin-associated proteins, are well known for recognizing e-N-acetyl lysine residues on histones. One of the most relevant BrDs is BRD4, a tandem BrD containing protein (BrD1 and BrD2) that plays a critical role in numerous diseases including cancer. Growing evidence shows that the two BrDs of BRD4 have different biological functions; hence selective ligands that can be used to study their functions are of great interest. Here, as a follow-up of our previous work, we first provide a detailed characterization study of the in silico rational design of Olinone as part of a series of five tetrahydropyrido indole-based compounds as BRD4 BrD1 inhibitors. Additionally, we investigated the molecular basis for Olinone's selective recognition by BrD1 over BrD2. Molecular dynamics simulations, free energy calculations, and conformational analyses of the apo-BRD4-BrD1|2 and BRD4-BrD1|2/Olinone complexes showed that Olinone's selectivity is facilitated by five key residues: Leu92 in BrD1|385 in BrD2 of ZA loop, Asn140|433, Asp144|His437 and Asp145|Glu438 of BC loop, and Ile146|Val49 of helix C. Furthermore, the difference in hydrogen bonds number and in mobility of the ZA and BC loops of the acetyl-lysine binding site between BRD4 BrD1/Olinone and BrD2/Olinone complexes also contribute to the difference in Olinone's binding affinity and selectivity toward BrD1 over BrD2. Altogether, our computer-aided molecular design techniques can effectively guide the development of small-molecule BRD4 BrD1 inhibitors, explain their selectivity origin, and further open doors to the design of new therapeutically improved derivatives.

Journal ArticleDOI
01 Apr 2020-Proteins
TL;DR: It is demonstrated that protein domains may have a learnable implicit semantic “meaning” in the context of their functional contributions to the multi‐domain proteins in which they are found using Word2vec.
Abstract: In this paper, using Word2vec, a widely-used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic "meaning" in the context of their functional contributions to the multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as "sentences" where domain identifiers are tokens which may be considered as "words." Using all InterPro (Finn et al. 2017) pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown function.

Journal ArticleDOI
01 Feb 2020-Proteins
TL;DR: Molecular mechanics/generalized born surface area (MM/GBSA) calculations showed that van der Waals and nonpolar solvation energy terms are crucial components for thermodynamically stable binding of the inhibitors.
Abstract: G-protein coupled glucagon receptors (GCGRs) play an important role in glucose homeostasis and pathophysiology of Type-II Diabetes Mellitus (T2DM). The allosteric pocket located at the trans-membrane domain of GCGR consists of hydrophobic (TM5) and hydrophilic (TM7) units. Hydrophobic interactions with the amino acid residues present at TM5, found to facilitate the favorable orientation of antagonist at GCGR allosteric pocket. A statistically robust and highly predictive 3D-QSAR model was developed using 58 β-alanine based GCGR antagonists with significant variation in structure and potency profile. The correlation coefficient (R2 ) and cross-validation coefficient (Q2 ) of the developed model were found to be 0.9981 and 0.8253, respectively at the PLS factor of 8. The analysis of the favorable and unfavorable contribution of different structural features on the glucagon receptor antagonists was done by 3D-QSAR contour plots. Hydrophobic and hydrogen bonding interactions are found to be main dominating non-bonding interactions in docking studies. Presence of highest occupied molecular orbital (HOMO) in the polar part and lowest unoccupied molecular orbital (LUMO) in the hydrophobic part of antagonists leads to favorable protein-ligand interactions. Molecular mechanics/generalized born surface area (MM/GBSA) calculations showed that van der Waals and nonpolar solvation energy terms are crucial components for thermodynamically stable binding of the inhibitors. The binding free energy of highly potent compound was found to be -63.475 kcal/mol; whereas the least active compound exhibited binding energy of -41.097 kcal/mol. Further, five 100 ns molecular dynamics simulation (MD) simulations were done to confirm the stability of the inhibitor-receptor complex. Outcomes of the present study can serve as the basis for designing improved GCGR antagonists.

Journal ArticleDOI
01 Aug 2020-Proteins
TL;DR: The results were broadly encouraging, and highlighted the pressing need to invest in flexible docking algorithms with the ability to model loop and linker motions and in (b) new sampling and scoring methods for oligosaccharide‐protein interactions.
Abstract: Critical Assessment of PRediction of Interactions (CAPRI) rounds 37 through 45 introduced larger complexes, new macromolecules, and multistage assemblies. For these rounds, we used and expanded docking methods in Rosetta to model 23 target complexes. We successfully predicted 14 target complexes and recognized and refined near-native models generated by other groups for two further targets. Notably, for targets T110 and T136, we achieved the closest prediction of any CAPRI participant. We created several innovative approaches during these rounds. Since round 39 (target 122), we have used the new RosettaDock 4.0, which has a revamped coarse-grained energy function and the ability to perform conformer selection during docking with hundreds of pregenerated protein backbones. Ten of the complexes had some degree of symmetry in their interactions, so we tested Rosetta SymDock, realized its shortcomings, and developed the next-generation symmetric docking protocol, SymDock2, which includes docking of multiple backbones and induced-fit refinement. Since the last CAPRI assessment, we also developed methods for modeling and designing carbohydrates in Rosetta, and we used them to successfully model oligosaccharide-protein complexes in round 41. Although the results were broadly encouraging, they also highlighted the pressing need to invest in (a) flexible docking algorithms with the ability to model loop and linker motions and in (b) new sampling and scoring methods for oligosaccharide-protein interactions.

Journal ArticleDOI
27 Feb 2020-Proteins
TL;DR: In this article, the authors identify the physical basis for protein structural differences by modeling protein cores as jammed packings of amino acid-shaped particles, and find that the average jammed packing fraction is identical to that observed in the cores of protein structures solved by X-ray crystallography.
Abstract: There have been several studies suggesting that protein structures solved by NMR spectroscopy and X-ray crystallography show significant differences. To understand the origin of these differences, we assembled a database of high-quality protein structures solved by both methods. We also find significant differences between NMR and crystal structures-in the root-mean-square deviations of the C α atomic positions, identities of core amino acids, backbone, and side-chain dihedral angles, and packing fraction of core residues. In contrast to prior studies, we identify the physical basis for these differences by modeling protein cores as jammed packings of amino acid-shaped particles. We find that we can tune the jammed packing fraction by varying the degree of thermalization used to generate the packings. For an athermal protocol, we find that the average jammed packing fraction is identical to that observed in the cores of protein structures solved by X-ray crystallography. In contrast, highly thermalized packing-generation protocols yield jammed packing fractions that are even higher than those observed in NMR structures. These results indicate that thermalized systems can pack more densely than athermal systems, which suggests a physical basis for the structural differences between protein structures solved by NMR and X-ray crystallography.

Journal ArticleDOI
28 May 2020-Proteins
TL;DR: A protein sequence fitness scoring function that implements sequence and corresponding secondary structural information at tripeptide levels to differentiate natural and nonnatural proteins is developed and could facilitate the exploration of new perspectives in the design of novel functional proteins.
Abstract: The infinitesimally small sequence space naturally scouted in the millions of years of evolution suggests that the natural proteins are constrained by some functional prerequisites and should differ from randomly generated sequences. We have developed a protein sequence fitness scoring function that implements sequence and corresponding secondary structural information at tripeptide levels to differentiate natural and nonnatural proteins. The proposed fitness function is extensively validated on a dataset of about 210 000 natural and nonnatural protein sequences and benchmarked with existing methods for differentiating natural and nonnatural proteins. The high sensitivity, specificity, and percentage accuracy (0.81%, 0.95%, and 91% respectively) of the fitness function demonstrates its potential application for sampling the protein sequences with higher probability of mimicking natural proteins. Moreover, the four major classes of proteins (α proteins, β proteins, α/β proteins, and α + β proteins) are separately analyzed and β proteins are found to score slightly lower as compared to other classes. Further, an analysis of about 250 designed proteins (adopted from previously reported cases) helped to define the boundaries for sampling the ideal protein sequences. The protein sequence characterization aided by the proposed fitness function could facilitate the exploration of new perspectives in the design of novel functional proteins.

Journal ArticleDOI
01 May 2020-Proteins
TL;DR: From the results, the most likely mechanism for inside‐out and outside‐in signaling is the switchblade model with further separation of the transmembrane helices.
Abstract: The bidirectional force transmission process of integrin through the cell membrane is still not well understood. Several possible mechanisms have been discussed in literature on the basis of experimental data, and in this study, we investigate these mechanisms by free and steered molecular dynamics simulations. For the first time, constant velocity pulling on the complete integrin molecule inside a dipalmitoyl-phosphatidylcholine membrane is conducted. From the results, the most likely mechanism for inside-out and outside-in signaling is the switchblade model with further separation of the transmembrane helices.