scispace - formally typeset
Search or ask a question

Showing papers in "Proteins in 2006"


Journal ArticleDOI
15 Nov 2006-Proteins
TL;DR: An effort to improve the φ/ψ dihedral terms in the ff99 energy function achieves a better balance of secondary structure elements as judged by improved distribution of backbone dihedrals for glycine and alanine with respect to PDB survey data.
Abstract: The ff94 force field that is commonly associated with the Amber simulation package is one of the most widely used parameter sets for biomolecular simulation. After a decade of extensive use and testing, limitations in this force field, such as over-stabilization of alpha-helices, were reported by us and other researchers. This led to a number of attempts to improve these parameters, resulting in a variety of "Amber" force fields and significant difficulty in determining which should be used for a particular application. We show that several of these continue to suffer from inadequate balance between different secondary structure elements. In addition, the approach used in most of these studies neglected to account for the existence in Amber of two sets of backbone phi/psi dihedral terms. This led to parameter sets that provide unreasonable conformational preferences for glycine. We report here an effort to improve the phi/psi dihedral terms in the ff99 energy function. Dihedral term parameters are based on fitting the energies of multiple conformations of glycine and alanine tetrapeptides from high level ab initio quantum mechanical calculations. The new parameters for backbone dihedrals replace those in the existing ff99 force field. This parameter set, which we denote ff99SB, achieves a better balance of secondary structure elements as judged by improved distribution of backbone dihedrals for glycine and alanine with respect to PDB survey data. It also accomplishes improved agreement with published experimental data for conformational preferences of short alanine peptides and better accord with experimental NMR relaxation data of test protein systems.

6,146 citations


Journal ArticleDOI
15 Aug 2006-Proteins
TL;DR: An approach based on a two‐level support vector machine (SVM) system, which performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity and when compared with other approaches, this approach performed significantly better.
Abstract: Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences-some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets-one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches-especially those relying on homology search or sequence annotations. Our two-level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two-level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization.

1,303 citations


Journal ArticleDOI
01 Oct 2006-Proteins
TL;DR: The standard feed‐forward (FNN) and recurrent neural network (RNN) have been used in this study for predicting B‐cell epitopes in an antigenic sequence and it has been observed that RNN (JE) was more successful than FNN in the prediction of B‐ cell epitopes.
Abstract: B-cell epitopes play a vital role in the development of peptide vaccines, in diagnosis of diseases, and also for allergy research. Experimental methods used for characterizing epitopes are time consuming and demand large resources. The availability of epitope prediction method(s) can rapidly aid experimenters in simplifying this problem. The standard feed-forward (FNN) and recurrent neural network (RNN) have been used in this study for predicting B-cell epitopes in an antigenic sequence. The networks have been trained and tested on a clean data set, which consists of 700 non-redundant B-cell epitopes obtained from Bcipep database and equal number of non-epitopes obtained randomly from Swiss-Prot database. The networks have been trained and tested at different input window length and hidden units. Maximum accuracy has been obtained using recurrent neural network (Jordan network) with a single hidden layer of 35 hidden units for window length of 16. The final network yields an overall prediction accuracy of 65.93% when tested by fivefold cross-validation. The corresponding sensitivity, specificity, and positive prediction values are 67.14, 64.71, and 65.61%, respectively. It has been observed that RNN (JE) was more successful than FNN in the prediction of B-cell epitopes. The length of the peptide is also important in the prediction of B-cell epitopes from antigenic sequences. The webserver ABCpred is freely available at www.imtech.res.in/raghava/abcpred/.

1,112 citations


Journal ArticleDOI
01 Oct 2006-Proteins
TL;DR: In this review the key concepts of protein–ligand docking methods are outlined, with major emphasis being given to the general strengths and weaknesses that presently characterize this methodology.
Abstract: Understanding the ruling principles whereby protein receptors recognize, interact, and associate with molecular substrates and inhibitors is of paramount importance in drug discovery efforts. Protein-ligand docking aims to predict and rank the structure(s) arising from the association between a given ligand and a target protein of known 3D structure. Despite the breathtaking advances in the field over the last decades and the widespread application of docking methods, several downsides still exist. In particular, protein flexibility-a critical aspect for a thorough understanding of the principles that guide ligand binding in proteins-is a major hurdle in current protein-ligand docking efforts that needs to be more efficiently accounted for. In this review the key concepts of protein-ligand docking methods are outlined, with major emphasis being given to the general strengths and weaknesses that presently characterize this methodology. Despite the size of the field, the principal types of search algorithms and scoring functions are reviewed and the most popular docking tools are briefly depicted. Recent advances that aim to address some of the traditional limitations associated with molecular docking are also described. A selection of hand-picked examples is used to illustrate these features.

840 citations


Journal ArticleDOI
01 Nov 2006-Proteins
TL;DR: The Fast Fourier Transform correlation approach to protein–protein docking is efficiently used with pairwise interaction potentials that substantially improve the docking results, and a novel class of structure‐based pairwise intermolecular potentials are presented.
Abstract: The Fast Fourier Transform (FFT) correlation approach to protein-protein docking can evaluate the energies of billions of docked conformations on a grid if the energy is described in the form of a correlation function. Here, this restriction is removed, and the approach is efficiently used with pairwise interaction potentials that substantially improve the docking results. The basic idea is approximating the interaction matrix by its eigenvectors corresponding to the few dominant eigenvalues, resulting in an energy expression written as the sum of a few correlation functions, and solving the problem by repeated FFT calculations. In addition to describing how the method is implemented, we present a novel class of structure-based pairwise intermolecular potentials. The DARS (Decoys As the Reference State) potentials are extracted from structures of protein-protein complexes and use large sets of docked conformations as decoys to derive atom pair distributions in the reference state. The current version of the DARS potential works well for enzyme-inhibitor complexes. With the new FFT-based program, DARS provides much better docking results than the earlier approaches, in many cases generating 50% more near-native docked conformations. Although the potential is far from optimal for antibody-antigen pairs, the results are still slightly better than those given by an earlier FFT method. The docking program PIPER is freely available for noncommercial applications.

746 citations


Journal ArticleDOI
15 Aug 2006-Proteins
TL;DR: A reliable and robust algorithm, MUSTANG (MUltiple STructural AligNment AlGorithm), for the alignment of multiple protein structures, based on the progressive pairwise heuristic, which performs comparably to popular pairwise and multiple structural alignment tools for closely related proteins.
Abstract: Multiple structural alignment is a fundamental problem in structural genomics. In this article, we define a reliable and robust algorithm, MUSTANG (MUltiple STructural AligNment AlGorithm), for the alignment of multiple protein structures. Given a set of protein structures, the program constructs a multiple alignment using the spatial information of the C(alpha) atoms in the set. Broadly based on the progressive pairwise heuristic, this algorithm gains accuracy through novel and effective refinement phases. MUSTANG reports the multiple sequence alignment and the corresponding superposition of structures. Alignments generated by MUSTANG are compared with several handcurated alignments in the literature as well as with the benchmark alignments of 1033 alignment families from the HOMSTRAD database. The performance of MUSTANG was compared with DALI at a pairwise level, and with other multiple structural alignment tools such as POSA, CE-MC, MALECON, and MultiProt. MUSTANG performs comparably to popular pairwise and multiple structural alignment tools for closely related proteins, and performs more reliably than other multiple structural alignment methods on hard data sets containing distantly related proteins or proteins that show conformational changes.

665 citations


Journal ArticleDOI
21 Dec 2006-Proteins
TL;DR: The protein structure validation software suite (PSVS) is developed, for assessment of protein structures generated by NMR or X‐ray crystallographic methods, and is particularly useful in assessing protein structures determined by N MR methods, but is also valuable for assessing X-ray crystal structures or homology models.
Abstract: Structural genomics projects are providing large quantities of new 3D structural data for proteins. To monitor the quality of these data, we have developed the protein structure validation software suite (PSVS), for assessment of protein structures generated by NMR or X-ray crystallographic methods. PSVS is broadly applicable for structure quality assessment in structural biology projects. The software integrates under a single interface analyses from several widely-used structure quality evaluation tools, including PROCHECK (Laskowski et al., J Appl Crystallog 1993;26:283-291), MolProbity (Lovell et al., Proteins 2003;50:437-450), Verify3D (Luthy et al., Nature 1992;356:83-85), ProsaII (Sippl, Proteins 1993;17: 355-362), the PDB validation software, and various structure-validation tools developed in our own laboratory. PSVS provides standard constraint analyses, statistics on goodness-of-fit between structures and experimental data, and knowledge-based structure quality scores in standardized format suitable for database integration. The analysis provides both global and site-specific measures of protein structure quality. Global quality measures are reported as Z scores, based on calibration with a set of high-resolution X-ray crystal structures. PSVS is particularly useful in assessing protein structures determined by NMR methods, but is also valuable for assessing X-ray crystal structures or homology models. Using these tools, we assessed protein structures generated by the Northeast Structural Genomics Consortium and other international structural genomics projects, over a 5-year period. Protein structures produced from structural genomics projects exhibit quality score distributions similar to those of structures produced in traditional structural biology projects during the same time period. However, while some NMR structures have structure quality scores similar to those seen in higher-resolution X-ray crystal structures, the majority of NMR structures have lower scores. Potential reasons for this "structure quality score gap" between NMR and X-ray crystal structures are discussed.

648 citations


Journal ArticleDOI
15 Nov 2006-Proteins
TL;DR: A new method for docking small molecules into protein binding sites employing a Monte Carlo minimization procedure in which the rigid body position and orientation of the small molecule and the protein side‐chain conformations are optimized simultaneously is described.
Abstract: Protein-small molecule docking algorithms provide a means to model the structure of protein-small molecule complexes in structural detail and play an important role in drug development. In recent years the necessity of simulating protein side-chain flexibility for an accurate prediction of the protein-small molecule interfaces has become apparent, and an increasing number of docking algorithms probe different approaches to include protein flexibility. Here we describe a new method for docking small molecules into protein binding sites employing a Monte Carlo minimization procedure in which the rigid body position and orientation of the small molecule and the protein side-chain conformations are optimized simultaneously. The energy function comprises van der Waals (VDW) interactions, an implicit solvation model, an explicit orientation hydrogen bonding potential, and an electrostatics model. In an evaluation of the scoring function the computed energy correlated with experimental small molecule binding energy with a correlation coefficient of 0.63 across a diverse set of 229 protein- small molecule complexes. The docking method produced lowest energy models with a root mean square deviation (RMSD) smaller than 2 A in 71 out of 100 protein-small molecule crystal structure complexes (self-docking). In cross-docking calculations in which both protein side-chain and small molecule internal degrees of freedom were varied the lowest energy predictions had RMSDs less than 2 A in 14 of 20 test cases.

418 citations


Journal ArticleDOI
15 May 2006-Proteins
TL;DR: In this article, the authors investigated the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, and assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks.
Abstract: Protein-protein interactions play a key role in many biological systems. High-throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false-positive and false-negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co-complex relationship, and (3) pathway co-membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity-based k-Nearest-Neighbor, Naive Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co-complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top-ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast-2-hybrid system were not among the top-ranking features under any condition.

405 citations


Journal ArticleDOI
06 Dec 2006-Proteins
TL;DR: The authors demonstrate that RSA prediction‐based fingerprints of protein interactions significantly improve the discrimination between interacting and noninteracting sites, compared with evolutionary conservation, physicochemical characteristics, structure‐derived and other features considered before.
Abstract: The recognition of protein interaction sites is an important intermediate step toward identification of functionally relevant residues and understanding protein function, facilitating experimental efforts in that regard. Toward that goal, the authors propose a novel representation for the recognition of protein-protein interaction sites that integrates enhanced relative solvent accessibility (RSA) predictions with high resolution structural data. An observation that RSA predictions are biased toward the level of surface exposure consistent with protein complexes led the authors to investigate the difference between the predicted and actual (i.e., observed in an unbound structure) RSA of an amino acid residue as a fingerprint of interaction sites. The authors demonstrate that RSA prediction-based fingerprints of protein interactions significantly improve the discrimination between interacting and noninteracting sites, compared with evolutionary conservation, physicochemical characteristics, structure-derived and other features considered before. On the basis of these observations, the authors developed a new method for the prediction of protein-protein interaction sites, using machine learning approaches to combine the most informative features into the final predictor. For training and validation, the authors used several large sets of protein complexes and derived from them nonredundant representative chains, with interaction sites mapped from multiple complexes. Alternative machine learning techniques are used, including Support Vector Machines and Neural Networks, so as to evaluate the relative effects of the choice of a representation and a specific learning algorithm. The effects of induced fit and uncertainty of the negative (noninteracting) class assignment are also evaluated. Several representative methods from the literature are reimplemented to enable direct comparison of the results. Using rigorous validation protocols, the authors estimated that the new method yields the overall classification accuracy of about 74% and Matthews correlation coefficients of 0.42, as opposed to up to 70% classification accuracy and up to 0.3 Matthews correlation coefficient for methods that do not utilize RSA prediction-based fingerprints. The new method is available at http://sppider.cchmc.org.

358 citations


Journal ArticleDOI
09 Nov 2006-Proteins
TL;DR: The results show that ensemble docking successfully predicts the binding modes of the inhibitors, and discriminates the inhibitors from a set of noninhibitors with similar chemical properties.
Abstract: One approach to incorporate protein flexibility in molecular docking is the use of an ensemble consisting of multiple protein structures. Sequentially docking each ligand into a large number of protein structures is computationally too expensive to allow large-scale database screening. It is challenging to achieve a good balance between docking accuracy and computational efficiency. In this work, we have developed a fast, novel docking algorithm utilizing multiple protein structures, referred to as ensemble docking, to account for protein structural variations. The algorithm can simultaneously dock a ligand into an ensemble of protein structures and automatically select an optimal protein structure that best fits the ligand by optimizing both ligand coordinates and the conformational variable m, where m represents the m-th structure in the protein ensemble. The docking algorithm was validated on 10 protein ensembles containing 105 crystal structures and 87 ligands in terms of binding mode and energy score predictions. A success rate of 93% was obtained with the criterion of root-mean-square deviation <2.5 A if the top five orientations for each ligand were considered, comparable to that of sequential docking in which scores for individual docking are merged into one list by re-ranking, and significantly better than that of single rigid-receptor docking (75% on average). Similar trends were also observed in binding score predictions and enrichment tests of virtual database screening. The ensemble docking algorithm is computationally efficient, with a computational time comparable to that for docking a ligand into a single protein structure. In contrast, the computational time for the sequential docking method increases linearly with the number of protein structures in the ensemble. The algorithm was further evaluated using a more realistic ensemble in which the corresponding bound protein structures of inhibitors were excluded. The results show that ensemble docking successfully predicts the binding modes of the inhibitors, and discriminates the inhibitors from a set of noninhibitors with similar chemical properties. Although multiple experimental structures were used in the present work, our algorithm can be easily applied to multiple protein conformations generated by computational methods, and helps improve the efficiency of other existing multiple protein structure(MPS)-based methods to accommodate protein flexibility.

Journal ArticleDOI
01 Jun 2006-Proteins
TL;DR: A graphical representation is introduced that visualizes the clusters of strong site–site interactions in the context of the three‐dimensional (3D) structure of the macromolecule, facilitating identification of functionally important clusters of ionizable groups.
Abstract: Structure and function of macromolecules depend critically on the ionization states of their acidic and basic groups. Most current structure-based theoretical methods that predict pK of ionizable groups in macromolecules include, as one of the key steps, a computation of the partition sum (Boltzmann average) over all possible protonation microstates. As the number of these microstates depends exponentially on the number of ionizable groups present in the molecule, direct computation of the sum is not realistically feasible for many typical proteins that may have tens or even hundreds of ionizable groups. We have tested a simple and robust approximate algorithm for computing these partition sums for macromolecules. The method subdivides the interacting sites into independent clusters, based upon the strength of site-site electrostatic interaction. The resulting partition function is factorizable into computationally manageable components. Two variants of the approach are presented and validated on a representative test set of 602 proteins, by comparing the pK(1/2) values computed by the proposed method with those obtained by the standard Monte Carlo approach used as a reference. With 95% confidence, the relative error introduced by the more accurate of the two methods is less than 0.25 pK units. The algorithms are one to two orders of magnitude faster than the Monte Carlo method, with the typical settings. A graphical representation is introduced that visualizes the clusters of strong site-site interactions in the context of the three-dimensional (3D) structure of the macromolecule, facilitating identification of functionally important clusters of ionizable groups; the approach is exemplified on two proteins, bacteriorhodopsin and myoglobin.

Journal ArticleDOI
01 Oct 2006-Proteins
TL;DR: An overview of the methods currently employed for disorder prediction highlighting their advantages and drawbacks is presented and a few practical examples of how they can be combined to avoid pitfalls and to achieve more reliable predictions are shown.
Abstract: In the past few years there has been a growing awareness that a large number of proteins contain long disordered (unstructured) regions that often play a functional role. However, these disordered regions are still poorly detected. Recognition of disordered regions in a protein is important for two main reasons: reducing bias in sequence similarity analysis by avoiding alignment of disordered regions against ordered ones, and helping to delineate boundaries of protein domains to guide structural and functional studies. As none of the available method for disorder prediction can be taken as fully reliable on its own, we present an overview of the methods currently employed highlighting their advantages and drawbacks. We show a few practical examples of how they can be combined to avoid pitfalls and to achieve more reliable predictions.

Journal ArticleDOI
01 Jun 2006-Proteins
TL;DR: This article introduces a new method for the identification and the accurate characterization of protein surface cavities encoded in the program SCREEN (Surface Cavity REcognition and EvaluatioN), and develops a classifier that can identify drug‐binding cavities with a balanced error rate of 7.2% and coverage of 88.9%.
Abstract: In this article we introduce a new method for the identification and the accurate characterization of protein surface cavities. The method is encoded in the program SCREEN (Surface Cavity REcognition and EvaluatioN). As a first test of the utility of our approach we used SCREEN to locate and analyze the surface cavities of a nonredundant set of 99 proteins cocrystallized with drugs. We find that this set of proteins has on average about 14 distinct cavities per protein. In all cases, a drug is bound at one (and sometimes more than one) of these cavities. Using cavity size alone as a criterion for predicting drug-binding sites yields a high balanced error rate of 15.7%, with only 71.7% coverage. Here we characterize each surface cavity by computing a comprehensive set of 408 physicochemical, structural, and geometric attributes. By applying modern machine learning techniques (Random Forests) we were able to develop a classifier that can identify drug-binding cavities with a balanced error rate of 7.2% and coverage of 88.9%. Only 18 of the 408 cavity attributes had a statistically significant role in the prediction. Of these 18 important attributes, almost all involved size and shape rather than physicochemical properties of the surface cavity. The implications of these results are discussed. A SCREEN Web server is available at http://interface.bioc.columbia.edu/screen.

Journal ArticleDOI
11 Dec 2006-Proteins
TL;DR: The results are consistent with the view that the antibodies undergo limited conformational change, and that incubation at 4°C at low pH results in no time‐dependent conformational changes.
Abstract: Exposure of antibodies to low pH is often unavoidable for purification and viral clearance. The conformation and stability of two humanized monoclonal antibodies (hIgG4-A and -B) directed against different antigens and a mouse monoclonal antibody (mIgG1) in 0.1M citrate at acidic pH were studied using circular dichroism (CD), differential scanning calorimetry (DSC), and sedimentation velocity. Near- and far-UV CD spectra showed that exposure of these antibodies to pH 2.7-3.9 induced only limited conformational changes, although the changes were greater at the lower pH. However, the acid conformation is far from unfolded or so-called molten globule structure. Incubation of hIgG4-A at pH 2.7 and 3.5 at 4 degrees C over the course of 24 h caused little change in the near-UV CD spectra, indicating that the acid conformation is stable. Sedimentation velocity showed that the hIgG4-A is largely monomeric at pH 2.7 and 3.5 as well as at pH 6.0. No time-dependent changes in sedimentation profile occurred upon incubation at these low pHs, consistent with the conformational stability observed by CD. The sedimentation coefficient of the monomer at pH 2.7 or 3.5 again suggested that no gross conformational changes occur at these pHs. DSC analysis of the antibodies showed thermal unfolding at pH 2.7-3.9 as well as at pH 6.0, but with decreased melting temperatures at the lower pH. These results are consistent with the view that the antibodies undergo limited conformational change, and that incubation at 4 degrees C at low pH results in no time-dependent conformational changes. Titration of hIgG4-A from pH 3.5 to 6.0 resulted in recovery of native monomeric proteins whose CD and DSC profiles resembled those of the original sample. However, titration from pH 2.7 resulted in lower recovery of monomeric antibody, indicating that the greater conformational changes observed at this pH cannot be fully reversed to the native structure by a simple pH titration.

Journal ArticleDOI
21 Dec 2006-Proteins
TL;DR: The analysis of hydrogen bond and van der Waal contacts showed that in general proteins complexed with messenger RNA, transfer RNA and viral RNA have more base specific contacts and less backbone contacts than expected, while proteins complexing with ribosomal RNA have less base‐specific contacts than the expected.
Abstract: A data set of 89 protein-RNA complexes has been extracted from the Protein Data Bank, and the nucleic acid recognition sites characterized through direct contacts, accessible surface area, and secondary structure motifs. The differences between RNA recognition sites that bind to RNAs in functional classes has also been analyzed. Analysis of the complete data set revealed that van der Waals interactions are more numerous than hydrogen bonds and the contacts made to the nucleic acid backbone occur more frequently than specific contacts to nucleotide bases. Of the base-specific contacts that were observed, contacts to guanine and adenine occurred most frequently. The most favored amino acid-nucleotide pairings observed were lysine-phosphate, tyrosine-uracil, arginine-phosphate, phenylalanine-adenine and tryptophan-guanine. The amino acid propensities showed that positively charged and polar residues were favored as expected, but also so were tryptophan and glycine. The propensities calculated for the functional classes showed trends similar to those observed for the complete data set. However, the analysis of hydrogen bond and van der Waal contacts showed that in general proteins complexed with messenger RNA, transfer RNA and viral RNA have more base specific contacts and less backbone contacts than expected, while proteins complexed with ribosomal RNA have less base-specific contacts than the expected. Hence, whilst the types of amino acids involved in the interfaces are similar, the distribution of specific contacts is dependent upon the functional class of the RNA bound.

Journal ArticleDOI
15 Nov 2006-Proteins
TL;DR: The usefulness of the automated AutoDock as a new promising tool in structure‐based virtual screening is demonstrated by identifying the actual inhibitors of various target enzymes in chemical databases with accuracy higher than the other docking tools including DOCK and FlexX.
Abstract: A major problem in virtual screening concerns the accuracy of the binding free energy between a target protein and a putative ligand. Here we report an example supporting the outperformance of the AutoDock scoring function in virtual screening in comparison to the other popular docking programs. The original AutoDock program is in itself inefficient to be used in virtual screening because the grids of interaction energy have to be calculated for each putative ligand in chemical database. However, the automation of the AutoDock program with the potential grids defined in common for all putative ligands leads to more than twofold increase in the speed of virtual database screening. The utility of the automated AutoDock in virtual screening is further demonstrated by identifying the actual inhibitors of various target enzymes in chemical databases with accuracy higher than the other docking tools including DOCK and FlexX. These results exemplify the usefulness of the automated AutoDock as a new promising tool in structure-based virtual screening.

Journal ArticleDOI
01 Jul 2006-Proteins
TL;DR: The interaction between β‐catenin and Tcf family members is crucial for the Wnt signal transduction pathway, which is commonly mutated in cancer, and inhibiting such interactions using low molecular weight inhibitors is a challenge.
Abstract: The interaction between beta-catenin and Tcf family members is crucial for the Wnt signal transduction pathway, which is commonly mutated in cancer. This interaction extends over a very large surface area (4800 A(2)), and inhibiting such interactions using low molecular weight inhibitors is a challenge. However, protein surfaces frequently contain "hot spots," small patches that are the main mediators of binding affinity. By making tight interactions with a hot spot, a small molecule can compete with a protein. The Tcf3/Tcf4-binding surface on beta-catenin contains a well-defined hot spot around residues K435 and R469. A 17,700 compounds subset of the Pharmacia corporate collection was docked to this hot spot with the QXP program; 22 of the best scoring compounds were put into a biophysical (NMR and ITC) screening funnel, where specific binding to beta-catenin, competition with Tcf4 and finally binding constants were determined. This process led to the discovery of three druglike, low molecular weight Tcf4-competitive compounds with the tightest binder having a K(D) of 450 nM. Our approach can be used in several situations (e.g., when selecting compounds from external collections, when no biochemical functional assay is available, or when no HTS is envisioned), and it may be generally applicable to the identification of inhibitors of protein-protein interactions.

Journal ArticleDOI
11 Dec 2006-Proteins
TL;DR: The data reveals that intrinsic disorder is significantly enriched in date hub proteins when compared with party hub proteins, suggesting that enrichment of intrinsic disorder in date hubs may facilitate transient interactions, which might be required for date hubs to interact with different partners at different times.
Abstract: Hubs in the protein-protein interaction network have been classified as "party" hubs, which are highly correlated in their mRNA expression with their partners while "date" hubs show lesser correlation. In this study, we explored the role of intrinsic disorder in date and party hub interactions. The data reveals that intrinsic disorder is significantly enriched in date hub proteins when compared with party hub proteins. Intrinsic disorder has been largely implicated in transient binding interactions. The disorder to order transition, which occurs during binding interactions in disordered regions, renders the interaction highly reversible while maintaining the high specificity. The enrichment of intrinsic disorder in date hubs may facilitate transient interactions, which might be required for date hubs to interact with different partners at different times.

Journal ArticleDOI
01 Jul 2006-Proteins
TL;DR: Support Vector Machine (SVM), a supervised pattern recognition method, is applied to predict DNA‐binding sites in DNA‐ binding proteins using the following features: amino acid sequence, profile of evolutionary conservation of sequence positions, and low‐resolution structural information.
Abstract: Proteins that interact with DNA are involved in a number of fundamental biological activities such as DNA replication, transcription, and repair. A reliable identification of DNA-binding sites in DNA-binding proteins is important for functional annotation, site-directed mutagenesis, and modeling protein-DNA interactions. We apply Support Vector Machine (SVM), a supervised pattern recognition method, to predict DNA-binding sites in DNA-binding proteins using the following features: amino acid sequence, profile of evolutionary conservation of sequence positions, and low-resolution structural information. We use a rigorous statistical approach to study the performance of predictors that utilize different combinations of features and how this performance is affected by structural and sequence properties of proteins. Our results indicate that an SVM predictor based on a properly scaled profile of evolutionary conservation in the form of a position specific scoring matrix (PSSM) significantly outperforms a PSSM-based neural network predictor. The highest accuracy is achieved by SVM predictor that combines the profile of evolutionary conservation with low-resolution structural information. Our results also show that knowledge-based predictors of DNA-binding sites perform significantly better on proteins from mainly-alpha structural class and that the performance of these predictors is significantly correlated with certain structural and sequence properties of proteins. These observations suggest that it may be possible to assign a reliability index to the overall accuracy of the prediction of DNA-binding sites in any given protein using its sequence and structural properties. A web-server implementation of the predictors is freely available online at http://lcg.rit.albany.edu/dp-bind/.

Journal ArticleDOI
15 May 2006-Proteins
TL;DR: The interface prediction program WHISCY is presented, which combines surface conservation and structural information to predict protein–protein interfaces and demonstrates the potential of using interface predictions to drive protein– protein docking.
Abstract: Protein-protein interactions play a key role in biological processes. Identifying the interacting residues is a first step toward understanding these interactions at a structural level. In this study, the interface prediction program WHISCY is presented. It combines surface conservation and structural information to predict protein-protein interfaces. The accuracy of the predictions is more than three times higher than a random prediction. These predictions have been combined with another interface prediction program, ProMate [Neuvirth et al. J Mol Biol 2004;338:181-199], resulting in an even more accurate predictor. The usefulness of the predictions was tested using the data-driven docking program HADDOCK [Dominguez et al. J Am Chem Soc 2003;125:1731-1737] in an unbound docking experiment, with the goal of generating as many near-native structures as possible. Unrefined rigid body docking solutions within 10 A ligand RMSD from the true structure were generated for 22 out of 25 docked complexes. For 18 complexes, more than 100 of the 8000 generated models were correct. Our results demonstrates the potential of using interface predictions to drive protein-protein docking.

Journal ArticleDOI
18 Dec 2006-Proteins
TL;DR: An integrated system of neural networks, called SPINE, is established and optimized for predicting structural properties of proteins, and approaches the theoretical upper limit of 88–90% accuracy in assigning secondary structures.
Abstract: An integrated system of neural networks, called SPINE, is established and optimized for predicting structural properties of proteins. SPINE is applied to three-state secondary-structure and residue-solvent-accessibility (RSA) prediction in this paper. The integrated neural networks are carefully trained with a large dataset of 2640 chains, sequence profiles generated from multiple sequence alignment, representative amino acid properties, a slow learning rate, overfitting protection, and an optimized sliding-widow size. More than 200,000 weights in SPINE are optimized by maximizing the accuracy measured by Q(3) (the percentage of correctly classified residues). SPINE yields a 10-fold cross-validated accuracy of 79.5% (80.0% for chains of length between 50 and 300) in secondary-structure prediction after one-month (CPU time) training on 22 processors. An accuracy of 87.5% is achieved for exposed residues (RSA >95%). The latter approaches the theoretical upper limit of 88-90% accuracy in assigning secondary structures. An accuracy of 73% for three-state solvent-accessibility prediction (25%/75% cutoff) and 79.3% for two-state prediction (25% cutoff) is also obtained.

Journal ArticleDOI
15 Nov 2006-Proteins
TL;DR: An improvement on the pattern search method for the identification of EF‐hand and EF‐like Ca2+‐binding proteins and a new signature profile has been established to allow for the Identification of pseudo EF‐ hand and S100 proteins from genomic information.
Abstract: The EF-hand protein with a helix-loop-helix Ca(2+) binding motif constitutes one of the largest protein families and is involved in numerous biological processes. To facilitate the understanding of the role of Ca(2+) in biological systems using genomic information, we report, herein, our improvement on the pattern search method for the identification of EF-hand and EF-like Ca(2+)-binding proteins. The canonical EF-hand patterns are modified to cater to different flanking structural elements. In addition, on the basis of the conserved sequence of both the N- and C-terminal EF-hands within S100 and S100-like proteins, a new signature profile has been established to allow for the identification of pseudo EF-hand and S100 proteins from genomic information. The new patterns have a positive predictive value of 99% and a sensitivity of 96% for pseudo EF-hands. Furthermore, using the developed patterns, we have identified zero pseudo EF-hand motif and 467 canonical EF-hand Ca(2+) binding motifs with diverse cellular functions in the bacteria genome. The prediction results imply that pseudo EF-hand motifs are phylogenetically younger than canonical EF-hand motifs. Our prediction of Ca(2+) binding motifs provides not only an insight into the role of Ca(2+) and Ca(2+)-binding proteins in bacterial systems, but also a way to explore and define the role of Ca(2+) in other biological systems (calciomics).

Journal ArticleDOI
01 Aug 2006-Proteins
TL;DR: Data obtained by fluorescence spectroscopy, CD, and FTIR experiments along with the docking studies suggest that EGCG binds to residues located in subdomains IIa and IIIa of HSA, suggesting that apart from an initial hydrophobic association, the complex is held together by van der Waals interactions and hydrogen bonding.
Abstract: (-)-Epigallocatechin-3-gallate (EGCG), the major constituent of green tea has been reported to prevent many diseases by virtue of its antioxidant properties. The binding of EGCG with human serum albumin (HSA) has been investigated for the first time by using fluorescence, circular dichroism (CD), Fourier transform infrared (FTIR) spectroscopy, and protein-ligand docking. We observed a quenching of fluorescence of HSA in the presence of EGCG. The binding parameters were determined by a Scatchard plot and the results were found to be consistent with those obtained from a modified Stern-Volmer equation. From the thermodynamic parameters calculated according to the van't Hoff equation, the enthalpy change deltaH degrees and entropy change deltaS degrees were found to be -22.59 and 16.23 J/mol K, respectively. These values suggest that apart from an initial hydrophobic association, the complex is held together by van der Waals interactions and hydrogen bonding. Data obtained by fluorescence spectroscopy, CD, and FTIR experiments along with the docking studies suggest that EGCG binds to residues located in subdomains IIa and IIIa of HSA. Specific interactions are observed with residues Trp 214, Arg 218, Gln 221, Asn 295 and Asp 451. We have also looked at changes in the accessible surface area of the interacting residues on binding EGCG for a better understanding of the interaction.

Journal ArticleDOI
01 Sep 2006-Proteins
TL;DR: The microscopic and semimacroscopic studies clarify the problems with incomplete alternative calculations, illustrating that the effects of various electrostatic elements are drastically overestimated by macroscopic calculations that use a low dielectric constant and do not consider the protein reorganization.
Abstract: The origin of the barrier for proton transport through the aquaporin channel is a problem of general interest. It is becoming increasingly clear that this barrier is not attributable to the orientation of the water molecules across the channel but rather to the electrostatic penalty for moving the proton charge to the center of the channel. However, the reason for the high electrostatic barrier is still rather controversial. It has been argued by some workers that the barrier is due to the so-called NPA motif and/or to the helix macrodipole or to other specific elements. However, our works indicated that the main reason for the high barrier is the loss of the generalized solvation upon moving the proton charge from the bulk to the center of the channel and that this does not reflect a specific repulsive electrostatic interaction but the absence of sufficient electrostatic stabilization. At this stage it seems that the elucidation and clarification of the origin of the electrostatic barrier can serve as an instructive test case for electrostatic models. Thus, we reexamine the free-energy surface for proton transport in aquaporins using the microscopic free-energy perturbation/umbrella sampling (FEP/US) and the empirical valence bond/umbrella sampling (EVB/US) methods as well as the semimacroscopic protein dipole Langevin dipole model in its linear response approximation version (the PDLD/S-LRA). These extensive studies help to clarify the nature of the barrier and to establish the “reduced solvation effect” as the primary source of this barrier. That is, it is found that the barrier is associated with the loss of the generalized solvation energy (which includes of course all electrostatic effects) upon moving the proton charge from the bulk solvent to the center of the channel. It is also demonstrated that the residues in the NPA region and the helix dipole cannot be considered as the main reasons for the electrostatic barrier. Furthermore, our microscopic and semimacroscopic studies clarify the problems with incomplete alternative calculations, illustrating that the effects of various electrostatic elements are drastically overestimated by macroscopic calculations that use a low dielectric constant and do not consider the protein reorganization. Similarly, it is pointed out that microscopic potential of mean force calculations that do not evaluate the electrostatic barrier relative to the bulk water cannot be used to establish the origin of the electrostatic barrier. The relationship between the present study and calculations of pKas in protein interiors is clarified, pointing out that approaches that are applied to study the aquaporin barrier should be validated by pKas calculations. Such calculations also help to clarify the crucial role of solvation energies in establishing the barrier in aquaporins. Proteins 2006. © 2006 Wiley-Liss, Inc.

Journal ArticleDOI
01 Jun 2006-Proteins
TL;DR: Overall, the results of the study indicate that current methodologies of correlated mutations analysis are not suitable for large‐scale intermolecular contact prediction, and thus cannot assist in docking.
Abstract: Correlated mutations have been repeatedly exploited for intramolecular contact map prediction. Over the last decade these efforts yielded several methods for measuring correlated mutations. Nevertheless, the application of correlated mutations for the prediction of intermolecular interactions has not yet been explored. This gap is due to several obstacles, such as 3D complexes availability, paralog discrimination, and the availability of sequence pairs that are required for inter- but not intramolecular analyses. Here we selected for analysis fusion protein families that bypass some of these obstacles. We find that several correlated mutation measurements yield reasonable accuracy for intramolecular contact map prediction on the fusion dataset. However, the accuracy level drops sharply in intermolecular contacts prediction. This drop in accuracy does not occur always. In the Cohesin-Dockerin family, reasonable accuracy is achieved in the prediction of both intra- and intermolecular contacts. The Cohesin-Dockerin family is well suited for correlated mutation analysis. Because, however, this family constitutes a special case (it has radical mutations, has domain repeats, within each species each Dockerin domain interacts with each Cohesin domain, see below), the successful prediction in this family does not point to a general potential in using correlated mutations for predicting intermolecular contacts. Overall, the results of our study indicate that current methodologies of correlated mutations analysis are not suitable for large-scale intermolecular contact prediction, and thus cannot assist in docking. With current measurements, sequence availability, sequence annotations, and underdeveloped sequence pairing methods, correlated mutations can yield reasonable accuracy only for a handful of families.

Journal ArticleDOI
01 Nov 2006-Proteins
TL;DR: An improved sampling algorithm and energy model for protein loop prediction has yielded the first methodology capable of achieving good results for the prediction of loop backbone conformations of 11 residue length or greater, and the inclusion of a hydrophobic term appears to approximately fix a major flaw in SGB solvation model.
Abstract: We have developed an improved sampling algorithm and energy model for protein loop prediction, the combination of which has yielded the first methodology capable of achieving good results for the prediction of loop backbone conformations of 11 residue length or greater. Applied to our newly constructed test suite of 104 loops ranging from 11 to 13 residues, our method obtains average/median global backbone root-mean-square deviations (RMSDs) to the native structure (superimposing the body of the protein, not the loop itself) of 1.00/0.62 A for 11 residue loops, 1.15/0.60 A for 12 residue loops, and 1.25/0.76 A for 13 residue loops. Sampling errors are virtually eliminated, while energy errors leading to large backbone RMSDs are very infrequent compared to any previously reported efforts, including our own previous study. We attribute this success to both an improved sampling algorithm and, more critically, the inclusion of a hydrophobic term, which appears to approximately fix a major flaw in SGB solvation model that we have been employing. A discussion of these results in the context of the general question of the accuracy of continuum solvation models is presented.

Journal ArticleDOI
01 Apr 2006-Proteins
TL;DR: It is proposed that a disorder‐to‐order transition occurs in the binding of 14‐3‐3 proteins with their partners, and the consequences for consensus binding sequences, specificity, affinity, and thermodynamic control are discussed.
Abstract: Proteins named 14-3-3 can bind more than 200 different proteins, mostly (but not exclusively) when they are at a phosphorylated state. These partner proteins are involved in different cellular processes, such as cell signaling, transcription factors, cellular morphology, and metabolism; this suggests pleiotropic functionality for 14-3-3 proteins. Recent efforts to establish a rational classification of 14-3-3 binding partners showed neither structural nor functional relatedness in this group of proteins. Using three natural predictors of disorder in proteins, and the structural available information, we show that >90% of 14-3-3 protein partners contain disordered regions. This percentage is significantly high when compared with recent studies on cell signaling and cancer-related proteins or RNA chaperons. More important, almost all 14-3-3-binding sites are inside disordered regions, this reinforcing the importance of structural disorder in this class of proteins. We also propose that a disorder-to-order transition occurs in the binding of 14-3-3 proteins with their partners. We discuss the consequences of the latter for consensus binding sequences, specificity, affinity, and thermodynamic control.

Journal ArticleDOI
01 Apr 2006-Proteins
TL;DR: This work discusses how to design protein molecules that would serve as the basic computational element by functioning as a NAND logical gate, utilizing DNA tags for recognition, and phosphorylation and exonuclease reactions for information processing.
Abstract: Can proteins be used as computational devices to address difficult computational problems? In recent years there has been much interest in biological computing, that is, building a general purpose computer from biological molecules. Most of the current efforts are based on DNA because of its ability to self-hybridize. The exquisite selectivity and specificity of complex protein-based networks motivated us to suggest that similar principles can be used to devise biological systems that will be able to directly implement any logical circuit as a parallel asynchronous computation. Such devices, powered by ATP molecules, would be able to perform, for medical applications, digital computation with natural interface to biological input conditions. We discuss how to design protein molecules that would serve as the basic computational element by functioning as a NAND logical gate, utilizing DNA tags for recognition, and phosphorylation and exonuclease reactions for information processing. A solution of these elements could carry out effective computation. Finally, the model and its robustness to errors were tested in a computer simulation.

Journal ArticleDOI
01 Jun 2006-Proteins
TL;DR: Results from these simulations support an activation mechanism in which the β4–α4 loop, at least partially, gates the isomerization of Tyr106, and the roles of phosphorylation and the conserved Thr87 are deemed indirect in that they stabilize the active configuration of theβ4– α4 loop.
Abstract: A combination of thirty-two 10-ns- scale molecular dynamics simulations were used to explore the coupling between conformational tran- sition and phosphorylation in the bacteria chemo- taxis Y protein (CheY), as a simple but representa- tive example of protein allostery. Results from these simulations support an activation mechanism in which the 4- 4 loop, at least partially, gates the isomerization of Tyr106. The roles of phosphoryla- tion and the conserved Thr87 are deemed indirect in that they stabilize the active configuration of the 4- 4 loop. The indirect role of the activation event (phosphorylation) and/or conserved residues in sta- bilizing, rather than causing, specific conforma- tional transition is likely a feature in many signal- ing systems. The current analysis of CheY also helps to make clear that neither the "old" (induced fit) nor the "new" (population shift) views for protein al- lostery are complete, because they emphasize the kinetic (mechanistic) and thermodynamic aspects of allosteric transitions, respectively. In this regard, an issue that warrants further analysis concerns the interplay of concerted collective motion and sequen- tial local structural changes in modulating cooperat- ivity between distant sites in biomolecules. Proteins