scispace - formally typeset
Search or ask a question

Showing papers by "Chris Sander published in 1994"


Journal ArticleDOI
01 May 1994-Proteins
TL;DR: This work extends the previous three‐level system of neural networks by using additional input information derived from multiple alignments using a position‐specific conservation weight as part of the input to increase performance and greatly increased accuracy.
Abstract: Using evolutionary information contained in multiple sequence alignments as input to neural networks, secondary structure can be predicted at significantly increased accuracy. Here, we extend our previous three-level system of neural networks by using additional input information derived from multiple alignments. Using a position-specific conservation weight as part of the input increases performance. Using the number of insertions and deletions reduces the tendency for overprediction and increases overall accuracy. Addition of the global amino acid content yields a further improvement, mainly in predicting structural class. The final network system has sustained overall accuracy of 71.6% in a multiple cross-validation test on 126 unique protein chains. A test on a new set of 124 recently solved protein structures that have no significant sequence similarity to the learning set confirms the high level of accuracy. The average cross-validated accuracy for all 250 sequence-unique chains is above 72%. Using various data sets, the method is compared to alternative prediction methods, some of which also use multiple alignments: the performance advantage of the network system is at least 6 percentage points in three-state accuracy. In addition, the network estimates secondary structure content from multiple sequence alignments about as well as circular dichroism spectroscopy on a single protein and classifies 75% of the 250 proteins correctly into one of four protein structural classes. Of particular practical importance is the definition of a position-specific reliability index. For 40% of all residues the method has a sustained three-state accuracy of 88%, as high as the overall average for homology modelling. A further strength of the method is greatly increased accuracy in predicting the placement of secondary structure segments.

1,470 citations


Journal ArticleDOI
01 Apr 1994-Proteins
TL;DR: A simple and general method is presented to analyze correlations in mutational behavior between different positions in a multiple sequence alignment to predict contact maps for each of 11 protein families and compare the result with the contacts determined by crystallography.
Abstract: The maintenance of protein function and structure constrains the evolution of amino acid sequences. This fact can be exploited to interpret correlated mutations observed in a sequence family as an indication of probable physical contact in three dimensions. Here we present a simple and general method to analyze correlations in mutational behavior between different positions in a multiple sequence alignment. We then use these correlations to predict contact maps for each of 11 protein families and compare the result with the contacts determined by crystallography. For the most strongly correlated residue pairs predicted to be in contact, the prediction accuracy ranges from 37 to 68% and the improvement ratio relative to a random prediction from 1.4 to 5.1. Predicted contact maps can be used as input for the calculation of protein tertiary structure, either from sequence information alone or in combination with experimental information.

876 citations


Journal ArticleDOI
TL;DR: To reduce redundancy in the Protein Data Bank of 3D protein structures, which is caused by many homologous proteins in the data bank, a representative set of structures is selected to reduce time and effort in statistical analyses.
Abstract: To reduce redundancy in the Protein Data Bank of 3D protein structures, which is caused by many homologous proteins in the data bank, we have selected a representative set of structures. The selection algorithm was designed to (1) select as many nonhomologous structures as possible, and (2) to select structures of good quality. The representative set may reduce time and effort in statistical analyses.

800 citations


Journal ArticleDOI
TL;DR: A systematic comparison of 23 Ig domain structures with less than 25% pairwise residue identity was performed using automatic structural alignment and analysis of beta-sheet and loop topology, revealing a common structural core of only four beta-strands and four different topological subtypes that correlate with the length of the intervening sequence between strands c and e, the most variable region in sequence.

780 citations


Journal ArticleDOI
01 Nov 1994-Proteins
TL;DR: A neural network system that predicts relative solvent accessibility of each residue using evolutionary profiles of amino acid substitutions derived from multiple sequence alignments is introduced, and the most reliably predicted fraction of the residues (50%) is predicted as accurately as by automatic homology modeling.
Abstract: Currently, the prediction of three-dimensional (3D) protein structure from sequence alone is an exceedingly difficult task. As an intermediate step, a much simpler task has been pursued extensively: predicting 1D strings of secondary structure. Here, we present an analysis of another 1D projection from 3D structure: the relative solvent accessibility of each residue. We show that solvent accessibility is less conserved in 3D homologues than is secondary structure, and hence is predicted less accurately from automatic homology modeling; the correlation coefficient of relative solvent accessibility between 3D homologues is only 0.77, and the average accuracy of predictions based on sequence alignments is only 0.68. The latter number provides an effective upper limit on the accuracy of predicting accessibility from sequence when homology modeling is not possible. We introduce a neural network system that predicts relative solvent accessibility (projected onto ten discrete states) using evolutionary profiles of amino acid substitutions derived from multiple sequence alignments. Evaluated in a cross-validation test on 238 unique proteins, the correlation between predicted and observed relative accessibility is 0.54. Interpreted in terms of a three-state (buried, intermediate, exposed) description of relative accessibility, the fraction of correctly predicted residue states is about 58%. In absolute terms this accuracy appears poor, but given the relatively low conservation of accessibility in 3D families, the network system is not far from its likely optimal performance. The most reliably predicted fraction of the residues (50%) is predicted as accurately as by automatic homology modeling. Prediction is best for buried residues, e.g., 86% of the completely buried sites are correctly predicted as having 0% relative accessibility. © 1994 Wiley-Liss, Inc.

623 citations


Journal ArticleDOI
TL;DR: Using information about evolutionary conservation as contained in multiple sequence alignments, the secondary structure of 4700 protein sequences was predicted by the automatic e-mail server PHD, with an expected overall three-state accuracy of 71.4%.
Abstract: By the middle of 1993, > 30,000 protein sequences has been listed. For 1000 of these, the three-dimensional (tertiary) structure has been experimentally solved. Another 7000 can be modelled by homology. For the remaining 21,000 sequences, secondary structure prediction provides a rough estimate of structural features. Predictions in three states range between 35% (random) and 88% (homology modelling) overall accuracy. Using information about evolutionary conservation as contained in multiple sequence alignments, the secondary structure of 4700 protein sequences was predicted by the automatic e-mail server PHD. For proteins with at least one known homologue, the method has an expected overall three-state accuracy of 71.4% for proteins with at least one known homologue (evaluated on 126 unique protein chains).

596 citations


Journal ArticleDOI
TL;DR: It is concluded that the highest scores one can reasonably expect for secondary structure prediction are a single residue accuracy of Q3 > 85% and a fractional segment overlap of Sov > 90%.

345 citations


Journal Article
TL;DR: The FSSP database currently contains an extended structural family for each of 330 representative protein chains, and all such comparisons are based purely on the 3D co-ordinates of the proteins and are derived by automatic structure comparison programs.
Abstract: FSSP (families of structurally similar proteins) is a database of structural alignments of proteins in the Protein Data Bank (PDB). The database currently contains an extended structural family for each of 330 representative protein chains. Each data set contains structural alignments of one search structure with all other structurally significantly similar proteins in the representative set (remote homologs, < 30% sequence identity), as well as all structures in the Protein Data Bank with 70-30% sequence identity relative to the search structure (medium homologs). Very close homologs (above 70% sequence identity) are excluded as they rarely have marked structural differences. The alignments of remote homologs are the result of pairwise all-against-all structural comparisons in the set of 330 representative protein chains. All such comparisons are based purely on the 3D co-ordinates of the proteins and are derived by automatic (objective) structure comparison programs. The significance of structural similarity is estimated based on statistical criteria. The FSSP database is available electronically from the EMBL file server and by anonymous ftp (file transfer protocol).

344 citations


Journal ArticleDOI
TL;DR: A new experimental approach to protein structure determination is suggested in which selection of functional mutants after random mutagenesis and analysis of correlated mutations provide sufficient proximity constraints for calculation of the protein fold.
Abstract: A method has been developed to detect pairs of positions with correlated mutations in protein multiple sequence alignments. The method is based on reconstruction of the phylogenetic tree for a set of sequences and statistical analysis of the distribution of mutations in the branches of the tree. The database of homology-derived protein structures (HSSP) is used as the source of multiple sequence alignments for proteins of known three-dimensional structure. We analyse pairs of positions with correlated mutations in 67 protein families and show quantitatively that the presence of such positions is a typical feature of protein families. A significant but weak tendency is observed for correlated residue pairs to be close in the three-dimensional structure. With further improvements, methods of this type may be useful for the prediction of residue--residue contacts and subsequent prediction of protein structure using distance geometry algorithms. In conclusion, we suggest a new experimental approach to protein structure determination in which selection of functional mutants after random mutagenesis and analysis of correlated mutations provide sufficient proximity constraints for calculation of the protein fold.

272 citations


Journal ArticleDOI
01 Jul 1994-Proteins
TL;DR: A new generation of computer algorithms has now been developed that allows routine comparison of a protein structure with the database of all known structures, and such structure database searches are beginning to rival sequence database searches as a tool for discovering biologically interesting relationships.
Abstract: The number of protein structures knownin atomic detail has increased from one in 1960 [1] to more than 1000 in 1994. The rate at which new structures are being published exceeds one a day as a result of recent advances in protein engineering, crystallography, and spectroscopy. More and more frequently, a newly determinedstructure is similar in fold to a known one, even when no sequence similarity isdetectable. A new generation of computer algorithms has now been developed thatallows routine comparison of a protein structure with the database of all known structures. Such structure database searches are already used daily and they are beginning to rival sequence database searches as a tool for discovering biologically interesting relationships.

241 citations


Journal ArticleDOI
01 Jul 1994-Proteins
TL;DR: An algorithm for identification of structural units by objective, quantitative criteria based on atomic interactions is proposed, which is useful for the analysis of folding principles, for modular protein design and for protein engineering.
Abstract: General patterns of protein structural organization have emerged from studies of hundreds of structures elucidated by X-ray crystallography and nuclear magnetic resonance. Structural units are commonly iden- tified by visual inspection of molecular models using qualitative criteria. Here, we propose an algorithm for identification of structural units by objective, quantitative criteria based on atomic interactions. The underlying physical concept is maximal interactions within each unit and minimal interaction between units (do- mains). In a simple harmonic approximation, interdomain dynamics is determined by the strength of the interface and the distribution of masses. The most likely domain decomposition involves units with the most correlated motion, or largest interdomain fluctuation time. The de- composition of a convoluted 3-D structure is complicated by the possibility that the chain can cross over several times between units. Grouping the residues by solving an eigenvalue problem for the contact matrix reduces the problem to a one-dimensional search for all rea- sonable trial bisections. Recursive bisection yields a tree of putative folding units. Simple physical criteria are used to identify units that could exist by themselves. The units so defined closely correspond to crystallographers' notion of structural domains. The results are useful for the analysis of folding principles, for modular protein design and for protein engineering. Q 1994 Wiley-Liss, Inc.

01 Jan 1994
TL;DR: A simple and general method to analyze correlations in mutational behavior between different positions in a multiple se- quence alignment and predict contact maps for each of 11 pro- tein families and compare the result with the contacts determined by crystallography.
Abstract: The maintenance of protein function and structure constrains the evolution of amino acid sequences. This fact can be ex- ploited to interpret correlated mutations ob- served in a sequence family as an indication of probable physical contact in three dimensions. Here we present a simple and general method to analyze correlations in mutational behavior between different positions in a multiple se- quence alignment. We then use these correla- tions to predict contact maps for each of 11 pro- tein families and compare the result with the contacts determined by crystallography. For the most strongly correlated residue pairs pre- dicted to be in contact, the prediction accuracy ranges from 37 to 68% and the improvement ra- tio relative to a random prediction from 1.4 to 5.1. Predicted contact maps can be used as in- put for the calculation of protein tertiary struc- ture, either from sequence information alone or in combination with experimental informa- tion. 0 1994 Wiley-Liss, Inc.


Journal ArticleDOI
TL;DR: A variant of the protein FtsA102, in which the nucleotide binding site was destroyed by mutagenesis of a highly conserved residue predicted to be needed for the binding, does not bind ATP, and phosphorylation and ATP binding may not be essential for the function of FTSA.
Abstract: Cell division protein FtsA, predicted to belong to the actin family, is present in different cell compartments depending on its phosphorylation state. The FtsA fraction isolated from the cytoplasm is phosphorylated and capable of binding ATP, while the membrane-bound form is unphosphorylated and does not bind ATP. A variant of the protein FtsA102, in which the nucleotide binding site was destroyed by mutagenesis of a highly conserved residue predicted to be needed for the binding, does not bind ATP. Another variant, FtsA104, cannot be phosphorylated because the predicted phosphorylatable residue has been replaced by a non-phosphorylatable one. This protein although unable to bind ATP in vitro, is able to rescue the reversible ftsA2, the irreversible ftsA3 and, almost with the same efficiency, the ftsA16 amber alleles. Consequently, phosphorylation and ATP binding may not be essential for the function of FtsA. Alternatively they may have a regulatory role on the action of FtsA in the septator.

Journal ArticleDOI
TL;DR: The effects of dnaK mutations which alter the corresponding glutamate‐171 of DnaK to alanine, leucine or lysine are analyzed and it is proposed that this coupling is essential for the chaperone function of DNAK.
Abstract: Central to the chaperone function of Hsp70 stress proteins including Escherichia coli DnaK is the ability of Hsp70 to bind unfolded protein substrates in an ATP-dependent manner. Mg2+/ATP dissociates bound substrates and, furthermore, substrate binding stimulates the ATPase of Hsp70. This coupling is proposed to require a glutamate residue, E175 of bovine Hsc70, that is entirely conserved within the Hsp70 family, as it contacts bound Mg2+/ATP and is part of a hinge required for a postulated ATP-dependent opening/closing movement of the nucleotide binding cleft which then triggers substrate release. We analyzed the effects of dnaK mutations which alter the corresponding glutamate-171 of DnaK to alanine, leucine or lysine. In vivo, the mutated dnaK alleles failed to complement the delta dnaK52 mutation and were dominant negative in dnaK+ cells. In vitro, all three mutant DnaK proteins were inactive in known DnaK-dependent reactions, including refolding of denatured luciferase and initiation of lambda DNA replication. The mutant proteins retained ATPase activity, as well as the capacity to bind peptide substrates. The intrinsic ATPase activities of the mutant proteins, however, did exhibit increased Km and Vmax values. More importantly, these mutant proteins showed no stimulation of ATPase activity by substrates and no substrate dissociation by Mg2+/ATP. Thus, glutamate-171 is required for coupling of ATPase activity with substrate binding, and this coupling is essential for the chaperone function of DnaK.

Journal ArticleDOI
TL;DR: It is hypothesize that a new type of RNA-binding domain may be utilized to deliver additional activities to the ribosome.
Abstract: Using computer methods for database search, multiple alignment, protein sequence motif analysis and secondary structure prediction, a putative new RNA-binding motif was identified. The novel motif is conserved in yeast omnipotent translation termination suppressor SUP1, the related DOM34 protein and its pseudogene homologue; three groups of eukaryotic and archaeal ribosomal proteins, namely L30e, L7Ae/S6e and S12e; an uncharacterized Bacillus subtilis protein related to the L7A/S6e group; and Escherichia coli ribosomal protein modification enzyme RimK. We hypothesize that a new type of RNA-binding domain may be utilized to deliver additional activities to the ribosome.

Journal ArticleDOI
TL;DR: One year after the release of the sequence of yeast chromosome III, its open reading frames (ORFs) are re‐examined by computer methods, finding several ORFs have similarities to uncharacterized proteins, resulting in new families in search of a function.
Abstract: One year after the release of the sequence of yeast chromosome III, we have re-examined its open reading frames (ORFs) by computer methods. More than 61% of the 171 probable gene products have significant sequence similarities in the current databases; as many as 54% have already known functions or are related to functionally characterized proteins, allowing partial prediction of protein function, 11 percentage points more than reported a year ago; 19% are similar to proteins of known three-dimensional structure, allowing model building by homology. The most interesting new identifications include a sugar kinase distantly related to ribokinases, a phosphatidyl serine synthetase, a putative transcription regulator, a flavodoxin-like protein, and a zinc finger protein belonging to a distinct subfamily. Several ORFs have similarities to uncharacterized proteins, resulting in new families in search of a function'. About 54% of ORFs match sequences from other phyla, including numerous fragments in the database of expressed sequence tags (ESTs). Most significant similarities to ESTs are with proteins in conserved families widely represented in the databases. About 30% of ORFs contain one or more predicted transmembrane segments. The increase in the power of functional and structural prediction comes from improvements in sequence analysis and from richer databases and is expected to facilitate substantially the experimental effort in characterizing the function of new gene products.

Proceedings Article
01 Jan 1994
TL;DR: The prototype of a software system, called GeneQuiz, for large-scale biological sequence analysis, is presented and an overview of the architecture of the system prototype and the experiences on its applicability in sequence analysis are covered.
Abstract: We present the prototype of a software system, called GeneQuiz, for large-scale biological sequence analysis. The system was designed to meet the needs that arise in computational sequence analysis and our past experience with the analysis of 171 protein sequences of yeast chromosome III. We explain the cognitive challenges associated with this particular research activity and present our model of the sequence analysis process. The prototype system consists of two parts: (i) the database update and search system (driven by perl programs and rdb, a simple relational database engine also written in perl) and (ii) the visualization and browsing system (developed under C++/ET++). The principal design requirement for the first part was the complete automation of all repetitive actions: database updates, efficient sequence similarity searches and sampling of results in a uniform fashion. The user is then presented with "hit-lists" that summarize the results from heterogeneous database searches. The expert's primary task now simply becomes the further analysis of the candidate entries, where the problem is to extract adequate information about functional characteristics of the query protein rapidly. This second task is tremendously accelerated by a simple combination of the heterogeneous output into uniform relational tables and the provision of browsing mechanisms that give access to database records, sequence entries and alignment views. Indexing of molecular sequence databases provides fast retrieval of individual entries with the use of unique identifiers as well as browsing through databases using pre-existing cross-references. The presentation here covers an overview of the architecture of the system prototype and our experiences on its applicability in sequence analysis.(ABSTRACT TRUNCATED AT 250 WORDS)

Journal ArticleDOI
TL;DR: An evolutionary connection between plant endochitinases and lysozymes is supported by similar overall topology of fold, overlapping substrate specificities and remarkable conservation of some sequence and architectural detail around the active site.

Journal ArticleDOI
TL;DR: Computer-assisted amino acid analysis can fulfill the identification of protein samples in minute quantities from two-dimensional polyacrylamide gel electrophoresis analysis and may replace protein sequencing as a first attempt in identification, provided a homolog can be found in the database.

Journal ArticleDOI
TL;DR: This work has shown that evolutionary relationships can be exploited to predict the function of many other proteins from their amino acid sequence, and the techniques for such predictions are becoming increasingly sophisticated and are now an essential part of genome analysis.

Journal ArticleDOI
TL;DR: A set of detailed predictive rules based on the comparison of crystal structures of point mutants and wild types in 83 cases are derived, which describe well the conformational changes in 85% of all point mutant structures available at present.
Abstract: Point mutations are frequently used to explore the structure and/or function of proteins. The ability to predict the structural effects of point mutations would make the planning of such experiments more reliable. We have now derived a set of detailed predictive rules based on the comparison of crystal structures of point mutants and wild types in 83 cases. Despite the surprising simplicity of these rules, they describe well the conformational changes in 85% of all point mutant structures available at present.

Journal ArticleDOI
TL;DR: The solution structure of LexA repressor from Escherichia coli reveals an unexpected structural similarity to a widespread class of prokaryotic and eukaryotic regulatory proteins, which is typified by catabolite gene activator protein (CAP).
Abstract: Comparison of structures can reveal surprising connections between protein families and provide new insights into the relationship between sequence, structure and function. The solution structure of LexA repressor from Escherichia coli reveals an unexpected structural similarity to a widespread class of prokaryotic and eukaryotic regulatory proteins, which is typified by catabolite gene activator protein (CAP). The use of combined sequence profiles allows the identification of two new prokaryotic members of the superfamily: listeriolysin regulatory protein (PrfA) and ferric uptake regulatory protein (Fur). LexA, PrfA and Fur are the first examples of prokaryotic regulatory proteins in which DNA recognition is mediated by a variant of the classical helix-turn-helix motif, with an insertion in the turn region.


Journal ArticleDOI
TL;DR: Three-dimensional methods, in which pseudopotentials or information values are derived from the databases, are proving their value for distinguishing between correct and incorrect models.

Journal ArticleDOI
TL;DR: SCAN3D is a new database system for integrated sequence and structure analysis of proteins that uses the relational paradigm wherever possible and its main power stems from the ability to retrieve stretches of consecutive residues with certain properties by comparing a property profile with all stretches of residues in the database, exploiting the ordered character of proteins.
Abstract: In protein engineering and design it is very important that residues can be inspected in their specific environment. A standard relational database system cannot serve this purpose adequately because it cannot handle relations between individual residues. With SCAN3D we introduce a new database system for integrated sequence and structure analysis of proteins. It uses the relational paradigm wherever possible. Its main power, however, stems from the ability to retrieve stretches of consecutive residues with certain properties by comparing a property profile with all stretches of residues in the database, exploiting the ordered character of proteins. In doing so, it bypasses the large number of join operations that would be required by relational database systems. An additional advantage of using property profile matching is that searches can be carried out allowing a pre-set number of mismatches. Also, as the database is read-only, SCAN3D does not need interactive data update mechanisms. Queries typical of a molecular engineering environment are demonstrated with specific examples: analysis of peptides that induce local structure, analysis of site-dependent rotamers and residue--residue contact analysis.


Journal ArticleDOI
TL;DR: The most promising practical strategies for developing proteins with useful biological or chemical function combine theoretical design with experimental screening or selection systems.

Proceedings Article
01 Jan 1994
TL;DR: An improvement of a neural network system using informalion about evolutionary conservation achieves a sustained overall accuracy of 71.4% and a test on 45 new proteins confirms the estimated accuracy.
Abstract: Some 30,Wprotein sequences are known. For 1,000 the structure is experimentally solved. Another 4,000 can be modeled by homology. For the remaining 25,000 sequences, the tertiary structure (30) cannot be predicted generally from the sequence. A reduction of the problem is the projection of 30 structure onto a one-dimensional string of secondary structure assignments. Predictions in three states rate between 36% (random) and 88% (homology modelling) accuracy. Here, we present an improvement of a neural network system using informalion about evolutionary conservation. The method achieves a sustained overall accuracy of 71.4%. A test on 45 new proteins confirms the estimated accuracy. Of practical importance is the definition of a reliability index at each residue position: e.g. about 40% of the predicted residues have an expected accuracy of 88%. The method has been made publicly available by an automatic e-mail server.

Journal ArticleDOI
TL;DR: In this paper, a set of methods are presented that are designed to cope with inconsistencies in symmetry information, which can be used for validation of protein data bank entries and automatic generation of symmetry contacts for inspection and analysis.
Abstract: Many natural proteins are active as multimers. Crystallographic protein databases, however, generally store only part of the native multimer, the asymmetric unit, along with symmetry information. As a result of inaccuracies in the data, it is not always possible to reconstruct the native multimer. Here, a set of methods is presented that are designed to cope with inconsistencies in symmetry information. Applications include the validation of Protein Data Bank entries and the automatic generation of symmetry contacts for inspection and analysis.