scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2000"


Journal ArticleDOI
TL;DR: A novel protein chip technology is developed that allows the high-throughput analysis of biochemical activities, and this approach is used to analyse nearly all of the protein kinases from Saccharomyces cerevisiae, finding many novel activities and that a large number ofprotein kinases are capable of phosphorylating tyrosine.
Abstract: We have developed a novel protein chip technology that allows the high-throughput analysis of biochemical activities, and used this approach to analyse nearly all of the protein kinases from Saccharomyces cerevisiae. Protein chips are disposable arrays of microwells in silicone elastomer sheets placed on top of microscope slides. The high density and small size of the wells allows for high-throughput batch processing and simultaneous analysis of many individual samples. Only small amounts of protein are required. Of 122 known and predicted yeast protein kinases, 119 were overexpressed and analysed using 17 different substrates and protein chips. We found many novel activities and that a large number of protein kinases are capable of phosphorylating tyrosine. The tyrosine phosphorylating enzymes often share common amino acid residues that lie near the catalytic region. Thus, our study identified a number of novel features of protein kinases and demonstrates that protein chip technology is useful for high-throughput screening of protein biochemical activity.

814 citations


Journal ArticleDOI
TL;DR: The special role for the beta-branched residues Ile and Val suggested here is consistent with the hypothesis that residues with constrained rotameric freedom in helical conformation might reduce the entropic cost of folding in transmembrane proteins.

610 citations


Journal ArticleDOI
TL;DR: The results show that traditional scores for sequence and structure similarity have the same basic exponential relationship as observed previously, with structural divergence, measured in RMS, being exponentially related to sequence divergence, Measurements have greater statistical weight and precision.

395 citations


Journal ArticleDOI
TL;DR: Of the first 10 structures determined, several provided clues to biochemical functions that were not detectable from sequence analysis, and in many cases these putative functions could be readily confirmed by biochemical methods, demonstrating that structural proteomics is feasible and can play a central role in functional genomics.
Abstract: A set of 424 nonmembrane proteins from Methanobacterium thermoautotrophicum were cloned, expressed and purified for structural studies. Of these, ∼20% were found to be suitable candidates for X-ray crystallographic or NMR spectroscopic analysis without further optimization of conditions, providing an estimate of the number of the most accessible structural targets in the proteome. A retrospective analysis of the experimental behavior of these proteins suggested some simple relations between sequence and solubility, implying that data bases of protein properties will be useful in optimizing high throughput strategies. Of the first 10 structures determined, several provided clues to biochemical functions that were not detectable from sequence analysis, and in many cases these putative functions could be readily confirmed by biochemical methods. This demonstrates that structural proteomics is feasible and can play a central role in functional genomics.

297 citations


Journal ArticleDOI
TL;DR: This system attempts to describe a protein motion as a rigid-body rotation of a small 'core' relative to a larger one, using a set of hinges, and finds that while this model can accommodate most protein motions, it cannot accommodate all; the degree to which a motion can be accommodated provides an aid in classifying it.
Abstract: The number of solved structures of macromolecules that have the same fold and thus exhibit some degree of conformational variability is rapidly increasing. It is consequently advantageous to develop a standardized terminology for describing this variability and automated systems for processing protein structures in different conformations. We have developed such a system as a ‘front-end’ server to our database of macromolecular motions. Our system attempts to describe a protein motion as a rigid-body rotation of a small ‘core’ relative to a larger one, using a set of hinges. The motion is placed in a standardized coordinate system so that all statistics between any two motions are directly comparable. We find that while this model can accommodate most protein motions, it cannot accommodate all; the degree to which a motion can be accommodated provides an aid in classifying it. Furthermore, we perform an adiabatic mapping (a restrained interpolation) between every two conformations. This gives some indication of the extent of the energetic barriers that need to be surmounted in the motion, and as a by-product results in a ‘morph movie’. We make these movies available over the Web to aid in visualization. Many instances of conformational variability occur between proteins with somewhat different sequences. We can accommodate these differences in a rough fashion, generating an ‘evolutionary morph’. Users have already submitted hundreds of examples of protein motions to our server, producing a comprehensive set of statistics. So far the statistics show that the median submitted motion has a rotation of ~10° and a maximum Cα displacement of 17 A. Almost all involve at least one large torsion angle change of >140°. The server is accessible at http://bioinfo.mbb.yale.edu/MolMovDB

259 citations


Journal ArticleDOI
TL;DR: This work built whole-genome trees based on the presence or absence of particular molecular features, either orthologs or folds, in the genomes of a number of recently sequenced microorganisms and compared them to the traditional ribosomal phylogeny and also to treesbased on the sequence similarity of individual orthologous proteins.
Abstract: We built whole-genome trees based on the presence or absence of particular molecular features, either orthologs or folds, in the genomes of a number of recently sequenced microorganisms. To put these genomic trees into perspective, we compared them to the traditional ribosomal phylogeny and also to trees based on the sequence similarity of individual orthologous proteins. We found that our genomic trees based on the overall occurrence of orthologs did not agree well with the traditional tree. This discrepancy, however, vanished when one restricted the tree to proteins involved in transcription and translation, not including problematic proteins involved in metabolism. Protein folds unite superficially unrelated sequence families and represent a most fundamental molecular unit described by genomes. We found that our genomic occurrence tree based on folds agreed fairly well with the traditional ribosomal phylogeny. Surprisingly, despite this overall agreement, certain classes of folds, particularly all-beta ones, had a somewhat different phylogenetic distribution. We also compared our occurrence trees to whole-genome clusters based on the composition of amino acids and di-nucleotides. Finally, we analyzed some technical aspects of genomic trees-e.g., comparing parsimony versus distance-based approaches and examining the effects of increasing numbers of organisms. Additional information (e.g. clickable trees) is available from http://bioinfo.mbb.yale.edu/genome/trees.

162 citations


Journal ArticleDOI
TL;DR: A probabilistic system for predicting the subcellular localization of proteins and estimating the relative population of the various compartments in yeast, based on a Bayesian approach, which gives better accuracy in determining relative compartment populations than that obtained by simply tallying the localization predictions for individual proteins.

153 citations


Journal ArticleDOI
TL;DR: The results showed that thermophiles have a greater content of charged residues than mesophiles, both at the overall genomic level and in alpha helices, and suggest that intra-helical salt bridges are an important factor stabilizing thermophilic proteins.
Abstract: We address the question of the thermal stability of proteins in thermophiles through comprehensive genome comparison, focussing on the occurrence of salt bridges. We compared a set of 12 genomes (from four thermophilic archaeons, one eukaryote, six mesophilic eubacteria, and one thermophilic eubacteria). Our results showed that thermophiles have a greater content of charged residues than mesophiles, both at the overall genomic level and in alpha helices. Furthermore, we found that in thermophiles the charged residues in helices tend to be preferentially arranged with a 1-4 helical spacing and oriented so that intra-helical charge pairs agree with the helix dipole. Collectively, these results imply that intra-helical salt bridges are more prevalent in thermophiles than mesophiles and thus suggest that they are an important factor stabilizing thermophilic proteins. We also found that the proteins in thermophiles appear to be somewhat shorter than those in mesophiles. However, this later observation may have more to do with evolutionary relationships than with physically stabilizing factors. In all our statistics we were careful to controls for various biases. These could have, for instance, arisen due to repetitive or duplicated sequences. In particular, we repeated our calculation using a variety of random and directed sampling schemes. One of these involved making a "stratified sample," a representative cross-section of the genomes derived from a set of 52 orthologous proteins present roughly once in each genome. For another sample, we focused on the subset of the 52 orthologs that had a known 3D structure. This allowed us to determine the frequency of tertiary as well as main-chain salt bridges. Our statistical controls supported our overall conclusion about the prevalence of salt bridges in thermophiles in comparison to mesophiles.

138 citations


Journal ArticleDOI
TL;DR: The most highly enriched functional categories in the transcriptome (based on the MIPS system) are energy production and protein synthesis, while categories such as transcription, transport and signaling are depleted.
Abstract: We analyzed 10 genome expression data sets by large-scale cross-referencing against broad structural and functional categories. The data sets, generated by different techniques (e.g. SAGE and gene chips), provide various representations of the yeast transcriptome (the set of all yeast genes, weighted by transcript abundance). Our analysis enabled us to determine features more prevalent in the transcriptome than the genome: i.e. those that are common to highly expressed proteins. Starting with simplest categories, we find that, relative to the genome, the transcriptome is enriched in Ala and Gly and depleted in Asn and very long proteins. We find, furthermore, that protein length and maximum expression level have a roughly inverse relationship. To relate expression level and protein structure, we assigned transmembrane helices and known folds (using PSI-blast) to each protein in the genome; this allowed us to determine that the transcriptome is enriched in mixed α–β structures and depleted in membrane proteins relative to the genome. In particular, some enzymatic folds, such as the TIM barrel and the G3P dehydrogenase fold, are much more prevalent in the transcriptome than the genome, whereas others, such as the protein-kinase and leucine-zipper folds, are depleted. The TIM barrel, in fact, is overwhelmingly the ‘top fold’ in the transcriptome, while it only ranks fifth in the genome. The most highly enriched functional categories in the transcriptome (based on the MIPS system) are energy production and protein synthesis, while categories such as transcription, transport and signaling are depleted. Furthermore, for a given functional category, transcriptome enrichment varies quite substantially between the different expression data sets, with a variation an order of magnitude larger than for the other categories cross-referenced (e.g. amino acids). One can readily see how the enrichment and depletion of the various functional categories relates directly to that of particular folds. Further information can be found at http://bioinfo.mbb.yale.edu/genome/expression

124 citations


Journal ArticleDOI
TL;DR: Whole-genome expression profiles provide a rich new data-trove for bioinformatics and initial analyses of the profiles have included clustering and cross-referencing to 'external' information on protein structure and function.

89 citations


Journal ArticleDOI
TL;DR: The major challenge facing biologists in the next decade will be to ‘‘finish the job’’, that is, to ascribe a function to each of the proteins that have been discovered.
Abstract: 1. BackgroundWith the near completion of many genome sequencing projects has come the soberingrealisation that our understanding of biology is nowhere near complete. For example, inthe worm, C. elegans, less than half of the predicted proteins have a known function(Consortium, 1998). The major challenge facing biologists in the next decade will be to‘‘finish the job’’, that is, to ascribe a function to each of the proteins that have been discovered

Journal ArticleDOI
TL;DR: The relationship between protein subcellularlocalization and gene expression for a variety of whole-genome expression datasets is investigated, finding high expression levels forcytoplasmic proteins and low ones for nuclear and membraneproteins.

Journal ArticleDOI
Mark Gerstein1
TL;DR: An important aspect of structural genomics is connecting coordinate data with whole-genome information related to phylogenetic occurrence, protein function, gene expression, and protein–protein interactions.
Abstract: An important aspect of structural genomics is connecting coordinate data with whole-genome information related to phylogenetic occurrence, protein function, gene expression, and protein–protein interactions. Integrative database analysis allows one to survey the 'finite parts list' of protein folds from many perspectives, highlighting certain folds and structural features that stand out in particular ways.

Journal ArticleDOI
TL;DR: The relationship between sequence variability and ``evolutionary opportunity'' is discussed and the utility of Maynard Smith's multidimensional evolutionary opportunity space metaphor is explored for exploring functional constraints, genetic redundancy, and the context dependency of the genotype-phenotype map.
Abstract: Variability profiles measured over a set of aligned sequences can be used to estimate evolutionary freedom to vary. Differences in variability profiles between clades can be used to identify shifts in function at the molecular level. We demonstrate such a shift between the alpha and beta subunits of hemoglobin. We also show that the variability profiles for myoglobin are different between whales and primates and speculate that the differences between the two clades may reflect a shift associated with the novel oxygen storage demands in the lineage leading to whales. We discuss the relationship between sequence variability and "evolutionary opportunity" and explore the utility of Maynard Smith's multidimensional evolutionary opportunity space metaphor for exploring functional constraints, genetic redundancy, and the context dependency of the genotype-phenotype map. This work has implications for quantitatively defining and comparing protein function. Supplementary data is available from bioinfo.mbb.yale. edu/align.

Journal ArticleDOI
TL;DR: The aim was to identify and characterize all soluble proteins in MG that are structurally and functionally uncharacterized, for example, proteins which are unstructured in the absence of a 'partner' molecule or those that exhibit unusual thermodynamic properties.
Abstract: We present the results of a comprehensive analysis of the proteome of Mycoplasma genitalium (MG), the smallest autonomously replicating organism that has been completely sequenced. Our aim was to identify and characterize all soluble proteins in MG that are structurally and functionally uncharacterized. We were particularly interested in identifying proteins that differed significantly from typical globular proteins, for example, proteins which are unstructured in the absence of a ‘partner’ molecule or those that exhibit unusual thermodynamic properties. This work is complementary to other structural genomics projects whose primary aim is to determine the threedimensional structures of proteins with unknown folds. We have identified all the full-length open reading frames (ORFs) in MG that have no homologs of known structure and are of unknown function. Twenty-five of the total 483 ORFs fall into this category and we have expressed, purified and characterized 11 of them. We have used circular dichroism (CD) to rapidly investigate their biophysical properties. Our studies reveal that these proteins have a wide range of structures varying from highly helical to partially structured to unfolded or random coil. They also display a variety of thermodynamic properties ranging from cooperative unfolding to no detectable unfolding upon thermal denaturation. Several of these proteins are highly conserved from mycoplasma to man. Further information about target selection and CD results is available at http:// bioinfo.mbb.yale.edu/genome

Journal Article
TL;DR: A comprehensive genome analysis on two spirochetes, T. pallidum and B. burgdorferi, finds that the lipid biosynthesis pathway is absent from the spiroChetes, and that the spIROchetes distribute flux disproportionately through the glycolytic pathway instead of the NADPH-providing pentose phosphate pathway.
Abstract: We perform a comprehensive genome analysis on two spirochetes, T. pallidum and B. burgdorferi. First, we focus on the occurrence of protein structures in these organisms. We find that there are only a few spirochete-specific folds, relative to those in other types of bacteria. The most common fold, by far, in the spirochetes is the P-loop NTP hydrolase, followed by the TIM barrel. These folds also happen to be amongst the most multifunctional of the known folds. We also survey the membrane-protein structures in T. pallidum and find a notable large family with twelve transmembrane (TM) helices, reflecting the prevalence of 12-TM transporters in bacteria. Then we move to analysis of the metabolic pathways and overall metabolism in the spirochetes, using the metabolicflux-balancing method. We find that the lipid biosynthesis pathway is absent from the spirochetes. This strongly limits the degree to which these organisms can metabolize NADPH. In turn, we find that the spirochetes distribute flux disproportionately through the glycolytic pathway instead of the NADPHproviding pentose phosphate pathway. Further information is available at http://bioinfo.mbb.yale.edu

01 Jan 2000
TL;DR: An initial estimate of the size, distribution and characteristics of the pseudogene population in the Caenorhabditis elegans genome is derived by performing a survey in ‘molecular archaeology’, indicating a highly dynamic genome.
Abstract: Pseudogenes are non-functioning copies of genes in genomic DNA, which may either result from reverse transcription from a mRNA transcript (processed pseudogenes) or from gene duplication and subsequent disablement (non-processed pseudogenes). As pseudogenes are apparently ‘dead’, they usually have a variety of obvious disablements (e.g. insertions, deletions, frameshifts and truncations) relative to their functioning homologues. We have derived an initial estimate of the size, distribution and characteristics of the pseudogene population in the Caenorhabditis elegans genome, performing a survey in ‘molecular archaeology’. Corresponding to the 18,576 annotated proteins in the worm (i.e., in Wormpep18), we have found an estimated total of 2,168 pseudogenes, about one for every eight genes. Few of these appear to be processed. Details of our pseudogene assignments are available from http://bioinfo.mbb.yale.edu/genome/worm/pseudogene. The population of pseudogenes differs significantly from that of genes in a number of respects: (i) Pseudogenes are distributed unevenly across the genome relative to genes, with a disproportionate number on chromosome IV; (ii) The density of pseudogenes is higher on the arms of the chromosomes; (iii) The amino-acid composition of pseudogenes is midway between that of genes and (translations of) random intergenic DNA, with enrichment of Phe, Ile, Leu and Lys, and depletion of Asp, Ala, Glu and Gly relative to the worm proteome; And (iv) the most common protein folds and families differ somewhat between genes and pseudogenes -whereas the most common fold found in the worm proteome is the immunoglobulin fold, the most common ‘pseudofold’ is the C-type lectin. In addition, the size of a gene family bears little overall relationship to the size of its corresponding pseudogene complement, indicating a highly dynamic genome. There are in fact a number of families associated with large

Journal ArticleDOI
02 Jun 2000-Science
TL;DR: There is, however, a third approach for annotating the human genome that is, in a sense, already extant: extend the capabilities of the biological science literature.
Abstract: The News article “Are sequencers ready to ‘annotate’ the human genome?” by Elizabeth Pennisi (special issue on the Drosophila Genome, 24 Mar., p. [2183][1]) is especially timely and provocative. Pennisi mentions two ideas: a small group gathering at a centralized annotation jamboree, or a distributed, Web-based system that would allow anyone to contribute annotations with a “smart browser” that would merge all efforts. I favor the essence of the second proposal because it provides a more democratic and more “biological” approach to an all-important problem. There is, however, a third approach for annotating the human genome (providing at least the putative start, stop, and structure of each gene) that is, in a sense, already extant: extend the capabilities of the biological science literature. The current journal system is decentralized, yet most research articles adhere to common standards that make them ideal for annotation: (i) Each article associates a bit of annotation with a distinct time and place and with specific, responsible parties. (ii) Attentive scholarly referencing and footnoting provide a way to connect bits of annotation and allow for continuous “updates.” (iii) Peer review and editing provide a proven quality-control mechanism. (iv) Publication is an established indicator of scientific productivity; consequently, scientists already have an incentive to provide the information, whereas database submissions are often regarded as a chore. The main drawback of current journal article formats is that they are not very “computer-parseable,” or suitable for bulk annotation of thousands of genes. However, by adding sections of highly structured text to each article (that is, extended keywords and using a controlled vocabulary) and linking subparts of an article to relevant database identifiers, one can envision how a “literature annotation standard” could readily be interpreted by computers. Furthermore, if an article could be linked to a large “supplementary materials” data file with simple annotations for many genes (for example, lists of all the membrane proteins in the Caenorhabditis elegans genome), one would have a mechanism for bulk annotation. Further standardization could be achieved if the article described defined ways in which the data file might be updated over time and if the supplementary materials were refereed and evaluated with the text of the article. [1]: /lookup/doi/10.1126/science.287.5461.2183