scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2002"


Journal ArticleDOI
25 Jul 2002-Nature
TL;DR: It is shown that previously known and new genes are necessary for optimal growth under six well-studied conditions: high salt, sorbitol, galactose, pH 8, minimal medium and nystatin treatment, and less than 7% of genes that exhibit a significant increase in messenger RNA expression are also required for optimal Growth in four of the tested conditions.
Abstract: Determining the effect of gene deletion is a fundamental approach to understanding gene function. Conventional genetic screens exhibit biases, and genes contributing to a phenotype are often missed. We systematically constructed a nearly complete collection of gene-deletion mutants (96% of annotated open reading frames, or ORFs) of the yeast Saccharomyces cerevisiae. DNA sequences dubbed 'molecular bar codes' uniquely identify each strain, enabling their growth to be analysed in parallel and the fitness contribution of each gene to be quantitatively assessed by hybridization to high-density oligonucleotide arrays. We show that previously known and new genes are necessary for optimal growth under six well-studied conditions: high salt, sorbitol, galactose, pH 8, minimal medium and nystatin treatment. Less than 7% of genes that exhibit a significant increase in messenger RNA expression are also required for optimal growth in four of the tested conditions. Our results validate the yeast gene-deletion collection as a valuable resource for functional genomics.

4,328 citations


PatentDOI
13 May 2002-Science
TL;DR: In this paper, the authors proposed a method for using proteome chips to systematically assay all protein interactions in a species in a high-throughput manner, and also related to methods for making protein arrays by attaching double-tagged fusion proteins to a solid support.
Abstract: The present invention relates to proteome chips comprising arrays having a large proportion of all proteins expressed in a single species. The invention also relates to methods for making proteome chips. The invention also relates to methods for using proteome chips to systematically assay all protein interactions in a species in a high-throughput manner. The present invention also relates to methods for making and purifying eukaryotic proteins in a high-density array format. The invention also relates to methods for making protein arrays by attaching double-tagged fusion proteins to a solid support. The invention also relates to a method for identifying whether a signal is positive.

1,967 citations


Journal ArticleDOI
TL;DR: This study reports the first proteome-scale analysis of protein localization within any eukaryote, and presents experimentally derived localization data for 955 proteins of previously unknown function: nearly half of all functionally uncharacterized proteins in yeast.
Abstract: A global understanding of the molecular mechanisms underpinning cell biology necessitates an understanding not only of an organism's genome but also of the protein complement encoded within this genome (the proteome). In the past, data regarding an organism's proteome have typically been accumulated piecemeal through studies of a single protein or cell pathway. Genomic methodologies have altered this paradigm: a variety of approaches are now in place by which proteins may be directly analyzed on a proteome-wide scale. Chromatography-coupled mass spectrometry (Gygi et al. 1999; Washburn et al. 2001), large-scale two-hybrid screens (Uetz et al. 2000; Ito et al. 2001; Tong et al. 2002), immunoprecipitation/mass spectrometric analysis of protein complexes (Gavin et al. 2002; Ho et al. 2002), and protein microarray technologies (MacBeath and Schreiber 2000; Zhu et al. 2000, 2001) are yielding unprecedented quantities of protein data. Recent genomic techniques combining microarray technologies with either chromatin immunoprecipitation (Ren et al. 2000; Iyer et al. 2001) or targeted DNA methylation (van Steensel et al. 2001) have been used to globally map binding sites of chromosomal proteins in vivo. Initiatives are even underway to automate and industrialize processes by which protein structures may be solved, potentially providing a library of structural data from which homologous proteins may be modeled (Burley 2000; Montelione 2001). Although these approaches promise a wealth of information, many fundamental proteomic data sets remain uncataloged. Notably, the subcellular distribution of proteins within any single eukaryotic proteome has never been extensively examined, despite the usefulness and importance of these data. Protein localization is assumed to be a strong indicator of gene function. Localization data are also useful as a means of evaluating protein information inferred from genetic data (e.g., supporting or refuting putative protein interactions suggested from two-hybrid analysis; Ito et al. 2001). Furthermore, the subcellular localization of a protein can often reveal its mechanism of action. To determine the subcellular localization of a protein, its corresponding gene is typically either fused to a reporter or tagged with an epitope. Reporters and epitope tags are fused routinely to either the N or C termini of target genes, a choice that can be critical in obtaining accurate localization data. Organelle-specific targeting signals (e.g., mitochondrial targeting peptides and nuclear localization signals) are often located at the N terminus (Silver 1991); N-terminal reporter fusions may disrupt these sequences, resulting in anomalous protein localizations. In other cases, C-terminal sequences may be important for proper function and regulation, as recently shown from analysis of the yeast γ-tubulin-like protein Tub4p (Vogel et al. 2001). Gene copy number can also have an impact on the accuracy with which a protein is localized; overexpressed protein products may saturate intracellular transport mechanisms, potentially producing an aberrant subcellular protein distribution. In other cases, weakly expressed single-copy genes may not yield sufficient protein to be visualized, particularly by fluorescence microscopy. The effects of copy number and reporter/tag orientation on protein localization, however, have never been studied in a large data set. To date, few studies have characterized protein localization on a large scale, primarily because few high-throughput methods exist by which reporter fusions or epitope-tagged proteins can be generated and subsequently localized. Typically, systematic approaches have been used to construct a limited number of chimeric reporter fusions applicable to pilot localization studies. For example, >100 human cDNAs have been cloned as N- and C-terminal gene fusions to spectral variants of green fluorescent protein (GFP) as a means of examining the subcellular localization of these proteins in living cells (Simpson et al. 2000). Thus far, the majority of localization studies have been undertaken in yeast, owing primarily to the fidelity of homologous recombination in Saccharomyces cerevisiae and the concomitant ease with which integrated reporter gene fusions can be generated. As part of a pilot study in S. cerevisiae, Niedenthal et al. (1996) constructed GFP reporter fusions to three unknown open reading frames (ORFs) from yeast Chromosome XIV and subsequently localized these chimeric GFP-fusion proteins by fluorescence microscopy. In addition to directed cloning methods, strains suitable for localization analysis may be generated through random approaches. Recently, a plasmid-based GFP-fusion library of Schizosaccharomyces pombe DNA was constructed by fusing random fragments of genomic DNA upstream of GFP-coding sequence. Fission yeast cells transformed with this library were subsequently screened for GFP fluorescence, and 250 independent gene products were localized (Ding et al. 2000). In S. cerevisiae, transposon-based methods have been used to generate random lacZ gene fusions (Burns et al. 1994) and epitope-tagged alleles (Ross-MacDonald et al. 1999) for subsequent immunolocalization. Although these transposon-based studies have resulted in the localization of ∼300 yeast proteins, the majority of the S. cerevisiae proteome has remained uncharacterized in regards to its subcellular distribution. To address this deficiency, we have undertaken the largest analysis to date of protein localization in yeast. Employing high-throughput methods of epitope-tagging and immunofluorescence analysis, our study defines the subcellular localization of 2744 proteins. By integrating these localization data with those previously published, we identify the subcellular localization of >3300 yeast proteins, 55% of the proteome. Building on these data, we have applied a Bayesian system to estimate the intracellular distribution of all 6100 yeast proteins and have further characterized a subset of nuclear proteins both by immunolocalization on surface spread chromosomal preparations and by phenotypic analysis. In total, our findings provide a wealth of insight into protein function, while formally corroborating an expected link between protein function and localization. Furthermore, this study provides experimentally derived localization data for nearly 1000 proteins of previously unknown function, thereby providing, at minimum, a starting point for informed analysis of this previously uncharacterized segment of the proteome.

722 citations


Journal ArticleDOI
TL;DR: The relationship of protein-protein interactions with mRNA expression levels, by integrating a variety of data sources for yeast, is investigated, finding that permanent complexes, such as the ribosome and proteasome, have a particularly strong relationship with expression, while transient ones do not.
Abstract: We investigate the relationship of protein-protein interactions with mRNA expression levels, by integrating a variety of data sources for yeast. We focus on known protein complexes that have clearly defined interactions between their subunits. We find that subunits of the same protein complex show significant coexpression, both in terms of similarities of absolute mRNA levels and expression profiles, e.g., we can often see subunits of a complex having correlated patterns of expression over a time course. We classify the yeast protein complexes as either permanent or transient, with permanent ones being maintained through most cellular conditions. We find that, generally, permanent complexes, such as the ribosome and proteasome, have a particularly strong relationship with expression, while transient ones do not. However, we note that several transient complexes, such as the RNA polymerase II holoenzyme and the replication complex, can be subdivided into smaller permanent ones, which do have a strong relationship to gene expression. We also investigated the interactions in aggregated, genome-wide data sets, such as the comprehensive yeast two-hybrid experiments, and found them to have only a weak relationship with gene expression, similar to that of transient complexes. (Further details on genecensus.org/expression/interactions and bioinfo.mbb.yale.edu/expression/interactions.)

667 citations


Journal ArticleDOI
TL;DR: The transcriptional circuitry that regulates the G1-to-S-phase progression, these factors were epitope-tagged and their binding targets were identified by chIp-chip analysis, indicating that a complex network of transcription factors coordinates the diverse activities that initiate a new cell cycle.
Abstract: In the yeast Saccharomyces cerevisiae, SBF (Swi4–Swi6 cell cycle box binding factor) and MBF (MluI binding factor) are the major transcription factors regulating the START of the cell cycle, a time just before DNA replication, bud growth initiation, and spindle pole body (SPB) duplication. These two factors bind to the promoters of 235 genes, but bind less than a quarter of the promoters upstream of genes with peak transcript levels at the G1 phase of the cell cycle. Several functional categories, which are known to be crucial for G1/S events, such as SPB duplication/migration and DNA synthesis, are under-represented in the list of SBF and MBF gene targets. SBF binds the promoters of several other transcription factors, including HCM1, PLM2, POG1, TOS4, TOS8, TYE7, YAP5, YHP1, and YOX1. Here, we demonstrate that these factors are targets of SBF using an independent assay. To further elucidate the transcriptional circuitry that regulates the G1-to-S-phase progression, these factors were epitope-tagged and their binding targets were identified by chIp–chip analysis. These factors bind the promoters of genes with roles in G1/S events including DNA replication, bud growth, and spindle pole complex formation, as well as the general activities of mitochondrial function, transcription, and protein synthesis. Although functional overlap exists between these factors and MBF and SBF, each of these factors has distinct functional roles. Most of these factors bind the promoters of other transcription factors known to be cell cycle regulated or known to be important for cell cycle progression and differentiation processes indicating that a complex network of transcription factors coordinates the diverse activities that initiate a new cell cycle.

282 citations


Journal ArticleDOI
01 Sep 2002-Proteins
TL;DR: This work investigated protein motions using normal modes within a database framework, determining on a large sample the degree to which normal modes anticipate the direction of the observed motion and were useful for motions classification, and identified a new statistic, mode concentration, related to the mathematical concept of information content.
Abstract: We investigated protein motions using normal modes within a database framework, determining on a large sample the degree to which normal modes anticipate the direction of the observed motion and were useful for motions classification. As a starting point for our analysis, we identified a large number of examples of protein flexibility from a comprehensive set of structural alignments of the proteins in the PDB. Each example consisted of a pair of proteins that were considerably different in structure given their sequence similarity. On each pair, we performed geometric comparisons and adiabatic-mapping interpolations in a high-throughput pipeline, arriving at a final list of 3,814 putative motions and standardized statistics for each. We then computed the normal modes of each motion in this list, determining the linear combination of modes that best approximated the direction of the observed motion. We integrated our new motions and normal mode calculations in the Macromolecular Motions Database, through a new ranking interface at http://molmovdb.org. Based on the normal mode calculations and the interpolations, we identified a new statistic, mode concentration, related to the mathematical concept of information content, which describes the degree to which the direction of the observed motion can be summarized by a few modes. Using this statistic, we were able to determine the fraction of the 3,814 motions where one could anticipate the direction of the actual motion from only a few modes. We also investigated mode concentration in comparison to related statistics on combinations of normal modes and correlated it with quantities characterizing protein flexibility (e.g., maximum backbone displacement or number of mobile atoms). Finally, we evaluated the ability of mode concentration to automatically classify motions into a variety of simple categories (e.g., whether or not they are "fragment-like"), in comparison to motion statistics. This involved the application of decision trees and feature selection (particular machine-learning techniques) to training and testing sets derived from merging the "list" of motions with manually classified ones.

278 citations


Journal ArticleDOI
TL;DR: This article showed that a significant fraction of the protein-protein interactions in genome-wide datasets, as well as many of the individual interactions reported in the literature, are inconsistent with the known 3D structures of three recent complexes (RNA polymerase II, Arp2/3 and the proteasome).

270 citations


Journal ArticleDOI
Paul M. Harrison1, Anuj Kumar1, Ning Lang1, Michael Snyder1, Mark Gerstein1 
TL;DR: The problems in defining the extent of the proteomes for completely sequenced eukaryotic organisms, focusing on yeast, worm, fly and human, are discussed, and the current estimates for the numbers of human genes are surveyed and a range for the size of the human proteome is estimated.
Abstract: We discuss the problems in defining the extent of the proteomes for completely sequenced eukaryotic organisms (i.e. the total number of protein-coding sequences), focusing on yeast, worm, fly and human. (i) Six years after completion of its genome sequence, the true size of the yeast proteome is still not defined. New small genes are still being discovered, and a large number of existing annotations are being called into question, with these questionable ORFs (qORFs) comprising up to one-fifth of the 'current' proteome. We discuss these in the context of an ideal genome-annotation strategy that considers the proteome as a rigorously defined subset of all possible coding sequences ('the orfome'). (ii) Despite the greater apparent complexity of the fly (more cells, more complex physiology, longer lifespan), the nematode worm appears to have more genes. To explain this, we compare the annotated proteomes of worm and fly, relating to both genome-annotation and genome evolution issues. (iii) The unexpectedly small size of the gene complement estimated for the complete human genome provoked much public debate about the nature of biological complexity. However, in the first instance, for the human genome, the relationship between gene number and proteome size is far from simple. We survey the current estimates for the numbers of human genes and, from this, we estimate a range for the size of the human proteome. The determination of this is substantially hampered by the unknown extent of the cohort of pseudogenes ('dead' genes), in combination with the prevalence of alternative splicing. (Further information relating to yeast is available at http://genecensus.org/yeast/orfome)

217 citations


Journal ArticleDOI
TL;DR: The large-scale distribution of RP pseudogenes throughout the genome appears to result, chiefly, from random insertions with the numbers on each chromosome, consequently, proportional to its size, with the highest density in GC-intermediate regions of the genome.
Abstract: Mammals have 79 ribosomal proteins (RP). Using a systematic procedure based on sequence-homology, we have comprehensively identified pseudogenes of these proteins in the human genome. Our assignments are available at http://www.pseudogene.org or http://bioinfo.mbb.yale.edu/genome/pseudogene. In total, we found 2090 processed pseudogenes and 16 duplications of RP genes. In relation to the matching parent protein, each of the processed pseudogenes has an average relative sequence length of 97% and an average sequence identity of 76%. A small number (258) of them do not contain obvious disablements (stop codons or frameshifts) and, therefore, could be mistaken as functional genes, and 178 are disrupted by one or more repetitive elements. On average, processed pseudogenes have a longer truncation at the 5' end than the 3' end, consistent with the target-primed-reverse-transcription (TPRT) mechanism. Interestingly, on chromosome 16, an RPL26 processed pseudogene was found in the intron region of a functional RPS2 gene. The large-scale distribution of RP pseudogenes throughout the genome appears to result, chiefly, from random insertions with the numbers on each chromosome, consequently, proportional to its size. In contrast to RP genes, the RP pseudogenes have the highest density in GC-intermediate regions (41%-46%) of the genome, with the density pattern being between that of LINEs and Alus. This can be explained by a negative selection theory as we observed that GC-rich RP pseudogenes decay faster in GC-poor regions. Also, we observed a correlation between the number of processed pseudogenes and the GC content of the associated functional gene, i.e., relatively GC-poor RPs have more processed pseudogenes. This ranges from 145 pseudogenes for RPL21 down to 3 pseudogenes for RPL14. We were able to date the RP pseudogenes based on their sequence divergence from present-day RP genes, finding an age distribution similar to that for Alus. The distribution is consistent with a decline in retrotransposition activity in the hominid lineage during the last 40 Myr. We discuss the implications for retrotransposon stability and genome dynamics based on these new findings.

212 citations


Journal ArticleDOI
TL;DR: Substantial agreement was found between gene expression and protein abundance, in terms of the enrichment of structural and functional categories, which reflects the way broad categories collect many individual measurements into simple, robust averages.
Abstract: Motivation: Protein abundance is related to mRNA expression through many different cellular processes. Up to now, there have been conflicting results on how correlated the levels of these two quantities are. Given that expression and abundance data are significantly more complex and noisy than the underlying genomic sequence information, it is reasonable to simplify and average them in terms of broad proteomic categories and features (e.g. functions or secondary structures), for understanding their relationship. Furthermore, it will be essential to integrate, within a common framework, the results of many varied experiments by different investigators. This will allow one to survey the characteristics of highly expressed genes and proteins. Results: To this end, we outline a formalism for merging and scaling many different gene expression and protein abundance data sets into a comprehensive reference set, and we develop an approach for analyzing this in terms of broad categories, such as composition, function, structure and localization. As the various experiments are not always done using the same set of genes, sampling bias becomes a central issue, and our formalism is designed to explicitly show this and correct for it. We apply our formalism to the currently available gene expression and protein abundance data for yeast. Overall, we found substantial agreement between gene expression and protein abundance, in terms of the enrichment of structural and functional categories. This agreement, which was considerably greater than the simple correlation between these quantities for individual genes, reflects the way broad categories collect many individual measurements into simple, robust averages. In particular, we found ∗ To whom correspondence should be addressed. † These authors contributed equally to this work. that in comparison to the population of genes in the yeast genome, the cellular populations of transcripts and proteins (weighted by their respective abundances, the transcriptome and what we dub the translatome) were both enriched in: (i) the small amino acids Val, Gly, and Ala; (ii) low molecular weight proteins; (iii) helices and sheets relative to coils; (iv) cytoplasmic proteins relative to nuclear ones; and (v) proteins involved in ‘protein synthesis,’ ‘cell structure,’ and ‘energy production.’

196 citations


Journal ArticleDOI
TL;DR: The main populations and clusters of pseudogenes on chromosomes 21 and 22 are determined, and it is found that chromosome 22 pseudogene population is dominated by immunoglobulin segments, which have a greater rate of disablement per amino acid than the other pseudogene populations and are also substantially more diverged.
Abstract: Pseudogenes are disabled copies of genes that do not produce a functional, full-length copy of a protein (Mighell et al. 2000; Vanin 1985). They are of two types: First, processed pseudogenes result from reverse transcription of messenger RNA transcripts followed by reintegration into genomic DNA (presumably in germ-line cells) and subsequent degradation with disablements (premature stop codons and frameshifts) (Vanin 1985). Second, nonprocessed pseudogenes result from duplication of a gene, followed by an initial disablement if the duplicated copy is not “useful” (Mighell et al. 2000). These then also accumulate further coding disablements. The extent of the pseudogene population in the human genome is unclear. Estimates for the number of human genes range from ∼22,000 to ∼75,000 (Crollius et al. 2000; Ewing and Green 2000; Lander et al. 2001; Venter et al. 2001; Wright et al. 2001). From previous reports, it is thought that up to 22% of these gene predictions may be pseudogenic (Lander et al. 2001; Yeh et al. 2001). It is important to characterize the human processed and nonprocessed pseudogene populations as their existence interferes with gene identification and prediction (particularly nonprocessed pseudogenes or individual pseudogenic exons). They are also an important resource for the study of the evolution of protein families (see, e.g., studies on the human olfactory receptor subgenome [e.g. Glusman et al. 2001]). Here, we have performed a detailed analysis of the pseudogene populations of human chromosomes 21 and 22, which have been sequenced contiguously to high quality. This is similar in spirit to previous surveys we have performed on pseudogenes and other genomic features in other organisms (Harrison et al. 2001; Gerstein 1997, 1998; Hegyi and Gerstein 1999). We have examined the main populations and clusters of pseudogenes for the two chromosomes. Patterns of distribution of both nonprocessed and processed pseudogenes indicate the existence of pseudogenic hot-spots in the human genome. In addition, we have estimated the total numbers and proportions of processed and nonprocessed pseudogenes in the whole human genome.

Journal ArticleDOI
TL;DR: Protein families can be used to understand many aspects of genomes, both their "live" and their "dead" parts, and there is great redundancy in proteomes, a fact linked to the large number of dispensable genes for each organism and the small size of the minimal, indispensable sub-proteome.

Journal ArticleDOI
TL;DR: The hemapoietic lineage-specific transcription factor GATA-1 is implicated in regulating the expression of the erythroid-specific genes including the genes of the β-globin locus and binds in a region encompassing the HS2 core element, as was previously identified.
Abstract: The expression of the β-like globin genes is intricately regulated by a series of both general and tissue-restricted transcription factors. The hemapoietic lineage-specific transcription factor GATA-1 is important for erythroid differentiation and has been implicated in regulating the expression of the erythroid-specific genes including the genes of the β-globin locus. In the human erythroleukemic K562 cell line, only one DNA region has been identified previously as a putative site of GATA-1 interaction by in vivo footprinting studies. We mapped GATA-1 binding throughout the β-globin locus by using chIp-chip analysis of K562 cells. We found that GATA-1 binds in a region encompassing the HS2 core element, as was previously identified, and an additional region of GATA-1 binding upstream of the γG gene. This approach will be of general utility for mapping transcription factor binding sites within the β-globin locus and throughout the genome.

Journal ArticleDOI
Nicholas M. Luscombe1, Jiang Qian1, Zhaolei Zhang1, Ted Johnson1, Mark Gerstein1 
TL;DR: Power-law behavior provides a concise mathematical description of an important biological feature: the sheer dominance of a few members over the overall population as genomes evolved to their current state.
Abstract: Background The sequencing of genomes provides us with an inventory of the 'molecular parts' in nature, such as protein families and folds, and their functions in living organisms. Through the analysis of such inventories, it has been shown that different genomes have very different usage of parts; for example, the common folds in the worm are very different from those in Escherichia coli.

Journal ArticleDOI
TL;DR: This study presents, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation in support vector machines (SVMs) and indicates that classification systems with a lower Borges effect are better suitable for machine learning.
Abstract: DNA array technology (Schena et al. 1995; Shalon et al. 1996) allows for the simultaneous recording of thousands of gene expression levels and has opened new ways of looking at organisms on a genome-wide scale. It is now possible to study genomic patterns of gene expression in prokaryotes (Arfin et al. 2000) or in simple eukaryotes like yeast (Eisen et al. 1998) and Caenorhabditis elegans (Hill et al. 2000), whereas in higher organisms, like humans, tens of thousands genes can be monitored (Zhang et al. 1997). DNA array experiments primarily involve the measurement of thousands of gene expression levels under different conditions. The data can be clustered along these two dimensions for two purposes: either (1) the classification of conditions (tissues, phenotypes, etc.) in terms of expression values, regarded as their molecular signatures, or (2) conversely, the classification of genes with correlated expression patterns, to explore shared functions or regulation. The classification of conditions has been investigated in several studies. For instance, aggregative hierarchical clustering has been used extensively for the molecular classification of leukemia (Golub et al. 1999), colon cancer (Alon et al. 1999), breast cancer (Perou et al. 1999), and lymphoma (Alizadeh et al. 2000), to cite just a few cases. Supervised methods can be used if there is some prior knowledge about the classes to be analysed. Thus, support vector machines (Furey et al. 2000), neural networks (Khan et al. 2001), and pattern discovery methods (Califano et al. 2000) have been applied to the molecular classification of different cancer tissues. Note that a typical expression data set usually contains several thousands of genes but perhaps <100 conditions. Thus, the classification of conditions involves only a few items (conditions) to be classified, but a high number of variables (genes). We face an opposite data structure for the second type of clustering, the classification of genes: Many items (genes) need to be classified using only a few variables (conditions). For the classification of genes, an arsenal of methods has been used, including aggregative hierarchical clustering (Sneath, and Sokal 1973; Eisen et al. 1998), K-means (Tavazoie et al. 1999), singular value decomposition (Alter et al. 2000), and other methods (Ben-Dor et al. 1999; Heyer et al. 1999). Unsupervised neural networks like self-organizing maps (Kohonen, 1997; Tamayo et al. 1999; Toronen et al. 1999), or their hierarchical version, the self-organizing tree algorithm (Herrero et al. 2001), have also been used to obtain clusters of co-expressing genes. Despite the wealth of data on gene properties (such as function, subcellular localization, protein interactions, presence in pathways, or cellular complexes), supervised approaches that take this prior information into account have been applied only scarcely. Brown et al. (2000), using expression data from yeast (Eisen et al. 1998), concluded that support vector machines (SVMs; Vapnik 1998) were the most efficient method for identifying sets of genes with common functions, among several machine-learning techniques they compared. However, supervised neural networks (SNNs; Bishop 1995), the method we use here, were not included in this comparison. SNNs are computer-based algorithms inspired by the structure and behavior of neurons in the human brain. Similar to SVMs, SNNs are capable of extracting features of classes in a training process in order to learn how to identify them. In particular, this pattern-recognition process is achieved in perceptrons (Rosenblatt 1958) by adjusting parameters of the SNN in a process of error back-propagation and minimization through learning from experience. They can be calibrated (trained) using any type of input data, such as gene expression levels from DNA arrays, and the output can be grouped into any given number of categories. Compared to SVMs, they have some potential advantages. SNNs allow for multiple classifications in a single query, whereas SVMs are only designed to bisect the data into two classes (the class to be learned and its complement) and can thus achieve multiple classifications only indirectly and iteratively. (If a gene being queried is assigned to the complement of the original class, the complement needs to be bisected again with respect to the next class, and so on, until the gene is assigned or remains unclassified in the final complement. This procedure is dependent on the chosen order of the classes.) In contrast, multilayer perceptron-based SNN schemes provide a more direct method in that they can be tailored to perform multiset classification in one run, with the output consisting of as many units as classes of interest. A further advantage is that the parameters of the SNN (weights) can give relevant information on the relative importance of each condition in the learning of the classes. Our goals in this paper are twofold. First, we explore the ability of supervised neural networks to learn the gene expression signatures of classes both in the binary case (i.e., for a class and its complement) and in the multiple-class case. Second, we systematically explore how well these classes can be learned. Similar to Brown et al., we used the classes from the Munich Information Center for Protein Sequences (MIPS) functional catalog, but unlike Brown et al., who analyzed only five of them, we investigated 96 classes of the MIPS functional catalog (Mewes et al. 2000). Our results show that even though some classes can be learned with a low rate of false positives and false negatives, other classes can hardly be learned at all. In fact, we get >60% false negatives for 92% of the functional classes. A priori, one could suspect that this is caused by a poor performance of the neural network learning method. There are, however, a number of reasons that can affect learning performance. First, the output of DNA array technology can have a poor signal to noise ratio, and poor learning can be a result of the noise eclipsing the signal. In addition to this, we identified three reasons for the poor learning performance that are purely related to the biology underlying the data rather than to the technical aspects of machine learning. They are (1) class size, (2) heterogeneity of the classes, and (3) the high degree of intersection among functional classes. The MIPS catalog, which has been compiled based on the extensive biological knowledge in the literature, is indeed highly interconnected. This of course reflects the fact that cellular processes do not represent isolated, modules. Thus, if the neural network falsely classifies a gene as a positive, this is not necessarily a failure of the learning scheme, but often represents a gene participating in biological processes closely associated with the original class to be learned. We substantiate these claims by studying the intersection structure of functional classes. For a given classification scheme, such as the MIPS catalog, we define what we call the “Borges effect” and introduce two numerical indices that give a rough measure of the overlapping structure of a given class. We conjecture that these indices determine the learning performance of this class. Finally, in order to test the proposition that false classifications are not necessarily errors of the learning process, we introduce an iterative procedure in which, starting from a single MIPS class, the false positives of iteration i are added as true positives for iteration i + 1. If the false positives were really caused by the learning process, one would expect the rate of false positives to remain approximately unchanged at each iteration, eventually producing a class that comprises almost all genes. If this were not the case, one would expect the false positives rate to decrease and the procedure to converge to a small set of genes. Indeed, we observe the latter scenario. We have let the iterations run until the rate of false positives reached a low preassigned threshold. We show that the new set of genes produced after a few iteration steps, can be learned with considerably low rates of false positives and false negatives. This finding is biologically meaningful in that the new gene set contains genes with functional classes that are related to the original class through interacting cellular processes. We shall argue that this method of iteratively picking up signatures related to an original class might allow for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle (TCA).

Journal ArticleDOI
TL;DR: The results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA, and it is proposed that the degree to which pseudogene approach a random sequence composition may be useful in dating different sets of pseudogenees, as well as to assess the rate at which intergenic DNA accumulates mutations.
Abstract: Based on searches for disabled homologs to known proteins, we have identified a large population of pseudogenes in four sequenced eukaryotic genomes-the worm, yeast, fly and human (chromosomes 21 and 22 only). Each of our nearly 2500 pseudogenes is characterized by one or more disablements mid-domain, such as premature stops and frameshifts. Here, we perform a comprehensive survey of the amino acid and nucleotide composition of these pseudogenes in comparison to that of functional genes and intergenic DNA. We show that pseudogenes invariably have an amino acid composition intermediate between genes and translated intergenic DNA. Although the degree of intermediacy varies among the four organisms, in all cases, it is most evident for amino acid types that differ most in occurrence between genes and intergenic regions. The same intermediacy also applies to codon frequencies, especially in the worm and human. Moreover, the intermediate composition of pseudogenes applies even though the composition of the genes in the four organisms is markedly different, showing a strong correlation with the overall A/T content of the genomic sequence. Pseudogenes can be divided into 'ancient' and 'modern' subsets, based on the level of sequence identity with their closest matching homolog (within the same genome). Modern pseudogenes usually have a much closer sequence composition to genes than ancient pseudogenes. Collectively, our results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA. Therefore, we propose that the degree to which pseudogenes approach a random sequence composition may be useful in dating different sets of pseudogenes, as well as to assess the rate at which intergenic DNA accumulates mutations. Our compositional analyses with the interactive viewer are available over the web at http://genecensus.org/pseudogene.

Journal ArticleDOI
TL;DR: A genome-wide analysis on patterns of the classified polytopic membrane protein families was carried out and the distribution of conserved amino acids and motifs in the transmembrane helix regions in these families were analyzed.
Abstract: Background: Polytopic membrane proteins can be related to each other on the basis of the number of transmembrane helices and sequence similarities. Building on the Pfam classification of protein domain families, and using transmembrane-helix prediction and sequence-similarity searching, we identified a total of 526 well-characterized membrane protein families in 26 recently sequenced genomes. To this we added a clustering of a number of predicted but unclassified membrane proteins, resulting in a total of 637 membrane protein families. Results: Analysis of the occurrence and composition of these families revealed several interesting trends. The number of assigned membrane protein domains has an approximately linear relationship to the total number of open reading frames (ORFs) in 26 genomes studied. Caenorhabditis elegans is an apparent outlier, because of its high representation of seven-span transmembrane (7-TM) chemoreceptor families. In all genomes, including that of C. elegans, the number of distinct membrane protein families has a logarithmic relation to the number of ORFs. Glycine, proline, and tyrosine locations tend to be conserved in transmembrane regions within families, whereas isoleucine, valine, and methionine locations are relatively mutable. Analysis of motifs in putative transmembrane helices reveals that GxxxG and GxxxxxxG (which can be written GG4 and GG7, respectively; see Materials and methods) are among the most prevalent. This was noted in earlier studies; we now find these motifs are particularly well conserved in families, however, especially those corresponding to transporters, symporters, and channels. Conclusions: We carried out a genome-wide analysis on patterns of the classified polytopic membrane protein families and analyzed the distribution of conserved amino acids and motifs in the transmembrane helix regions in these families.

Journal ArticleDOI
TL;DR: The discovery of 137 previously unappreciated genes in yeast through a widely applicable and highly scalable approach integrating methods of gene-trapping, microarray-based expression analysis, and genome-wide homology searching, which provides an effective supplement to current gene-finding schemes.
Abstract: We report here the discovery of 137 previously unappreciated genes in yeast through a widely applicable and highly scalable approach integrating methods of gene-trapping, microarray-based expression analysis, and genome-wide homology searching. Our approach is a multistep process in which expressed sequences are first trapped using a modified transposon that produces protein fusions to β-galactosidase (β-gal); nonannotated open reading frames (ORFs) translated as β-gal chimeras are selected as a candidate pool of potential genes. To verify expression of these sequences, labeled RNA is hybridized against a microarray of oligonucleotides designed to detect gene transcripts in a strand-specific manner. In complement to this experimental method, novel genes are also identified in silico by homology to previously annotated proteins. As these methods are capable of identifying both short ORFs and antisense ORFs, our approach provides an effective supplement to current gene-finding schemes. In total, the genes discovered using this approach constitute 2% of the yeast genome and represent a wealth of overlooked biology.

Journal ArticleDOI
TL;DR: This paper focuses on the prediction of membership in protein complexes for individual genes, and recruits six different data sources that include expression profiles, interaction data, essentiality and localization information, which can be improved by combining all of them.
Abstract: The ultimate goal of functional genomics is to define the function of all the genes in the genome of an organism. A large body of information of the biological roles of genes has been accumulated and aggregated in the past decades of research, both from traditional experiments detailing the role of individual genes and proteins, and from newer experimental strategies that aim to characterize gene function on a genomic scale. It is clear that the goal of functional genomics can only be achieved by integrating information and data sources from the variety of these different experiments. Integration of different data is thus an important challenge for bioinformatics. The integration of different data sources often helps to uncover non-obvious relationships between genes, but there are also two further benefits. First, it is likely that whenever information from multiple independent sources agrees, it should be more valid and reliable. Secondly, by looking at the union of multiple sources, one can cover larger parts of the genome. This is obvious for integrating results from multiple single gene or protein experiments, but also necessary for many of the results from genome-wide experiments since they are often confined to certain (although sizable) subsets of the genome. In this paper, we explore an example of such a data integration procedure. We focus on the prediction of membership in protein complexes for individual genes. For this, we recruit six different data sources that include expression profiles, interaction data, essentiality and localization information. Each of these data sources individually contains some weakly predictive information with respect to protein complexes, but we show how this prediction can be improved by combining all of them. Supplementary information is available at http:// bioinfo.mbb.yale.edu/integrate/interactions/.

Journal ArticleDOI
TL;DR: The characteristics of the dORF population suggest the sorts of genes that are likely to fall in and out of usage (and vary in copy number) in a strain-specific way and highlight the role of subtelomeric regions in engendering this diversity.

Journal ArticleDOI
01 Nov 2002-Blood
TL;DR: This first time such a relationship between mRNA and protein in terms of simultaneous changes in their levels over multiple time points is studied, and it is found that it gives a much stronger correlation, consistent with the hypothesis that a substantial proportion of protein change is a consequence of changed mRNA levels, rather than posttranscriptional effects.


Journal ArticleDOI
TL;DR: This work investigates the sensitivity of the volume calculations to a number of factors, and shows how the variation in volumes appears to be clearly related to the quality of the structures analyzed, with higher quality structures giving consistently smaller average volumes with less variance.
Abstract: Motivation: The precise sizes of protein atoms in terms of occupied packing volume are of great importance. We have previously presented standard volumes for protein residues based on calculations with Voronoi-like polyhedra. To understand the applicability and limitations of our set, we investigated, in detail, the sensitivity of the volume calculations to a number of factors: (i) the van der Waals radii set, (ii) the criteria for including buried atoms in the calculations or atom selection, (iii) the method of positioning the dividing plane in polyhedra construction, and (iv) the set of structures used in the averaging. Results: We find that different radii sets have only moderate affects to the distribution and mean of volumes. Atom selection and dividing plane methods cause larger changes in protein atoms volumes. More significantly, we show how the variation in volumes appears to be clearly related to the quality of the structures analyzed, with higher quality structures giving consistently smaller average volumes with less variance.

Journal ArticleDOI
01 May 2002-Proteins
TL;DR: A structural genomics analysis of the folds and structural superfamilies in the first 20 completely sequenced genomes by focusing on the patterns of fold usage and trying to identify structural characteristics of typical and atypical folds finds that common folds tend be more multifunctional and associated with more regular, “symmetrical” structures than the unique ones.
Abstract: We conducted a structural genomics analysis of the folds and structural superfamilies in the first 20 completely sequenced genomes by focusing on the patterns of fold usage and trying to identify structural characteristics of typical and atypical folds. We assigned folds to sequences using PSI-blast, run with a systematic protocol to reduce the amount of computational overhead. On average, folds could be assigned to about a fourth of the ORFs in the genomes and about a fifth of the amino acids in the proteomes. More than 80% of all the folds in the SCOP structural classification were identified in one of the 20 organisms, with worm and E. coli having the largest number of distinct folds. Folds are particularly effective at comprehensively measuring levels of gene duplication, because they group together even very remote homologues. Using folds, we find the average level of duplication varies depending on the complexity of the organism, ranging from 2.4 in M. genitalium to 32 for the worm, values significantly higher than those observed based purely on sequence similarity. We rank the common folds in the 20 organisms, finding that the top three are the P-loop NTP hydrolase, the ferrodoxin fold, and the TIM-barrel, and discuss in detail the many factors that affect and bias these rankings. We also identify atypical folds that are "unique" to one of the organisms in our study and compare the characteristics of these folds with the most common ones. We find that common folds tend be more multifunctional and associated with more regular, "symmetrical" structures than the unique ones. In addition, many of the unique folds are associated with proteins involved in cell defense (e.g., toxins). We analyze specific patterns of fold occurrence in the genomes by associating some of them with instances of horizontal transfer and others with gene loss. In particular, we find three possible examples of transfer between archaea and bacteria and six between eukarya and bacteria. We make available our detailed results at http://genecensus.org/20.

Journal ArticleDOI
TL;DR: It is found that the small residues glycine and serine contribute more to transmembrane helix–helix interactions in thermophilic organisms, which may result in a tighter packing of the helices allowing more hydrogen bond formation.

Book ChapterDOI
01 Jan 2002
TL;DR: A database of macromolecular motions, which is accessible on the World Wide Web with an entry point at http://bioinfo.yale.edu/MolMovDB, and quantitatively systematize the description of packing through the use of Voronoi polyhedra and Delaunay triangulation.
Abstract: We describe database approaches taken in our lab to the study of protein and nucleic acid motions. We have developed a database of macromolecular motions, which is accessible on the World Wide Web with an entry point at http://bioinfo.mbb.yale.edu/MolMovDB. This attempts to systematize all instances of macromolecular movement for which there is at least some structural information. At present it contains detailed descriptions of more than 100 motions, most of which are of proteins. Protein motions are further classified hierarchically into a limited number of categories, first on the basis of size (distinguishing between fragment, domain, and subunit motions) and then on the basis of packing. Our packing classification divides motions into various categories (shear, hinge, other) depending on whether or not they involve sliding over a continuously maintained and tightly packed interface. We quantitatively systematize the description of packing through the use of Voronoi polyhedra and Delaunay triangulation. In addition to the packing classification, the database provides some indication about the evidence behind each motion (i.e. the type of experimental information or whether the motion is inferred based on structural similarity) and attempts to describe many aspects of a motion in terms of a standardized nomenclature (e.g. the maximum rotation, the residue selection of a fixed core, etc). Currently, we use a standard relational design to implement the database. However, the complexity and heterogeneity of the information kept in the database makes it an ideal application for an object-relational approach, and we are moving it in this direction. The database, moreover, incorporates innovative Internet cooperatively features that allow authorized remote experts to serve as database editors. The database also contains plausible representations for motion pathways, derived from restrained 3D interpolation between known endpoint conformations. These pathways can be viewed in a variety of movie formats, and the database is associated with a server that can automatically generate these movies from submitted coordinates. Based on the structures in the database we have developed sequence patterns for linkers and flexible hinges and are currently using these for the annotation of genome sequence data.

Proceedings Article
01 Jan 2002
TL;DR: The Yale Microarray Database (YMD) is a robust database system that allows efficient data storage, retrieval, secure access, data dissemination, and integrated data analyses for microarray researchers at Yale and their collaborators.
Abstract: The use of microarray technology to perform parallel analysis of the expression pattern of a large number of genes in a single experiment has created a new frontier of medical research. The vast amount of gene expression data generated from multiple microarray experiments requires a robust database system that allows efficient data storage, retrieval, secure access, data dissemination, and integrated data analyses. To address the growing needs of microarray researchers at Yale and their collaborators, we have built the Yale Microarray Database (YMD). YMD is Web-accessible with the following features: (i) a Web program that tracks DNA samples between source plates and arrays, (ii) the capability of finding common genes/clones across different array platforms, (iii) an image file server, (iv) laboratory-based user management and access privileges, (v) project management, (vi) template data entry, (vii) linking gene expression data to annotation databases for functional analysis. YMD is currently being used on a pilot basis by several laboratories for different organisms and array platforms.

Journal ArticleDOI
TL;DR: A prototype of a new database tool, GeneCensus, which focuses on comparing genomes globally, in terms of the collective properties of many genes, rather than in termsof the attributes of a single gene is presented.
Abstract: We present a prototype of a new database tool, GeneCensus, which focuses on comparing genomes globally, in terms of the collective properties of many genes, rather than in terms of the attributes of a single gene (e.g. sequence similarity for a particular ortholog). The comparisons are presented in a visual fashion over the web at GeneCensus.org. The system concentrates on two types of comparisons: (i) trees based on the sharing of generalized protein families between genomes, and (ii) whole pathway analysis in terms of activity levels. For the trees, we have developed a module (TreeViewer) that clusters genomes in terms of the folds, superfamilies or orthologs—all can be considered as generalized ‘families’ or ‘protein parts’—they share, and compares the resulting trees side-by-side with those built from sequence similarity of individual genes (e.g. a traditional tree built on ribosomal similarity). We also include comparisons to trees built on whole-genome dinucleotide or codon composition. For pathway comparisons, we have implemented a module (PathwayPainter) that graphically depicts, in selected metabolic pathways, the fluxes or expression levels of the associated enzymes (i.e. generalized ‘activities’). One can, consequently, compare organisms (and organism states) in terms of representations of these systemic quantities. Development of this module involved compiling, calculating and standardizing flux and expression information from many different sources. We illustrate pathway analysis for enzymes involved in central metabolism. We are able to show that, to some degree, flux and expression fluctuations have characteristic values in different sections of the central metabolism and that control points in this system (e.g. hexokinase, pyruvate kinase, phosphofructokinase, isocitrate dehydrogenase and citric synthase) tend to be especially variable in flux and expression. Both the TreeViewer and PathwayPainter modules connect to other information sources related to individual-gene or organism properties (e.g. a single-gene structural annotation viewer).

Journal ArticleDOI
TL;DR: 67 pseudomotif patterns over-represented in fly intergenic regions, 34 in worm, 21 in human and six in yeast, including the zinc finger, leucine zipper, nucleotide-binding motif and EGF domain are found, which implies that a fraction of the intergenic areas consist of ancient protein fragments that have become unrecognizable by conventional techniques for gene and pseudogene identification.

Journal ArticleDOI
TL;DR: A comprehensive analysis of 39,408 SNPs on human chromosomes 21 and 22 from the SNP consortium (TSC) database, where SNPs are obtained by random sequencing using consistent and uniform methods indicates that the occurrence of SNPs is lowest in exons and higher in repeats, introns and pseudogenes.
Abstract: SNPs are useful for genome-wide mapping and the study of disease genes. Previous studies have focused on SNPs in specific genes or SNPs pooled from a variety of different sources. Here, a systematic approach to the analysis of SNPs in relation to various features on a genome-wide scale, with emphasis on protein features and pseudogenes, is presented. We have performed a comprehensive analysis of 39,408 SNPs on human chromosomes 21 and 22 from the SNP consortium (TSC) database, where SNPs are obtained by random sequencing using consistent and uniform methods. Our study indicates that the occurrence of SNPs is lowest in exons and higher in repeats, introns and pseudogenes. Moreover, in comparing genes and pseudogenes, we find that the SNP density is higher in pseudogenes and the ratio of nonsynonymous to synonymous changes is also much higher. These observations may be explained by the increased rate of SNP accumulation in pseudogenes, which presumably are not under selective pressure. We have also performed secondary structure prediction on all coding regions and found that there is no preferential distribution of SNPs in a -helices, b -sheets or coils. This could imply that protein structures, in general, can tolerate a wide degree of substitutions. Tables relating to our results are available from http://genecensus.org/pseudogene.