scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2004"


Journal ArticleDOI
Elise A. Feingold1, Peter J. Good1, Mark S. Guyer1, S. Kamholz1  +193 moreInstitutions (19)
22 Oct 2004-Science
TL;DR: The ENCyclopedia Of DNA Elements (ENCODE) Project is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function.
Abstract: The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (∼1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.

2,248 citations


Journal ArticleDOI
23 Jan 2004-Science
TL;DR: A large fraction of the Caenorhabditis elegans interactome network is mapped, starting with a subset of metazoan-specific proteins, and more than 4000 interactions were identified from high-throughput, yeast two-hybrid screens.
Abstract: To initiate studies on how protein-protein interaction (or "interactome") networks relate to multicellular functions, we have mapped a large fraction of the Caenorhabditis elegans interactome network. Starting with a subset of metazoan-specific proteins, more than 4000 interactions were identified from high-throughput, yeast two-hybrid (HT=Y2H) screens. Independent coaffinity purification assays experimentally validated the overall quality of this Y2H data set. Together with already described Y2H interactions and interologs predicted in silico, the current version of the Worm Interactome (WI5) map contains approximately 5500 interactions. Topological and biological features of this interactome network, as well as its integration with phenome and transcriptome data sets, lead to numerous biological hypotheses.

1,733 citations


Journal ArticleDOI
24 Dec 2004-Science
TL;DR: This work constructed a series of high-density oligonucleotide tiling arrays representing sense and antisense strands of the entire nonrepetitive sequence of the human genome and found 10,595 transcribed sequences not detected by other methods.
Abstract: Elucidating the transcribed regions of the genome constitutes a fundamental aspect of human biology, yet this remains an outstanding problem. To comprehensively identify coding sequences, we constructed a series of high-density oligonucleotide tiling arrays representing sense and antisense strands of the entire nonrepetitive sequence of the human genome. Transcribed sequences were located across the genome via hybridization to complementary DNA samples, reverse-transcribed from polyadenylated RNA obtained from human liver tissue. In addition to identifying many known and predicted genes, we found 10,595 transcribed sequences not detected by other methods. A large fraction of these are located in intergenic regions distal from previously annotated genes and exhibit significant homology to other mammalian proteins.

1,073 citations


Journal ArticleDOI
16 Sep 2004-Nature
TL;DR: The dynamics of a biological network on a genomic scale is presented, by integrating transcriptional regulatory information and gene-expression data for multiple conditions in Saccharomyces cerevisiae, using an approach for the statistical analysis of network dynamics, called SANDY, combining well-known global topological measures, local motifs and newly derived statistics.
Abstract: Network analysis has been applied widely, providing a unifying language to describe disparate systems ranging from social interactions to power grids. It has recently been used in molecular biology, but so far the resulting networks have only been analysed statically. Here we present the dynamics of a biological network on a genomic scale, by integrating transcriptional regulatory information and gene-expression data for multiple conditions in Saccharomyces cerevisiae. We develop an approach for the statistical analysis of network dynamics, called SANDY, combining well-known global topological measures, local motifs and newly derived statistics. We uncover large changes in underlying network architecture that are unexpected given current viewpoints and random simulations. In response to diverse stimuli, transcription factors alter their interactions to varying degrees, thereby rewiring the network. A few transcription factors serve as permanent hubs, but most act transiently only during certain conditions. By studying sub-network structures, we show that environmental responses facilitate fast signal propagation (for example, with short regulatory cascades), whereas the cell cycle and sporulation direct temporal progression through multiple stages (for example, with highly inter-connected transcription factors). Indeed, to drive the latter processes forward, phase-specific transcription factors inter-regulate serially, and ubiquitously active transcription factors layer above them in a two-tiered hierarchy. We anticipate that many of the concepts presented here--particularly the large-scale topological changes and hub transience--will apply to other biological networks, including complex sub-systems in higher eukaryotes.

1,007 citations


Journal ArticleDOI
TL;DR: Despite the general organisational similarity of networks across the phylogenetic spectrum, there are interesting qualitative differences among the network components, such as the transcription factors.

811 citations


Journal ArticleDOI
TL;DR: This work quantitatively assesses the degree to which interologs can be reliably transferred between species as a function of the sequence similarity of the corresponding interacting proteins and introduces the concept of a "regulog"--a conserved regulatory relationship between proteins across different species.
Abstract: Proteins function mainly through interactions, especially with DNA and other proteins. While some large-scale interaction networks are now available for a number of model organisms, their experimental generation remains difficult. Consequently, interolog mapping--the transfer of interaction annotation from one organism to another using comparative genomics--is of significant value. Here we quantitatively assess the degree to which interologs can be reliably transferred between species as a function of the sequence similarity of the corresponding interacting proteins. Using interaction information from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Helicobacter pylori, we find that protein-protein interactions can be transferred when a pair of proteins has a joint sequence identity >80% or a joint E-value <10(-70). (These "joint" quantities are the geometric means of the identities or E-values for the two pairs of interacting proteins.) We generalize our interolog analysis to protein-DNA binding, finding such interactions are conserved at specific thresholds between 30% and 60% sequence identity depending on the protein family. Furthermore, we introduce the concept of a "regulog"--a conserved regulatory relationship between proteins across different species. We map interologs and regulogs from yeast to a number of genomes with limited experimental annotation (e.g., Arabidopsis thaliana) and make these available through an online database at http://interolog.gersteinlab.org. Specifically, we are able to transfer approximately 90,000 potential protein-protein interactions to the worm. We test a number of these in two-hybrid experiments and are able to verify 45 overlaps, which we show to be statistically significant.

572 citations


Journal ArticleDOI
Haiyuan Yu1, Dov Greenbaum1, Hao Xin Lu1, Xiaowei Zhu1, Mark Gerstein1 
TL;DR: This article introduces the notion of 'marginal essentiality' through combining quantitatively the results from large-scale phenotypic experiments and finds that this quantity relates to many of the topological characteristics of protein-protein interaction networks.

330 citations


Journal ArticleDOI
TL;DR: Motions related to protein-protein binding events can be surveyed from the perspective of the Database of Macromolecular Movements, whereby proteins are found to simultaneously exist in populations of diverse conformations.

294 citations


Journal ArticleDOI
David A. Hall1, Heng Zhu1, Xiaowei Zhu1, Thomas Royce1, Mark Gerstein1, Michael Snyder1 
15 Oct 2004-Science
TL;DR: Results indicate that metabolic enzymes can directly regulate eukaryotic gene expression.
Abstract: Gene expression in eukaryotes is normally believed to be controlled by transcriptional regulators that activate genes encoding structural proteins and enzymes. To identify previously unrecognized DNA binding activities, a yeast proteome microarray was screened with DNA probes; Arg5,6, a well-characterized mitochondrial enzyme involved in arginine biosynthesis, was identified. Chromatin immunoprecipitation experiments revealed that Arg5,6 is associated with specific nuclear and mitochondrial loci in vivo, and Arg5,6 binds to specific fragments in vitro. Deletion of Arg5,6 causes altered transcript levels of both nuclear and mitochondrial target genes. These results indicate that metabolic enzymes can directly regulate eukaryotic gene expression.

244 citations


Journal ArticleDOI
TL;DR: This work has systematically identified approximately 5000 processed pseudogenes in the mouse genome, and estimated that approximately 60% are lineage specific, created after the mouse and human diverged.

208 citations


Journal ArticleDOI
TL;DR: The notion of networks between biological entities (including molecular and genetic interaction networks as well as transcriptional regulatory relationships) potentially provides a unifying language suitable for the systematic description of protein function.

Journal ArticleDOI
TL;DR: This analysis shows that the MIPS and Gene Ontology functional similarity datasets as the dominating information contributors for predicting the protein-protein interactions under the framework proposed by Jansen et al. can give highly accurate classifications.
Abstract: Background: Identifying protein-protein interactions is fundamental for understanding the molecular machinery of the cell. Proteome-wide studies of protein-protein interactions are of significant value, but the high-throughput experimental technologies suffer from high rates of both false positive and false negative predictions. In addition to high-throughput experimental data, many diverse types of genomic data can help predict protein-protein interactions, such as mRNA expression, localization, essentiality, and functional annotation. Evaluations of the information contributions from different evidences help to establish more parsimonious models with comparable or better prediction accuracy, and to obtain biological insights of the relationships between protein-protein interactions and other genomic information. Results: Our assessment is based on the genomic features used in a Bayesian network approach to predict protein-protein interactions genome-wide in yeast. In the special case, when one does not have any missing information about any of the features, our analysis shows that there is a larger information contribution from the functional-classification than from expression correlations or essentiality. We also show that in this case alternative models, such as logistic regression and random forest, may be more effective than Bayesian networks for predicting interactions. Conclusions: In the restricted problem posed by the complete-information subset, we identified that the MIPS and Gene Ontology (GO) functional similarity datasets as the dominating information contributors for predicting the protein-protein interactions under the framework proposed by Jansen et al. Random forests based on the MIPS and GO information alone can give highly accurate classifications. In this particular subset of complete information, adding other genomic data does little for improving predictions. We also found that the data discretizations used in the Bayesian methods decreased classification performance.

Journal ArticleDOI
TL;DR: Mapping for the first time the binding distribution of CREB along an entire human chromosome revealed 215 binding sites corresponding to 192 different loci and 100 annotated potential gene targets, providing novel molecular insights into how CREB mediates its functions in humans.
Abstract: The cyclic AMP-responsive element-binding protein (CREB) is an important transcription factor that can be activated by hormonal stimulation and regulates neuronal function and development. An unbiased, global analysis of where CREB binds has not been performed. We have mapped for the first time the binding distribution of CREB along an entire human chromosome. Chromatin immunoprecipitation of CREB-associated DNA and subsequent hybridization of the associated DNA to a genomic DNA microarray containing all of the nonrepetitive DNA of human chromosome 22 revealed 215 binding sites corresponding to 192 different loci and 100 annotated potential gene targets. We found binding near or within many genes involved in signal transduction and neuronal function. We also found that only a small fraction of CREB binding sites lay near well-defined 5' ends of genes; the majority of sites were found elsewhere, including introns and unannotated regions. Several of the latter lay near novel unannotated transcriptionally active regions. Few CREB targets were found near full-length cyclic AMP response element sites; the majority contained shorter versions or close matches to this sequence. Several of the CREB targets were altered in their expression by treatment with forskolin; interestingly, both induced and repressed genes were found. Our results provide novel molecular insights into how CREB mediates its functions in humans.

Journal ArticleDOI
TL;DR: It is found that pseudogenes are more than twice as likely as genes to have anomalous codon usage associated with horizontal transfer and a significant difference in the number of horizontally transferred Pseudogenes in pathogenic and non-pathogenic strains of Escherichia coli.
Abstract: Background: Pseudogenes often manifest themselves as disabled copies of known genes. In prokaryotes, it was generally believed (with a few well-known exceptions) that they were rare. Results: We have carried out a comprehensive analysis of the occurrence of pseudogenes in a diverse selection of 64 prokaryote genomes. Overall, we find a total of around 7,000 candidate pseudogenes. Moreover, in all the genomes surveyed, pseudogenes occur in at least 1 to 5% of all gene-like sequences, with some genomes having considerably higher occurrence. Although many large populations of pseudogenes arise from large, diverse protein families (for example, the ABC transporters), notable numbers of pseudogenes are associated with specific families that do not occur that widely. These include the cytochrome P450 and PPE families (PF00067 and PF00823) and others that have a direct role in DNA transposition. Conclusions: We find suggestive evidence that a large fraction of prokaryote pseudogenes arose from failed horizontal transfer events. In particular, we find that pseudogenes are more than twice as likely as genes to have anomalous codon usage associated with horizontal transfer. Moreover, we found a significant difference in the number of horizontally transferred pseudogenes in pathogenic and non-pathogenic strains of Escherichia coli.

Journal ArticleDOI
TL;DR: There are persistent differences in gene expression between adult males and females, and these molecular differences have important implications for the physiological differences between males and Female.

Journal ArticleDOI
TL;DR: Pseudogenes are considered as genomic fossils: disabled copies of functional genes that were once active in the ancient genome as discussed by the authors, and they can be used to improve the accuracy of gene annotation.

Journal ArticleDOI
TL;DR: This work uses tree-based analyses and random forest algorithms to discover the most significant protein features that influence a protein's amenability to high-throughput experimentation and identifies combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets.

Journal ArticleDOI
TL;DR: The results suggest that novel transcribed regions with low coding potential exhibit a strong propensity for early DNA replication, and their activity is linked to the replication-timing program.
Abstract: Duplication of the genome during the S phase of the cell cycle does not occur simultaneously; rather, different sequences are replicated at different times. The replication timing of specific sequences can change during development; however, the determinants of this dynamic process are poorly understood. To gain insights into the contribution of developmental state, genomic sequence, and transcriptional activity to replication timing, we investigated the timing of DNA replication at high resolution along an entire human chromosome (chromosome 22) in two different cell types. The pattern of replication timing was correlated with respect to annotated genes, gene expression, novel transcribed regions of unknown function, sequence composition, and cytological features. We observed that chromosome 22 contains regions of early- and late-replicating domains of 100 kb to 2 Mb, many (but not all) of which are associated with previously described chromosomal bands. In both cell types, expressed sequences are replicated earlier than nontranscribed regions. However, several highly transcribed regions replicate late. Overall, the DNA replication-timing profiles of the two different cell types are remarkably similar, with only nine regions of difference observed. In one case, this difference reflects the differential expression of an annotated gene that resides in this region. Novel transcribed regions with low coding potential exhibit a strong propensity for early DNA replication. Although the cellular function of such transcripts is poorly understood, our results suggest that their activity is linked to the replication-timing program.

Journal ArticleDOI
Haiyuan Yu1, Xiaowei Zhu1, Dov Greenbaum1, John E. Karro1, Mark Gerstein1 
TL;DR: TopNet, an automated web tool designed to address the challenge of comparing the topologies of sub- networks, found that soluble proteins had more interactions than membrane proteins and amongst soluble proteins, those that were highly expressed, had many polar amino acids, and had many alpha helices tended to have the most interaction partners.
Abstract: Biological networks are a topic of great current interest, particularly with the publication of a number of large genome-wide interaction datasets. They are globally characterized by a variety of graph-theoretic statistics, such as the degree distribution, clustering coefficient, characteristic path length and diameter. Moreover, real protein networks are quite complex and can often be divided into many sub-networks through systematic selection of different nodes and edges. For instance, proteins can be sub-divided by expression level, length, amino-acid composition, solubility, secondary structure and function. A challenging research question is to compare the topologies of sub- networks, looking for global differences associated with different types of proteins. TopNet is an automated web tool designed to address this question, calculating and comparing topological characteristics for different sub-networks derived from any given protein network. It provides reasonable solutions to the calculation of network statistics for sub-networks embedded within a larger network and gives simplified views of a sub-network of interest, allowing one to navigate through it. After constructing TopNet, we applied it to the interaction networks and protein classes currently available for yeast. We were able to find a number of potential biological correlations. In particular, we found that soluble proteins had more interactions than membrane proteins. Moreover, amongst soluble proteins, those that were highly expressed, had many polar amino acids, and had many alpha helices, tended to have the most interaction partners. Interestingly, TopNet also turned up some systematic biases in the current yeast interaction network: on average, proteins with a known functional classification had many more interaction partners than those without. This phenomenon may reflect the incompleteness of the experimentally determined yeast interaction network.

Journal ArticleDOI
TL;DR: New data emphasize a breadth of possible structural mechanisms, particularly the ability to drastically alter protein architecture and the native flexibility of many structures, as well as high-resolution studies of increasingly complex assemblies and conformational changes.

Journal ArticleDOI
TL;DR: The collection reported here constitutes the largest plasmid-based set of sequenced yeast mutant alleles to date and, as such, should be singularly useful for gene and genome-wide functional analysis.
Abstract: We present here an unbiased and extremely versatile insertional library of yeast genomic DNA generated by in vitro mutagenesis with a multipurpose element derived from the bacterial transposon Tn7. This mini-Tn7 element has been engineered such that a single insertion can be used to generate a lacZ fusion, gene disruption, and epitope-tagged gene product. Using this transposon, we generated a plasmid-based library of ∼300,000 mutant alleles; by high-throughput screening in yeast, we identified and sequenced 9032 insertions affecting 2613 genes (45% of the genome). From analysis of 7176 insertions, we found little bias in Tn7 target-site selection in vitro. In contrast, we also sequenced 10,174 Tn3 insertions and found a markedly stronger preference for an AT-rich 5-base pair target sequence. We further screened 1327 insertion alleles in yeast for hypersensitivity to the chemotherapeutic cisplatin. Fifty-one genes were identified, including four functionally uncharacterized genes and 25 genes involved in DNA repair, replication, transcription, and chromatin structure. In total, the collection reported here constitutes the largest plasmid-based set of sequenced yeast mutant alleles to date and, as such, should be singularly useful for gene and genome-wide functional analysis.

Journal ArticleDOI
TL;DR: The total result of this work allows for the first time to begin to think about the membrane protein interactome, the set of all interactions between distinct transmembrane helices in the lipid bilayer.
Abstract: We review recent computational advances in the study of membrane proteins, focusing on those that have at least one transmembrane helix. Transmembrane protein regions are, in many respects, easier to investigate computationally than experimentally, due to the uniformity of their structure and interactions (e.g. consisting predominately of nearly parallel helices packed together) on one hand and presenting the challenges of solubility on the other. We present the progress made on identifying and classifying membrane proteins into families, predicting their structure from amino-acid sequence patterns (using many different methods), and analyzing their interactions and packing. The total result of this work allows us for the first time to begin to think about the membrane protein interactome, the set of all interactions between distinct transmembrane helices in the lipid bilayer.

Journal ArticleDOI
TL;DR: Novel proteins with transmembrane sequences distinct from the E5 protein that can activate the PDGF β receptor and transform cells are identified and this approach may allow the creation and identification of small proteins that modulate the activity of a variety of cellular trans Membrane proteins.

Journal ArticleDOI
TL;DR: It is suggested that noncovalent oligomeric associations, which are common in membrane proteins, may provide an alternative source of evolutionary diversity.
Abstract: Recombination of evolutionarily unrelated domains is a mechanism often used by evolution to produce variety in soluble proteins. By using a classification of polytopic transmembrane domains into families, we examined integral membrane proteins for evidence of this mechanism. Surprisingly, we found that domain recombination is not common for the transmembrane regions of membrane proteins, a majority of integral membrane proteins containing only a single transmembrane domain. We suggest that noncovalent oligomeric associations, which are common in membrane proteins, may provide an alternative source of evolutionary diversity.

Journal ArticleDOI
TL;DR: An HMM formalism that explicitly uses 3D coordinates in its match states for protein structures is developed and implemented, which suggests that the described construct is quite useful for protein structure analysis.
Abstract: Hidden Markov Models (HMMs) have proven very useful in computational biology for such applications as sequence pattern matching, gene-finding, and structure prediction. Thus far, however, they have been confined to representing 1D sequence (or the aspects of structure that could be represented by character strings). We develop an HMM formalism that explicitly uses 3D coordinates in its match states. The match states are modeled by 3D Gaussian distributions centered on the mean coordinate position of each alpha carbon in a large structural alignment. The transition probabilities depend on the spread of the neighboring match states and on the number of gaps found in the structural alignment. We also develop methods for aligning query structures against 3D HMMs and scoring the result probabilistically. For 1D HMMs these tasks are accomplished by the Viterbi and forward algorithms. However, these will not work in unmodified form for the 3D problem, due to non-local quality of structural alignment, so we develop extensions of these algorithms for the 3D case. Several applications of 3D HMMs for protein structure classification are reported. A good separation of scores for different fold families suggests that the described construct is quite useful for protein structure analysis. We have created a rigorous 3D HMM representation for protein structures and implemented a complete set of routines for building 3D HMMs in C and Perl. The code is freely available from http://www.molmovdb.org/geometry/3dHMM , and at this site we also have a simple prototype server to demonstrate the features of the described approach.

Journal ArticleDOI
01 Aug 2004-Proteins
TL;DR: The article by O’Toole et al. in this issue of Proteins describes some of the features of these protein targets lists, the overlap between these worldwide efforts, and a first pass at the data mining that becomes possible by analyzing success and failure at various points along the structure production pipeline across thousands of protein targets.
Abstract: The U.S. NIH Protein Structure Initiative (PSI) is a joint government, university, and industry effort, organized and supported by the National Institute of General Medical Sciences, and aimed at reducing the costs and increasing the speed of protein structure determination. Its long-range goal is to make the 3D atomic-level structures of most proteins in nature easily obtainable from knowledge of their corresponding DNA sequences (http:// www.nigms.gov/psi). It is the primary U.S. component of a broad international effort in structural genomics, involving at least 20 projects throughout the world. In order to minimize overlap of their efforts, most of these structural genomics pilot projects make their protein target lists and progress reports publicly available. These protein target lists provide dynamic summaries of progress on the production and structure determination of each target protein. These Web-accessible data represent a tremendously valuable new resource to the biological science community, which is only beginning to be widely recognized. As illustrated in the article by Liu et al. in this issue of Proteins, much thought and effort, often involving advanced bioinformatics analysis, has gone into developing these protein target lists. The article by O’Toole et al. in this issue of Proteins describes some of the features of these protein targets lists, the overlap between these worldwide efforts, and a first pass at the data mining that becomes possible by analyzing success and failure at various points along the structure production pipeline across thousands of protein targets. Such retrospective analysis of structural genomics data has the potential to greatly improve methods for protein expression, sample preparation, functional characterization, and structure determination. In addition, the targets lists themselves provide inventories of protein expression vectors, protein samples, and many other biochemical reagents that are generally freely available to the broader biological community. The Northeast Structural Genomics Consortium (NESG) is one of the several pilot projects of the PSI. Its primary goals are to develop and refine new technologies for high-throughput protein production and structure determination by both NMR and X-ray crystallography, and to apply these technologies in determining representative structures of the domain sequence families that constitute eukaryotic proteomes. The project (http://www.nesg.org) is developing technology aimed at optimizing each stage of the structure determination pipeline, including intelligent protein target selection, high-throughput, and costeffective protein sample production, robotics-aided protein crystallization screening, rapid NMR data collection, automated NMR and X-ray diffraction data analysis, and integrated databases for laboratory information management and structure–function annotations. The key shortterm goal of the project is to construct a technology platform capable of experimentally determining 100–200 sequence-unique NMR or X-ray crystal structures of proteins per year. Most structural genomics projects involve collaborative interactions between multiple research groups, coordinated through LIMS. The development and integration of these LIMS are significant challenges that are being addressed both individually and collectively by the structural genomics research community. SPiNE (http:// spine.nesg.org) is a data warehouse and integrated data tracking tool that holds detailed records about the cloning, expression, purification, biophysical characterization, crystallization, and structure determination by NMR and/or X-ray crystallography of each target under study by the NESG Consortium. The NESG also aims at correlating the structural data produced by the project with the extensive biological data emerging from large-scale functional genomics efforts (e.g., see Goh et al. and Carter et al.).

Journal ArticleDOI
TL;DR: The solution to a basic online interval maximum problem via a sliding-window approach is discussed and how to use this solution in a nontrivial manner for many of the tiling problems introduced.
Abstract: In this paper, we consider several variations of the following basic tiling problem: given a sequence of real numbers with two size-bound parameters, we want to find a set of tiles of maximum total weight such that each tiles satisfies the size bounds. A solution to this problem is important to a number of computational biology applications such as selecting genomic DNA fragments for PCR-based amplicon microarrays and performing homology searches with long sequence queries. Our goal is to design efficient algorithms with linear or near-linear time and space in the normal range of parameter values for these problems. For this purpose, we first discuss the solution to a basic online interval maximum problem via a sliding-window approach and show how to use this solution in a nontrivial manner for many of the tiling problems introduced. We also discuss NP-hardness results and approximation algorithms for generalizing our basic tiling problem to higher dimensions. Finally, computational results from applying our tiling algorithms to genomic sequences of five model eukaryotes are reported.

Journal ArticleDOI
01 May 2004-Proteins
TL;DR: Detailed sequence–structure analysis indicates that while the active‐site structure of isocitrate dehydrogenase is most likely similar between pathogens and nonpathogens, the unusual sequence divergence could result from an extra domain added at the N‐terminus and may therefore confer additional pathogenic functions.
Abstract: We have introduced a method to identify functional shifts in protein families Our method is based on the calculation of an active-site conservation ratio, which we call the "ASC ratio" For a structurally based alignment of a protein family, this ratio is the average sequence similarity of the active-site region compared to the full-length protein The active-site region is defined as all the residues within a certain radius of the known functionally important groups Using our method, we have analyzed enzymes of central metabolism from a large number of genomes (35) We found that for most of the enzymes, the active-site region is more highly conserved than the full-length sequence However, for three tricarboxylic acid (TCA)-cycle enzymes, active-site sequences are considerably more diverged (than full-length ones) In particular, we were able to identify in six pathogens a novel isocitrate dehydrogenase that has very low sequence similarity around the active site Detailed sequence-structure analysis indicates that while the active-site structure of isocitrate dehydrogenase is most likely similar between pathogens and nonpathogens, the unusual sequence divergence could result from an extra domain added at the N-terminus This domain has a leucine-rich motif similar one in the Yersinia pestis cytotoxin and may therefore confer additional pathogenic functions

Journal ArticleDOI
TL;DR: A high- productivity/low-maintenance (HP/LM) approach to HPC that is based on establishing a collaborative relationship between the bioinformaticist and HPC expert that respects the former's codes and minimizes the latter's efforts is described.

01 Jan 2004
TL;DR: A prototype yeast hub server is implemented that allows sharing, querying, and integration of different types and formats of yeast genome data that are located in disparate sources and a standard XML format is proposed called “Yeast Hub XML” (YHX).
Abstract: While there are an increasing number of genomes (including the human genome) whose sequences have been fully or nearly completed, the budding yeast Saccharomyces cerevisiae was the first fully sequenced eukaryotic genome. Given its ease of genetic manipulation and the fact that many of its genes are strikingly similar to human genes, the yeast genome has been studied extensively through a wide range of biological experiments (e.g., microarray experiments). As a result, a large variety of types of yeast genome data have been generated and made accessible through many resources (e.g., SGD, MIPS, and YPD). While these resources serve many specific needs of individual researchers, we can reap more benefits by integrating these disparate datasets to facilitate larger-context data mining. However, such integrated analysis is hampered by the heterogeneous formats that are used for data distribution. With the increasing use of eXtensible Mark Language (XML) in the bioinformatics domain, we demonstrate how to use XML to standardize the exchange of a variety of types of yeast data between different resources. In particular, we propose a standard XML format called “Yeast Hub XML” (YHX). This format consists of: i) metadata and ii) data. While the former describes the resource and data structure, the latter is used to represent the data. In addition, we apply various XML-related technologies including XPath and XSLT to query, integrate, and transform multiple XML datasets. We have implemented a prototype yeast hub server that allows sharing, querying, and integration of different types and formats of yeast genome data that are located in disparate sources.