Showing papers by "Mark Gerstein published in 2007"

PDF

Open Access

Journal Article•DOI•

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project

[...]

Ewan Birney, John A. Stamatoyannopoulos¹, Anindya Dutta², Roderic Guigó³ +317 more•Institutions (44)

14 Jun 2007-Nature

TL;DR: Functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project are reported, providing convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts.

...read moreread less

Abstract: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.

...read moreread less

5,091 citations

Journal Article•DOI•

Paired-end mapping reveals extensive structural variation in the human genome.

[...]

19 Oct 2007-Science

TL;DR: High-throughput and massive paired-end mapping (PEM) was used to map SVs in an African and in a putatively European individual and identified shared and divergent SVs relative to the reference genome, documenting that the number of SVs among humans is much larger than initially hypothesized; many of the SVs potentially affect gene function.

...read moreread less

Abstract: Structural variation of the genome involves kilobase- to megabase-sized deletions, duplications, insertions, inversions, and complex combinations of rearrangements. We introduce high-throughput and massive paired-end mapping (PEM), a large-scale genome-sequencing method to identify structural variants (SVs) ∼3 kilobases (kb) or larger that combines the rescue and capture of paired ends of 3-kb fragments, massive 454 sequencing, and a computational approach to map DNA reads onto a reference genome. PEM was used to map SVs in an African and in a putatively European individual and identified shared and divergent SVs relative to the reference genome. Overall, we fine-mapped more than 1000 SVs and documented that the number of SVs among humans is much larger than initially hypothesized; many of the SVs potentially affect gene function. The breakpoint junction sequences of more than 200 SVs were determined with a novel pooling strategy and computational analysis. Our analysis provided insights into the mechanisms of SV formation in humans.

...read moreread less

1,211 citations

Journal Article•DOI•

What is a gene, post-ENCODE? History and updated definition

[...]

Mark Gerstein¹, Joel Rozowsky¹, Deyou Zheng¹, Jiang Du¹, Jan O. Korbel¹, Olof Emanuelsson, Zhengdong D. Zhang¹, Sherman M. Weissman¹, Michael Snyder¹ - Show less +5 more•Institutions (1)

Yale University¹

01 Jun 2007-Genome Research

TL;DR: This definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene.

...read moreread less

Abstract: While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century—from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition sidesteps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.

...read moreread less

678 citations

Journal Article•DOI•

Getting connected: analysis and principles of biological networks

[...]

Xiaowei Zhu¹, Mark Gerstein, Michael Snyder•Institutions (1)

Yale University¹

01 May 2007-Genes & Development

TL;DR: Systematic approaches to study large numbers of proteins, metabolites, and their modification have revealed complex molecular networks which provide novel insights in understanding basic mechanisms controlling normal cellular processes and disease pathologies.

...read moreread less

Abstract: The execution of complex biological processes requires the precise interaction and regulation of thousands of molecules. Systematic approaches to study large numbers of proteins, metabolites, and their modification have revealed complex molecular networks. These biological networks are significantly different from random networks and often exhibit ubiquitous properties in terms of their structure and organization. Analyzing these networks provides novel insights in understanding basic mechanisms controlling normal cellular processes and disease pathologies.

...read moreread less

555 citations

Journal Article•DOI•

New insights into Acinetobacter baumannii pathogenesis revealed by high-density pyrosequencing and transposon mutagenesis

[...]

Michael G. Smith¹, Tara A. Gianoulis, Stefan Pukatzki, John J. Mekalanos, L. Nicholas Ornston, Mark Gerstein, Michael Snyder - Show less +3 more•Institutions (1)

Yale University¹

01 Mar 2007-Genes & Development

TL;DR: The pathogenic content of this harmful pathogen is explored using a combination of DNA sequencing and insertional mutagenesis and it is verified that six of the islands contain virulence genes, including two novel islands containing genes that lacked homology with others in the databases.

...read moreread less

Abstract: Acinetobacter baumannii has emerged as an important and problematic human pathogen as it is the causative agent of several types of infections including pneumonia, meningitis, septicemia, and urinary tract infections. We explored the pathogenic content of this harmful pathogen using a combination of DNA sequencing and insertional mutagenesis. The genome of this organism was sequenced using a strategy involving high-density pyrosequencing, a novel, rapid method of high-throughput sequencing. Excluding the rDNA repeats, the assembled genome is 3,976,746 base pairs (bp) and has 3830 ORFs. A significant fraction of ORFs (17.2%) are located in 28 putative alien islands, indicating that the genome has acquired a large amount of foreign DNA. Consistent with its role in pathogenesis, a remarkable number of the islands (16) contain genes implicated in virulence, indicating the organism devotes a considerable portion of its genes to pathogenesis. The largest island contains elements homologous to the Legionella/Coxiella Type IV secretion apparatus. Type IV secretion systems have been demonstrated to be important for virulence in other organisms and thus are likely to help mediate pathogenesis of A. baumannii. Insertional mutagenesis generated avirulent isolates of A. baumannii and verified that six of the islands contain virulence genes, including two novel islands containing genes that lacked homology with others in the databases. The DNA sequencing approach described in this study allows the rapid elucidation of the DNA sequence of any microbe and, when combined with genetic screens, can identify many novel genes important for microbial pathogenesis.

...read moreread less

490 citations

Journal Article•DOI•

Diverse cellular functions of the Hsp90 molecular chaperone uncovered using systems approaches.

[...]

Amie J. McClellan¹, Yu Xia², Adam M. Deutschbauer¹, Ronald W. Davis¹, Mark Gerstein², Judith Frydman¹ - Show less +2 more•Institutions (2)

Stanford University¹, Yale University²

05 Oct 2007-Cell

TL;DR: Several unanticipated functions of Hsp90 under normal conditions and in response to stress are identified, highlighting the potential of the integrated global approach to uncover chaperone functions in the cell.

...read moreread less

471 citations

Journal Article•DOI•

Divergence of transcription factor binding sites across related yeast species

[...]

Anthony R. Borneman¹, Tara A. Gianoulis¹, Zhengdong D. Zhang¹, Haiyuan Yu¹, Joel Rozowsky¹, Michael Seringhaus¹, Lu Yong Wang², Mark Gerstein¹, Michael Snyder¹ - Show less +5 more•Institutions (2)

Yale University¹, Princeton University²

10 Aug 2007-Science

TL;DR: It is shown that most of the binding sites of the pseudohyphal regulators Ste12 and Tec1 have diverged across these species, far exceeding the interspecies variation in orthologous genes.

...read moreread less

Abstract: Characterization of interspecies differences in gene regulation is crucial for understanding the molecular basis of both phenotypic diversity and evolution. By means of chromatin immunoprecipitation and DNA microarray analysis, the divergence in the binding sites of the pseudohyphal regulators Ste12 and Tec1 was determined in the yeasts Saccharomyces cerevisiae, S. mikatae, and S. bayanus under pseudohyphal conditions. We have shown that most of these sites have diverged across these species, far exceeding the interspecies variation in orthologous genes. A group of Ste12 targets was shown to be bound only in S. mikatae and S. bayanus under pseudohyphal conditions. Many of these genes are targets of Ste12 during mating in S. cerevisiae, indicating that specialization between the two pathways has occurred in this species. Transcription factor binding sites have therefore diverged substantially faster than ortholog content. Thus, gene regulation resulting from transcription factor binding is likely to be a major cause of divergence between related species.

...read moreread less

374 citations

Journal Article•DOI•

Differential binding of calmodulin-related proteins to their targets revealed through high-density Arabidopsis protein microarrays

[...]

Sorina C. Popescu¹, George V. Popescu, Shawn Bachan, Zimei Zhang, Montrell Seay, Mark Gerstein, Michael Snyder, Savithramma P. Dinesh-Kumar - Show less +4 more•Institutions (1)

Yale University¹

13 Mar 2007-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: It is suggested that calcium functions through distinct CaM/CML proteins to regulate a wide range of targets and cellular activities.

...read moreread less

Abstract: Calmodulins (CaMs) are the most ubiquitous calcium sensors in eukaryotes. A number of CaM-binding proteins have been identified through classical methods, and many proteins have been predicted to bind CaMs based on their structural homology with known targets. However, multicellular organisms typically contain many CaM-like (CML) proteins, and a global identification of their targets and specificity of interaction is lacking. In an effort to develop a platform for large-scale analysis of proteins in plants we have developed a protein microarray and used it to study the global analysis of CaM/CML interactions. An Arabidopsis thaliana expression collection containing 1,133 ORFs was generated and used to produce proteins with an optimized medium-throughput plant-based expression system. Protein microarrays were prepared and screened with several CaMs/CMLs. A large number of previously known and novel CaM/CML targets were identified, including transcription factors, receptor and intracellular protein kinases, F-box proteins, RNA-binding proteins, and proteins of unknown function. Multiple CaM/CML proteins bound many binding partners, but the majority of targets were specific to one or a few CaMs/CMLs indicating that different CaM family members function through different targets. Based on our analyses, the emergent CaM/CML interactome is more extensive than previously predicted. Our results suggest that calcium functions through distinct CaM/CML proteins to regulate a wide range of targets and cellular activities.

...read moreread less

357 citations

Journal Article•DOI•

The minimum information required for reporting a molecular interaction experiment (MIMIx)

[...]

Sandra Orchard¹, Lukasz Salwinski², Samuel Kerrien, Luisa Montecchi-Palazzi, Matthias Oesterheld, Volker Stümpflen, Arnaud Ceol³, Andrew Chatr-aryamontri³, John Armstrong⁴, Peter Woollard⁴, John J. Salama⁵, Susan Moore⁶, Jérôme Wojcik⁷, Gary D. Bader⁵, Marc Vidal⁸, Michael E. Cusick⁸, Mark Gerstein⁹, Anne-Claude Gavin¹, Giulio Superti-Furga¹⁰, Jack Greenblatt⁵, Joel S. Bader¹¹, Peter Uetz¹², Mike Tyers⁵, Pierre Legrain¹³, Stanley Fields¹⁴, Nicola Mulder¹⁵, Michael K. Gilson¹⁶, Michael Niepmann¹⁷, Lyle D. Burgoon, Javier De Las Rivas¹⁸, Carlos Prieto¹⁸, Victoria M. Perreau¹⁹, Christopher W. V. Hogue⁵, Hans-Werner Mewes, Rolf Apweiler, Ioannis Xenarios⁷, David Eisenberg², Gianni Cesareni³, Henning Hermjakob - Show less +35 more•Institutions (19)

European Bioinformatics Institute¹, University of California, Los Angeles², University of Rome Tor Vergata³, GlaxoSmithKline⁴, University of Toronto⁵, National University of Singapore⁶, Merck KGaA⁷, Harvard University⁸, Yale University⁹, Austrian Academy of Sciences¹⁰, Johns Hopkins University¹¹, Karlsruhe Institute of Technology¹², French Alternative Energies and Atomic Energy Commission¹³, University of Washington¹⁴, University of Cape Town¹⁵, University of Maryland, College Park¹⁶, University of Giessen¹⁷, University of Salamanca¹⁸, University of Melbourne¹⁹

01 Aug 2007-Nature Biotechnology

TL;DR: MIMIx, the minimum information required for reporting a molecular interaction experiment, is proposed, which will support the rapid, systematic capture of molecular interaction data in public databases, thereby improving access to valuable interaction data.

...read moreread less

Abstract: A wealth of molecular interaction data is available in the literature, ranging from large-scale datasets to a single interaction confirmed by several different techniques. These data are all too often reported either as free text or in tables of variable format, and are often missing key pieces of information essential for a full understanding of the experiment. Here we propose MIMIx, the minimum information required for reporting a molecular interaction experiment. Adherence to these reporting guidelines will result in publications of increased clarity and usefulness to the scientific community and will support the rapid, systematic capture of molecular interaction data in public databases, thereby improving access to valuable interaction data.

...read moreread less

270 citations

Journal Article•DOI•

Pseudogenes in the ENCODE regions: Consensus annotation, analysis of transcription, and evolution

[...]

Deyou Zheng¹, Adam Frankish², Robert Baertsch³, Philipp Kapranov⁴, Alexandre Reymond⁵, Alexandre Reymond⁶, Siew Woh Choo⁷, Yontao Lu³, Stylianos E. Antonarakis⁶, Michael Snyder¹, Yijun Ruan⁷, Chia-Lin Wei⁷, Thomas R. Gingeras⁴, Roderic Guigó⁸, Jennifer Harrow², Mark Gerstein¹ - Show less +12 more•Institutions (8)

Yale University¹, Wellcome Trust Sanger Institute², University of California, Santa Cruz³, Thermo Fisher Scientific⁴, University of Lausanne⁵, University of Geneva⁶, Agency for Science, Technology and Research⁷, Pompeu Fabra University⁸

01 Jun 2007-Genome Research

TL;DR: This work extensively examined the transcriptional activity of the ENCODE pseudogenes and performed systematic series of pseudogene-specific RACE analyses, demonstrating that at least a fifth of the 201 pseudogene are transcribed in one or more cell lines or tissues.

...read moreread less

Abstract: Arising from either retrotransposition or genomic duplication of functional genes, pseudogenes are “genomic fossils” valuable for exploring the dynamics and evolution of genes and genomes. Pseudogene identification is an important problem in computational genomics, and is also critical for obtaining an accurate picture of a genome’s structure and function. However, no consensus computational scheme for defining and detecting pseudogenes has been developed thus far. As part of the ENCyclopedia Of DNA Elements (ENCODE) project, we have compared several distinct pseudogene annotation strategies and found that different approaches and parameters often resulted in rather distinct sets of pseudogenes. We subsequently developed a consensus approach for annotating pseudogenes (derived from protein coding genes) in the ENCODE regions, resulting in 201 pseudogenes, two-thirds of which originated from retrotransposition. A survey of orthologs for these pseudogenes in 28 vertebrate genomes showed that a significant fraction (∼80%) of the processed pseudogenes are primate-specific sequences, highlighting the increasing retrotransposition activity in primates. Analysis of sequence conservation and variation also demonstrated that most pseudogenes evolve neutrally, and processed pseudogenes appear to have lost their coding potential immediately or soon after their emergence. In order to explore the functional implication of pseudogene prevalence, we have extensively examined the transcriptional activity of the ENCODE pseudogenes. We performed systematic series of pseudogene-specific RACE analyses. These, together with complementary evidence derived from tiling microarrays and high throughput sequencing, demonstrated that at least a fifth of the 201 pseudogenes are transcribed in one or more cell lines or tissues.

...read moreread less

214 citations

Journal Article•DOI•

Mapping of transcription factor binding regions in mammalian cells by ChIP: Comparison of array- and sequencing-based technologies

[...]

Ghia Euskirchen¹, Joel Rozowsky¹, Chia-Lin Wei², Wah Heng Lee², Zhengdong D. Zhang¹, Stephen Hartman¹, Olof Emanuelsson¹, Viktor Stolc³, Sherman M. Weissman¹, Mark Gerstein¹, Yijun Ruan², Michael Snyder¹ - Show less +8 more•Institutions (3)

Yale University¹, Agency for Science, Technology and Research², Ames Research Center³

01 Jun 2007-Genome Research

TL;DR: It is found that Chip-chip and ChIP-PET are frequently complementary in their relative abilities to detect STAT1 targets for the lower ranked targets; each method detected validated targets that were missed by the other method.

...read moreread less

Abstract: Recent progress in mapping transcription factor (TF) binding regions can largely be credited to chromatin immunoprecipitation (ChIP) technologies. We compared strategies for mapping TF binding regions in mammalian cells using two different ChIP schemes: ChIP with DNA microarray analysis (ChIP-chip) and ChIP with DNA sequencing (ChIP-PET). We first investigated parameters central to obtaining robust ChIP-chip data sets by analyzing STAT1 targets in the ENCODE regions of the human genome, and then compared ChIP-chip to ChIP-PET. We devised methods for scoring and comparing results among various tiling arrays and examined parameters such as DNA microarray format, oligonucleotide length, hybridization conditions, and the use of competitor Cot-1 DNA. The best performance was achieved with high-density oligonucleotide arrays, oligonucleotides >/=50 bases (b), the presence of competitor Cot-1 DNA and hybridizations conducted in microfluidics stations. When target identification was evaluated as a function of array number, 80%-86% of targets were identified with three or more arrays. Comparison of ChIP-chip with ChIP-PET revealed strong agreement for the highest ranked targets with less overlap for the low ranked targets. With advantages and disadvantages unique to each approach, we found that ChIP-chip and ChIP-PET are frequently complementary in their relative abilities to detect STAT1 targets for the lower ranked targets; each method detected validated targets that were missed by the other method. The most comprehensive list of STAT1 binding regions is obtained by merging results from ChIP-chip and ChIP-sequencing. Overall, this study provides information for robust identification, scoring, and validation of TF targets using ChIP-based technologies.

...read moreread less

Journal Article•DOI•

Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation

[...]

John E. Karro¹, Yangpan Yan², Deyou Zheng², Zhaolei Zhang³, Nicholas Carriero², Philip Cayting², Paul Harrrison⁴, Mark Gerstein² - Show less +4 more•Institutions (4)

Pennsylvania State University¹, Yale University², University of Toronto³, McGill University⁴

01 Jan 2007-Nucleic Acids Research

TL;DR: The Pseudogene.org knowledgebase serves as a comprehensive repository for pseudogene annotation, including a collection of human annotations compiled from 16 sources, and supports a subset structure that highlights specific groups of pseudogenes that are of interest to the research community.

...read moreread less

Abstract: The Pseudogene.org knowledgebase serves as a comprehensive repository for pseudogene annotation. The definition of a pseudogene varies within the literature, resulting in significantly different approaches to the problem of identification. Consequently, it is difficult to maintain a consistent collection of pseudogenes in detail necessary for their effective use. Our database is designed to address this issue. It integrates a variety of heterogeneous resources and supports a subset structure that highlights specific groups of pseudogenes that are of interest to the research community. Tools are provided for the comparison of sets and the creation of layered set unions, enabling researchers to derive a current 'consensus' set of pseudogenes. Additional features include versatile search, the capacity for robust interaction with other databases, the ability to reconstruct older versions of the database (accounting for changing genome builds) and an underlying object-oriented interface designed for researchers with a minimal knowledge of programming. At the present time, the database contains more than 100,000 pseudogenes spanning 64 prokaryote and 11 eukaryote genomes, including a collection of human annotations compiled from 16 sources.

...read moreread less

Journal Article•DOI•

Structured RNAs in the ENCODE selected regions of the human genome

[...]

01 Jun 2007-Genome Research

TL;DR: In this paper, the authors presented a computational study to detect functional RNA structures within the ENCODE regions of the human genome using three recently introduced programs based on either phylogenetic-stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures.

...read moreread less

Abstract: Functional RNA structures play an important role both in the context of noncoding RNA transcripts as well as regulatory elements in mRNAs. Here we present a computational study to detect functional RNA structures within the ENCODE regions of the human genome. Since structural RNAs in general lack characteristic signals in primary sequence, comparative approaches evaluating evolutionary conservation of structures are most promising. We have used three recently introduced programs based on either phylogenetic-stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures (corresponding to approximately 2.7% of the ENCODE regions). EvoFold has its highest sensitivity in highly conserved and relatively AU-rich regions, while RNAz favors slightly GC-rich regions, resulting in a relatively small overlap between methods. Comparison with the GENCODE annotation points to functional RNAs in all genomic contexts, with a slightly increased density in 3'-UTRs. While we estimate a significant false discovery rate of approximately 50%-70% many of the predictions can be further substantiated by additional criteria: 248 loci are predicted by both RNAz and EvoFold, and an additional 239 RNAz or EvoFold predictions are supported by the (more stringent) AlifoldZ algorithm. Five hundred seventy RNAz structure predictions fall into regions that show signs of selection pressure also on the sequence level (i.e., conserved elements). More than 700 predictions overlap with noncoding transcripts detected by oligonucleotide tiling arrays. One hundred seventy-five selected candidates were tested by RT-PCR in six tissues, and expression could be verified in 43 cases (24.6%).

...read moreread less

Journal Article•DOI•

Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context.

[...]

Philip M. Kim¹, Jan O. Korbel, Mark Gerstein¹•Institutions (1)

Yale University¹

18 Dec 2007-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: It is shown that the interaction network roughly maps to cellular organization, with the periphery of the network corresponding to the cellular periphery (i.e., extracellular space or cell membrane).

...read moreread less

Abstract: Because of recent advances in genotyping and sequencing, human genetic variation and adaptive evolution in the primate lineage have become major research foci. Here, we examine the relationship between genetic signatures of adaptive evolution and network topology. We find a striking tendency of proteins that have been under positive selection (as compared with the chimpanzee) to be located at the periphery of the interaction network. Our results are based on the analysis of two types of genome evolution, both in terms of intra- and interspecies variation. First, we looked at single-nucleotide polymorphisms and their fixed variants, single-nucleotide differences in the human genome relative to the chimpanzee. Second, we examine fixed structural variants, specifically large segmental duplications and their polymorphic precursors known as copy number variants. We propose two complementary mechanisms that lead to the observed trends. First, we can rationalize them in terms of constraints imposed by protein structure: We find that positively selected sites are preferentially located on the exposed surface of proteins. Because central network proteins (hubs) are likely to have a larger fraction of their surface involved in interactions, they tend to be constrained and under negative selection. Conversely, we show that the interaction network roughly maps to cellular organization, with the periphery of the network corresponding to the cellular periphery (i.e., extracellular space or cell membrane). This suggests that the observed positive selection at the network periphery may be due to an increase of adaptive events on the cellular periphery responding to changing environments.

...read moreread less

Journal Article•DOI•

The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they?

[...]

Deyou Zheng¹, Mark Gerstein¹•Institutions (1)

Yale University¹

01 May 2007-Trends in Genetics

TL;DR: The evidence for and against pseudogene functionality are examined, it is argued that the time is ripe for revising the definition of a pseudogene, and a classification system is suggested to accommodate pseudogenes with various levels of functionality.

...read moreread less

Journal Article•DOI•

Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome.

[...]

Jan O. Korbel¹, Alexander E. Urban, Fabian Grubert, Jiang Du, Thomas Royce, Peter Starr, Guoneng Zhong, Beverly S. Emanuel, Sherman M. Weissman, Michael Snyder, Mark Gerstein - Show less +7 more•Institutions (1)

Yale University¹

12 Jun 2007-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: An iterative, “active” approach to initially scoring with a preliminary model, performing targeted validations, retraining the model, and then rescoring, and a flexible parameterization system that intuitively collapses from a full model of 2,503 parameters to a core one of only 10 enable the study of CNV population frequencies.

...read moreread less

Abstract: Copy-number variants (CNVs) are an abundant form of genetic variation in humans. However, approaches for determining exact CNV breakpoint sequences (physical deletion or duplication boundaries) across individuals, crucial for associating genotype to phenotype, have been lacking so far, and the vast majority of CNVs have been reported with approximate genomic coordinates only. Here, we report an approach, called BreakPtr, for fine-mapping CNVs (available from http://breakptr.gersteinlab.org). We statistically integrate both sequence characteristics and data from high-resolution comparative genome hybridization experiments in a discrete-valued, bivariate hidden Markov model. Incorporation of nucleotide-sequence information allows us to take into account the fact that recently duplicated sequences (e.g., segmental duplications) often coincide with breakpoints. In anticipation of an upcoming increase in CNV data, we developed an iterative, “active” approach to initially scoring with a preliminary model, performing targeted validations, retraining the model, and then rescoring, and a flexible parameterization system that intuitively collapses from a full model of 2,503 parameters to a core one of only 10. Using our approach, we accurately mapped >400 breakpoints on chromosome 22 and a region of chromosome 11, refining the boundaries of many previously approximately mapped CNVs. Four predicted breakpoints flanked known disease-associated deletions. We validated an additional four predicted CNV breakpoints by sequencing. Overall, our results suggest a predictive resolution of ≈300bp. This level of resolution enables more precise correlations between CNVs and across individuals than previously possible, allowing the study of CNV population frequencies. Further, it enabled us to demonstrate a clear Mendelian pattern of inheritance for one of the CNVs.

...read moreread less

Journal Article•DOI•

Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification

[...]

Thomas Royce¹, Joel Rozowsky¹, Mark Gerstein¹•Institutions (1)

Yale University¹

01 Aug 2007-Nucleic Acids Research

TL;DR: The feasibility of a generic DNA microarray design applicable to any species by leveraging the great feature densities and relatively unbiased nature of genomic tiling microarrays is addressed, providing proof-of-principle for probing nucleic acid targets with off-target, nearest-neighbor features.

...read moreread less

Abstract: A generic DNA microarray design applicable to any species would greatly benefit comparative genomics. We have addressed the feasibility of such a design by leveraging the great feature densities and relatively unbiased nature of genomic tiling microarrays. Specifically, we first divided each Homo sapiens Refseq-derived gene's spliced nucleotide sequence into all of its possible contiguous 25 nt subsequences. For each of these 25 nt subsequences, we searched a recent human transcript mapping experiment's probe design for the 25 nt probe sequence having the fewest mismatches with the subsequence, but that did not match the subsequence exactly. Signal intensities measured with each gene's nearest-neighbor features were subsequently averaged to predict their gene expression levels in each of the experiment's thirty-three hybridizations. We examined the fidelity of this approach in terms of both sensitivity and specificity for detecting actively transcribed genes, for transcriptional consistency between exons of the same gene, and for reproducibility between tiling array designs. Taken together, our results provide proof-of-principle for probing nucleic acid targets with off-target, nearest-neighbor features.

...read moreread less

Journal Article•DOI•

Publishing perishing? Towards tomorrow's information architecture

[...]

Michael Seringhaus¹, Mark Gerstein¹•Institutions (1)

Yale University¹

19 Jan 2007-BMC Bioinformatics

TL;DR: The changing roles of scholarly journals and databases are examined, the vision of the optimal information architecture for the biosciences is presented, and tangible steps to improve the handling of scientific information today while paving the way for an expansive central index in the future are closed.

...read moreread less

Abstract: Scientific articles are tailored to present information in human-readable aliquots. Although the Internet has revolutionized the way our society thinks about information, the traditional text-based framework of the scientific article remains largely unchanged. This format imposes sharp constraints upon the type and quantity of biological information published today. Academic journals alone cannot capture the findings of modern genome-scale inquiry.

...read moreread less

Journal Article•DOI•

Integrative microarray analysis of pathways dysregulated in metastatic prostate cancer.

[...]

Sunita R. Setlur¹, Thomas Royce², Andrea Sboner², Juan Miguel Mosquera³, Francesca Demichelis³, Matthias D. Hofer³, Kirsten D. Mertz³, Mark Gerstein², Mark A. Rubin - Show less +5 more•Institutions (3)

Brigham and Women's Hospital¹, Yale University², Harvard University³

01 Nov 2007-Cancer Research

TL;DR: The pathway that showed the most significant dysregulation, HIV-I NEF, was validated at both the transcript level and the protein level by quantitative PCR and immunohistochemical analysis, respectively and indicates that this pathway is especially dysregulated in hormone-refractory prostate cancer.

...read moreread less

Abstract: Microarrays have been used to identify genes involved in cancer progression. We have now developed an algorithm that identifies dysregulated pathways from multiple expression array data sets without a priori definition of gene expression thresholds. Integrative microarray analysis of pathways (IMAP) was done using existing expression array data from localized and metastatic prostate cancer. Comparison of metastatic cancer and localized disease in multiple expression array profiling studies using the IMAP approach yielded a list of about 100 pathways that were significantly dysregulated (P < 0.05) in prostate cancer metastasis. The pathway that showed the most significant dysregulation, HIV-I NEF, was validated at both the transcript level and the protein level by quantitative PCR and immunohistochemical analysis, respectively. Validation by unsupervised analysis on an independent data set using the gene expression signature from the HIV-I NEF pathway verified the accuracy of our method. Our results indicate that this pathway is especially dysregulated in hormone-refractory prostate cancer.

...read moreread less

Journal Article•DOI•

Global Identification and Characterization of Transcriptionally Active Regions in the Rice Genome

[...]

Lei Li¹, Xiangfeng Wang², Xiangfeng Wang¹, Rajkumar Sasidharan¹, Viktor Stolc³, Viktor Stolc¹, Wei Deng⁴, Hang He⁴, Jan O. Korbel¹, Xuewei Chen⁵, Waraporn Tongprasit, Pamela C. Ronald⁵, Runsheng Chen⁴, Mark Gerstein¹, Xing Wang Deng¹ - Show less +11 more•Institutions (5)

Yale University¹, Peking University², Ames Research Center³, Chinese Academy of Sciences⁴, University of California, Davis⁵

14 Mar 2007-PLOS ONE

TL;DR: The identification of 25,352 and 27,744 TARs not encoded by annotated exons in the rice subspecies japonica and indica are reported, providing a systematic characterization of non-exonic transcripts in rice and expanding the current view of the complexity and dynamics of the rice transcriptome.

...read moreread less

Abstract: Genome tiling microarray studies have consistently documented rich transcriptional activity beyond the annotated genes. However, systematic characterization and transcriptional profiling of the putative novel transcripts on the genome scale are still lacking. We report here the identification of 25,352 and 27,744 transcriptionally active regions (TARs) not encoded by annotated exons in the rice (Oryza. sativa) subspecies japonica and indica, respectively. The non-exonic TARs account for approximately two thirds of the total TARs detected by tiling arrays and represent transcripts likely conserved between japonica and indica. Transcription of 21,018 (83%) japonica non-exonic TARs was verified through expression profiling in 10 tissue types using a re-array in which annotated genes and TARs were each represented by five independent probes. Subsequent analyses indicate that about 80% of the japonica TARs that were not assigned to annotated exons can be assigned to various putatively functional or structural elements of the rice genome, including splice variants, uncharacterized portions of incompletely annotated genes, antisense transcripts, duplicated gene fragments, and potential non-coding RNAs. These results provide a systematic characterization of non-exonic transcripts in rice and thus expand the current view of the complexity and dynamics of the rice transcriptome.

...read moreread less

Journal Article•DOI•

Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions

[...]

Zhengdong D. Zhang¹, Alberto Paccanaro², Yutao Fu³, Sherman M. Weissman¹, Zhiping Weng³, Joseph T. Chang¹, Michael Snyder¹, Mark Gerstein¹ - Show less +4 more•Institutions (3)

Yale University¹, Royal Holloway, University of London², Boston University³

01 Jun 2007-Genome Research

TL;DR: This study developed an intuitive and yet powerful approach to analyze the distribution of regulatory elements found in many different ChIP-chip experiments on a 10 approximately 100-kb scale and shows that regulatory elements are associated with the location of known genes.

...read moreread less

Abstract: The comprehensive inventory of functional elements in 44 human genomic regions carried out by the ENCODE Project Consortium enables for the first time a global analysis of the genomic distribution of transcriptional regulatory elements. In this study we developed an intuitive and yet powerful approach to analyze the distribution of regulatory elements found in many different ChIP–chip experiments on a 10∼100-kb scale. First, we focus on the overall chromosomal distribution of regulatory elements in the ENCODE regions and show that it is highly nonuniform. We demonstrate, in fact, that regulatory elements are associated with the location of known genes. Further examination on a local, single-gene scale shows an enrichment of regulatory elements near both transcription start and end sites. Our results indicate that overall these elements are clustered into regulatory rich “islands” and poor “deserts.” Next, we examine how consistent the nonuniform distribution is between different transcription factors. We perform on all the factors a multivariate analysis in the framework of a biplot, which enhances biological signals in the experiments. This groups transcription factors into sequence-specific and sequence-nonspecific clusters. Moreover, with experimental variation carefully controlled, detailed correlations show that the distribution of sites was generally reproducible for a specific factor between different laboratories and microarray platforms. Data sets associated with histone modifications have particularly strong correlations. Finally, we show how the correlations between factors change when only regulatory elements far from the transcription start sites are considered.

...read moreread less

Journal Article•DOI•

Comparing classical pathways and modern networks: towards the development of an edge ontology

[...]

Long J. Lu¹, Long J. Lu², Andrea Sboner¹, Yuanpeng J. Huang³, Hao Xin Lu¹, Tara A. Gianoulis¹, Kevin Y. Yip¹, Philip M. Kim¹, Gaetano T. Montelione⁴, Gaetano T. Montelione³, Mark Gerstein¹ - Show less +7 more•Institutions (4)

Yale University¹, Cincinnati Children's Hospital Medical Center², Center for Advanced Biotechnology and Medicine³, Rutgers University⁴

01 Jul 2007-Trends in Biochemical Sciences

TL;DR: In this paper, a standardized and well-defined edge ontology is proposed to represent pathways in large-scale networks, and a prototype is proposed as a starting point for reaching this goal.

...read moreread less

Journal Article•DOI•

Total ancestry measure

[...]

Haiyuan Yu¹, Ronald Jansen¹, Gustavo Stolovitzky¹, Mark Gerstein¹•Institutions (1)

Harvard University¹

05 Aug 2007-Bioinformatics

TL;DR: The total ancestry measure is based on counting the number of leaf nodes that share exactly the same set of 'higher up' category nodes in comparison to the total number of classified pairs and is associated with a power-law distribution, allowing for the quick assessment of the statistical significance of shared functional annotations.

...read moreread less

Abstract: Motivation: Many classifications of protein function such as Gene Ontology (GO) are organized in directed acyclic graph (DAG) structures. In these classifications, the proteins are terminal leaf nodes; the categories ‘above’ them are functional annotations at various levels of specialization and the computation of a numerical measure of relatedness between two arbitrary proteins is an important proteomics problem. Moreover, analogous problems are important in other contexts in large-scale information organization—e.g. the Wikipedia online encyclopedia and the Yahoo and DMOZ web page classification schemes. Results: Here we develop a simple probabilistic approach for computing this relatedness quantity, which we call the total ancestry method. Our measure is based on counting the number of leaf nodes that share exactly the same set of ‘higher up’ category nodes in comparison to the total number of classified pairs (i.e. the chance for the same total ancestry). We show such a measure is associated with a power-law distribution, allowing for the quick assessment of the statistical significance of shared functional annotations. We formally compare it with other quantitative functional similarity measures (such as, shortest path within a DAG, lowest common ancestor shared and Azuaje's information-theoretic similarity) and provide concrete metrics to assess differences. Finally, we provide a practical implementation for our total ancestry measure for GO and the MIPS functional catalog and give two applications of it in specific functional genomics contexts. Availability: The implementations and results are available through our supplementary website at: http://gersteinlab.org/proj/funcsim Contact: mark.gerstein@yale.edu Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics.

[...]

Andrew Smith¹, Kei-Hoi Cheung, Kevin Y. Yip¹, Martin H. Schultz¹, Mark Gerstein¹ - Show less +1 more•Institutions (1)

Yale University¹

09 May 2007-BMC Bioinformatics

TL;DR: LinkHub leverages Semantic Web standards-based integrated data to provide novel information retrieval to identifier-related documents through relational graph queries, simplifies and manages connections to major hubs such as UniProt, and provides useful interactive and query interfaces for exploring the integrated data.

...read moreread less

Abstract: A key abstraction in representing proteomics knowledge is the notion of unique identifiers for individual entities (e.g. proteins) and the massive graph of relationships among them. These relationships are sometimes simple (e.g. synonyms) but are often more complex (e.g. one-to-many relationships in protein family membership). We have built a software system called LinkHub using Semantic Web RDF that manages the graph of identifier relationships and allows exploration with a variety of interfaces. For efficiency, we also provide relational-database access and translation between the relational and RDF versions. LinkHub is practically useful in creating small, local hubs on common topics and then connecting these to major portals in a federated architecture; we have used LinkHub to establish such a relationship between UniProt and the North East Structural Genomics Consortium. LinkHub also facilitates queries and access to information and documents related to identifiers spread across multiple databases, acting as "connecting glue" between different identifier spaces. We demonstrate this with example queries discovering "interologs" of yeast protein interactions in the worm and exploring the relationship between gene essentiality and pseudogene content. We also show how "protein family based" retrieval of documents can be achieved. LinkHub is available at hub.gersteinlab.org and hub.nesg.org with supplement, database models and full-source code. LinkHub leverages Semantic Web standards-based integrated data to provide novel information retrieval to identifier-related documents through relational graph queries, simplifies and manages connections to major hubs such as UniProt, and provides useful interactive and query interfaces for exploring the integrated data.

...read moreread less

Posted Content•

Comparing Classical Pathways and Modern Networks: Towards the Development of an Edge Ontology

[...]

Long J. Lu¹, Long J. Lu², Andrea Sboner¹, Yuanpeng J. Huang³, Hao Xin Lu¹, Tara A. Gianoulis¹, Kevin Y. Yip¹, Philip M. Kim¹, Gaetano T. Montelione³, Gaetano T. Montelione⁴, Mark Gerstein¹ - Show less +7 more•Institutions (4)

Yale University¹, Cincinnati Children's Hospital Medical Center², Center for Advanced Biotechnology and Medicine³, Rutgers University⁴

01 Jun 2007-arXiv: Molecular Networks

TL;DR: It is suggested that a standardized and well-defined edge ontology is necessary and a prototype is proposed as a starting point for reaching this goal, and the current edge representation is inadequate to accurately convey all the information in pathways.

...read moreread less

Abstract: Pathways are integral to systems biology. Their classical representation has proven useful but is inconsistent in the meaning assigned to each arrow (or edge) and inadvertently implies the isolation of one pathway from another. Conversely, modern high-throughput experiments give rise to standardized networks facilitating topological calculations. Combining these perspectives, we can embed classical pathways within large-scale networks and thus demonstrate the crosstalk between them. As more diverse types of high-throughput data become available, we can effectively merge both perspectives, embedding pathways simultaneously in multiple networks. However, the original problem still remains - the current edge representation is inadequate to accurately convey all the information in pathways. Therefore, we suggest that a standardized, well-defined, edge ontology is necessary and propose a prototype here, as a starting point for reaching this goal.

...read moreread less

Journal Article•DOI•

Structured digital abstract makes text mining easy

[...]

Mark Gerstein¹, Michael Seringhaus¹, Stanley Fields²•Institutions (2)

Yale University¹, Howard Hughes Medical Institute²

09 May 2007-Nature

Journal Article•DOI•

Tilescope: online analysis pipeline for high-density tiling microarray data

[...]

Zhengdong D. Zhang¹, Joel Rozowsky¹, Hugo Y. K. Lam¹, Jiang Du¹, Michael Snyder¹, Mark Gerstein¹ - Show less +2 more•Institutions (1)

Yale University¹

14 May 2007-Genome Biology

TL;DR: Tilescope is a fully integrated data processing pipeline for analyzing high-density tiling-array data, designed with a modular, three-tiered architecture, facilitating parallelism, and a graphic user-friendly interface.

...read moreread less

Abstract: We developed Tilescope, a fully integrated data processing pipeline for analyzing high-density tiling-array data http://tilescope.gersteinlab.org. In a completely automated fashion, Tilescope will normalize signals between channels and across arrays, combine replicate experiments, score each array element, and identify genomic features. The program is designed with a modular, three-tiered architecture, facilitating parallelism, and a graphic user-friendly interface, presenting results in an organized web page, downloadable for further analysis.

...read moreread less

Journal Article•DOI•

Hinge Atlas: relating protein sequence to sites of structural flexibility

[...]

Samuel C. Flores¹, Long J. Lu¹, Julie Yang¹, Nicholas Carriero¹, Mark Gerstein¹ - Show less +1 more•Institutions (1)

Yale University¹

22 May 2007-BMC Bioinformatics

TL;DR: A Hinge Atlas of manually annotated hinges and a statistical formalism for calculating the enrichment of various types of residues in these hinges found that hinges tend to coincide with active sites, but unlike the latter they are not at all conserved in evolution.

...read moreread less

Abstract: Relating features of protein sequences to structural hinges is important for identifying domain boundaries, understanding structure-function relationships, and designing flexibility into proteins. Efforts in this field have been hampered by the lack of a proper dataset for studying characteristics of hinges. Using the Molecular Motions Database we have created a Hinge Atlas of manually annotated hinges and a statistical formalism for calculating the enrichment of various types of residues in these hinges. We found various correlations between hinges and sequence features. Some of these are expected; for instance, we found that hinges tend to occur on the surface and in coils and turns and to be enriched with small and hydrophilic residues. Others are less obvious and intuitive. In particular, we found that hinges tend to coincide with active sites, but unlike the latter they are not at all conserved in evolution. We evaluate the potential for hinge prediction based on sequence. Motions play an important role in catalysis and protein-ligand interactions. Hinge bending motions comprise the largest class of known motions. Therefore it is important to relate the hinge location to sequence features such as residue type, physicochemical class, secondary structure, solvent exposure, evolutionary conservation, and proximity to active sites. To do this, we first generated the Hinge Atlas, a set of protein motions with the hinge locations manually annotated, and then studied the coincidence of these features with the hinge location. We found that all of the features have bearing on the hinge location. Most interestingly, we found that hinges tend to occur at or near active sites and yet unlike the latter are not conserved. Less surprisingly, we found that hinge residues tend to be small, not hydrophobic or aliphatic, and occur in turns and random coils on the surface. A functional sequence based hinge predictor was made which uses some of the data generated in this study. The Hinge Atlas is made available to the community for further flexibility studies.

...read moreread less

Journal Article•DOI•

The tYNA platform for comparative interactomics

[...]

Kevin Y. Yip, Haiyuan Yu, Philip M. Kim, Martin H. Schultz, Mark Gerstein - Show less +1 more

15 Mar 2007-Bioinformatics

TL;DR: TYNA is a Web system for managing, comparing and mining multiple networks, both directed and undirected, that efficiently implements methods that have proven useful in network analysis, including identifying defective cliques, finding small network motifs and calculating global statistics.

...read moreread less

Abstract: UNLABELLED Biological processes involve complex networks of interactions between molecules. Various large-scale experiments and curation efforts have led to preliminary versions of complete cellular networks for a number of organisms. To grapple with these networks, we developed TopNet-like Yale Network Analyzer (tYNA), a Web system for managing, comparing and mining multiple networks, both directed and undirected. tYNA efficiently implements methods that have proven useful in network analysis, including identifying defective cliques, finding small network motifs (such as feed-forward loops), calculating global statistics (such as the clustering coefficient and eccentricity), and identifying hubs and bottlenecks. It also allows one to manage a large number of private and public networks using a flexible tagging system, to filter them based on a variety of criteria, and to visualize them through an interactive graphical interface. A number of commonly used biological datasets have been pre-loaded into tYNA, standardized and grouped into different categories. AVAILABILITY The tYNA system can be accessed at http://networks.gersteinlab.org/tyna. The source code, JavaDoc API and WSDL can also be downloaded from the website. tYNA can also be accessed from the Cytoscape software using a plugin.

...read moreread less

Journal Article•DOI•

Assessing the need for sequence-based normalization in tiling microarray experiments

[...]

Thomas Royce¹, Joel Rozowsky¹, Mark Gerstein¹•Institutions (1)

Yale University¹

15 Mar 2007-Bioinformatics

TL;DR: This work investigated the importance of probe sequence composition on the efficacy of tiling microarrays for identifying novel transcription and transcription factor binding sites and developed three metrics for assessing this sequence dependence and use them in evaluating existing sequence-based normalizations from the tilingmicroarray literature.

...read moreread less

Abstract: Motivation: Increases in microarray feature density allow the construction of so-called tiling microarrays. These arrays, or sets of arrays, contain probes targeting regions of sequenced genomes at regular genomic intervals. The unbiased nature of this approach allows for the identification of novel transcribed sequences, the localization of transcription factor binding sites (ChIP-chip), and high resolution comparative genomic hybridization, among other uses. These applications are quickly growing in popularity as tiling microarrays become more affordable. To reach maximum utility, the tiling microarray platform needs be developed to the point that 1 nt resolutions are achieved and that we have confidence in individual measurements taken at this fine of resolution. Any biases in tiling array signals must be systematically removed to achieve this goal. Results: Towards this end, we investigated the importance of probe sequence composition on the efficacy of tiling microarrays for identifying novel transcription and transcription factor binding sites. We found that intensities are highly sequence dependent and can greatly influence results. We developed three metrics for assessing this sequence dependence and use them in evaluating existing sequence-based normalizations from the tiling microarray literature. In addition, we applied three new techniques for addressing this problem; one method, adapted from similar work on GeneChip brand microarrays, is based on modeling array signal as a linear function of probe sequence, the second method extends this approach by iterative weighting and re-fitting of the model, and the third technique extrapolates the popular quantile normalization algorithm for between-array normalization to probe sequence space. These three methods perform favorably to existing strategies, based on the metrics defined here. Availability: http://tiling.gersteinlab.org/sequence_effects/ Contact: mark.gerstein@yale.edu Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less