Showing papers by "Mark Gerstein published in 2014"
••
Harvard University1, National Institutes of Health2, Medical College of Wisconsin3, University of Washington4, University of Michigan5, Stanford University6, University of Geneva7, Wellcome Trust Sanger Institute8, Washington University in St. Louis9, University of Chicago10, Yale University11, Duke University12, Boston Children's Hospital13, Baylor College of Medicine14, Lawrence Berkeley National Laboratory15, Johns Hopkins University16, University of Pennsylvania17, Broad Institute18
TL;DR: The key challenges of assessing sequence variants in human disease are discussed, integrating both gene-level and variant-level support for causality and guidelines for summarizing confidence in variant pathogenicity are proposed.
Abstract: The discovery of rare genetic variants is accelerating, and clear guidelines for distinguishing disease-causing sequence variants from the many potentially functional variants present in any human genome are urgently needed. Without rigorous standards we risk an acceleration of false-positive reports of causality, which would impede the translation of genomic research findings into the clinical diagnostic setting and hinder biological understanding of disease. Here we discuss the key challenges of assessing sequence variants in human disease, integrating both gene-level and variant-level support for causality. We propose guidelines for summarizing confidence in variant pathogenicity and highlight several areas that require further resource development.
1,165 citations
••
TL;DR: An anatomically comprehensive atlas of the mid-gestational human brain is described, including de novo reference atlases, in situ hybridization, ultra-high-resolution magnetic resonance imaging (MRI) and microarray analysis on highly discrete laser-microdissected brain regions.
Abstract: The anatomical and functional architecture of the human brain is mainly determined by prenatal transcriptional processes. We describe an anatomically comprehensive atlas of the mid-gestational human brain, including de novo reference atlases, in situ hybridization, ultra-high-resolution magnetic resonance imaging (MRI) and microarray analysis on highly discrete laser-microdissected brain regions. In developing cerebral cortex, transcriptional differences are found between different proliferative and post-mitotic layers, wherein laminar signatures reflect cellular composition and developmental processes. Cytoarchitectural differences between human and mouse have molecular correlates, including species differences in gene expression in subplate, although surprisingly we find minimal differences between the inner and outer subventricular zones even though the outer zone is expanded in humans. Both germinal and post-mitotic cortical layers exhibit fronto-temporal gradients, with particular enrichment in the frontal lobe. Finally, many neurodevelopmental disorder and human-evolution-related genes show patterned expression, potentially underlying unique features of human cortical formation. These data provide a rich, freely-accessible resource for understanding human brain development.
1,114 citations
••
Massachusetts Institute of Technology1, California Institute of Technology2, Stanford University3, Harvard University4, Broad Institute5, Duke University6, University of Massachusetts Medical School7, National Institutes of Health8, University of Southern California9, Yale University10, Florida State University11, Cold Spring Harbor Laboratory12, Wellcome Trust Sanger Institute13, University of California, Santa Cruz14, Princeton University15, University of California, San Diego16, University of Washington17, University of Chicago18, Pennsylvania State University19
TL;DR: The strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies are reviewed.
Abstract: With the completion of the human genome sequence, attention turned to identifying and annotating its functional DNA elements. As a complement to genetic and comparative genomics approaches, the Encyclopedia of DNA Elements Project was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types. The resulting genome-wide data reveal sites of biochemical activity with high positional resolution and cell type specificity that facilitate studies of gene regulation and interpretation of noncoding variants associated with human disease. However, the biochemically active regions cover a much larger fraction of the genome than do evolutionarily conserved regions, raising the question of whether nonconserved but biochemically active regions are truly functional. Here, we review the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies. We also analyze the relationship between signal intensity, genomic coverage, and evolutionary conservation. Our results reinforce the principle that each approach provides complementary information and that we need to use combinations of all three to elucidate genome function in human biology and disease.
691 citations
••
TL;DR: A computational framework to annotate and prioritize noncoding drivers from thousands of somatic alterations in a typical tumor, FunSeq2, which combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline.
Abstract: Identification of noncoding drivers from thousands of somatic alterations in a typical tumor is a difficult and unsolved problem. We report a computational framework, FunSeq2, to annotate and prioritize these mutations. The framework combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline. The pipeline has a weighted scoring system combining: inter- and intra-species conservation; loss- and gain-of-function events for transcription-factor binding; enhancer-gene linkages and network centrality; and per-element recurrence across samples. We further highlight putative drivers with information specific to a particular sample, such as differential expression. FunSeq2 is available from funseq2.gersteinlab.org.
314 citations
••
Yale University1, Dartmouth College2, Lawrence Berkeley National Laboratory3, University of California, Berkeley4, Cold Spring Harbor Laboratory5, University of Washington6, University of California, Los Angeles7, University of Connecticut Health Center8, Pompeu Fabra University9, Harvard University10, Indiana University11, Tsinghua University12, National Institutes of Health13, Wellcome Trust Sanger Institute14, Swiss Institute of Bioinformatics15, University of Lausanne16, King's College London17, Kettering University18, Carnegie Mellon University19, Vanderbilt University20, University of California, Irvine21, Howard Hughes Medical Institute22, European Bioinformatics Institute23, University of Vienna24, CAS-MPG Partner Institute for Computational Biology25, The Chinese University of Hong Kong26
TL;DR: It is found in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a ‘universal model’ based on a single set of organism-independent parameters.
Abstract: The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
284 citations
••
TL;DR: It is found that poly(A)- lncRNAs tend to have shorter transcripts and lower expression levels, and they show significant expression specificity in response to stresses, and their differential expression is significantly enriched in drought condition and depleted in heat condition.
Abstract: Summary
Recently, in addition to poly(A)+ long non-coding RNAs (lncRNAs), many lncRNAs without poly(A) tails, have been characterized in mammals. However, the non-polyA lncRNAs and their conserved motifs, especially those associated with environmental stresses, have not been fully investigated in plant genomes. We performed poly(A)− RNA-seq for seedlings of Arabidopsis thaliana under four stress conditions, and predicted lncRNA transcripts. We classified the lncRNAs into three confidence levels according to their expression patterns, epigenetic signatures and RNA secondary structures. Then, we further classified the lncRNAs to poly(A)+ and poly(A)− transcripts. Compared with poly(A)+ lncRNAs and coding genes, we found that poly(A)− lncRNAs tend to have shorter transcripts and lower expression levels, and they show significant expression specificity in response to stresses. In addition, their differential expression is significantly enriched in drought condition and depleted in heat condition. Overall, we identified 245 poly(A)+ and 58 poly(A)− lncRNAs that are differentially expressed under various stress stimuli. The differential expression was validated by qRT-PCR, and the signaling pathways involved were supported by specific binding of transcription factors (TFs), phytochrome-interacting factor 4 (PIF4) and PIF5. Moreover, we found many conserved sequence and structural motifs of lncRNAs from different functional groups (e.g. a UUC motif responding to salt and a AU-rich stem-loop responding to cold), indicated that the conserved elements might be responsible for the stress-responsive functions of lncRNAs.
243 citations
••
TL;DR: The results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections.
Abstract: Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.
184 citations
01 Aug 2014
TL;DR: In this article, the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors were mapped for a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time.
Abstract: Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.
167 citations
••
TL;DR: It is shown that a residue located in the kinase activation segment, which is termed the “DFG+1” residue, acts as a major determinant for serine-threonine phosphorylation site specificity.
96 citations
••
TL;DR: Overall, a broad spectrum of biochemical activity for pseudogenes is identified, with the majority in each organism exhibiting varying degrees of partial activity, suggesting a uniform degradation process.
Abstract: Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodeling processes marking each organism’s genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (∼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles.
66 citations
••
TL;DR: Analysis of RNA polymerase II data reveals a clear distinction between the stalled and elongating forms of the polymerase, which is useful given the wide range of scales probed in ChIP-Seq assays.
Abstract: We present MUSIC, a signal processing approach for identification of enriched regions in ChIP-Seq data, available at http://www.music.gersteinlab.org
. MUSIC first filters the ChIP-Seq read-depth signal for systematic noise from non-uniform mappability, which fragments enriched regions. Then it performs a multiscale decomposition, using median filtering, identifying enriched regions at multiple length scales. This is useful given the wide range of scales probed in ChIP-Seq assays. MUSIC performs favorably in terms of accuracy and reproducibility compared with other methods. In particular, analysis of RNA polymerase II data reveals a clear distinction between the stalled and elongating forms of the polymerase.
••
TL;DR: Proteomics should be exploited to enhance high-throughput functional genomic analysis by tighter integration of data analyses and experimental strategies to achieve finer cellular and subcellular resolution in transcriptomic and proteomic studies of neural tissues are discussed.
Abstract: The immense intercellular and intracellular heterogeneity of the CNS presents major challenges for high-throughput omic analyses. Transcriptional, translational and post-translational regulatory events are localized to specific neuronal cell types or subcellular compartments, resulting in discrete patterns of protein expression and activity. A spatial and quantitative knowledge of the neuroproteome is therefore critical to understanding both normal and pathological aspects of the functional genomics and anatomy of the CNS. Improvements in mass spectrometry allow the profiling of proteins at a sufficient depth to complement results from high-throughput genomic and transcriptomic assays. However, there are challenges in integrating proteomic data with other data modalities and even greater challenges in obtaining comprehensive neuroproteomic data with cell-type specificity. Here we discuss how proteomics should be exploited to enhance high-throughput functional genomic analysis by tighter integration of data analyses. We also discuss experimental strategies to achieve finer cellular and subcellular resolution in transcriptomic and proteomic studies of neural tissues.
••
TL;DR: OrthoClust is a computational framework that integrates the co-association networks of individual species by utilizing the orthology relationships of genes between species and outputs optimized modules that are fundamentally cross-species, which can either be conserved or species-specific.
Abstract: Increasingly, high-dimensional genomics data are becoming available for many organisms.Here, we develop OrthoClust for simultaneously clustering data across multiple species. OrthoClust is a computational framework that integrates the co-association networks of individual species by utilizing the orthology relationships of genes between species. It outputs optimized modules that are fundamentally cross-species, which can either be conserved or species-specific. We demonstrate the application of OrthoClust using the RNA-Seq expression profiles of Caenorhabditis elegans and Drosophila melanogaster from the modENCODE consortium. A potential application of cross-species modules is to infer putative analogous functions of uncharacterized elements like non-coding RNAs based on guilt-by-association.
••
TL;DR: Using an epidemiological framework, it is suggested that heterogeneity among CD4+ T cells in the genital mucosa could help explain the low infection-to-exposure ratio and selection of the founder strain after sexual exposure to HIV.
Abstract: Worldwide, more than 250 people become infected with HIV every hour [1], yet an individual's chance of becoming infected after a single sexual exposure, the predominant mode of HIV transmission, is often lower than one in 100 [2]. When sexually transmitted HIV-1 infection does occur, it is usually initiated by a single virus, called the founder strain, despite the presence of thousands of genetically diverse viral strains in the transmitting partner [3]. Here we review evidence from molecular biology and virology suggesting that heterogeneity among CD4+ T cells could yield wide variation in the capability of individual cells to become infected and transmit HIV to other cells. Using an epidemiological framework, we suggest that such heterogeneity among CD4+ T cells in the genital mucosa could help explain the low infection-to-exposure ratio and selection of the founder strain after sexual exposure to HIV.
During sexual transmission, founder viral strains preferentially infect CD4+ T cells using the CCR5 coreceptor [4], [5]. At the time of initial exposure to HIV, these CD4+ T cells exhibit baseline heterogeneity due to stochasticity in cellular gene expression [6] and dynamic variation in immunological status (activated, resting, etc.) [7]. In addition, because CD4+ T cells are mobile, they are heterogeneously distributed in the genital mucosa, with varying degrees of clustering and contact [8]–[11]. In other contexts, it is well-known that heterogeneity among isogeneic cells inside the body can affect many cellular behaviors and outcomes, including infection dynamics [12], [13].
Epidemiological analyses of disease outbreaks among people indicate that heterogeneity in the ability of individuals in a population to spread disease can have a significant impact on whether a local outbreak becomes an epidemic [14]. Heterogeneity among a population of CD4+ T cells may play a similarly critical role in the establishment and spread of HIV in the genital mucosa after sexual exposure.
••
Massachusetts Institute of Technology1, Broad Institute2, California Institute of Technology3, Stanford University4, Harvard University5, Duke University6, University of Massachusetts Medical School7, National Institutes of Health8, University of Southern California9, Yale University10, Florida State University11, Cold Spring Harbor Laboratory12, Wellcome Trust Sanger Institute13, University of California, Santa Cruz14, University of Chicago15, University of California, San Diego16, University of Washington17, Pennsylvania State University18
TL;DR: The Encyclopedia of DNA Elements (ENCODE) catalog and similar data resources are viewed as important foundations for understanding the DNA elements and molecular mechanisms underlying human biology and disease.
Abstract: We agree with Brunet and Doolittle (1) on the utility of distinguishing the evolutionarily selected effects (SE) of some genomic elements from the causal roles (CR) of other elements that lack signatures of selection (1⇓⇓–4). DNA sequences identified by biochemical approaches include both SE and CR elements, and genetic variation in both has been implicated in human traits and disease susceptibility. We thus view the Encyclopedia of DNA Elements (ENCODE) catalog and similar data resources as important foundations for understanding the DNA elements and molecular mechanisms underlying human biology and disease.
••
TL;DR: This article characterized the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates, and revealed a distinct class of genes with levels of expression across cell types and species, that have been constrained early in vertebrate evolution.
Abstract: We characterized by RNA-seq the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles obtained in human cell lines reveals substantial conservation of transcriptional programs, and uncovers a distinct class of genes with levels of expression across cell types and species, that have been constrained early in vertebrate evolution. This core set of genes capture a substantial and constant fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but it is associated with strong and conserved epigenetic marking, as well as to a characteristic post-transcriptional regulatory program in which sub-cellular localization and alternative splicing play comparatively large roles.
••
23 May 2014TL;DR: This paper proposes and outlines a licensing scheme, similar to those used by professional organizations, that not only enforce a code of conduct and punish those who fail to live up to that code, but also mandate required continuing education to limit the possibility that the code will be violated inadvertently.
Abstract: The issues of privacy and disclosure are two sides of a weighty coin. Computational biologists and other scientists involved in genomic research need to be constantly cognizant of the push and pull of these two important concepts. Clinical genomics research in particular raises a number of particularly poignant concerns as society struggles between invasions of privacy such as recent efforts by the FBI and the NSA, and our own (surprisingly) personal disclosures on social media sites or via apathetic acquiescence to large data collection efforts. With regard to privacy there are numerous computational efforts that have heretofore offered to provide both the robustness of protection and the ease of use to be effective in manipulating the terabytes of data before the genomics researcher. Unfortunately algorithms alone have thus far failed to provide either the necessary strength to foil those intent on obtaining information or the promised agility to manipulate the vast datasets. While technical solutions advance, they cannot stand on their own and this paper proposes and outlines a licensing scheme, similar to those used by professional organizations, that not only enforce a code of conduct and punish those who fail to live up to that code, but also mandate required continuing education to limit the possibility that the code will be violated inadvertently. It is the use of the social and the technological advances together that will likely create not only an environment that fosters research and innovation, but also one that is responsive to privacy needs and norms.