scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2014"


Journal ArticleDOI
24 Apr 2014-Nature
TL;DR: The key challenges of assessing sequence variants in human disease are discussed, integrating both gene-level and variant-level support for causality and guidelines for summarizing confidence in variant pathogenicity are proposed.
Abstract: The discovery of rare genetic variants is accelerating, and clear guidelines for distinguishing disease-causing sequence variants from the many potentially functional variants present in any human genome are urgently needed. Without rigorous standards we risk an acceleration of false-positive reports of causality, which would impede the translation of genomic research findings into the clinical diagnostic setting and hinder biological understanding of disease. Here we discuss the key challenges of assessing sequence variants in human disease, integrating both gene-level and variant-level support for causality. We propose guidelines for summarizing confidence in variant pathogenicity and highlight several areas that require further resource development.

1,165 citations


Journal ArticleDOI
10 Apr 2014-Nature
TL;DR: An anatomically comprehensive atlas of the mid-gestational human brain is described, including de novo reference atlases, in situ hybridization, ultra-high-resolution magnetic resonance imaging (MRI) and microarray analysis on highly discrete laser-microdissected brain regions.
Abstract: The anatomical and functional architecture of the human brain is mainly determined by prenatal transcriptional processes. We describe an anatomically comprehensive atlas of the mid-gestational human brain, including de novo reference atlases, in situ hybridization, ultra-high-resolution magnetic resonance imaging (MRI) and microarray analysis on highly discrete laser-microdissected brain regions. In developing cerebral cortex, transcriptional differences are found between different proliferative and post-mitotic layers, wherein laminar signatures reflect cellular composition and developmental processes. Cytoarchitectural differences between human and mouse have molecular correlates, including species differences in gene expression in subplate, although surprisingly we find minimal differences between the inner and outer subventricular zones even though the outer zone is expanded in humans. Both germinal and post-mitotic cortical layers exhibit fronto-temporal gradients, with particular enrichment in the frontal lobe. Finally, many neurodevelopmental disorder and human-evolution-related genes show patterned expression, potentially underlying unique features of human cortical formation. These data provide a rich, freely-accessible resource for understanding human brain development.

1,114 citations


Journal ArticleDOI
TL;DR: The strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies are reviewed.
Abstract: With the completion of the human genome sequence, attention turned to identifying and annotating its functional DNA elements. As a complement to genetic and comparative genomics approaches, the Encyclopedia of DNA Elements Project was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types. The resulting genome-wide data reveal sites of biochemical activity with high positional resolution and cell type specificity that facilitate studies of gene regulation and interpretation of noncoding variants associated with human disease. However, the biochemically active regions cover a much larger fraction of the genome than do evolutionarily conserved regions, raising the question of whether nonconserved but biochemically active regions are truly functional. Here, we review the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies. We also analyze the relationship between signal intensity, genomic coverage, and evolutionary conservation. Our results reinforce the principle that each approach provides complementary information and that we need to use combinations of all three to elucidate genome function in human biology and disease.

691 citations


Journal ArticleDOI
TL;DR: A computational framework to annotate and prioritize noncoding drivers from thousands of somatic alterations in a typical tumor, FunSeq2, which combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline.
Abstract: Identification of noncoding drivers from thousands of somatic alterations in a typical tumor is a difficult and unsolved problem. We report a computational framework, FunSeq2, to annotate and prioritize these mutations. The framework combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline. The pipeline has a weighted scoring system combining: inter- and intra-species conservation; loss- and gain-of-function events for transcription-factor binding; enhancer-gene linkages and network centrality; and per-element recurrence across samples. We further highlight putative drivers with information specific to a particular sample, such as differential expression. FunSeq2 is available from funseq2.gersteinlab.org.

314 citations


Journal ArticleDOI
Mark Gerstein1, Joel Rozowsky1, Koon-Kiu Yan1, Daifeng Wang1, Chao Cheng2, James B. Brown3, James B. Brown4, Carrie A. Davis5, LaDeana W. Hillier6, Cristina Sisu1, Jingyi Jessica Li7, Jingyi Jessica Li4, Baikang Pei1, Arif Harmanci1, Michael O. Duff8, Sarah Djebali9, Roger P. Alexander1, Burak H. Alver10, Raymond K. Auerbach1, Kimberly Bell5, Peter J. Bickel4, Max E. Boeck6, Nathan Boley4, Nathan Boley3, Benjamin W. Booth3, Lucy Cherbas11, Peter Cherbas11, Chao Di12, Alexander Dobin5, Jorg Drenkow5, Brent Ewing6, Gang Fang1, Megan Fastuca5, Elise A. Feingold13, Adam Frankish14, Guanjun Gao12, Peter J. Good13, Roderic Guigó9, Ann S. Hammonds3, Jen Harrow14, Roger A. Hoskins3, Cédric Howald15, Cédric Howald16, Long Hu12, Haiyan Huang4, Tim Hubbard14, Tim Hubbard17, Chau Huynh6, Sonali Jha5, Dionna M. Kasper1, Masaomi Kato1, Thomas C. Kaufman11, Robert R. Kitchen1, Erik Ladewig18, Julien Lagarde9, Eric C. Lai18, Jing Leng1, Zhi Lu12, Michael J. MacCoss6, Gemma E. May8, Gemma E. May19, Rebecca McWhirter20, Gennifer E. Merrihew6, David M. Miller20, Ali Mortazavi21, Rabi Murad21, Brian Oliver13, Sara Olson8, Peter J. Park10, Michael J. Pazin13, Norbert Perrimon10, Norbert Perrimon22, Dmitri D. Pervouchine9, Valerie Reinke1, Alexandre Reymond16, Garrett Robinson4, Anastasia Samsonova22, Anastasia Samsonova10, Gary Saunders14, Gary Saunders23, Felix Schlesinger5, Anurag Sethi1, Frank J. Slack1, William C. Spencer20, Marcus H. Stoiber4, Marcus H. Stoiber3, Pnina Strasbourger6, Andrea Tanzer24, Andrea Tanzer9, Owen Thompson6, Kenneth H. Wan3, Guilin Wang1, Huaien Wang5, Kathie L. Watkins20, Jiayu Wen18, Kejia Wen12, Chenghai Xue5, Li Yang8, Li Yang25, Kevin Y. Yip26, Chris Zaleski5, Yan Zhang1, Henry Zheng1, Steven E. Brenner4, Brenton R. Graveley8, Susan E. Celniker3, Thomas R. Gingeras5, Robert H. Waterston6 
28 Aug 2014-Nature
TL;DR: It is found in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a ‘universal model’ based on a single set of organism-independent parameters.
Abstract: The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.

284 citations


Journal ArticleDOI
TL;DR: It is found that poly(A)- lncRNAs tend to have shorter transcripts and lower expression levels, and they show significant expression specificity in response to stresses, and their differential expression is significantly enriched in drought condition and depleted in heat condition.
Abstract: Summary Recently, in addition to poly(A)+ long non-coding RNAs (lncRNAs), many lncRNAs without poly(A) tails, have been characterized in mammals. However, the non-polyA lncRNAs and their conserved motifs, especially those associated with environmental stresses, have not been fully investigated in plant genomes. We performed poly(A)− RNA-seq for seedlings of Arabidopsis thaliana under four stress conditions, and predicted lncRNA transcripts. We classified the lncRNAs into three confidence levels according to their expression patterns, epigenetic signatures and RNA secondary structures. Then, we further classified the lncRNAs to poly(A)+ and poly(A)− transcripts. Compared with poly(A)+ lncRNAs and coding genes, we found that poly(A)− lncRNAs tend to have shorter transcripts and lower expression levels, and they show significant expression specificity in response to stresses. In addition, their differential expression is significantly enriched in drought condition and depleted in heat condition. Overall, we identified 245 poly(A)+ and 58 poly(A)− lncRNAs that are differentially expressed under various stress stimuli. The differential expression was validated by qRT-PCR, and the signaling pathways involved were supported by specific binding of transcription factors (TFs), phytochrome-interacting factor 4 (PIF4) and PIF5. Moreover, we found many conserved sequence and structural motifs of lncRNAs from different functional groups (e.g. a UUC motif responding to salt and a AU-rich stem-loop responding to cold), indicated that the conserved elements might be responsible for the stress-responsive functions of lncRNAs.

243 citations


Journal ArticleDOI
28 Aug 2014-Nature
TL;DR: The results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections.
Abstract: Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.

184 citations


01 Aug 2014
TL;DR: In this article, the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors were mapped for a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time.
Abstract: Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.

167 citations


Journal ArticleDOI
TL;DR: It is shown that a residue located in the kinase activation segment, which is termed the “DFG+1” residue, acts as a major determinant for serine-threonine phosphorylation site specificity.

96 citations


Journal ArticleDOI
TL;DR: Overall, a broad spectrum of biochemical activity for pseudogenes is identified, with the majority in each organism exhibiting varying degrees of partial activity, suggesting a uniform degradation process.
Abstract: Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodeling processes marking each organism’s genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (∼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles.

66 citations


Journal ArticleDOI
TL;DR: Analysis of RNA polymerase II data reveals a clear distinction between the stalled and elongating forms of the polymerase, which is useful given the wide range of scales probed in ChIP-Seq assays.
Abstract: We present MUSIC, a signal processing approach for identification of enriched regions in ChIP-Seq data, available at http://www.music.gersteinlab.org . MUSIC first filters the ChIP-Seq read-depth signal for systematic noise from non-uniform mappability, which fragments enriched regions. Then it performs a multiscale decomposition, using median filtering, identifying enriched regions at multiple length scales. This is useful given the wide range of scales probed in ChIP-Seq assays. MUSIC performs favorably in terms of accuracy and reproducibility compared with other methods. In particular, analysis of RNA polymerase II data reveals a clear distinction between the stalled and elongating forms of the polymerase.

Journal ArticleDOI
TL;DR: Proteomics should be exploited to enhance high-throughput functional genomic analysis by tighter integration of data analyses and experimental strategies to achieve finer cellular and subcellular resolution in transcriptomic and proteomic studies of neural tissues are discussed.
Abstract: The immense intercellular and intracellular heterogeneity of the CNS presents major challenges for high-throughput omic analyses. Transcriptional, translational and post-translational regulatory events are localized to specific neuronal cell types or subcellular compartments, resulting in discrete patterns of protein expression and activity. A spatial and quantitative knowledge of the neuroproteome is therefore critical to understanding both normal and pathological aspects of the functional genomics and anatomy of the CNS. Improvements in mass spectrometry allow the profiling of proteins at a sufficient depth to complement results from high-throughput genomic and transcriptomic assays. However, there are challenges in integrating proteomic data with other data modalities and even greater challenges in obtaining comprehensive neuroproteomic data with cell-type specificity. Here we discuss how proteomics should be exploited to enhance high-throughput functional genomic analysis by tighter integration of data analyses. We also discuss experimental strategies to achieve finer cellular and subcellular resolution in transcriptomic and proteomic studies of neural tissues.

Journal ArticleDOI
TL;DR: OrthoClust is a computational framework that integrates the co-association networks of individual species by utilizing the orthology relationships of genes between species and outputs optimized modules that are fundamentally cross-species, which can either be conserved or species-specific.
Abstract: Increasingly, high-dimensional genomics data are becoming available for many organisms.Here, we develop OrthoClust for simultaneously clustering data across multiple species. OrthoClust is a computational framework that integrates the co-association networks of individual species by utilizing the orthology relationships of genes between species. It outputs optimized modules that are fundamentally cross-species, which can either be conserved or species-specific. We demonstrate the application of OrthoClust using the RNA-Seq expression profiles of Caenorhabditis elegans and Drosophila melanogaster from the modENCODE consortium. A potential application of cross-species modules is to infer putative analogous functions of uncharacterized elements like non-coding RNAs based on guilt-by-association.

Journal ArticleDOI
TL;DR: Using an epidemiological framework, it is suggested that heterogeneity among CD4+ T cells in the genital mucosa could help explain the low infection-to-exposure ratio and selection of the founder strain after sexual exposure to HIV.
Abstract: Worldwide, more than 250 people become infected with HIV every hour [1], yet an individual's chance of becoming infected after a single sexual exposure, the predominant mode of HIV transmission, is often lower than one in 100 [2]. When sexually transmitted HIV-1 infection does occur, it is usually initiated by a single virus, called the founder strain, despite the presence of thousands of genetically diverse viral strains in the transmitting partner [3]. Here we review evidence from molecular biology and virology suggesting that heterogeneity among CD4+ T cells could yield wide variation in the capability of individual cells to become infected and transmit HIV to other cells. Using an epidemiological framework, we suggest that such heterogeneity among CD4+ T cells in the genital mucosa could help explain the low infection-to-exposure ratio and selection of the founder strain after sexual exposure to HIV. During sexual transmission, founder viral strains preferentially infect CD4+ T cells using the CCR5 coreceptor [4], [5]. At the time of initial exposure to HIV, these CD4+ T cells exhibit baseline heterogeneity due to stochasticity in cellular gene expression [6] and dynamic variation in immunological status (activated, resting, etc.) [7]. In addition, because CD4+ T cells are mobile, they are heterogeneously distributed in the genital mucosa, with varying degrees of clustering and contact [8]–[11]. In other contexts, it is well-known that heterogeneity among isogeneic cells inside the body can affect many cellular behaviors and outcomes, including infection dynamics [12], [13]. Epidemiological analyses of disease outbreaks among people indicate that heterogeneity in the ability of individuals in a population to spread disease can have a significant impact on whether a local outbreak becomes an epidemic [14]. Heterogeneity among a population of CD4+ T cells may play a similarly critical role in the establishment and spread of HIV in the genital mucosa after sexual exposure.

Journal ArticleDOI
TL;DR: The Encyclopedia of DNA Elements (ENCODE) catalog and similar data resources are viewed as important foundations for understanding the DNA elements and molecular mechanisms underlying human biology and disease.
Abstract: We agree with Brunet and Doolittle (1) on the utility of distinguishing the evolutionarily selected effects (SE) of some genomic elements from the causal roles (CR) of other elements that lack signatures of selection (1⇓⇓–4). DNA sequences identified by biochemical approaches include both SE and CR elements, and genetic variation in both has been implicated in human traits and disease susceptibility. We thus view the Encyclopedia of DNA Elements (ENCODE) catalog and similar data resources as important foundations for understanding the DNA elements and molecular mechanisms underlying human biology and disease.

Posted ContentDOI
30 Oct 2014-bioRxiv
TL;DR: This article characterized the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates, and revealed a distinct class of genes with levels of expression across cell types and species, that have been constrained early in vertebrate evolution.
Abstract: We characterized by RNA-seq the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles obtained in human cell lines reveals substantial conservation of transcriptional programs, and uncovers a distinct class of genes with levels of expression across cell types and species, that have been constrained early in vertebrate evolution. This core set of genes capture a substantial and constant fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but it is associated with strong and conserved epigenetic marking, as well as to a characteristic post-transcriptional regulatory program in which sub-cellular localization and alternative splicing play comparatively large roles.

Proceedings ArticleDOI
23 May 2014
TL;DR: This paper proposes and outlines a licensing scheme, similar to those used by professional organizations, that not only enforce a code of conduct and punish those who fail to live up to that code, but also mandate required continuing education to limit the possibility that the code will be violated inadvertently.
Abstract: The issues of privacy and disclosure are two sides of a weighty coin. Computational biologists and other scientists involved in genomic research need to be constantly cognizant of the push and pull of these two important concepts. Clinical genomics research in particular raises a number of particularly poignant concerns as society struggles between invasions of privacy such as recent efforts by the FBI and the NSA, and our own (surprisingly) personal disclosures on social media sites or via apathetic acquiescence to large data collection efforts. With regard to privacy there are numerous computational efforts that have heretofore offered to provide both the robustness of protection and the ease of use to be effective in manipulating the terabytes of data before the genomics researcher. Unfortunately algorithms alone have thus far failed to provide either the necessary strength to foil those intent on obtaining information or the promised agility to manipulate the vast datasets. While technical solutions advance, they cannot stand on their own and this paper proposes and outlines a licensing scheme, similar to those used by professional organizations, that not only enforce a code of conduct and punish those who fail to live up to that code, but also mandate required continuing education to limit the possibility that the code will be violated inadvertently. It is the use of the social and the technological advances together that will likely create not only an environment that fosters research and innovation, but also one that is responsive to privacy needs and norms.