scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2018"


Journal ArticleDOI
14 Dec 2018-Science
TL;DR: This work integrated genotypes and RNA sequencing in brain samples from 1695 individuals with autism spectrum disorder, schizophrenia, and bipolar disorder, as well as controls to identify causal drivers and define a mechanistic basis for the composite activity of genetic risk variants.
Abstract: Most genetic risk for psychiatric disease lies in regulatory regions, implicating pathogenic dysregulation of gene expression and splicing. However, comprehensive assessments of transcriptomic organization in diseased brains are limited. In this work, we integrated genotypes and RNA sequencing in brain samples from 1695 individuals with autism spectrum disorder (ASD), schizophrenia, and bipolar disorder, as well as controls. More than 25% of the transcriptome exhibits differential splicing or expression, with isoform-level changes capturing the largest disease effects and genetic enrichments. Coexpression networks isolate disease-specific neuronal alterations, as well as microglial, astrocyte, and interferon-response modules defining previously unidentified neural-immune mechanisms. We integrated genetic and genomic data to perform a transcriptome-wide association study, prioritizing disease loci likely mediated by cis effects on brain expression. This transcriptome-wide characterization of the molecular pathology across three major psychiatric disorders provides a comprehensive resource for mechanistic insight and therapeutic development.

791 citations


Journal ArticleDOI
Adam P. Arkin1, Adam P. Arkin2, Robert W. Cottingham3, Christopher S. Henry4, Nomi L. Harris1, Rick Stevens5, Sergei Maslov6, Paramvir S. Dehal1, Doreen Ware7, Fernando Perez, Shane Canon1, Michael W. Sneddon1, Matthew L. Henderson1, William J. Riehl1, Dan Murphy-Olson4, Stephen Y. Chan1, Roy T. Kamimura1, Sunita Kumari7, Meghan M Drake3, Thomas Brettin4, Elizabeth M. Glass4, Dylan Chivian1, Dan Gunter1, David J. Weston3, Benjamin H. Allen3, Jason K. Baumohl1, Aaron A. Best8, Benjamin P. Bowen1, Steven E. Brenner2, Christopher Bun4, John-Marc Chandonia1, Jer Ming Chia7, R. L. Colasanti4, Neal Conrad4, James J. Davis4, Brian H. Davison3, Matthew DeJongh8, Scott Devoid4, Emily M. Dietrich4, Inna Dubchak1, Janaka N. Edirisinghe5, Janaka N. Edirisinghe4, Gang Fang9, José P. Faria4, Paul M. Frybarger4, Wolfgang Gerlach4, Mark Gerstein9, Annette Greiner1, James Gurtowski7, Holly L. Haun3, Fei He6, Rashmi Jain10, Rashmi Jain1, Marcin P. Joachimiak1, Kevin P. Keegan4, Shinnosuke Kondo8, Vivek Kumar7, Miriam Land3, Folker Meyer4, Mark Mills3, Pavel S. Novichkov1, Taeyun Oh10, Taeyun Oh1, Gary J. Olsen11, Robert Olson4, Bruce Parrello4, Shiran Pasternak7, Erik Pearson1, Sarah S. Poon1, Gavin Price1, Srividya Ramakrishnan7, Priya Ranjan3, Priya Ranjan12, Pamela C. Ronald1, Pamela C. Ronald10, Michael C. Schatz7, Samuel M. D. Seaver4, Maulik Shukla4, Roman A. Sutormin1, Mustafa H Syed3, James Thomason7, Nathan L. Tintle8, Daifeng Wang9, Fangfang Xia4, Hyunseung Yoo4, Shinjae Yoo6, Dantong Yu6 
TL;DR: Author(s): Arkin, Adam P; Cottingham, Robert W; Henry, Christopher S; Harris, Nomi L; Stevens, Rick L; Maslov, Sergei; Dehal, Paramvir; Ware, Doreen; Perez, Fernando; Canon, Shane; Sneddon, Michael W; Henderson, Matthew L; Riehl, William J; Murphy-Olson, Dan; Chan, Stephen Y; Kamimura, Roy T.
Abstract: Author(s): Arkin, Adam P; Cottingham, Robert W; Henry, Christopher S; Harris, Nomi L; Stevens, Rick L; Maslov, Sergei; Dehal, Paramvir; Ware, Doreen; Perez, Fernando; Canon, Shane; Sneddon, Michael W; Henderson, Matthew L; Riehl, William J; Murphy-Olson, Dan; Chan, Stephen Y; Kamimura, Roy T; Kumari, Sunita; Drake, Meghan M; Brettin, Thomas S; Glass, Elizabeth M; Chivian, Dylan; Gunter, Dan; Weston, David J; Allen, Benjamin H; Baumohl, Jason; Best, Aaron A; Bowen, Ben; Brenner, Steven E; Bun, Christopher C; Chandonia, John-Marc; Chia, Jer-Ming; Colasanti, Ric; Conrad, Neal; Davis, James J; Davison, Brian H; DeJongh, Matthew; Devoid, Scott; Dietrich, Emily; Dubchak, Inna; Edirisinghe, Janaka N; Fang, Gang; Faria, Jose P; Frybarger, Paul M; Gerlach, Wolfgang; Gerstein, Mark; Greiner, Annette; Gurtowski, James; Haun, Holly L; He, Fei; Jain, Rashmi; Joachimiak, Marcin P; Keegan, Kevin P; Kondo, Shinnosuke; Kumar, Vivek; Land, Miriam L; Meyer, Folker; Mills, Marissa; Novichkov, Pavel S; Oh, Taeyun; Olsen, Gary J; Olson, Robert; Parrello, Bruce; Pasternak, Shiran; Pearson, Erik; Poon, Sarah S; Price, Gavin A; Ramakrishnan, Srividya; Ranjan, Priya; Ronald, Pamela C; Schatz, Michael C; Seaver, Samuel MD; Shukla, Maulik; Sutormin, Roman A; Syed, Mustafa H; Thomason, James; Tintle, Nathan L; Wang, Daifeng; Xia, Fangfang; Yoo, Hyunseung; Yoo, Shinjae; Yu, Dantong

743 citations


Journal ArticleDOI
14 Dec 2018-Science
TL;DR: The resource and integrative analyses have uncovered genomic elements and networks in the brain, which in turn have provided insight into the molecular mechanisms underlying psychiatric disorders.
Abstract: Despite progress in defining genetic risk for psychiatric disorders, their molecular mechanisms remain elusive. Addressing this, the PsychENCODE Consortium has generated a comprehensive online resource for the adult brain across 1866 individuals. The PsychENCODE resource contains ~79,000 brain-active enhancers, sets of Hi-C linkages, and topologically associating domains; single-cell expression profiles for many cell types; expression quantitative-trait loci (QTLs); and further QTLs associated with chromatin, splicing, and cell-type proportions. Integration shows that varying cell-type proportions largely account for the cross-population variation in expression (with >88% reconstruction accuracy). It also allows building of a gene regulatory network, linking genome-wide association study variants to genes (e.g., 321 for schizophrenia). We embed this network into an interpretable deep-learning model, which improves disease prediction by ~6-fold versus polygenic risk scores and identifies key genes and pathways in psychiatric disorders.

684 citations


Journal ArticleDOI
14 Dec 2018-Science
TL;DR: The generation and analysis of a variety of genomic data modalities at the tissue and single-cell levels, including transcriptome, DNA methylation, and histone modifications across multiple brain regions ranging in age from embryonic development through adulthood, reveal insights into neurodevelopment and the genomic basis of neuropsychiatric risks.
Abstract: To broaden our understanding of human neurodevelopment, we profiled transcriptomic and epigenomic landscapes across brain regions and/or cell types for the entire span of prenatal and postnatal development. Integrative analysis revealed temporal, regional, sex, and cell type-specific dynamics. We observed a global transcriptomic cup-shaped pattern, characterized by a late fetal transition associated with sharply decreased regional differences and changes in cellular composition and maturation, followed by a reversal in childhood-adolescence, and accompanied by epigenomic reorganizations. Analysis of gene coexpression modules revealed relationships with epigenomic regulation and neurodevelopmental processes. Genes with genetic associations to brain-based traits and neuropsychiatric disorders (including MEF2C, SATB2, SOX5, TCF4, and TSHZ3) converged in a small number of modules and distinct cell types, revealing insights into neurodevelopment and the genomic basis of neuropsychiatric risks.

532 citations


Journal ArticleDOI
14 Dec 2018-Science
TL;DR: It is demonstrated that organoids from human pluripotent cells model cerebral cortical development on the molecular level before 16 weeks postconception, and validated hiPSC-derived cortical organoids as a suitable model system for studying gene regulation in human embryonic brain development, evolution, and disease.
Abstract: INTRODUCTION The human cerebral cortex has undergone an extraordinary increase in size and complexity during mammalian evolution. Cortical cell lineages are specified in the embryo, and genetic and epidemiological evidence implicates early cortical development in the etiology of neuropsychiatric disorders such as autism spectrum disorder (ASD), intellectual disabilities, and schizophrenia. Most of the disease-implicated genomic variants are located outside of genes, and the interpretation of noncoding mutations is lagging behind owing to limited annotation of functional elements in the noncoding genome. RATIONALE We set out to discover gene-regulatory elements and chart their dynamic activity during prenatal human cortical development, focusing on enhancers, which carry most of the weight upon regulation of gene expression. We longitudinally modeled human brain development using human induced pluripotent stem cell (hiPSC)–derived cortical organoids and compared organoids to isogenic fetal brain tissue. RESULTS Fetal fibroblast–derived hiPSC lines were used to generate cortically patterned organoids and to compare oganoids’ epigenome and transcriptome to that of isogenic fetal brains and external datasets. Organoids model cortical development between 5 and 16 postconception weeks, thus enabling us to study transitions from cortical stem cells to progenitors to early neurons. The greatest changes occur at the transition from stem cells to progenitors. The regulatory landscape encompasses a total set of 96,375 enhancers linked to target genes, with 49,640 enhancers being active in organoids but not in mid-fetal brain, suggesting major roles in cortical neuron specification. Enhancers that gained activity in the human lineage are active in the earliest stages of organoid development, when they target genes that regulate the growth of radial glial cells. Parallel weighted gene coexpression network analysis (WGCNA) of transcriptome and enhancer activities defined a number of modules of coexpressed genes and coactive enhancers, following just six and four global temporal patterns that we refer to as supermodules, likely reflecting fundamental programs in embryonic and fetal brain. Correlations between gene expression and enhancer activity allowed stratifying enhancers into two categories: activating regulators (A-regs) and repressive regulators (R-regs). Several enhancer modules converged with gene modules, suggesting that coexpressed genes are regulated by enhancers with correlated patterns of activity. Furthermore, enhancers active in organoids and fetal brains were enriched for ASD de novo variants that disrupt binding sites of homeodomain, Hes1, NR4A2, Sox3, and NFIX transcription factors. CONCLUSION We validated hiPSC-derived cortical organoids as a suitable model system for studying gene regulation in human embryonic brain development, evolution, and disease. Our results suggest that organoids may reveal how noncoding mutations contribute to ASD etiology.

207 citations


Journal ArticleDOI
TL;DR: A controlled longitudinal weight perturbation study combining multiple omics strategies during periods of weight gain and loss in humans demonstrated that weight gain is associated with the activation of strong inflammatory and hypertrophic cardiomyopathy signatures in blood.
Abstract: Advances in omics technologies now allow an unprecedented level of phenotyping for human diseases, including obesity, in which individual responses to excess weight are heterogeneous and unpredictable. To aid the development of better understanding of these phenotypes, we performed a controlled longitudinal weight perturbation study combining multiple omics strategies (genomics, transcriptomics, multiple proteomics assays, metabolomics, and microbiomics) during periods of weight gain and loss in humans. Results demonstrated that: (1) weight gain is associated with the activation of strong inflammatory and hypertrophic cardiomyopathy signatures in blood; (2) although weight loss reverses some changes, a number of signatures persist, indicative of long-term physiologic changes; (3) we observed omics signatures associated with insulin resistance that may serve as novel diagnostics; (4) specific biomolecules were highly individualized and stable in response to perturbations, potentially representing stable personalized markers. Most data are available open access and serve as a valuable resource for the community.

159 citations


Journal ArticleDOI
01 Mar 2018-Genetics
TL;DR: These data will facilitate a vast number of scientific inquiries into the function of individual TFs in key developmental, metabolic, and defense and homeostatic regulatory pathways, as well as provide a broader perspective on how individualTFs work together in local networks and globally across the life spans of these two key model organisms.
Abstract: To develop a catalog of regulatory sites in two major model organisms, Drosophila melanogaster and Caenorhabditis elegans, the modERN (model organism Encyclopedia of Regulatory Networks) consortium has systematically assayed the binding sites of transcription factors (TFs). Combined with data produced by our predecessor, modENCODE (Model Organism ENCyclopedia Of DNA Elements), we now have data for 262 TFs identifying 1.23 M sites in the fly genome and 217 TFs identifying 0.67 M sites in the worm genome. Because sites from different TFs are often overlapping and tightly clustered, they fall into 91,011 and 59,150 regions in the fly and worm, respectively, and these binding sites span as little as 8.7 and 5.8 Mb in the two organisms. Clusters with large numbers of sites (so-called high occupancy target, or HOT regions) predominantly associate with broadly expressed genes, whereas clusters containing sites from just a few factors are associated with genes expressed in tissue-specific patterns. All of the strains expressing GFP-tagged TFs are available at the stock centers, and the chromatin immunoprecipitation sequencing data are available through the ENCODE Data Coordinating Center and also through a simple interface (http://epic.gs.washington.edu/modERN/) that facilitates rapid accessibility of processed data sets. These data will facilitate a vast number of scientific inquiries into the function of individual TFs in key developmental, metabolic, and defense and homeostatic regulatory pathways, as well as provide a broader perspective on how individual TFs work together in local networks and globally across the life spans of these two key model organisms.

145 citations


Journal ArticleDOI
TL;DR: These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids, suggesting a possible role for this gene in the regulation of brain development and the identification of regions with the greatest sequence diversity between strains.
Abstract: We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.

138 citations


Journal ArticleDOI
TL;DR: In this article, the authors compared the evolutionary dynamics between the Muridae and Hominidae and found that the divergence times between the two families are similar in divergence times to each other, and that the relative rates of nucleotide change and feature turnover in both neutral and functional sequences of the two groups are similar.
Abstract: Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 and 6 million yr ago, but that are absent in the Hominidae. Hominidae show between four- and sevenfold lower rates of nucleotide change and feature turnover in both neutral and functional sequences, suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. Recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli, which resulted in thousands of novel, species-specific CTCF binding sites. Our results show that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.

88 citations


Journal ArticleDOI
28 Sep 2018-Science
TL;DR: The exploration of the link between SD-ASM, stochastic variation in DNA methylation, and gene regulation requires deep coverage by WGBS across tissues and individuals and the context of other epigenomic marks and gene transcription.
Abstract: INTRODUCTION A majority of imbalances in DNA methylation between homologous chromosomes in humans are sequence-dependent; the DNA sequence differences between the two chromosomes cause differences in the methylation state of neighboring cytosines on the same chromosome. The analyses of this sequence-dependent allele-specific methylation (SD-ASM) traditionally involved measurement of average methylation levels across many cells. Detailed understanding of SD-ASM at the single-cell and single-chromosome levels is lacking. This gap in understanding may hide the connection between SD-ASM, ubiquitous stochastic cell-to-cell and chromosome-to-chromosome variation in DNA methylation, and the puzzling and evolutionarily conserved patterns of intermediate methylation at gene regulatory loci. RATIONALE Whole-genome bisulfite sequencing (WGBS) provides the ultimate single-chromosome level of resolution and comprehensive whole-genome coverage required to explore SD-ASM. However, the exploration of the link between SD-ASM, stochastic variation in DNA methylation, and gene regulation requires deep coverage by WGBS across tissues and individuals and the context of other epigenomic marks and gene transcription. RESULTS We constructed maps of allelic imbalances in DNA methylation, histone marks, and gene transcription in 71 epigenomes from 36 distinct cell and tissue types from 13 donors. Deep (1691-fold) combined WGBS read coverage across 49 methylomes revealed CpG methylation imbalances exceeding 30% differences at 5% of the loci, which is more conservative than previous estimates in the 8 to 10% range; a similar value (8%) is observed in our dataset when we lowered our threshold for detecting allelic imbalance to 20% methylation difference between the two alleles. Extensive sequence-dependent CpG methylation imbalances were observed at thousands of heterozygous regulatory loci. Stochastic switching, defined as random transitions between fully methylated and unmethylated states of DNA, occurred at thousands of regulatory loci bound by transcription factors (TFs). Our results explain the conservation of intermediate methylation states at regulatory loci by showing that the intermediate methylation reflects the relative frequencies of fully methylated and fully unmethylated epialleles. SD-ASM is explainable by different relative frequencies of methylated and unmethylated epialleles for the two alleles. The differences in epiallele frequency spectra of the alleles at thousands of TF-bound regulatory loci correlated with the differences in alleles’ affinities for TF binding, which suggests a mechanistic explanation for SD-ASM. We observed an excess of rare variants among those showing SD-ASM, which suggests that an average human genome harbors at least ~200 detrimental rare variants that also show SD-ASM. The methylome’s sensitivity to genetic variation is unevenly distributed across the genome, which is consistent with buffering of housekeeping genes against the effects of random mutations. By contrast, less essential genes with tissue-specific expression patterns show sensitivity, thus providing opportunity for evolutionary innovation through changes in gene regulation. CONCLUSION Analysis of allelic epigenome maps provides a unifying model that links sequence-dependent allelic imbalances of the epigenome, stochastic switching at gene regulatory loci, selective buffering of the regulatory circuitry against the effects of random mutations, and disease-associated genetic variation.

80 citations


Journal ArticleDOI
TL;DR: Excluding mutations affecting low-mappability regions or occurring in certain mutational contexts was found to reduce artifacts, yet detection of sub-clonal mutations by WES in the absence of orthogonal validation remains unreliable.

Journal ArticleDOI
TL;DR: This work circumscribed the universe of all possible uORFs based on coding gene sequence motifs and identified 1.3 million unique UORFs, a substantially larger catalog of functional uORF than has previously been reported.
Abstract: Upstream open reading frames (uORFs) latent in mRNA transcripts are thought to modify translation of coding sequences by altering ribosome activity. Not all uORFs are thought to be active in such a process. To estimate the impact of uORFs on the regulation of translation in humans, we first circumscribed the universe of all possible uORFs based on coding gene sequence motifs and identified 1.3 million unique uORFs. To determine which of these are likely to be biologically relevant, we built a simple Bayesian classifier using 89 attributes of uORFs labeled as active in ribosome profiling experiments. This allowed us to extrapolate to a comprehensive catalog of likely functional uORFs. We validated our predictions using in vivo protein levels and ribosome occupancy from 46 individuals. This is a substantially larger catalog of functional uORFs than has previously been reported. Our ranked list of likely active uORFs allows researchers to test their hypotheses regarding the role of uORFs in health and disease. We demonstrate several examples of biological interest through the application of our catalog to somatic mutations in cancer and disease-associated germline variants in humans.

Journal ArticleDOI
TL;DR: FusorSV is developed, which uses a data mining approach to assess performance and merge callsets from an ensemble of SV-calling algorithms, and includes a fusion model built using analysis of 27 deep-coverage human genomes from the 1000 Genomes Project.
Abstract: Comprehensive and accurate identification of structural variations (SVs) from next generation sequencing data remains a major challenge. We develop FusorSV, which uses a data mining approach to assess performance and merge callsets from an ensemble of SV-calling algorithms. It includes a fusion model built using analysis of 27 deep-coverage human genomes from the 1000 Genomes Project. We identify 843 novel SV calls that were not reported by the 1000 Genomes Project for these 27 samples. Experimental validation of a subset of these calls yields a validation rate of 86.7%. FusorSV is available at https://github.com/TheJacksonLaboratory/SVE .


Journal ArticleDOI
TL;DR: The authors show sensitive information leakage is possible by analyzing functional genomics signal profiles, and develop an anonymization procedure for privacy protection.
Abstract: Functional genomics experiments, such as RNA-seq, provide non-individual specific information about gene expression under different conditions such as disease and normal. There is great desire to share these data. However, privacy concerns often preclude sharing of the raw reads. To enable safe sharing, aggregated summaries such as read-depth signal profiles and levels of gene expression are used. Projects such as GTEx and ENCODE share these because they ostensibly do not leak much identifying information. Here, we attempt to quantify the validity of this statement, measuring the leakage of genomic deletions from signal profiles. We present information theoretic measures for the degree to which one can genotype these deletions. We then develop practical genotyping approaches and demonstrate how to use these to identify an individual within a large cohort in the context of linking attacks. Finally, we present an anonymization method removing much of the leakage from signal profiles.

Journal ArticleDOI
TL;DR: An approach to interrogate phosphorylation and its role in protein-protein interactions on a proteome-wide scale was developed and hundreds of known and potentially new phosphoserine-dependent interactors with 14-3-3 proteins and WW domains were found.
Abstract: Post-translational phosphorylation is essential to human cellular processes, but the transient, heterogeneous nature of this modification complicates its study in native systems. We developed an approach to interrogate phosphorylation and its role in protein-protein interactions on a proteome-wide scale. We genetically encoded phosphoserine in recoded E. coli and generated a peptide-based heterologous representation of the human serine phosphoproteome. We designed a single-plasmid library encoding >100,000 human phosphopeptides and confirmed the site-specific incorporation of phosphoserine in >36,000 of these peptides. We then integrated our phosphopeptide library into an approach known as Hi-P to enable proteome-level screens for serine-phosphorylation-dependent human protein interactions. Using Hi-P, we found hundreds of known and potentially new phosphoserine-dependent interactors with 14-3-3 proteins and WW domains. These phosphosites retained important binding characteristics of the native human phosphoproteome, as determined by motif analysis and pull-downs using full-length phosphoproteins. This technology can be used to interrogate user-defined phosphoproteomes in any organism, tissue, or disease of interest.

Journal ArticleDOI
TL;DR: It is shown that molecular biological networks can be interpreted in several straightforward ways, and key aspects of molecular networks are dynamics and evolution, i.e., how they evolve over time and how genetic variants affect them.
Abstract: Biomedical data scientists study many types of networks, ranging from those formed by neurons to those created by molecular interactions. People often criticize these networks as uninterpretable diagrams termed hairballs; however, here we show that molecular biological networks can be interpreted in several straightforward ways. First, we can break down a network into smaller components, focusing on individual pathways and modules. Second, we can compute global statistics describing the network as a whole. Third, we can compare networks. These comparisons can be within the same context (e.g., between two gene regulatory networks) or cross-disciplinary (e.g., between regulatory networks and governmental hierarchies). The latter comparisons can transfer a formalism, such as that for Markov chains, from one context to another or relate our intuitions in a familiar setting (e.g., social networks) to the relatively unfamiliar molecular context. Finally, key aspects of molecular networks are dynamics and evolut...

Journal ArticleDOI
TL;DR: Alignment to exogenous genomes and their quantification results were used in this paper for the analyses of small RNAs of exogenous origin and they revealed that almost all of the reads map to bacterial genomes.
Abstract: Motivation Analysis of RNA sequencing (RNA-Seq) data in human saliva is challenging. Lack of standardization and unification of the bioinformatic procedures undermines saliva's diagnostic potential. Thus, it motivated us to perform this study. Results We applied principal pipelines for bioinformatic analysis of small RNA-Seq data of saliva of 98 healthy Korean volunteers including either direct or indirect mapping of the reads to the human genome using Bowtie1. Analysis of alignments to exogenous genomes by another pipeline revealed that almost all of the reads map to bacterial genomes. Thus, salivary exRNA has fundamental properties that warrant the design of unique additional steps while performing the bioinformatic analysis. Our pipelines can serve as potential guidelines for processing of RNA-Seq data of human saliva. Availability and implementation Processing and analysis results of the experimental data generated by the exceRpt (v4.6.3) small RNA-seq pipeline (github.gersteinlab.org/exceRpt) are available from exRNA atlas (exrna-atlas.org). Alignment to exogenous genomes and their quantification results were used in this paper for the analyses of small RNAs of exogenous origin. Contact dtww@ucla.edu.

Journal ArticleDOI
TL;DR: This work exploits transcript-level expression from RNA-seq to set prior likelihoods and enable protein isoform abundances to be directly estimated from LC-MS/MS, an approach derived from the principle that most genes appear to be expressed as a single dominant isoform in a given cell type or tissue.
Abstract: Cellular control of gene expression is a complex process that is subject to multiple levels of regulation, but ultimately it is the protein produced that determines the biosynthetic state of the ce...

Posted ContentDOI
02 Jul 2018-bioRxiv
TL;DR: Reanalyze the evidence used by CHESS, and it is found that nearly all protein-coding predictions are false positives, and that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein- coding predictions.
Abstract: In a 2018 paper posted to bioRxiv, Pertea et al. presented the CHESS database, a new catalog of human gene annotations that includes 1,178 new protein-coding predictions. These are based on evidence of transcription in human tissues and homology to earlier annotations in human and other mammals. Here, we reanalyze the evidence used by CHESS, and find that nearly all protein-coding predictions are false positives. We find that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein-coding predictions. More than half are homologous to only nine Alu-derived primate sequences corresponding to an erroneous and previously withdrawn Pfam protein domain. The entire set shows poor evolutionary conservation and PhyloCSF protein-coding evolutionary signatures indistinguishable from noncoding RNAs, indicating lack of protein-coding constraint. Only four predictions are supported by mass spectrometry evidence, and even those matches are inconclusive. Overall, the new protein-coding predictions are unsupported by any credible experimental or evolutionary evidence of function, result primarily from homology to genes incorrectly classified as protein-coding, and are unlikely to encode functional proteins.

Journal ArticleDOI
09 Nov 2018-Science
TL;DR: 21 Lessons for the 21st Century as mentioned in this paper is the third in a trilogy of books that examines the annals of humanity, divided into five parts and 21 chapters, it discusses the near future, as extrapolated from our current trajectories.
Abstract: Historian Yuval Noah Harari9s new book is his third in a trilogy that examines the annals of humanity. Divided into five parts and 21 chapters, it discusses the near future, as extrapolated from our current trajectories. Like his other bestsellers, 21 Lessons for the 21st Century is meant for a broad audience, and Harari effortlessly jumps between diverse topics, from biology and information science to history, religion, and philosophy.

Posted ContentDOI
12 Feb 2018-bioRxiv
TL;DR: High quality collection of genomes revealed a previously unannotated gene (Efcab3-like) encoding 5,874 amino acids, one of the largest known in the rodent lineage, and Interestingly, Efcab 3-like−/− mice exhibit severe size anomalies in four regions of the brain suggesting a mechanism of EfcAB3- like regulating brain development.
Abstract: The most commonly employed mammalian model organism is the laboratory mouse. A wide variety of genetically diverse inbred mouse strains, representing distinct physiological states, disease susceptibilities, and biological mechanisms have been developed over the last century. We report full length draft de novo genome assemblies for 16 of the most widely used inbred strains and reveal for the first time extensive strain-specific haplotype variation. We identify and characterise 2,567 regions on the current Genome Reference Consortium mouse reference genome exhibiting the greatest sequence diversity between strains. These regions are enriched for genes involved in defence and immunity, and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. Several immune related loci, some in previously identified QTLs for disease response have novel haplotypes not present in the reference that may explain the phenotype. We used these genomes to improve the mouse reference genome resulting in the completion of 10 new gene structures, and 62 new coding loci were added to the reference genome annotation. Notably this high quality collection of genomes revealed a previously unannotated gene (Efcab3-like) encoding 5,874 amino acids, one of the largest known in the rodent lineage. Interestingly, Efcab3-like-/- mice exhibit severe size anomalies in four regions of the brain suggesting a mechanism of Efcab3-like regulating brain development.

Posted ContentDOI
12 Mar 2018-bioRxiv
TL;DR: An additive effects model derived from complex trait studies is adapted to show that aggregating the impact of putative passenger variants provides significant predictability for cancer phenotypes beyond the PCAWG identified driver mutations (12.5% additive variance).
Abstract: The Pan-cancer Analysis of Whole Genomes (PCAWG) project provides an unprecedented opportunity to comprehensively characterize a vast set of uniformly annotated coding and non- coding mutations present in thousands of cancer genomes. Classical models of cancer progression posit that only a small number of these mutations strongly drive tumor progression and that the remaining ones (termed putative passengers) are inconsequential for tumorigenesis. In this study, we leveraged the comprehensive variant data from PCAWG to ascertain the molecular functional impact of each variant. The impact distribution of PCAWG mutations shows that, in addition to high- and low-impact mutations, there is a group of medium- impact putative passengers predicted to influence gene activity. Moreover, the predicted impact relates to the underlying mutational signature: different signatures confer divergent impact, differentially affecting distinct regulatory subsystems and gene categories. We also find that impact varies based on subclonal architecture (i.e., early vs. late mutations) and can be related to patient survival. Finally, we note that insufficient power due to limited cohort sizes precludes identification of weak drivers using standard recurrence-based approaches. To address this, we adapted an additive effects model derived from complex trait studies to show that aggregating the impact of putative passenger variants (i.e. including yet undetected weak drivers) provides significant predictability for cancer phenotypes beyond the PCAWG identified driver mutations (12.5% additive variance). Furthermore, this framework allowed us to estimate the frequency of potential weak driver mutations in the subset of PCAWG samples lacking well-characterized driver alterations.

Posted ContentDOI
24 Jan 2018-bioRxiv
TL;DR: Excluding mutations affecting low mappability regions or occurring in certain mutational contexts was found to reduce artifacts, yet detection of subclonal mutations by WES in the absence of orthogonal validation remains unreliable.
Abstract: Multi-region sequencing is used to detect intratumor genetic heterogeneity (ITGH) in tumors. To assess whether genuine ITGH can be distinguished from sequencing artifacts, we whole-exome sequenced (WES) three anatomically distinct regions of the same tumor with technical replicates to estimate technical noise. Somatic variants were detected with three different WES pipelines and subsequently validated by high-depth amplicon sequencing. The cancer-only pipeline was unreliable, with about 69% of the identified somatic variants being false positive. Even with matched normal DNA where 82% of the somatic variants were detected reliably, only 36%-78% were found consistently in technical replicate pairs. Overall 34%-80% of the discordant somatic variants, which could be interpreted as ITGH, were found to constitute technical noise. Excluding mutations affecting low mappability regions or occurring in certain mutational contexts was found to reduce artifacts, yet detection of subclonal mutations by WES in the absence of orthogonal validation remains unreliable.

Posted ContentDOI
05 Aug 2018-bioRxiv
TL;DR: A framework that uses Drosophila STARR-seq data to create shape-matching filters based on enhancer-associated meta-profiles of epigenetic features and it is demonstrated that these patterns enable the construction of a secondary model effectively discriminating between enhancers and promoters.
Abstract: Author(s): Sethi, Anurag; Gu, Mengting; Gumusgoz, Emrah; Chan, Landon; Yan, Koon-Kiu; Rozowsky, Joel; Barozzi, Iros; Afzal, Veena; Akiyama, Jennifer; Plajzer-Frick, Ingrid; Yan, Chengfei; Pickle, Catherine; Kato, Momoe; Garvin, Tyler; Pham, Quan; Harrington, Anne; Mannion, Brandon; Lee, Elizabeth; Fukuda-Yuzawa, Yoko; Visel, Axel; Dickel, Diane; Yip, Kevin; Sutton, Richard; Pennacchio, Len; Gerstein, Mark | Abstract: Enhancers are important noncoding elements, but they have been traditionally hard to characterize experimentally. Only a few mammalian enhancers have been validated, making it difficult to train statistical models for their identification properly. Instead, postulated patterns of genomic features have been used heuristically for identification. The development of massively parallel assays allows for the characterization of large numbers of enhancers for the first time. Here, we developed a framework that uses Drosophila STARR-seq data to create shape-matching filters based on enhancer-associated meta-profiles of epigenetic features. We combined these features with supervised machine learning algorithms (e.g., support vector machines) to predict enhancers. We demonstrated that our model could be applied to predict enhancers in mammalian species (i.e., mouse and human). We comprehensively validated the predictions using a combination of in vivo and in vitro approaches, involving transgenic assays in mouse and transduction-based reporter assays in human cell lines. Overall, the validations involved 153 enhancers in 6 mouse tissues and 4 human cell lines. The results confirmed that our model can accurately predict enhancers in different species without re-parameterization. Finally, we examined the transcription-factor binding patterns at predicted enhancers and promoters in human cell lines. We demonstrated that these patterns enable the construction of a secondary model effectively discriminating between enhancers and promoters.


Journal ArticleDOI
TL;DR: This work introduces Mutations Overburdening Annotations Tool (MOAT), a non‐parametric scheme that makes no assumptions about mutation process except requiring that the BMR changes smoothly with genomic features.
Abstract: Summary Identifying genomic regions with higher than expected mutation count is useful for cancer driver detection. Previous parametric approaches require numerous cell-type-matched covariates for accurate background mutation rate (BMR) estimation, which is not practical for many situations. Non-parametric, permutation-based approaches avoid this issue but usually suffer from considerable compute-time cost. Hence, we introduce Mutations Overburdening Annotations Tool (MOAT), a non-parametric scheme that makes no assumptions about mutation process except requiring that the BMR changes smoothly with genomic features. MOAT randomly permutes single-nucleotide variants, or target regions, on a relatively large scale to provide robust burden analysis. Furthermore, we show how we can do permutations in an efficient manner using graphics processing unit acceleration, speeding up the calculation by a factor of ∼250. Availability and implementation MOAT is available at moat.gersteinlab.org. Contact mark@gersteinlab.org. Supplementary information Supplementary data are available at Bioinformatics online.

Posted ContentDOI
12 Jun 2018-bioRxiv
TL;DR: A data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage is developed, thus enabling principled privacy-utility trade-offs.
Abstract: Functional genomics experiments provide data on aspects of gene function in a variety of conditions and how they relate to organismal phenotype (e.g. "genes upregulated in AIDS"). These experiments do not necessarily concern findings on identifiable individuals, leading to a neglect of their privacy issues; however, for each experiment, it is possible to create "cryptic quasi-identifiers"9 statistically linking them back to individuals and thereby leaking sensitive phenotypic information (e.g. "HIV status"). Here, we develop metrics for quantifying this leakage and instantiate them in practical linking attacks. As genotyping noise is a crucial quantity for the feasibility of attacks, we perform them both with highly accurate reference genomics datasets as well as by generating RNA and DNA data from more realistic environmental samples. Finally, in order to reduce leakage, we develop a data-sanitization protocol for making principled privacy-utility trade-offs, permitting the sharing of functional genomics data while minimizing risk of leakage.

Posted ContentDOI
12 Jun 2018-bioRxiv
TL;DR: A proof-of-concept analytic framework, in which the amount of leaked information can be estimated from the depth and breadth of the coverage as well as sequencing biases of a given functional genomics experiment, and proposed file formats that maximize the potential sharing of data while protecting individuals9 sensitive information.
Abstract: Functional genomics experiments on human subjects present a privacy conundrum. On one hand, many of the conclusions we infer from these experiments are not tied to the identity of individuals but represent universal statements about biology and disease. On the other hand, by virtue of the experimental procedure, the sequencing reads are tagged with small bits of patients9 variant information, which presents privacy challenges in terms of data sharing. There is great desire to share data as broadly as possible. Therefore, measuring the amount of variant information leaked in a variety of experiments, particularly in relation to the amount of sequencing, is a key first step in reducing information leakage and determining an appropriate set point for sharing with minimal leakage. To this end, we derived information-theoretic measures for the private information leaked in experiments and developed various file formats to reduce this during sharing. We show that high-depth experiments such as Hi-C provide accurate genotyping that can lead to large privacy leaks. Counterintuitively, low-depth experiments such as ChIP and single-cell RNA sequencing, although not useful for genotyping, can create strong quasi-identifiers for re-identification through linking attacks. We show that partial and incomplete genotypes from many of these experiments can further be combined to construct an individual9s complete variant set and identify phenotypes. We provide a proof-of-concept analytic framework, in which the amount of leaked information can be estimated from the depth and breadth of the coverage as well as sequencing biases of a given functional genomics experiment. Finally, as a practical instantiation of our framework, we propose file formats that maximize the potential sharing of data while protecting individuals9 sensitive information. Depending on the desired sharing set point, our proposed format can achieve differential trade-offs in the privacy-utility balance. At the highest level of privacy, we mask all the variants leaked from reads, but still can create useable signal profiles that give complete recovery of the original gene expression levels.

Journal ArticleDOI
TL;DR: This work develops a methodology of ensembling, Multi-Swarm Ensemble (MSWE) by using multiple particle swarm optimizations and demonstrates its ability to further enhance the performance ofEnsembles.
Abstract: Machine learning is an integral part of computational biology, and has already shown its use in various applications, such as prognostic tests. In the last few years in the non-biological machine learning community, ensembling techniques have shown their power in data mining competitions such as the Netflix challenge; however, such methods have not found wide use in computational biology. In this work, we endeavor to show how ensembling techniques can be applied to practical problems, including problems in the field of bioinformatics, and how they often outperform other machine learning techniques in both predictive power and robustness. Furthermore, we develop a methodology of ensembling, Multi-Swarm Ensemble (MSWE) by using multiple particle swarm optimizations and demonstrate its ability to further enhance the performance of ensembles.