scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2019"


Journal ArticleDOI
TL;DR: This work generates primary data, creates bioinformatics tools and provides analysis to support the work of expert manual gene annotators and automated gene annotation pipelines to identify and characterise gene loci to the highest standard.
Abstract: The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.

2,095 citations


Journal ArticleDOI
TL;DR: The authors explore the potential of the 16S gene for discriminating bacterial taxa and show that full-length sequencing combined with appropriate clustering of intragenomic sequence variation can provide accurate representation of bacterial species in microbiome datasets.
Abstract: The 16S rRNA gene has been a mainstay of sequence-based bacterial analysis for decades. However, high-throughput sequencing of the full gene has only recently become a realistic prospect. Here, we use in silico and sequence-based experiments to critically re-evaluate the potential of the 16S gene to provide taxonomic resolution at species and strain level. We demonstrate that targeting of 16S variable regions with short-read sequencing platforms cannot achieve the taxonomic resolution afforded by sequencing the entire (~1500 bp) gene. We further demonstrate that full-length sequencing platforms are sufficiently accurate to resolve subtle nucleotide substitutions (but not insertions/deletions) that exist between intragenomic copies of the 16S gene. In consequence, we argue that modern analysis approaches must necessarily account for intragenomic variation between 16S gene copies. In particular, we demonstrate that appropriate treatment of full-length 16S intragenomic copy variants has the potential to provide taxonomic resolution of bacterial communities at species and strain level.

859 citations


Journal ArticleDOI
Mark Chaisson1, Mark Chaisson2, Ashley D. Sanders, Xuefang Zhao3, Xuefang Zhao4, Ankit Malhotra, David Porubsky5, David Porubsky6, Tobias Rausch, Eugene J. Gardner7, Oscar L. Rodriguez8, Li Guo9, Ryan L. Collins3, Xian Fan10, Jia Wen11, Robert E. Handsaker3, Robert E. Handsaker12, Susan Fairley13, Zev N. Kronenberg1, Xiangmeng Kong14, Fereydoun Hormozdiari15, Dillon Lee16, Aaron M. Wenger17, Alex Hastie, Danny Antaki18, Thomas Anantharaman, Peter A. Audano1, Harrison Brand3, Stuart Cantsilieris1, Han Cao, Eliza Cerveira, Chong Chen10, Xintong Chen7, Chen-Shan Chin17, Zechen Chong10, Nelson T. Chuang7, Christine C. Lambert17, Deanna M. Church, Laura Clarke13, Andrew Farrell16, Joey Flores19, Timur R. Galeev14, David U. Gorkin18, David U. Gorkin20, Madhusudan Gujral18, Victor Guryev6, William Haynes Heaton, Jonas Korlach17, Sushant Kumar14, Jee Young Kwon21, Ernest T. Lam, Jong Eun Lee, Joyce V. Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li14, Patrick Marks, Karine A. Viaud-Martinez19, Sascha Meiers, Katherine M. Munson1, Fabio C. P. Navarro14, Bradley J. Nelson1, Conor Nodzak11, Amina Noor18, Sofia Kyriazopoulou-Panagiotopoulou, Andy Wing Chun Pang, Yunjiang Qiu18, Yunjiang Qiu20, Gabriel Rosanio18, Mallory Ryan, Adrian M. Stütz, Diana C.J. Spierings6, Alistair Ward16, Anne Marie E. Welch1, Ming Xiao22, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley13, Ernesto Lowy13, Sergei Yakneen, Steven A. McCarroll12, Steven A. McCarroll3, Goo Jun23, Li Ding24, Chong-Lek Koh25, Bing Ren18, Bing Ren20, Paul Flicek13, Ken Chen10, Mark Gerstein, Pui-Yan Kwok26, Peter M. Lansdorp27, Peter M. Lansdorp28, Peter M. Lansdorp6, Gabor T. Marth16, Jonathan Sebat18, Xinghua Shi11, Ali Bashir8, Kai Ye9, Scott E. Devine7, Michael E. Talkowski12, Michael E. Talkowski3, Ryan E. Mills4, Tobias Marschall5, Jan O. Korbel13, Evan E. Eichler1, Charles Lee21 
TL;DR: A suite of long-read, short- read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms are applied to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner.
Abstract: The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.

606 citations


Journal ArticleDOI
04 Sep 2019-Neuron
TL;DR: This work ties cell-cycle progression with early cell fate decisions during neurogenesis, demonstrating that differentiation occurs on a transcriptomic continuum; rather than only expressing a few transcription factors that drive cell fates, differentiating cells express broad, mixed cell-type transcriptomes before telophase.

315 citations


Journal ArticleDOI
04 Apr 2019-Cell
TL;DR: A model with six exRNA cargo types, each detectable in multiple biofluids, is presented and tools for deconvolution and analysis of user-provided case-control studies are provided to enable wide application of this model.

201 citations


Journal ArticleDOI
TL;DR: Phenotypic annotation of all human genes; development of bioinformatic tools and analytic methods; exploration of non-Mendelian modes of inheritance including reduced penetrance, multilocus variation, and oligogenic inheritance; construction of allelic series at a locus; enhanced data sharing worldwide; and integration with clinical genomics are explored.

149 citations



Journal ArticleDOI
TL;DR: This work assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices, to identify low-quality experiments.
Abstract: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments. In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.

105 citations


Journal ArticleDOI
TL;DR: ExceRpt, the exRNA-processing toolkit of the NIH Extracellular RNA Communication Consortium (ERCC), is presented, structured as a cascade of filters and quantifications prioritized based on one's confidence in a given set of annotated RNAs.
Abstract: Small RNA sequencing has been widely adopted to study the diversity of extracellular RNAs (exRNAs) in biofluids; however, the analysis of exRNA samples can be challenging: they are vulnerable to contamination and artifacts from different isolation techniques, present in lower concentrations than cellular RNA, and occasionally of exogenous origin To address these challenges, we present exceRpt, the exRNA-processing toolkit of the NIH Extracellular RNA Communication Consortium (ERCC) exceRpt is structured as a cascade of filters and quantifications prioritized based on one's confidence in a given set of annotated RNAs It generates quality control reports and abundance estimates for RNA biotypes It is also capable of characterizing mappings to exogenous genomes, which, in turn, can be used to generate phylogenetic trees exceRpt has been used to uniformly process all ∼3,500 exRNA-seq datasets in the public exRNA Atlas and is available from genboreeorg and githubgersteinlaborg/exceRpt

101 citations


Journal ArticleDOI
TL;DR: The potential for quantum computing to aid in the merging of insights across different areas of biological sciences is discussed.
Abstract: The search for meaningful structure in biological data has relied on cutting-edge advances in computational technology and data science methods. However, challenges arise as we push the limits of scale and complexity in biological problems. Innovation in massively parallel, classical computing hardware and algorithms continues to address many of these challenges, but there is a need to simultaneously consider new paradigms to circumvent current barriers to processing speed. Accordingly, we articulate a view towards quantum computation and quantum information science, where algorithms have demonstrated potential polynomial and exponential computational speedups in certain applications, such as machine learning. The maturation of the field of quantum computing, in hardware and algorithm development, also coincides with the growth of several collaborative efforts to address questions across length and time scales, and scientific disciplines. We use this coincidence to explore the potential for quantum computing to aid in one such endeavor: the merging of insights from genetics, genomics, neuroimaging and behavioral phenotyping. By examining joint opportunities for computational innovation across fields, we highlight the need for a common language between biological data analysis and quantum computing. Ultimately, we consider current and future prospects for the employment of quantum computing algorithms in the biological sciences.

56 citations


Posted ContentDOI
18 Jul 2019-bioRxiv
TL;DR: A custom annotation for cancer-associated cell types is developed by leveraging advanced assays, such as eCLIP, Hi-C, and whole-genome STARR-seq on a number of data-rich ENCODE cell types to prioritize key elements and variants, in addition to regulators.
Abstract: ENCODE comprises thousands of functional genomics datasets, and the encyclopedia covers hundreds of cell types, providing a universal annotation for genome interpretation. However, for particular applications, it may be advantageous to use a customized annotation. Here, we develop such a custom annotation by leveraging advanced assays, such as eCLIP, Hi-C, and whole-genome STARR-seq on a number of data-rich ENCODE cell types. A key aspect of this annotation is comprehensive and experimentally derived networks of both transcription factors and RNA-binding proteins (TFs and RBPs). Cancer, a disease of system-wide dysregulation, is an ideal application for such a network-based annotation. Specifically, for cancer-associated cell types, we put regulators into hierarchies and measure their network change (rewiring) during oncogenesis. We also extensively survey TF-RBP crosstalk, highlighting how SUB1, a previously uncharacterized RBP, drives aberrant tumor expression and amplifies the effect of MYC, a well-known oncogenic TF. Furthermore, we show how our annotation allows us to place oncogenic transformations in the context of a broad cell space; here, many normal-to-tumor transitions move towards a stem-like state, while oncogene knockdowns show an opposing trend. Finally, we organize the resource into a coherent workflow to prioritize key elements and variants, in addition to regulators. We showcase the application of this prioritization to somatic burdening, cancer differential expression and GWAS. Targeted validations of the prioritized regulators, elements and variants using siRNA knockdowns, CRISPR-based editing, and luciferase assays demonstrate the value of the ENCODE resource.

Journal ArticleDOI
TL;DR: This work focuses on how genomics fits as a specific application subdomain, in terms of well-known 3 V data and 4 M process frameworks (volume-velocity-variety and measurement-mining-modeling-manipulation, respectively).
Abstract: Data science allows the extraction of practical insights from large-scale data. Here, we contextualize it as an umbrella term, encompassing several disparate subdomains. We focus on how genomics fits as a specific application subdomain, in terms of well-known 3 V data and 4 M process frameworks (volume-velocity-variety and measurement-mining-modeling-manipulation, respectively). We further analyze the technical and cultural “exports” and “imports” between genomics and other data-science subdomains (e.g., astronomy). Finally, we discuss how data value, privacy, and ownership are pressing issues for data science applications, in general, and are especially relevant to genomics, due to the persistent nature of DNA.

Journal ArticleDOI
TL;DR: An Argonaute 2 (Ago2)-dependent miRNA network is discovered that, in response to substrate stiffness, regulates genes involved in tissue mechanics, and it is shown that Ago2 restrains stiffness and contributes to regeneration in the zebrafish fin fold.
Abstract: Vertebrate tissues exhibit mechanical homeostasis, showing stable stiffness and tension over time and recovery after changes in mechanical stress. However, the regulatory pathways that mediate these effects are unknown. A comprehensive identification of Argonaute 2-associated microRNAs and mRNAs in endothelial cells identified a network of 122 microRNA families that target 73 mRNAs encoding cytoskeletal, contractile, adhesive and extracellular matrix (CAM) proteins. The level of these microRNAs increased in cells plated on stiff versus soft substrates, consistent with homeostasis, and suppressed targets via microRNA recognition elements within the 3′ untranslated regions of CAM mRNAs. Inhibition of DROSHA or Argonaute 2, or disruption of microRNA recognition elements within individual target mRNAs, such as connective tissue growth factor, induced hyper-adhesive, hyper-contractile phenotypes in endothelial and fibroblast cells in vitro, and increased tissue stiffness, contractility and extracellular matrix deposition in the zebrafish fin fold in vivo. Thus, a network of microRNAs buffers CAM expression to mediate tissue mechanical homeostasis. Moro et al. discover an Argonaute 2 (Ago2)-dependent miRNA network that, in response to substrate stiffness, regulates genes involved in tissue mechanics, and show that Ago2 restrains stiffness and contributes to regeneration in the zebrafish fin fold.

Journal ArticleDOI
TL;DR: A framework to identify cancer driver genes using a dynamics-based search of mutational hotspot communities in 3D structures is presented and a comparison between this approach and existing cancer hotspot detection methods suggests that including protein dynamics significantly increases the sensitivity of driver detection.
Abstract: Large-scale exome sequencing of tumors has enabled the identification of cancer drivers using recurrence-based approaches. Some of these methods also employ 3D protein structures to identify mutational hotspots in cancer-associated genes. In determining such mutational clusters in structures, existing approaches overlook protein dynamics, despite its essential role in protein function. We present a framework to identify cancer driver genes using a dynamics-based search of mutational hotspot communities. Mutations are mapped to protein structures, which are partitioned into distinct residue communities. These communities are identified in a framework where residue–residue contact edges are weighted by correlated motions (as inferred by dynamics-based models). We then search for signals of positive selection among these residue communities to identify putative driver genes, while applying our method to the TCGA (The Cancer Genome Atlas) PanCancer Atlas missense mutation catalog. Overall, we predict 1 or more mutational hotspots within the resolved structures of proteins encoded by 434 genes. These genes were enriched among biological processes associated with tumor progression. Additionally, a comparison between our approach and existing cancer hotspot detection methods using structural data suggests that including protein dynamics significantly increases the sensitivity of driver detection.

Journal ArticleDOI
TL;DR: The results show how accounting for pervasive transcription is critical to accurately quantify the activity of highly repetitive regions of the human genome.
Abstract: The Long interspersed nuclear element 1 (LINE-1) is a primary source of genetic variation in humans and other mammals. Despite its importance, LINE-1 activity remains difficult to study because of its highly repetitive nature. Here, we developed and validated a method called TeXP to gauge LINE-1 activity accurately. TeXP builds mappability signatures from LINE-1 subfamilies to deconvolve the effect of pervasive transcription from autonomous LINE-1 activity. In particular, it apportions the multiple reads aligned to the many LINE-1 instances in the genome into these two categories. Using our method, we evaluated well-established cell lines, cell-line compartments and healthy tissues and found that the vast majority (91.7%) of transcriptome reads overlapping LINE-1 derive from pervasive transcription. We validated TeXP by independently estimating the levels of LINE-1 autonomous transcription using ddPCR, finding high concordance. Next, we applied our method to comprehensively measure LINE-1 activity across healthy somatic cells, while backing out the effect of pervasive transcription. Unexpectedly, we found that LINE-1 activity is present in many normal somatic cells. This finding contrasts with earlier studies showing that LINE-1 has limited activity in healthy somatic tissues, except for neuroprogenitor cells. Interestingly, we found that the amount of LINE-1 activity was associated with the with the amount of cell turnover, with tissues with low cell turnover rates (e.g. the adult central nervous system) showing lower LINE-1 activity. Altogether, our results show how accounting for pervasive transcription is critical to accurately quantify the activity of highly repetitive regions of the human genome.

Posted ContentDOI
17 Dec 2019-bioRxiv
TL;DR: This study showed that pharmacogenomics data can be stored and queried efficiently on the Ethereum blockchain, and designed a smart contract to store and query gene-drug interaction data in Ethereum using an index-based, multi-mapping approach allowing for time and space efficient storage and query.
Abstract: Background: With the advent of precision medicine, pharmacogenomics data is becoming increasingly critical to patient care. These data describe the relationship between a particular variant in the genome and the response to a drug by the patient. As utilizing this kind of data becomes more integral to medical treatment decisions, appropriate storage and sharing of this data will be critical. A potential way of securely storing and sharing pharmacogenomics data is a smart contract with the Ethereum blockchain. This is an open-source blockchain platform for decentralized applications. A transaction-based, state machine, the world of Ethereum maintains user accounts and storage in a network state. Immutable pieces of code called smart contracts may be deployed to the Ethereum network and run on the Ethereum Virtual Machine when called by a user or other contract. The 2019 iDASH (Integrating Data for Analysis, Anonymization, and Sharing) competition for Secure Genome Analysis challenged participants to develop time- and space-efficient smart contracts to log and query gene-drug relationship data on the Ethereum blockchain. Methods: We designed a smart contract to store and query pharmacogenomics data (gene-drug interaction data) in Ethereum using an index-based, multi-mapping approach allowing for time and space efficient storage and query. Our solution to the IDASH competition ranked in the top three at a workshop held in Bloomington, IN in October 2019. Although our solution performed well in the challenge, we wanted to improve its scalability and query efficiency. To that end, we developed an alternate fastQuery solution that stores pooled rather than raw data, allowing for significantly improved query time for 0-AND queries, and constant query time for 1- and 2-AND queries. Results: We tested the performance of both of our solutions in Truffle (v5.0.31) using datasets ranging from 100 to 1000 entries, and inserting data at 25, 50, 100, and 200 observations at a time. On a private, proof-of-authority test network, our challenge solution requires approximately 70 seconds, 500 MB of memory, and 80 MB of disk space to insert 1000 entries (200 at a time); and 400 ms and 5 MB of memory to query a two-AND query from 1000 entries. This solution exhibits constant memory for insertion and querying, and linear query time. Our alternate fastQuery solution requires approximately 60 seconds, 500 MB of memory, and 80 MB of disk space to insert 1000 entries (200 at a time); and 83 ms and 5 MB of memory to query a two-AND query from 1000 entries. This solution exhibits constant memory for insertion and querying, linear query time for 0-AND queries, and constant query time for 1- and 2-AND queries in a database of up to 1000 entries. Conclusion: In this study we showed that pharmacogenomics data can be stored and queried efficiently on the Ethereum blockchain. Our approach has the potential to be useful for a wide range of datasets in biomedical research; while we focused on gene-drug interaction data, our solution designs could be used to store a range of clinical trial data. Moreover, our solutions could be adapted to store and query data in any field where high-integrity data storage and efficient access is required.

Posted ContentDOI
08 Jul 2019-bioRxiv
TL;DR: A statistical framework for uniformly processing STARR-seq data: STARRPeaker, which identifies highly reproducible and epigenetically active enhancers across replicates and outperforms other peak callers in terms of identifying known enhancers.
Abstract: High-throughput reporter assays, such as self-transcribing active regulatory region sequencing (STARR-seq), allow for unbiased and quantitative assessment of enhancers at the genome-wide level. In order to cover the size of the human genome, recent advancements of STARR-seq technology have employed more complex genomic library and increased sequencing depths. These advances necessitate a reliable processing pipeline and peak-calling algorithm. Most studies of STARR-seq have relied on chromatin immunoprecipitation sequencing (ChIP-seq) processing pipeline to identify peak regions. However, here we highlight key differences in the processing of STARR-seq versus ChIP-seq data. STARR-seq uses transcribed RNA to measure enhancer activity, making determining the basal transcription rate important. Further, STARR-seq coverage is non-uniform, overdispersed, and often confounded by sequencing biases such as GC content and mappability. We observed a correlation between RNA thermodynamic stability and STARR-seq RNA readout, suggesting that STARR-seq might be sensitive to RNA secondary structure and stability. Considering these findings, we developed a statistical framework for uniformly processing STARR-seq data: STARRPeaker. We applied our method to two whole human genome STARR-seq experiments; HepG2 and K562. Our method identifies highly reproducible and epigenetically active enhancers across replicates. Moreover, STARRPeaker outperforms other peak callers in terms of identifying known enhancers. Thus, our framework optimized for processing STARR-seq data accurately characterizes cell-type-specific enhancers, while addressing potential confounders.

Posted ContentDOI
19 Aug 2019-bioRxiv
TL;DR: An agnostic machine-learning-based workflow, called SVFX, is built to assign a “pathogenicity score” to somatic and germline SVs in various diseases and found that predicted pathogenic SVS in cancer cohorts were enriched among known cancer genes and many cancer-related pathways.
Abstract: A rapid decline in sequencing cost has made large-scale genome sequencing studies feasible. One of the fundamental goals of these studies is to catalog all pathogenic variants. Numerous methods and tools have been developed to interpret point mutations and small insertions and deletions. However, there is a lack of approaches for identifying pathogenic genomic structural variations (SVs). That said, SVs are known to play a crucial role in many diseases by altering the sequence and three-dimensional structure of the genome. Previous studies have suggested a complex interplay of genomic and epigenomic features in the emergence and distribution of SVs. However, the exact mechanism of pathogenesis for SVs in different diseases is not straightforward to decipher. Thus, we built an agnostic machine-learning-based workflow, called SVFX, to assign a “pathogenicity score” to somatic and germline SVs in various diseases. In particular, we generated somatic and germline training models, which included genomic, epigenomic, and conservation-based features for SV call sets in diseased and healthy individuals. We then applied SVFX to SVs in six different cancer cohorts and a cardiovascular disease (CVD) cohort. Overall, SVFX achieved high accuracy in identifying pathogenic SVs. Moreover, we found that predicted pathogenic SVs in cancer cohorts were enriched among known cancer genes and many cancer-related pathways (including Wnt signaling, Ras signaling, DNA repair, and ubiquitin-mediated proteolysis). Finally, we note that SVFX is flexible and can be easily extended to identify pathogenic SVs in additional disease cohorts.

Journal ArticleDOI
TL;DR: This work does 3D structure-based docking on ∼10,000 SNVs modifying known protein-drug complexes to construct a pseudo gold standard and uses this augmented set of BAs to train a statistical model combining structure, ligand and sequence features and illustrates how it can be applied to millions of SNVs.

Journal ArticleDOI
27 Sep 2019-iScience
TL;DR: It is found that transmission of pollen-miRNAs into the circulation occurs via pulmonary transfer and this transfer was mediated by platelet-pulmonary vascular cell interactions and platelet pollen-DNA uptake.

Posted ContentDOI
02 Sep 2019-bioRxiv
TL;DR: This work developed a method that quantifies tumor growth and driver effects for individual samples based solely on the variant allele frequency (VAF) spectrum and found that the identified periods of positive growth are associated with drivers previously highlighted via recurrence by the PCAWG consortium.
Abstract: Evolving tumors accumulate thousands of mutations. Technological advances have enabled whole genome sequencing of these mutations in large cohorts, such as those from the Pancancer Analysis of Whole Genomes (PCAWG) Consortium. The resulting data explosion has led to many methods for detecting cancer drivers through mutational recurrence and deviation from background mutation rates. However, these methods require a large cohort and underperform when recurrence is low. An alternate approach involves harnessing the variant allele frequency (VAF) of mutations in the population of tumor cells in a single individual. Moreover, ultra-deep sequencing of tumors, which is now possible, allows for particularly accurate VAF measurements, and recent studies have begun to use these to determine evolutionary trajectories and quantify subclonal selection. Here, we developed a method that quantifies tumor growth and driver effects for individual samples based solely on the VAF spectrum. Drivers introduce a perturbation into this spectrum, and our method uses the frequency of “hitchhiking” mutations preceding a driver to measure this perturbation. Specifically, our method applies various growth models to identify periods of positive/negative growth, the genomic regions associated with them, and the presence and effect of putative drivers. To validate our method, we first used simulation models to successfully approximate the timing and size of a driver’s effect. Then, we tested our method on 993 linear tumors (i.e. those with linear subclonal expansion, where each parent-subclone has one child) from the PCAWG Consortium and found that the identified periods of positive growth are associated with drivers previously highlighted via recurrence by the PCAWG consortium. Finally, we applied our method to an ultra-deep sequenced AML tumor and identified known cancer genes and additional driver candidates. In summary, our method presents opportunities for personalized diagnosis using deep sequenced whole genome data from an individual.


Posted ContentDOI
18 Dec 2019-bioRxiv
TL;DR: A method to use both controls in combination to further improve binding site detection is developed, demonstrating that using a DNA input control results in a definable set of spurious sites, and their abundance is tightly associated with the intrinsic properties of the ChIP-seq sample.
Abstract: Chromatin immunoprecipitation (IP) followed by sequencing (ChIP-seq) is the gold standard to detect genome-wide DNA-protein binding. The binding sites of transcription factors facilitate many biological studies. Of emerging concern is the abundance of spurious sites in ChIP-seq, which are mainly caused by uneven genomic sonication and nonspecific interactions between chromatin and antibody. A "mock" IP is designed to correct for both factors, whereas a DNA input control corrects only for uneven sonication. However, a mock IP is more susceptible to technical noise than a DNA input, and empirically, these two controls perform similarly for ChIP-seq. Therefore, DNA input is currently being used almost exclusively. With a large dataset, we demonstrate that using a DNA input control results in a definable set of spurious sites, and their abundance is tightly associated with the intrinsic properties of the ChIP-seq sample. For example, compared to human cell lines, samples such as human tissues and whole worm and fly have more accessible genomes, and thus have more spurious sites. The large and varying abundance of spurious sites may impede comparative studies across multiple samples. In contrast, using a mock IP as control substantially removes these spurious sites, resulting in high-quality binding sites and facilitating their comparability across samples. Although outperformed by mock IP, DNA input is still informative and has unique advantages. Therefore, we have developed a method to use both controls in combination to further improve binding site detection.

Posted ContentDOI
24 May 2019-bioRxiv
TL;DR: The results show how accounting for pervasive transcription is critical to accurately quantify the activity of highly repetitive regions of the human genome, and a new method called TeXP is developed and validated to gauge LINE-1 activity accurately.
Abstract: Long interspersed nuclear element 1 (LINE-1) is a primary source of genetic variation in humans and other mammals. Despite its importance, LINE-1 activity remains difficult to study because of its highly repetitive nature. Here, we developed and validated a method called TeXP to gauge LINE-1 activity accurately. TeXP builds mappability signatures from LINE-1 subfamilies to deconvolve the effect of pervasive transcription from autonomous LINE-1 activity. In particular, it apportions the multiple reads aligned to the many LINE-1 instances in the genome into these two categories. Using our method, we evaluated well-established cell lines, cell-line compartments and healthy tissues and found that the vast majority (91.7%) of transcriptome reads overlapping LINE-1 derive from pervasive transcription. We validated TeXP by independently estimating the levels of LINE-1 autonomous transcription using ddPCR, finding high concordance. Next, we applied our method to comprehensively measure LINE-1 activity across healthy somatic cells, while backing out the effect of pervasive transcription. Unexpectedly, we found that LINE-1 activity is present in many normal somatic cells. This finding contrasts with earlier studies showing that LINE-1 has limited activity in healthy somatic tissues, except for neuroprogenitor cells. Interestingly, we found that the amount of LINE-1 activity was associated with the with the amount of cell turnover, with tissues with low cell turnover rates (e.g. the adult central nervous system) showing lower LINE-1 activity. Altogether, our results show how accounting for pervasive transcription is critical to accurately quantify the activity of highly repetitive regions of the human genome. Author Summary Repetitive sequences, such as LINEs, comprise more than half of the human genome. Due to their repetitive nature, LINEs are hard to grasp. In particular, we find that pervasive transcription is a major confounding factor in transcriptome data. We observe that, on average, more than 90% of LINE signal derives from pervasive transcription. To investigate this issue, we developed and validated a new method called TeXP. TeXP accounts and removes the effects of pervasive transcription when quantifying LINE activity. Our method uses the broad distribution of LINEs to estimate the effects of pervasive transcription. Using TeXP, we processed thousands of transcriptome datasets to uniformly, and unbiasedly measure LINE-1 activity across healthy somatic cells. By removing the pervasive transcription component, we find that (1) LINE-1 is broadly expressed in healthy somatic tissues; (2) Adult brain show small levels of LINE transcription and; (3) LINE-1 transcription level is correlated with tissue cell turnover. Our method thus offers insights into how repetitive sequences and influenced by pervasive transcription. Moreover, we uncover the activity of LINE-1 in somatic tissues at an unmatched scale.

Journal ArticleDOI
TL;DR: This work coupled multiple genomic predictors to build GRAM, a GeneRAlized Model, to predict a well-defined experimental target: the expression-modulating effect of a non-coding variant on its associated gene, in a transferable, cell-specific manner.
Abstract: There has been much effort to prioritize genomic variants with respect to their impact on "function". However, function is often not precisely defined: sometimes it is the disease association of a variant; on other occasions, it reflects a molecular effect on transcription or epigenetics. Here, we coupled multiple genomic predictors to build GRAM, a GeneRAlized Model, to predict a well-defined experimental target: the expression-modulating effect of a non-coding variant on its associated gene, in a transferable, cell-specific manner. Firstly, we performed feature engineering: using LASSO, a regularized linear model, we found transcription factor (TF) binding most predictive, especially for TFs that are hubs in the regulatory network; in contrast, evolutionary conservation, a popular feature in many other variant-impact predictors, has almost no contribution. Moreover, TF binding inferred from in vitro SELEX is as effective as that from in vivo ChIP-Seq. Second, we implemented GRAM integrating only SELEX features and expression profiles; thus, the program combines a universal regulatory score with an easily obtainable modifier reflecting the particular cell type. We benchmarked GRAM on large-scale MPRA datasets, achieving AUROC scores of 0.72 in GM12878 and 0.66 in a multi-cell line dataset. We then evaluated the performance of GRAM on targeted regions using luciferase assays in the MCF7 and K562 cell lines. We noted that changing the insertion position of the construct relative to the reporter gene gave very different results, highlighting the importance of carefully defining the exact prediction target of the model. Finally, we illustrated the utility of GRAM in fine-mapping causal variants and developed a practical software pipeline to carry this out. In particular, we demonstrated in specific examples how the pipeline could pinpoint variants that directly modulate gene expression within a larger linkage-disequilibrium block associated with a phenotype of interest (e.g., for an eQTL).

Posted ContentDOI
02 Dec 2019-bioRxiv
TL;DR: This work proposes a method called TopicNet that applies latent Dirichlet allocation (LDA) to extract meaningful functional topics for a collection of genes regulated by a TF and defines a rewiring score to quantify the large-scale changes in the regulatory network in terms of topic change for a TF.
Abstract: Next generation sequencing data highlights comprehensive and dynamic changes in the human gene regulatory network. Moreover, changes in regulatory network connectivity (network "rewiring") manifest different regulatory programs in multiple cellular states. However, due to the dense and noisy nature of the connectivity in regulatory networks, directly comparing the gains and losses of targets of key TFs is not that informative. Thus, here, we seek an abstracted lower-dimensional representation to understand the main features of network change. In particular, we propose a method called TopicNet that applies latent Dirichlet allocation (LDA) to extract meaningful functional topics for a collection of genes regulated by a TF. We then define a rewiring score to quantify the large-scale changes in the regulatory network in terms of topic change for a TF. Using this framework, we can pinpoint particular TFs that change greatly in network connectivity between different cellular states. This is particularly relevant in oncogenesis. Also, incorporating gene-expression data, we define a topic activity score that gives the degree that a topic is active in a particular cellular state. Furthermore, we show how activity differences can highlight differential survival in certain cancers.

Posted ContentDOI
04 Nov 2019-bioRxiv
TL;DR: A general theoretical framework for analyzing evolutionary processes drawing on recent approaches to causal modeling developed in the machine-learning literature, which have extended Pearl’s ‘do’-calculus to incorporate cyclic causal interactions and multilevel causation is developed.
Abstract: Many models of evolution are implicitly causal processes. Features such as causal feedback between evolutionary variables and evolutionary processes acting at multiple levels, though, mean that conventional causal models miss important phenomena. We develop here a general theoretical framework for analyzing evolutionary processes drawing on recent approaches to causal modeling developed in the machine-learning literature, which have extended Pearl9s 9do9-calculus to incorporate cyclic causal interactions and multilevel causation. We also develop information-theoretic notions necessary to analyze causal information dynamics in our framework, introducing a causal generalization of the Partial Information Decomposition framework. We show how our causal framework helps to clarify conceptual issues in the contexts of complex trait analysis and cancer genetics, including assigning variation in an observed trait to genetic, epigenetic and environmental sources in the presence of epigenetic and environmental feedback processes, and variation in fitness to mutation processes in cancer using a multilevel causal model respectively, as well as relating causally-induced to observed variation in these variables via information theoretic bounds. In the process, we introduce a general class of multilevel causal evolutionary processes which connect evolutionary processes at multiple levels via coarse-graining relationships. Further, we show how a range of 9fitness models9 can be formulated in our framework, as well as a causal analog of Price9s equation (generalizing the probabilistic 9Rice equation9), clarifying the relationships between realized/probabilistic fitness and direct/indirect selection. Finally, we consider the potential relevance of our framework to foundational issues in biology and evolution, including supervenience, multilevel selection and individuality. Particularly, we argue that our class of multilevel causal evolutionary processes, in conjunction with a minimum description length principle, provides a framework in which identification of multiple levels of selection may be addressed as a model selection problem.

Posted ContentDOI
20 Jul 2019-bioRxiv
TL;DR: In this paper, a negative-binomial regression framework for uniformly processing STARR-seq data, called STARRPeaker, was proposed to detect active enhancers from both captured and whole-genome STARRseq data.
Abstract: High-throughput reporter assays, such as self-transcribing active regulatory region sequencing (STARR-seq), allow for unbiased and quantitative assessment of enhancers at a genome-wide scale. Recent advances in STARR-seq technology have employed progressively more complex genomic libraries and increased sequencing depths, to assay larger sized regions, up to the entire human genome. These advances necessitate a reliable processing pipeline and peak-calling algorithm. Most STARR-seq studies have relied on chromatin immunoprecipitation sequencing (ChIP-seq) processing pipelines. However, there are key differences in STARR-seq versus ChIP-seq. First, STARR-seq uses transcribed RNA to measure the activity of an enhancer, making an accurate determination of the basal transcription rate important. Second, STARR-seq coverage is highly non-uniform, overdispersed, and often confounded by sequencing biases, such as GC content and mappability. Lastly, here, we observed a clear correlation between RNA thermodynamic stability and STARR-seq readout, suggesting that STARR-seq may be sensitive to RNA secondary structure and stability. Considering these findings, we developed a negative-binomial regression framework for uniformly processing STARR-seq data, called STARRPeaker. In support of this, we generated whole-genome STARR-seq data from the HepG2 and K562 human cell lines and applied STARRPeaker to call enhancers. We show STARRPeaker can unbiasedly detect active enhancers from both captured and whole-genome STARR-seq data. Specifically, we report ~33,000 and ~20,000 candidate enhancers from HepG2 and K562, respectively. Moreover, we show that STARRPeaker outperforms other peak callers in terms of identifying known enhancers with fewer false positives. Overall, we demonstrate an optimized processing framework for STARR-seq experiments can identify putative enhancers while addressing potential confounders.

Posted ContentDOI
14 Nov 2019-bioRxiv
TL;DR: The Local Event-based analysis of alternative Splicing using RNA-Seq (LESSeq) pipeline is developed, which utilizes information of local splicing events to identify unambiguous alternative splicing and quantifies the abundance of these alternative events using Maximum Likelihood Estimation (MLE) and provides their significance between different cellular conditions.
Abstract: Alternative splicing, which can be observed genome-wide by RNA-Seq, is important in cellular development and evolution. Comparative RNA-Seq experiments between different cellular conditions allow alternative splicing signatures to be detected. However, inferring alternative splicing signatures from short-read technology is unreliable and still presents many challenges before biologically significant signatures may be identified. To enable the robust discovery of differential alternative splicing, we developed the Local Event-based analysis of alternative Splicing using RNA-Seq (LESSeq) pipeline. LESSeq utilizes information of local splicing events (i.e., the partial structures in genes where transcript-splicing patterns diverge) to identify unambiguous alternative splicing. In addition, LESSeq quantifies the abundance of these alternative events using Maximum Likelihood Estimation (MLE) and provides their significance between different cellular conditions. The utility of LESSeq is demonstrated through two case studies relevant to human variation and evolution. Using an RNA-Seq data set of lymphoblastoid cell lines in two human populations, we examined within-species variation and discovered population-differential alternative splicing events. With an RNA-Seq data set of several tissues in human and rhesus macaque, we studied cross-species variation and identified lineage-differential alternative splicing events. LESSeq is implemented in C++ and R, and made publicly available on GitHub at: https://github.com/gersteinlab/LESSeq

Posted ContentDOI
11 Sep 2019-bioRxiv
TL;DR: A generative model, Latent Dirichlet allocation (LDA), is applied, to identify patterns of gene expression and microbial abundances and relate them to clinical data, and a method called LDA-link is developed that connects microbes to genes using reduced-dimensionality LDA topics.
Abstract: Sputum induction is a non-invasive method to evaluate the airway environment, particularly for asthma. RNA sequencing (RNAseq) can be used on sputum, but it can be challenging to interpret because sputum contains a complex and heterogeneous mixture of human cells and exogenous (microbial) material. In this study, we developed a methodology that integrates dimensionality reduction and statistical modeling to grapple with the heterogeneity. We use this to relate bulk RNAseq data from 115 asthmatic patients with clinical information, microscope images, and single-cell profiles. First, we mapped sputum RNAseq to human and exogenous sources. Next, we decomposed the human reads into cell-expression signatures and fractions of these in each sample; we validated the decomposition using targeted single-cell RNAseq and microscopy. We observed enrichment of immune-system cells (neutrophils, eosinophils, and mast cells) in severe asthmatics. Second, we inferred microbial abundances from the exogenous reads and then associated these with clinical variables -- e.g., Haemophilus was associated with increased white blood cell count and Candida, with worse lung function. Third, we applied a generative model, Latent Dirichlet allocation (LDA), to identify patterns of gene expression and microbial abundances and relate them to clinical data. Based on this, we developed a method called LDA-link that connects microbes to genes using reduced-dimensionality LDA topics. We found a number of known connections, e.g. between Haemophilus and the gene IL1B, which is highly expressed by mast cells. In addition, we identified novel connections, including Candida and the calcium-signaling gene CACNA1E, which is highly expressed by eosinophils. These results speak to the mechanism by which gene-microbe interactions contribute to asthma and define a strategy for making inferences in heterogeneous and noisy RNAseq datasets.