scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2017"


Journal ArticleDOI
TL;DR: An extensive study analysing a broad spectrum of RNA-seq workflows and proposing a comprehensive analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy, which could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome.
Abstract: RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome. RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.

194 citations


Journal ArticleDOI
TL;DR: This work proposes a new method for determining the target genes of transcriptional enhancers in specific cells and tissues, and discovers three major co-regulation modes of enhancers and finds defense-related genes often simultaneously regulated by multiple enhancers bound by different transcription factors.
Abstract: We propose a new method for determining the target genes of transcriptional enhancers in specific cells and tissues. It combines global trends across many samples and sample-specific information, and considers the joint effect of multiple enhancers. Our method outperforms existing methods when predicting the target genes of enhancers in unseen samples, as evaluated by independent experimental data. Requiring few types of input data, we are able to apply our method to reconstruct the enhancer-target networks in 935 samples of human primary cells, tissues and cell lines, which constitute by far the largest set of enhancer-target networks. The similarity of these networks from different samples closely follows their cell and tissue lineages. We discover three major co-regulation modes of enhancers and find defense-related genes often simultaneously regulated by multiple enhancers bound by different transcription factors. We also identify differentially methylated enhancers in hepatocellular carcinoma (HCC) and experimentally confirm their altered regulation of HCC-related genes.

189 citations


Journal ArticleDOI
24 Nov 2017-Science
TL;DR: Comparing transcriptome and histology of human and nonhuman primate brains reveals changes that make humans unique, and diverse molecular and cellular features of the phylogenetic reorganization of the human brain across multiple levels, with relevance for brain function and disease.
Abstract: To better understand the molecular and cellular differences in brain organization between human and nonhuman primates, we performed transcriptome sequencing of 16 regions of adult human, chimpanzee, and macaque brains. Integration with human single-cell transcriptomic data revealed global, regional, and cell-type–specific species expression differences in genes representing distinct functional categories. We validated and further characterized the human specificity of genes enriched in distinct cell types through histological and functional analyses, including rare subpallial-derived interneurons expressing dopamine biosynthesis genes enriched in the human striatum and absent in the nonhuman African ape neocortex. Our integrated analysis of the generated data revealed diverse molecular and cellular features of the phylogenetic reorganization of the human brain across multiple levels, with relevance for brain function and disease.

174 citations


Journal ArticleDOI
TL;DR: An in-depth proteomic survey of regions of the postnatal human brain, ranging in age from early infancy to adulthood, revealed varied patterns of protein–RNA relationships, with generally increased magnitudes of protein abundance differences between brain regions compared to RNA.
Abstract: Detailed observations of transcriptional, translational and post-translational events in the human brain are essential to improving our understanding of its development, function and vulnerability to disease. Here, we exploited label-free quantitative tandem mass-spectrometry to create an in-depth proteomic survey of regions of the postnatal human brain, ranging in age from early infancy to adulthood. Integration of protein data with existing matched whole-transcriptome sequencing (RNA-seq) from the BrainSpan project revealed varied patterns of protein-RNA relationships, with generally increased magnitudes of protein abundance differences between brain regions compared to RNA. Many of the differences amplified in protein data were reflective of cytoarchitectural and functional variation between brain regions. Comparing structurally similar cortical regions revealed significant differences in the abundances of receptor-associated and resident plasma membrane proteins that were not readily observed in the RNA expression data.

119 citations


Journal ArticleDOI
TL;DR: A novel reproducibility metric for quantifying the similarity between contact maps based on spectral decomposition is introduced, which successfully separates contact maps mapped from Hi-C data coming from biological replicates, pseudo-replicates and different cell types.
Abstract: Summary Genome-wide proximity ligation based assays like Hi-C have opened a window to the 3D organization of the genome. In so doing, they present data structures that are different from conventional 1D signal tracks. To exploit the 2D nature of Hi-C contact maps, matrix techniques like spectral analysis are particularly useful. Here, we present HiC-spector, a collection of matrix-related functions for analyzing Hi-C contact maps. In particular, we introduce a novel reproducibility metric for quantifying the similarity between contact maps based on spectral decomposition. The metric successfully separates contact maps mapped from Hi-C data coming from biological replicates, pseudo-replicates and different cell types. Availability and implementation Source code in Julia and Python, and detailed documentation is available at https://github.com/gersteinlab/HiC-spector . Contact koonkiu.yan@gmail.com or mark@gersteinlab.org. Supplementary information Supplementary data are available at Bioinformatics online.

81 citations


Journal ArticleDOI
TL;DR: A large degree of somatic mosaicism in healthy human tissues is revealed, which could be a characteristic of clonal cell selection, clonal expansion, or both, and de novo and cancer mutations to somatic mosaicicism are linked.
Abstract: Few studies have been conducted to understand post-zygotic accumulation of mutations in cells of the healthy human body. We reprogrammed 32 skin fibroblast cells from families of donors into human induced pluripotent stem cell (hiPSC) lines. The clonal nature of hiPSC lines allows a high-resolution analysis of the genomes of the founder fibroblast cells without being confounded by the artifacts of single-cell whole-genome amplification. We estimate that on average a fibroblast cell in children has 1035 mostly benign mosaic SNVs. On average, 235 SNVs could be directly confirmed in the original fibroblast population by ultradeep sequencing, down to an allele frequency (AF) of 0.1%. More sensitive droplet digital PCR experiments confirmed more SNVs as mosaic with AF as low as 0.01%, suggesting that 1035 mosaic SNVs per fibroblast cell is the true average. Similar analyses in adults revealed no significant increase in the number of SNVs per cell, suggesting that a major fraction of mosaic SNVs in fibroblasts arises during development. Mosaic SNVs were distributed uniformly across the genome and were enriched in a mutational signature previously observed in cancers and in de novo variants and which, we hypothesize, is a hallmark of normal cell proliferation. Finally, AF distribution of mosaic SNVs had distinct narrow peaks, which could be a characteristic of clonal cell selection, clonal expansion, or both. These findings reveal a large degree of somatic mosaicism in healthy human tissues, link de novo and cancer mutations to somatic mosaicism, and couple somatic mosaicism with cell proliferation.

61 citations


Posted ContentDOI
14 Sep 2017-bioRxiv
TL;DR: This work assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices.
Abstract: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established (e.g., ratio of intra to interchromosomal interactions) and novel (e.g., QuASAR-QC) measures to identify low quality experiments. In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundaje/3DChromatin_ReplicateQC to facilitate adoption in the community.

57 citations


Journal ArticleDOI
TL;DR: It is found that embryos lacking particular miRNA-dependent signaling pathways develop a vascular trait similar to wild-type, but with a profound increase in phenotypic heterogeneity, which marks an important advance in the comprehension of how miRNAs function in the development of higher organisms.

56 citations


Posted ContentDOI
23 Dec 2017-bioRxiv
TL;DR: These analyses redefine the landscape of non-coding driver mutations in cancer genomes, confirming a few previously reported elements and raising doubts about others, while identifying novel candidate elements across 27 cancer types.
Abstract: Discovery of cancer drivers has traditionally focused on the identification of protein-coding genes. Here we present a comprehensive analysis of putative cancer driver mutations in both protein-coding and non-coding genomic regions across >2,500 whole cancer genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium. We developed a statistically rigorous strategy for combining significance levels from multiple driver discovery methods and demonstrate that the integrated results overcome limitations of individual methods. We combined this strategy with careful filtering and applied it to protein-coding genes, promoters, untranslated regions (UTRs), distal enhancers and non-coding RNAs. These analyses redefine the landscape of non-coding driver mutations in cancer genomes, confirming a few previously reported elements and raising doubts about others, while identifying novel candidate elements across 27 cancer types. Novel recurrent events were found in the promoters or 5’UTRs of TP53, RFTN1, RNF34, and MTG2, in the 3’UTRs of NFKBIZ and TOB1, and in the non-coding RNA RMRP. We provide evidence that the previously reported non-coding RNAs NEAT1 and MALAT1 may be subject to a localized mutational process. Perhaps the most striking finding is the relative paucity of point mutations driving cancer in non-coding genes and regulatory elements. Though we have limited power to discover infrequent non-coding drivers in individual cohorts, combined analysis of promoters of known cancer genes show little excess of mutations beyond TERT.

54 citations


Journal ArticleDOI
TL;DR: This study identifies global changes in RNA-protein interactions during vertebrate MZT and shows that Hnrnpa1 RNA-binding activities are spatially and temporally coordinated to regulate RNA metabolism during early development.
Abstract: During the maternal-to-zygotic transition (MZT), transcriptionally silent embryos rely on post-transcriptional regulation of maternal mRNAs until zygotic genome activation (ZGA). RNA-binding proteins (RBPs) are important regulators of post-transcriptional RNA processing events, yet their identities and functions during developmental transitions in vertebrates remain largely unexplored. Using mRNA interactome capture, we identified 227 RBPs in zebrafish embryos before and during ZGA, hereby named the zebrafish MZT mRNA-bound proteome. This protein constellation consists of many conserved RBPs, some of which are potential stage-specific mRNA interactors that likely reflect the dynamics of RNA-protein interactions during MZT. The enrichment of numerous splicing factors like hnRNP proteins before ZGA was surprising, because maternal mRNAs were found to be fully spliced. To address potentially unique roles of these RBPs in embryogenesis, we focused on Hnrnpa1. iCLIP and subsequent mRNA reporter assays revealed a function for Hnrnpa1 in the regulation of poly(A) tail length and translation of maternal mRNAs through sequence-specific association with 3' UTRs before ZGA. Comparison of iCLIP data from two developmental stages revealed that Hnrnpa1 dissociates from maternal mRNAs at ZGA and instead regulates the nuclear processing of pri-mir-430 transcripts, which we validated experimentally. The shift from cytoplasmic to nuclear RNA targets was accompanied by a dramatic translocation of Hnrnpa1 and other pre-mRNA splicing factors to the nucleus in a transcription-dependent manner. Thus, our study identifies global changes in RNA-protein interactions during vertebrate MZT and shows that Hnrnpa1 RNA-binding activities are spatially and temporally coordinated to regulate RNA metabolism during early development.

49 citations


Journal ArticleDOI
TL;DR: MrTADFinder provides a novel computational framework to explore the multi-scale structures in Hi-C contact maps and examines how somatic mutations are distributed across boundaries and finds a clear stepwise pattern.
Abstract: Genome-wide proximity ligation based assays such as Hi-C have revealed that eukaryotic genomes are organized into structural units called topologically associating domains (TADs). From a visual examination of the chromosomal contact map, however, it is clear that the organization of the domains is not simple or obvious. Instead, TADs exhibit various length scales and, in many cases, a nested arrangement. Here, by exploiting the resemblance between TADs in a chromosomal contact map and densely connected modules in a network, we formulate TAD identification as a network optimization problem and propose an algorithm, MrTADFinder, to identify TADs from intra-chromosomal contact maps. MrTADFinder is based on the network-science concept of modularity. A key component of it is deriving an appropriate background model for contacts in a random chain, by numerically solving a set of matrix equations. The background model preserves the observed coverage of each genomic bin as well as the distance dependence of the contact frequency for any pair of bins exhibited by the empirical map. Also, by introducing a tunable resolution parameter, MrTADFinder provides a self-consistent approach for identifying TADs at different length scales, hence the acronym "Mr" standing for Multiple Resolutions. We then apply MrTADFinder to various Hi-C datasets. The identified domain boundaries are marked by characteristic signatures in chromatin marks and transcription factors (TF) that are consistent with earlier work. Moreover, by calling TADs at different length scales, we observe that boundary signatures change with resolution, with different chromatin features having different characteristic length scales. Furthermore, we report an enrichment of HOT (high-occupancy target) regions near TAD boundaries and investigate the role of different TFs in determining boundaries at various resolutions. To further explore the interplay between TADs and epigenetic marks, as tumor mutational burden is known to be coupled to chromatin structure, we examine how somatic mutations are distributed across boundaries and find a clear stepwise pattern. Overall, MrTADFinder provides a novel computational framework to explore the multi-scale structures in Hi-C contact maps.

Journal ArticleDOI
TL;DR: Using data from Mendelian disease-gene discovery projects, ALoFT can distinguish between loss-of-function variants that are deleterious as heterozygotes and those causing disease only in the homozygous state and its application to interpreting LoF variants in different contexts is shown.
Abstract: Variants predicted to result in the loss of function of human genes have attracted interest because of their clinical impact and surprising prevalence in healthy individuals. Here, we present ALoFT (annotation of loss-of-function transcripts), a method to annotate and predict the disease-causing potential of loss-of-function variants. Using data from Mendelian disease-gene discovery projects, we show that ALoFT can distinguish between loss-of-function variants that are deleterious as heterozygotes and those causing disease only in the homozygous state. Investigation of variants discovered in healthy populations suggests that each individual carries at least two heterozygous premature stop alleles that could potentially lead to disease if present as homozygotes. When applied to de novo putative loss-of-function variants in autism-affected families, ALoFT distinguishes between deleterious variants in patients and benign variants in unaffected siblings. Finally, analysis of somatic variants in >6500 cancer exomes shows that putative loss-of-function variants predicted to be deleterious by ALoFT are enriched in known driver genes. Variants causing loss of function (LoF) of human genes have clinical implications. Here, the authors present a method to predict disease-causing potential of LoF variants, ALoFT (annotation of Loss-of-Function Transcripts) and show its application to interpreting LoF variants in different contexts.

Journal ArticleDOI
TL;DR: The first whole-genome analysis of pRCC is performed, finding genome-wide mutational patterns are governed mostly by methylation-associated C-to-T transitions, and significantly more mutations in open chromatin and early-replicating regions in tumors with chromatin-modifier alterations are observed.
Abstract: To date, studies on papillary renal-cell carcinoma (pRCC) have largely focused on coding alterations in traditional drivers, particularly the tyrosine-kinase, Met. However, for a significant fraction of tumors, researchers have been unable to determine a clear molecular etiology. To address this, we perform the first whole-genome analysis of pRCC. Elaborating on previous results on MET, we find a germline SNP (rs11762213) in this gene predicting prognosis. Surprisingly, we detect no enrichment for small structural variants disrupting MET. Next, we scrutinize noncoding mutations, discovering potentially impactful ones associated with MET. Many of these are in an intron connected to a known, oncogenic alternative-splicing event; moreover, we find methylation dysregulation nearby, leading to a cryptic promoter activation. We also notice an elevation of mutations in the long noncoding RNA NEAT1, and these mutations are associated with increased expression and unfavorable outcome. Finally, to address the origin of pRCC heterogeneity, we carry out whole-genome analyses of mutational processes. First, we investigate genome-wide mutational patterns, finding they are governed mostly by methylation-associated C-to-T transitions. We also observe significantly more mutations in open chromatin and early-replicating regions in tumors with chromatin-modifier alterations. Finally, we reconstruct cancer-evolutionary trees, which have markedly different topologies and suggested evolutionary trajectories for the different subtypes of pRCC.

Posted ContentDOI
Sebastian M. Waszak1, Grace Tiao2, Bin Zhu3, Tobias Rausch1, Francesc Muyas4, Bernardo Rodriguez-Martin5, Raquel Rabionet6, Sergei Yakneen1, Geòrgia Escaramís, Yang Li7, Natalie Saini3, Steven A. Roberts8, German Demidov4, Esa Pitkänen1, Olivier Delaneau9, Jose Maria Heredia-Genestar10, Joachim Weischenfeldt11, Suyash Shringarpure12, Jieming Chen13, Hidewaki Nakagawa, Ludmil B. Alexandrov14, Oliver Drechsel4, L. J. Dursi15, Ayellet V. Segrè2, Erik Garrison7, Serap Erkek1, Nina Habermann1, Lara Urban1, Ekta Khurana16, Andy Cafferkey1, Shuto Hayashi17, Seiya Imoto17, Lauri A. Aaltonen18, Eva G. Alvarez5, Adrian Baez-Ortega19, Matthew A. Bailey20, Mattia Bosio4, Alicia L. Bruzos5, Ivo Buchhalter21, Carlos Bustamante12, Claudia Calabrese1, Anthony DiBiase22, Mark Gerstein20, Aliaksei Holik4, Xing Hua3, Kuan-lin Huang23, Ivica Letunic, Leszek J. Klimczak3, Roelof Koster3, Sushant Kumar20, Michael D. McLellan23, R. Jay Mashl23, Lisa Mirabello3, Steven Newhouse1, Aparna Prasad4, Gunnar Rätsch24, Matthias Schlesner21, Roland F. Schwarz21, Pramod Sharma22, Tal Shmaya, Nikos Sidiropoulos11, Lei Song3, Hana Susak4, Tomas Tanskanen18, Marta Tojo5, David C. Wedge25, Mark H. Wright12, Ying Wu, Kai Ye23, Venkata Yellapantula23, Jorge Zamora5, Atul J. Butte13, Gad Getz26, Jared T. Simpson15, Li Ding23, Tomas Marques-Bonet4, Arcadi Navarro4, Alvis Brazma1, Peter J. Campbell27, Stephen J. Chanock3, Nilanjan Chatterjee28, Oliver Stegle21, Reiner Siebert29, Stephan Ossowski4, Olivier Harismendy30, Dmitry A. Gordenin3, Jose M. C. Tubio5, Francisco M. De La Vega12, Douglas F. Easton19, Xavier Estivill, Jan O. Korbel1, Icgc 
01 Nov 2017-bioRxiv
TL;DR: This study highlights the major impact of rare and common germline variants on mutational landscapes in cancer and inferred over a hundred polymorphic L1/LINE elements with somatic retrotransposition activity in cancer.
Abstract: Cancers develop through somatic mutagenesis, however germline genetic variation can markedly contribute to tumorigenesis via diverse mechanisms. We discovered and phased 88 million germline single nucleotide variants, short insertions/deletions, and large structural variants in whole genomes from 2,642 cancer patients, and employed this genomic resource to study genetic determinants of somatic mutagenesis across 39 cancer types. Our analyses implicate damaging germline variants in a variety of cancer predisposition and DNA damage response genes with specific somatic mutation patterns. Mutations in the MBD4 DNA glycosylase gene showed association with elevated C>T mutagenesis at CpG dinucleotides, a ubiquitous mutational process acting across tissues. Analysis of somatic structural variation exposed complex rearrangement patterns, involving cycles of templated insertions and tandem duplications, in BRCA1-deficient tumours. Genome-wide association analysis implicated common genetic variation at the APOBEC3 gene cluster with reduced basal levels of somatic mutagenesis attributable to APOBEC cytidine deaminases across cancer types. We further inferred over a hundred polymorphic L1/LINE elements with somatic retrotransposition activity in cancer. Our study highlights the major impact of rare and common germline variants on mutational landscapes in cancer.

Journal ArticleDOI
01 Apr 2017-Stroke
TL;DR: This is the largest study of ex-RNAs in relation to stroke using an unbiased approach in an observational cohort and the first large study to examine human small noncoding RNAs beyond miRNAs, and demonstrates that when studied in a large observational cohort, extracellular mi RNAs are associated with stroke risk.
Abstract: Background and Purpose— There is increasing interest in extracellular RNAs (ex-RNAs), with numerous reports of associations between selected microRNAs (miRNAs) and a variety of cardiovascular disease phenotypes. Previous studies of ex-RNAs in relation to risk for cardiovascular disease have investigated small numbers of patients and assayed only candidate miRNAs. No human studies have investigated links between novel ex-RNAs and stroke. Methods— We conducted unbiased next-generation sequencing using plasma from 40 participants of the FHS (Framingham Heart Study; Offspring Cohort Exam 8) followed by high-throughput polymerase chain reaction of 471 ex-RNAs. The reverse transcription quantitative polymerase chain reaction included 331 of the most abundant miRNAs, 43 small nucleolar RNAs, and 97 piwi-interacting RNAs in 2763 additional FHS participants and explored the relations of ex-RNAs and prevalent (n=63) and incident (n=51) stroke and coronary heart disease (prevalent=286, incident=69). Results— After adjustment for multiple cardiovascular disease risk factors, 7 ex-RNAs were associated with stroke prevalence or incidence; there were no ex-RNA associated with prevalent or incident coronary heart disease. Statistically significant ex-RNA associations with stroke were specific, with no overlap between prevalent and incident events. Conclusions— This is the largest study of ex-RNAs in relation to stroke using an unbiased approach in an observational cohort and the first large study to examine human small noncoding RNAs beyond miRNAs. These results demonstrate that when studied in a large observational cohort, extracellular miRNAs are associated with stroke risk.

Journal ArticleDOI
TL;DR: This investigation provides insight into the functional impact and association with genomic elements of retroduplications, and expects the approach and analytical methodology to have application in a more clinical context, where exome sequencing data is abundant and the discovery of ret Reproduplications can potentially improve the accuracy of SNP calling.
Abstract: Retroduplications come from reverse transcription of mRNAs and their insertion back into the genome. Here, we performed comprehensive discovery and analysis of retroduplications in a large cohort of 2,535 individuals from 26 human populations, as part of 1000 Genomes Phase 3. We developed an integrated approach to discover novel retroduplications combining high-coverage exome and low-coverage whole-genome sequencing data, utilizing information from both exon-exon junctions and discordant paired-end reads. We found 503 parent genes having novel retroduplications absent from the reference genome. Based solely on retroduplication variation, we built phylogenetic trees of human populations; these represent superpopulation structure well and indicate that variable retroduplications are effective population markers. We further identified 43 retroduplication parent genes differentiating superpopulations. This group contains several interesting insertion events, including a SLMO2 retroduplication and insertion into CAV3, which has a potential disease association. We also found retroduplications to be associated with a variety of genomic features: (1) Insertion sites were correlated with regular nucleosome positioning. (2) They, predictably, tend to avoid conserved functional regions, such as exons, but, somewhat surprisingly, also avoid introns. (3) Retroduplications tend to be co-inserted with young L1 elements, indicating recent retrotranspositional activity, and (4) they have a weak tendency to originate from highly expressed parent genes. Our investigation provides insight into the functional impact and association with genomic elements of retroduplications. We anticipate our approach and analytical methodology to have application in a more clinical context, where exome sequencing data is abundant and the discovery of retroduplications can potentially improve the accuracy of SNP calling.

Posted ContentDOI
24 Aug 2017-bioRxiv
TL;DR: Observations illuminate a relevant role of L1 retrotransposition in remodeling the cancer genome, with potential implications in the development of human tumours.
Abstract: About half of all cancers have somatic integrations of retrotransposons. To characterize their role in oncogenesis, we analyzed the patterns and mechanisms of somatic retrotransposition in 2,774 cancer genomes from 37 histological cancer subtypes. We identified 20,230 somatically acquired retrotransposition events, affecting 43% of samples, and spanning a range of event types. L1 insertions emerged as the third most frequent type of somatic structural variation in cancer. Aberrant L1 integrations can delete megabase-scale regions of a chromosome, sometimes removing tumour suppressor genes, as well as inducing complex translocations and large-scale duplications. Somatic retrotranspositions can also initiate breakage-fusion-bridge cycles of genomic instability, leading to high-level amplification of oncogenes. These observations illuminate a relevant role of L1 retrotransposition in remodeling the cancer genome, with potential implications in the initiation and/or development of human tumours.

Journal ArticleDOI
TL;DR: A hierarchical organization for supplements is proposed, with some parts paralleling and “shadowing” the main text and other elements branching off from it, and a specific formatting is suggested to make this structure explicit.
Abstract: Supplements are increasingly important to the scientific record, particularly in genomics. However, they are often underutilized. Optimally, supplements should make results findable, accessible, interoperable, and reusable (i.e., “FAIR”). Moreover, properly off-loading to them the data and detail in a paper could make the main text more readable. We propose a hierarchical organization for supplements, with some parts paralleling and “shadowing” the main text and other elements branching off from it, and we suggest a specific formatting to make this structure explicit. Furthermore, sections of the supplement could be presented in multiple scientific “dialects”, including machine-readable and lay-friendly formats.

Posted ContentDOI
02 Jul 2017-bioRxiv
TL;DR: A system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species and demonstrated that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.
Abstract: Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 to 6 MYA, but that are absent in the Hominidae. In fact, Hominidae show between four- and seven-fold lower rates of nucleotide change and feature turnover in both neutral and functional sequences suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. For example, recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli. This process resulted in thousands of novel, species-specific CTCF binding sites. Our results demonstrate that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.

Journal ArticleDOI
28 Jun 2017-Nature
TL;DR: An analysis of 360 breast-cancer genomes has identified cancer-driving mutations in 9 non-coding DNA sequences called promoters, which regulate gene expression, which hints at the prevalence of non-Coding drivers.
Abstract: An analysis of 360 breast-cancer genomes has identified cancer-driving mutations in 9 non-coding DNA sequences called promoters, which regulate gene expression. The result hints at the prevalence of non-coding drivers. See Article p.55

Journal ArticleDOI
TL;DR: This unit provides guidelines for installing and running FunSeq2 to annotate and prioritize variants, incorporate user‐defined annotations, and detect differential gene expression.
Abstract: The identification of non-coding drivers remains a challenge and bottleneck for the use of whole-genome sequencing in the clinic. FunSeq2 is a computational tool for annotation and prioritization of somatic mutations in coding and non-coding regions. It integrates a data context made from large-scale genomic datasets and uses a high-throughput variant prioritization pipeline. This unit provides guidelines for installing and running FunSeq2 to (a) annotate and prioritize variants, (b) incorporate user-defined annotations, and (c) detect differential gene expression. © 2017 by John Wiley & Sons, Inc.

Journal ArticleDOI
TL;DR: The Intensification approach is developed, which uses the modular structure of repeat protein domains to amplify signals of selection from population genetics and traditional interspecies conservation, and is able to aggregate variants at the codon level to identify important positions in repeat domains that show strong conservation signals.

Posted ContentDOI
07 Feb 2017-bioRxiv
TL;DR: ALoFT (Annotation of Loss-of-Function Transcripts), a method to annotate and predict the disease-causing potential of LoF variants, shows that pLoF variants predicted to be deleterious by ALoFT are enriched in known driver genes.
Abstract: Variants predicted to result in the loss of function (LoF) of human genes have attracted interest because of their clinical impact and surprising prevalence in healthy individuals. Here, we present ALoFT (Annotation of Loss-of-Function Transcripts), a method to annotate and predict the disease-causing potential of LoF variants. Using data from Mendelian disease-gene discovery projects, we show that ALoFT can distinguish between LoF variants deleterious as heterozygotes and those causing disease only in the homozygous state. Investigation of variants discovered in healthy populations suggests that each individual carries at least two heterozygous premature stop alleles that could potentially lead to disease if present as homozygotes. When applied to de novo pLoF variants in autism-affected families, ALoFT distinguishes between deleterious variants in patients and benign variants in unaffected siblings. Finally, analysis of somatic variants in > 6,500 cancer exomes shows that pLoF variants predicted to be deleterious by ALoFT are enriched in known driver genes.