scispace - formally typeset
Search or ask a question

Showing papers by "Manolis Kellis published in 2017"


Journal ArticleDOI
TL;DR: ChromHMM as mentioned in this paper learns chromatin-state signatures using a multivariate hidden Markov model (HMM) that explicitly models the combinatorial presence or absence of each mark.
Abstract: Noncoding DNA regions have central roles in human biology, evolution, and disease. ChromHMM helps to annotate the noncoding genome using epigenomic information across one or multiple cell types. It combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type. ChromHMM learns chromatin-state signatures using a multivariate hidden Markov model (HMM) that explicitly models the combinatorial presence or absence of each mark. ChromHMM uses these signatures to generate a genome-wide annotation for each cell type by calculating the most probable state for each genomic segment. ChromHMM provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state. ChromHMM is distinguished by its modeling emphasis on combinations of marks, its tight integration with downstream functional enrichment analyses, its speed, and its ease of use. Chromatin states are learned, annotations are produced, and enrichments are computed within 1 d.

478 citations


Journal ArticleDOI
Felix R. Day1, Deborah J. Thompson1, Hannes Helgason2, Hannes Helgason3  +241 moreInstitutions (67)
TL;DR: In this article, the authors used 1000 Genomes Project-imputed genotype data in up to ∼370,000 women to identify 389 independent signals (P < 5 × 10-8) for age at menarche, a milestone in female pubertal development.
Abstract: The timing of puberty is a highly polygenic childhood trait that is epidemiologically associated with various adult diseases. Using 1000 Genomes Project-imputed genotype data in up to ∼370,000 women, we identify 389 independent signals (P < 5 × 10-8) for age at menarche, a milestone in female pubertal development. In Icelandic data, these signals explain ∼7.4% of the population variance in age at menarche, corresponding to ∼25% of the estimated heritability. We implicate ∼250 genes via coding variation or associated expression, demonstrating significant enrichment in neural tissues. Rare variants near the imprinted genes MKRN3 and DLK1 were identified, exhibiting large effects when paternally inherited. Mendelian randomization analyses suggest causal inverse associations, independent of body mass index (BMI), between puberty timing and risks for breast and endometrial cancers in women and prostate cancer in men. In aggregate, our findings highlight the complexity of the genetic regulation of puberty timing and support causal links with cancer susceptibility.

392 citations


01 Nov 2017
TL;DR: ChromHMM combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type, and provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state.
Abstract: Noncoding DNA regions have central roles in human biology, evolution, and disease. ChromHMM helps to annotate the noncoding genome using epigenomic information across one or multiple cell types. It combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type. ChromHMM learns chromatin-state signatures using a multivariate hidden Markov model (HMM) that explicitly models the combinatorial presence or absence of each mark. ChromHMM uses these signatures to generate a genome-wide annotation for each cell type by calculating the most probable state for each genomic segment. ChromHMM provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state. ChromHMM is distinguished by its modeling emphasis on combinations of marks, its tight integration with downstream functional enrichment analyses, its speed, and its ease of use. Chromatin states are learned, annotations are produced, and enrichments are computed within 1 d.

364 citations


Journal ArticleDOI
Ashis Saha1, Yungil Kim1, Ariel D. H. Gewirtz2, Brian Jo2  +256 moreInstitutions (49)
TL;DR: These networks are built that additionally capture the regulation of relative isoform abundance and splicing, along with tissue-specific connections unique to each of a diverse set of tissues, and provide an improved understanding of the complex relationships of the human transcriptome across tissues.
Abstract: Gene co-expression networks capture biologically important patterns in gene expression data, enabling functional analyses of genes, discovery of biomarkers, and interpretation of genetic variants. Most network analyses to date have been limited to assessing correlation between total gene expression levels in a single tissue or small sets of tissues. Here, we built networks that additionally capture the regulation of relative isoform abundance and splicing, along with tissue-specific connections unique to each of a diverse set of tissues. We used the Genotype-Tissue Expression (GTEx) project v6 RNA sequencing data across 50 tissues and 449 individuals. First, we developed a framework called Transcriptome-Wide Networks (TWNs) for combining total expression and relative isoform levels into a single sparse network, capturing the interplay between the regulation of splicing and transcription. We built TWNs for 16 tissues and found that hubs in these networks were strongly enriched for splicing and RNA binding genes, demonstrating their utility in unraveling regulation of splicing in the human transcriptome. Next, we used a Bayesian biclustering model that identifies network edges unique to a single tissue to reconstruct Tissue-Specific Networks (TSNs) for 26 distinct tissues and 10 groups of related tissues. Finally, we found genetic variants associated with pairs of adjacent nodes in our networks, supporting the estimated network structures and identifying 20 genetic variants with distant regulatory impact on transcription and splicing. Our networks provide an improved understanding of the complex relationships of the human transcriptome across tissues.

146 citations



Posted ContentDOI
23 Dec 2017-bioRxiv
TL;DR: These analyses redefine the landscape of non-coding driver mutations in cancer genomes, confirming a few previously reported elements and raising doubts about others, while identifying novel candidate elements across 27 cancer types.
Abstract: Discovery of cancer drivers has traditionally focused on the identification of protein-coding genes. Here we present a comprehensive analysis of putative cancer driver mutations in both protein-coding and non-coding genomic regions across >2,500 whole cancer genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium. We developed a statistically rigorous strategy for combining significance levels from multiple driver discovery methods and demonstrate that the integrated results overcome limitations of individual methods. We combined this strategy with careful filtering and applied it to protein-coding genes, promoters, untranslated regions (UTRs), distal enhancers and non-coding RNAs. These analyses redefine the landscape of non-coding driver mutations in cancer genomes, confirming a few previously reported elements and raising doubts about others, while identifying novel candidate elements across 27 cancer types. Novel recurrent events were found in the promoters or 5’UTRs of TP53, RFTN1, RNF34, and MTG2, in the 3’UTRs of NFKBIZ and TOB1, and in the non-coding RNA RMRP. We provide evidence that the previously reported non-coding RNAs NEAT1 and MALAT1 may be subject to a localized mutational process. Perhaps the most striking finding is the relative paucity of point mutations driving cancer in non-coding genes and regulatory elements. Though we have limited power to discover infrequent non-coding drivers in individual cohorts, combined analysis of promoters of known cancer genes show little excess of mutations beyond TERT.

54 citations


Journal ArticleDOI
TL;DR: The results suggest the existence of a recombination rate valley at regulatory domains and provide a potential molecular mechanism to interpret the interplay between genetic and epigenetic variations.
Abstract: Recombination rate is non-uniformly distributed across the human genome. The variation of recombination rate at both fine and large scales cannot be fully explained by DNA sequences alone. Epigenetic factors, particularly DNA methylation, have recently been proposed to influence the variation in recombination rate. We study the relationship between recombination rate and gene regulatory domains, defined by a gene and its linked control elements. We define these links using expression quantitative trait loci (eQTLs), methylation quantitative trait loci (meQTLs), chromatin conformation from publicly available datasets (Hi-C and ChIA-PET), and correlated activity links that we infer across cell types. Each link type shows a “recombination rate valley” of significantly reduced recombination rate compared to matched control regions. This recombination rate valley is most pronounced for gene regulatory domains of early embryonic development genes, housekeeping genes, and constitutive regulatory elements, which are known to show increased evolutionary constraint across species. Recombination rate valleys show increased DNA methylation, reduced doublestranded break initiation, and increased repair efficiency, specifically in the lineage leading to the germ line. Moreover, by using only the overlap of functional links and DNA methylation in germ cells, we are able to predict the recombination rate with high accuracy. Our results suggest the existence of a recombination rate valley at regulatory domains and provide a potential molecular mechanism to interpret the interplay between genetic and epigenetic variations.

40 citations


Journal ArticleDOI
TL;DR: Predictive properties in isolation are inspected, and meta‐analysis over the competing methods find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results.
Abstract: In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.

37 citations


Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a hierarchical hidden Markov model (diHMM) to identify domain-level states that vary in nucleosome-level state composition, spatial distribution and functionality.
Abstract: Chromatin-state analysis is widely applied in the studies of development and diseases. However, existing methods operate at a single length scale, and therefore cannot distinguish large domains from isolated elements of the same type. To overcome this limitation, we present a hierarchical hidden Markov model, diHMM, to systematically annotate chromatin states at multiple length scales. We apply diHMM to analyse a public ChIP-seq data set. diHMM not only accurately captures nucleosome-level information, but identifies domain-level states that vary in nucleosome-level state composition, spatial distribution and functionality. The domain-level states recapitulate known patterns such as super-enhancers, bivalent promoters and Polycomb repressed regions, and identify additional patterns whose biological functions are not yet characterized. By integrating chromatin-state information with gene expression and Hi-C data, we identify context-dependent functions of nucleosome-level states. Thus, diHMM provides a powerful tool for investigating the role of higher-order chromatin structure in gene regulation. The impact of chromatin structure on gene expression makes it integral to our understanding of developmental and disease processes. Here, the authors introduce a hierarchical hidden Markov model to systematically annotate chromatin states at multiple length scales, and demonstrate its utility for the elucidation of the role of chromatin structure in gene expression.

36 citations


Posted ContentDOI
27 Sep 2017-bioRxiv
TL;DR: High-Definition Reporter Assay provides a general, scalable, high-throughput, and high-resolution approach for experimental dissection of regulatory regions and driver nucleotides in the context of human biology and disease.
Abstract: Genome-wide epigenomic maps revealed millions of regions showing signatures of enhancers, promoters, and other gene-regulatory elements. However, high-throughput experimental validation of their function and high-resolution dissection of their driver nucleotides remain limited in their scale and length of regions tested. Here, we present a new method, HiDRA (High-Definition Reporter Assay), that overcomes these limitations by combining components of Sharpr-MPRA and STARR-Seq with genome-wide selection of accessible regions from ATAC-Seq. We used HiDRA to test ~7 million DNA fragments preferentially selected from accessible chromatin in the GM12878 lymphoblastoid cell line. By design, accessibility-selected fragments were highly overlapping (up to 370 per region), enabling us to pinpoint driver regulatory nucleotides by exploiting subtle differences in reporter activity between partially-overlapping fragments, using a new machine learning model SHARPR2. Our resulting maps include ~65,000 regions showing significant enhancer function and enriched for endogenous active histone marks (including H3K9ac, H3K27ac), regulatory sequence motifs, and regions bound by immune regulators. Within them, we discover ~13,000 high-resolution driver elements enriched for regulatory motifs and evolutionarily-conserved nucleotides, and help predict causal genetic variants underlying disease from genome-wide association studies. Overall, HiDRA provides a general, scalable, high-throughput, and high-resolution approach for experimental dissection of regulatory regions and driver nucleotides in the context of human biology and disease.

20 citations


Posted ContentDOI
26 Nov 2017-bioRxiv
TL;DR: In this paper, a deep-coverage whole-genome sequencing in 8,392 individuals of European and African American ancestries was performed to comprehensively delineate the inherited basis for plasma Lp(a), where apolipoprotein(a) was found to be 85% heritable among African Americans and 75% among Europeans, with notable inter-ethnic heterogeneity.
Abstract: Lipoprotein(a), Lp(a), is a modified low-density lipoprotein particle where apolipoprotein(a) (protein product of the LPA gene) is covalently attached to apolipoprotein B. Lp(a) is a highly heritable, causal risk factor for cardiovascular diseases and varies in concentrations across ancestries. To comprehensively delineate the inherited basis for plasma Lp(a), we performed deep-coverage whole genome sequencing in 8,392 individuals of European and African American ancestries. Through whole genome variant discovery and direct genotyping of all structural variants overlapping LPA, we quantified the 5.5kb kringle IV-2 copy number (KIV2-CN), a known LPA structural polymorphism, and developed a model for its imputation. Through common variant analysis, we discovered a novel locus (SORT1) associated with Lp(a)-cholesterol, and also genetic modifiers of KIV2-CN. Furthermore, in contrast to previous GWAS studies, we explain most of the heritability of Lp(a), observing Lp(a) to be 85% heritable among African Americans and 75% among Europeans, yet with notable inter-ethnic heterogeneity. Through analyses of aggregates of rare coding and non-coding variants with Lp(a)-cholesterol, we found the only genome-wide significant signal to be at a non-coding SLC22A3 intronic window also previously described to be associated with Lp(a); however, this association was mitigated by adjustment with KIV2-CN. Finally, using an additional imputation dataset (N=27,344), we performed Mendelian randomization of LPA variant classes, finding that genetically regulated Lp(a) is more strongly associated with incident cardiovascular diseases than directly measured Lp(a), and is significantly associated with measures of subclinical atherosclerosis in African Americans.

Posted ContentDOI
10 Feb 2017-bioRxiv
TL;DR: The authors proposed a multi-tissue, multivariate model for mapping expression quantitative trait loci and predicting gene expression, which decomposes eQTL effects into SNP-specific and tissue-specific components.
Abstract: Transcriptome-wide association studies (TWAS) have proven to be a powerful tool to identify genes associated with human diseases by aggregating cis-regulatory effects on gene expression. However, TWAS relies on building predictive models of gene expression, which are sensitive to the sample size and tissue on which they are trained. The Gene Tissue Expression Project has produced reference transcriptomes across 53 human tissues and cell types; however, the data is highly sparse, making it difficult to build polygenic models in relevant tissues for TWAS. Here, we propose ‘fQTL’, a multi-tissue, multivariate model for mapping expression quantitative trait loci and predicting gene expression. Our model decomposes eQTL effects into SNP-specific and tissue-specific components, pooling information across relevant tissues to effectively boost sample sizes. In simulation, we demonstrate that our multi-tissue approach outperforms single-tissue approaches in identifying causal eQTLs and tissues of action. Using our method, we fit polygenic models for 13,461 genes, characterized the tissue-specificity of the learned cis-eQTLs, and performed TWAS for Alzheimer9s disease and schizophrenia, identifying 107 and 382 associated genes, respectively.

Posted ContentDOI
18 Nov 2017-bioRxiv
TL;DR: A novel Bayesian inference framework, termed Causal Multivariate Mediation within Extended Linkage disequilibrium (CaMMEL), is developed to jointly model multiple mediated and unmediated effects, relying only on summary statistics.
Abstract: Characterizing the intermediate phenotypes such as gene expression that mediate genetic effects on complex diseases is a fundamental problem in human genetics. Existing methods based on imputation of transcriptomic data by utilizing genotypic data and summary statistics to identify putative disease genes cannot distinguish pleiotropy from causal mediation and are limited by overly strong assumptions about the data. To overcome these limitations, we develop a novel Bayesian inference framework, termed Causal Multivariate Mediation within Extended Linkage disequilibrium (CaMMEL), to jointly model multiple mediated and unmediated effects, relying only on summary statistics. We show in simulation unlike existing methods fail to distinguish between mediation and independent direct effects, CaMMEL accurately estimates mediation genes, clearly excluding pleiotropic genes. We applied our method to Alzheimer9s disease (AD) and found 21 genes in sub-threshold loci (p

Posted ContentDOI
21 Aug 2017-bioRxiv
TL;DR: It is found that the identity and sequence environment of the modified nucleotide greatly affects the odds of introducing a mismatch or causing reverse transcriptase drop-off, and specific mismatch signatures generated by dimethyl sulfate probing that can be used to remove false positives typically produced in RNA structurome analyses are identified.
Abstract: Genome-wide RNA structure maps have recently become available through the coupling of in vivo chemical probing reagents with next-generation sequencing. Initial analyses relied on the identification of truncated reverse transcription reads to identify the chemically modified nucleotides, but recent studies have shown that mutational signatures can also be used. While these two methods have been employed interchangeably, here we show that they actually provide complementary information. Consequently, analyses using exclusively one of the two methodologies may disregard a significant portion of the structural information. We find that the identity and sequence environment of the modified nucleotide greatly affects the odds of introducing a mismatch or causing reverse transcriptase drop-off. Finally, we identify specific mismatch signatures generated by dimethyl sulfate probing that can be used to remove false positives typically produced in RNA structurome analyses, and how these signatures vary depending on the reverse transcription enzyme used.

Journal Article
01 Apr 2017-Nature
TL;DR: Li et al. as discussed by the authors presented a hierarchical hidden Markov model (diHMM) to systematically annotate chromatin states at multiple length scales, and applied diHMM to analyse a public ChIP-seq data set.
Abstract: Chromatin-state analysis is widely applied in the studies of development and diseases. However, existing methods operate at a single length scale, and therefore cannot distinguish large domains from isolated elements of the same type. To overcome this limitation, we present a hierarchical hidden Markov model, diHMM, to systematically annotate chromatin states at multiple length scales. We apply diHMM to analyse a public ChIP-seq data set. diHMM not only accurately captures nucleosome-level information, but identifies domain-level states that vary in nucleosome-level state composition, spatial distribution and functionality. The domain-level states recapitulate known patterns such as super-enhancers, bivalent promoters and Polycomb repressed regions, and identify additional patterns whose biological functions are not yet characterized. By integrating chromatin-state information with gene expression and Hi-C data, we identify context-dependent functions of nucleosome-level states. Thus, diHMM provides a powerful tool for investigating the role of higher-order chromatin structure in gene regulation.

Posted ContentDOI
14 Nov 2017-bioRxiv
TL;DR: Causal Multivariate Mediation within Extended Linkage disequilibrium is developed, a novel Bayesian inference framework to jointly model multiple mediated and unmediated effects relying only on summary statistics, and it is shown in simulation that CaMMEL accurately distinguishes between mediating and pleiotropic genes unlike existing methods.
Abstract: Characterizing the intermediate phenotypes, such as gene expression, that mediate genetic effects on complex diseases is a fundamental problem in human genetics. Existing methods utilize genotypic data and summary statistics to identify putative disease genes, but cannot distinguish pleiotropy from causal mediation and are limited by overly strong assumptions about the data. To overcome these limitations, we develop Causal Multivariate Mediation within Extended Linkage disequilibrium (CaMMEL), a novel Bayesian inference framework to jointly model multiple mediated and unmediated effects relying only on summary statistics. We show in simulation that CaMMEL accurately distinguishes between mediating and pleiotropic genes unlike existing methods. We applied CaMMEL to Alzheimer's disease (AD) and found 206 causal genes in sub-threshold loci (p < 1e-4). We prioritized 21 genes which mediate at least 5% of local genetic variance, disrupting innate immune pathways in AD.

Posted ContentDOI
14 Feb 2017-bioRxiv
TL;DR: A novel two-stage Bayesian regression method is developed which incorporates uncertainty in imputed gene expression and achieves higher power to detect TWAS genes than existing TWAS methods as well as standard methods based on missing value and measurement error theory.
Abstract: Transcriptome-wide association studies (TWAS) test for associations between imputed gene expression levels and phenotypes in GWAS cohorts using models of transcriptional regulation learned from reference transcriptomes. However, current methods for TWAS only use point estimates of imputed expression and ignore uncertainty in the prediction. We develop a novel two-stage Bayesian regression method which incorporates uncertainty in imputed gene expression and achieves higher power to detect TWAS genes than existing TWAS methods as well as standard methods based on missing value and measurement error theory. We apply our method to GTEx whole blood transcriptomes and GWAS cohorts for seven diseases from the Wellcome Trust Case Control Consortium and find 45 TWAS genes, of which 17 do not overlap previously reported case-control GWAS or differential expression associations. Surprisingly, we replicate only 2 of 40 previously reported TWAS genes after accounting for uncertainty in the prediction.

Journal ArticleDOI
TL;DR: Network Maximal Correlation is introduced as a multivariate measure of nonlinear association among random variables and its utility in a data application of learning nonlinear dependencies among genes in a cancer dataset is shown.
Abstract: We introduce Network Maximal Correlation (NMC) as a multivariate measure of nonlinear association among random variables. NMC is defined via an optimization that infers transformations of variables by maximizing aggregate inner products between transformed variables. For finite discrete and jointly Gaussian random variables, we characterize a solution of the NMC optimization using basis expansion of functions over appropriate basis functions. For finite discrete variables, we propose an algorithm based on alternating conditional expectation to determine NMC. Moreover we propose a distributed algorithm to compute an approximation of NMC for large and dense graphs using graph partitioning. For finite discrete variables, we show that the probability of discrepancy greater than any given level between NMC and NMC computed using empirical distributions decays exponentially fast as the sample size grows. For jointly Gaussian variables, we show that under some conditions the NMC optimization is an instance of the Max-Cut problem. We then illustrate an application of NMC in inference of graphical model for bijective functions of jointly Gaussian variables. Finally, we show NMC’s utility in a data application of learning nonlinear dependencies among genes in a cancer dataset.

Posted ContentDOI
20 Feb 2017-bioRxiv
TL;DR: The model is used to train a machine learning model that uses DNA sequence information, regulatory motif annotations, evolutionary conservation, and epigenomic information to predict genomic regions that show enhancer activity when tested in MPRA assays, and finds that genetic variants with stronger predicted regulatory activity show significantly lower minor allele frequency.
Abstract: Massively-parallel reporter assays (MPRA) enable unprecedented opportunities to test for regulatory activity of thousands of regulatory sequences. However, MPRA only assay a subset of the genome thus limiting their applicability for genome-wide functional annotations. To overcome this limitation, we have used existing MPRA datasets to train a machine learning model that uses DNA sequence information, regulatory motif annotations, evolutionary conservation, and epigenomic information to predict genomic regions that show enhancer activity when tested in MPRA assays. We used the resulting model to generate global predictions of regulatory activity at single-nucleotide resolution across 14 million common variants. We find that genetic variants with stronger predicted regulatory activity show significantly lower minor allele frequency, indicative of evolutionary selection within the human population. They also show higher overlap with eQTL annotations across multiple tissues relative to the background SNPs, indicating that their perturbations in vivo more frequently result in changes in gene expression. In addition, they are more frequently associated with trait-associated SNPs from genome-wide association studies (GWAS), enabling us to prioritize genetic variants that are more likely to be causal based on their predicted regulatory activity. Lastly, we use our model to compare MPRA inferences across cell types and platforms and to prioritize the assays most predictive of MPRA assay results, including cell-dependent DNase hypersensitivity sites and transcription factors known to be active in the tested cell types. Our results indicate that high-throughput testing of thousands of putative regions, coupled with regulatory predictions across millions of sites, presents a powerful strategy for systematic annotation of genomic regions and genetic variants.

Posted ContentDOI
10 Nov 2017-bioRxiv
TL;DR: It is shown that the unbiased integration of independent data sources suggestive of regulatory interactions produces meaningful associations supported by existing functional and physical evidence, correlating with expected independent biological features.
Abstract: Despite large experimental and computational efforts aiming to dissect the mechanisms underlying disease risk, mapping cis-regulatory elements to target genes remains a challenge. Here, we introduce a matrix factorization framework to integrate physical and functional interaction data of genomic segments. The framework was used to predict a regulatory network of chromatin interaction edges linking more than 20,000 promoters and 1.8 million enhancers across 127 human reference epigenomes, including edges that are present in any of the input datasets. Our network integrates functional evidence of correlated activity patterns from epigenomic data and physical evidence of chromatin interactions. An important contribution of this work is the representation of heterogeneous data with different qualities as networks. We show that the unbiased integration of independent data sources suggestive of regulatory interactions produces meaningful associations supported by existing functional and physical evidence, correlating with expected independent biological features.

Posted ContentDOI
30 Aug 2017-bioRxiv
TL;DR: This screen for significant mutation patterns followed by correlative mutational analysis identified new individual driver candidates and suggest that some non-coding mutations recurrently affect expression and play a role in cancer development.
Abstract: Cancer develops by accumulation of somatic driver mutations, which impact cellular function. Non-coding mutations in non-coding regulatory regions can now be studied genome-wide and further characterized by correlation with gene expression and clinical outcome to identify driver candidates. Using a new two-stage procedure, called ncDriver, we first screened 507 ICGC whole-genomes from ten cancer types for non-coding elements, in which mutations are both recurrent and have elevated conservation or cancer specificity. This identified 160 significant non-coding elements, including the TERT promoter, a well-known non-coding driver element, as well as elements associated with known cancer genes and regulatory genes (e.g., PAX5, TOX3, PCF11, MAPRE3). However, in some significant elements, mutations appear to stem from localized mutational processes rather than recurrent positive selection in some cases. To further characterize the driver potential of the identified elements and shortlist candidates, we identified elements where presence of mutations correlated significantly with expression levels (e.g. TERT and CDH10) and survival (e.g. CDH9 and CDH10) in an independent set of 505 TCGA whole-genome samples. In a larger pan-cancer set of 4,128 TCGA exomes with expression profiling, we identified mutational correlation with expression for additional elements (e.g., near GATA3, CDC6, ZNF217 and CTCF transcription factor binding sites). Survival analysis further pointed to MIR122, a known marker of poor prognosis in liver cancer. This screen for significant mutation patterns followed by correlative mutational analysis identified new individual driver candidates and suggest that some non-coding mutations recurrently affect expression and play a role in cancer development.

Patent
04 Aug 2017
TL;DR: In this article, methods of analyzing DNA methylation in cell-free DNA (cfDNA) and genomic DNA (gDNA) from sequencing data are described, as well as methods of extracting DNA methylations from cell free and genomic data.
Abstract: As described below, disclosed herein are methods of analyzing DNA methylation in cell-free DNA (cfDNA) and genomic DNA (gDNA) from sequencing data.

Posted ContentDOI
24 Nov 2017-bioRxiv
TL;DR: Deep-coverage whole genome sequencing at the population level is now feasible and offers potential advantages for locus discovery, particularly in the analysis rare mutations in non-coding regions, and a framework to interpret genome sequence for dyslipidemia risk is developed.
Abstract: Deep-coverage whole genome sequencing at the population level is now feasible and offers potential advantages for locus discovery, particularly in the analysis rare mutations in non-coding regions. Here, we performed whole genome sequencing in 16,324 participants from four ancestries at mean depth >29X and analyzed correlations of genotypes with four quantitative traits – plasma levels of total cholesterol, low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol, and triglycerides. We conducted a discovery analysis including common or rare variants in coding as well as non-coding regions and developed a framework to interpret genome sequence for dyslipidemia risk. Common variant association yielded loci previously described with the exception of a few variants not captured earlier by arrays or imputation. In coding sequence, rare variant association yielded known Mendelian dyslipidemia genes and, in non-coding sequence, we detected no rare variant association signals after application of four approaches to aggregate variants in non-coding regions. We developed a new, genome-wide polygenic score for LDL-C and observed that a high polygenic score conferred similar effect size to a monogenic mutation (~30 mg/dl higher LDL-C for each); however, among those with extremely high LDL-C, a high polygenic score was considerably more prevalent than a monogenic mutation (23% versus 2% of participants, respectively).

Dataset
16 Oct 2017
TL;DR: The evolutionary dynamics of abundant stop codon readthrough is studied in detail in order to clarify the role of EMT in the evolution of phylogeny.
Abstract: Supplementary material for Jungreis, I., Chan, C. S., Waterhouse, R. M., Fields, G., Lin, M. F., & Kellis, M. (2016). Evolutionary dynamics of abundant stop codon readthrough. Molecular Biology and Evolution. doi:10.1093/molbev/msw189

Proceedings ArticleDOI
TL;DR: A machine learning approach is developed to infer the base pair resolution DNA methylation level from fragment size information in whole genome sequencing (WGS), and deconvoluted cfDNA’s tissue-of-origin status by inferred DNAmethylation level at ULP-WGS from thousands of breast/prostate cancer samples and healthy individuals.
Abstract: Cell free DNA (cfDNA) has been shown to be an emerging non-invasive biomarker to monitor tumor progression in cancer patients. Elevated cfDNA has been found not only from tumors, but also from normal tissues. Thus, the identification of cfDNA’s tissue-of-origin is critical to understand the mechanism of cfDNA release and tumor progression. Recent efforts to identify cfDNA’s tissue-of-origin begin to utilize cfDNA’s epigenomic status, such as DNA methylation and nucleosome spacing. However, both of these methods have limitations: (1) For nucleosome positioning, lack of reference nucleosome maps in different tumor and normal tissues has limited its application to tissue-of-origin deconvolution; (2) For DNA methylation, large DNA degradation during whole genome bisulfite sequencing (WGBS) library preparation, even with current low-input DNA technology, is still the major hurdle for its clinical application, although extensive DNA methylation studies by WGBS in tumor and normal tissues during the last decade have provided many reference maps. Very recently, a pioneer study showed significant differences between DNA fragment lengths of methylated and unmethylated cfDNA. Taking advantage of this experimental observation, we developed a machine learning approach to infer the base pair resolution DNA methylation level from fragment size information in whole genome sequencing (WGS). The predicted DNA methylation, from not only high coverage but also dozens of ultra-low-pass WGS (ULP-WGS), showed high concordance with the ground truth DNA methylation level from WGBS in the same cancer patients. Furthermore, by using hundreds of WGBS datasets from different tumor and normal tissues/cells as the reference map, we deconvoluted cfDNA’s tissue-of-origin status by inferred DNA methylation level at ULP-WGS from thousands of breast/prostate cancer samples and healthy individuals. The cfDNA’s tissue-of-origin status in cancer patients showed high concordance with confirmed metastasis tissues from physicians. Interestingly, some clinical information, such as cancer grades/stages, seemed to be correlated with cfDNA’s tissue-of-origin status. Overall, our methods here pave the road for cfDNA’s application in clinical diagnosis and monitoring. Citation Format: Yaping Liu, Sarah Reed, Atish D. Choudhury, Heather A. Parsons, Daniel G. Stover, Gavin Ha, Gregory Gydush, Justin Rhoades, Denisse Rotem, Samuel Freeman, Viktor Adalsteinsson, Manolis Kellis. Identify tissue-of-origin in cancer cfDNA by whole genome sequencing [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 5689. doi:10.1158/1538-7445.AM2017-5689