Showing papers by "Anshul Kundaje published in 2018"

PDF

Open Access

Journal Article•DOI•

Opportunities and obstacles for deep learning in biology and medicine.

[...]

Travers Ching¹, Daniel Himmelstein², Brett K. Beaulieu-Jones², Alexandr A. Kalinin³, Brian T. Do⁴, Gregory P. Way², Enrico Ferrero⁵, Paul-Michael Agapow⁶, Michael Zietz², Michael M. Hoffman⁷, Michael M. Hoffman⁸, Wei Xie⁹, Gail L. Rosen¹⁰, Benjamin J. Lengerich¹¹, Johnny Israeli¹², Jack Lanchantin¹³, Stephen Woloszynek¹⁰, Anne E. Carpenter¹⁴, Avanti Shrikumar¹², Jinbo Xu¹⁵, Evan M. Cofer¹⁶, Evan M. Cofer¹⁷, Christopher A. Lavender¹⁸, Srinivas C. Turaga¹⁹, Amr Alexandari¹², Zhiyong Lu¹⁸, David J. Harris²⁰, Dave DeCaprio, Yanjun Qi¹³, Anshul Kundaje¹², Yifan Peng¹⁸, Laura K. Wiley²¹, Marwin H. S. Segler²², Simina M. Boca²³, S. Joshua Swamidass²⁴, Austin Huang²⁵, Anthony Gitter²⁶, Anthony Gitter²⁷, Casey S. Greene² - Show less +35 more•Institutions (27)

01 Apr 2018-Journal of the Royal Society Interface

TL;DR: It is found that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art.

...read moreread less

Abstract: Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

...read moreread less

1,491 citations

Journal Article•DOI•

Impact of regulatory variation across human iPSCs and differentiated cells

[...]

Nicholas E. Banovich¹, Yang I. Li², Anil Raj², Michelle C Ward¹, Peyton Greenside², Diego Calderon², Po-Yuan Tung¹, Jonathan E. Burnett¹, Marsha Myrthil¹, Samantha M. Thomas¹, Courtney K Burrows¹, Irene Gallego Romero¹, Bryan J Pavlovic¹, Anshul Kundaje², Jonathan K. Pritchard³, Jonathan K. Pritchard², Yoav Gilad¹ - Show less +13 more•Institutions (3)

University of Chicago¹, Stanford University², Howard Hughes Medical Institute³

01 Jan 2018-Genome Research

TL;DR: A deep neural network is developed to predict open chromatin regions from DNA sequence alone and is able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell-type-specific chromatin accessibility.

...read moreread less

Abstract: Induced pluripotent stem cells (iPSCs) are an essential tool for studying cellular differentiation and cell types that are otherwise difficult to access. We investigated the use of iPSCs and iPSC-derived cells to study the impact of genetic variation on gene regulation across different cell types and as models for studies of complex disease. To do so, we established a panel of iPSCs from 58 well-studied Yoruba lymphoblastoid cell lines (LCLs); 14 of these lines were further differentiated into cardiomyocytes. We characterized regulatory variation across individuals and cell types by measuring gene expression levels, chromatin accessibility, and DNA methylation. Our analysis focused on a comparison of inter-individual regulatory variation across cell types. While most cell-type-specific regulatory quantitative trait loci (QTLs) lie in chromatin that is open only in the affected cell types, we found that 20% of cell-type-specific regulatory QTLs are in shared open chromatin. This observation motivated us to develop a deep neural network to predict open chromatin regions from DNA sequence alone. Using this approach, we were able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell-type-specific chromatin accessibility.

...read moreread less

115 citations

Journal Article•DOI•

Intertumoral Heterogeneity in SCLC Is Influenced by the Cell Type of Origin

[...]

Dian Yang¹, Sarah K. Denny¹, Peyton Greenside¹, Andrea C. Chaikovsky¹, Jennifer J. Brady¹, Youcef Ouadah¹, Jeffrey M. Granja¹, Nadine Jahchan¹, Jing Shan Lim¹, Shirley Kwok¹, Christina S. Kong¹, Anna S. Berghoff², Anna Schmitt, H. Christian Reinhardt³, Kwon-Sik Park⁴, Matthias Preusser², Anshul Kundaje¹, William J. Greenleaf¹, Julien Sage¹, Monte M. Winslow¹ - Show less +16 more•Institutions (4)

Stanford University¹, Medical University of Vienna², University of Cologne³, University of Virginia⁴

18 Sep 2018-Cancer Discovery

TL;DR: It is shown that SCLC can arise from different cell types of origin, which profoundly influences the eventual genetic and epigenetic changes that enable metastatic progression, and underscores the importance of the identity of cell type of origin in influencing tumor evolution and metastatic mechanisms.

...read moreread less

Abstract: The extent to which early events shape tumor evolution is largely uncharacterized, even though a better understanding of these early events may help identify key vulnerabilities in advanced tumors. Here, using genetically defined mouse models of small cell lung cancer (SCLC), we uncovered distinct metastatic programs attributable to the cell type of origin. In one model, tumors gain metastatic ability through amplification of the transcription factor NFIB and a widespread increase in chromatin accessibility, whereas in the other model, tumors become metastatic in the absence of NFIB-driven chromatin alterations. Gene-expression and chromatin accessibility analyses identify distinct mechanisms as well as markers predictive of metastatic progression in both groups. Underlying the difference between the two programs was the cell type of origin of the tumors, with NFIB-independent metastases arising from mature neuroendocrine cells. Our findings underscore the importance of the identity of cell type of origin in influencing tumor evolution and metastatic mechanisms. Significance: We show that SCLC can arise from different cell types of origin, which profoundly influences the eventual genetic and epigenetic changes that enable metastatic progression. Understanding intertumoral heterogeneity in SCLC, and across cancer types, may illuminate mechanisms of tumor progression and uncover how the cell type of origin affects tumor evolution. Cancer Discov; 8(10); 1316–31. ©2018 AACR. See related commentary by Pozo et al., p. 1216. This article is highlighted in the In This Issue feature, p. 1195

...read moreread less

109 citations

Journal Article•DOI•

Umap and Bismap: quantifying genome and methylome mappability.

[...]

Mehran Karimzadeh¹, Mehran Karimzadeh², Carl Ernst³, Anshul Kundaje⁴, Michael M. Hoffman - Show less +1 more•Institutions (4)

Princess Margaret Cancer Centre¹, University of Toronto², McGill University³, Stanford University⁴

16 Nov 2018-Nucleic Acids Research

TL;DR: The Umap software for identifying uniquely mappable regions of any genome is introduced and its Bismap extension identifies mappability of the bisulfite-converted genome.

...read moreread less

Abstract: Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called 'mappability', which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at https://bismap.hoffmanlab.org for use with genome browsers.

...read moreread less

103 citations

Journal Article•DOI•

GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs.

[...]

Oana Ursu¹, Nathan Boley¹, Maryna Taranova¹, Y. X. Rachel Wang¹, Galip Gürkan Yardımcı², William Stafford Noble², Anshul Kundaje¹ - Show less +3 more•Institutions (2)

Stanford University¹, University of Washington²

15 Aug 2018-Bioinformatics

TL;DR: A concordance measure called DIfferences between Smoothed COntact maps (GenomeDISCO) is introduced for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments, which accurately distinguishes biological replicates from samples obtained from different cell types.

...read moreread less

Abstract: Motivation The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of three-dimensional chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts. Results We introduce a concordance measure called DIfferences between Smoothed COntact maps (GenomeDISCO) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP. Availability and implementation Software implementing GenomeDISCO is available at https://github.com/kundajelab/genomedisco. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

74 citations

Journal Article•DOI•

Discovering epistatic feature interactions from neural network models of regulatory DNA sequences.

[...]

Peyton Greenside¹, Tyler C. Shimko¹, Polly M. Fordyce, Anshul Kundaje¹•Institutions (1)

Stanford University¹

01 Sep 2018-Bioinformatics

TL;DR: This work presents a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence and makes significant strides in improving the interpretability of deep learning models for genomics.

...read moreread less

Abstract: Motivation Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models. Results We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics. Availability and implementation Code is available at: https://github.com/kundajelab/dfim. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

65 citations

Posted Content•

Technical Note on Transcription Factor Motif Discovery from Importance Scores (TF-MoDISco) version 0.5.6.5

[...]

Avanti Shrikumar¹, Katherine Tian, Žiga Avsec², Anna Shcherbina¹, Abhimanyu Banerjee¹, Mahfuza Sharmin¹, Surag Nair¹, Anshul Kundaje¹ - Show less +4 more•Institutions (2)

Stanford University¹, Technische Universität München²

31 Oct 2018-arXiv: Learning

TL;DR: TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data and the implementation is available at this URL.

...read moreread less

Abstract: TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data. This technical note focuses on version v0.5.6.5. The implementation is available at this https URL

...read moreread less

61 citations

Posted Content•DOI•

Kipoi: accelerating the community exchange and reuse of predictive models for genomics

[...]

Ziga Avsec¹, Roman Kreuzhuber², Johnny Israeli³, Nancy Xu³, Jun Cheng¹, Avanti Shrikumar³, Abhimanyu Banerjee³, Daniel S Kim³, Lara Urban², Anshul Kundaje³, Oliver Stegle², Julien Gagneur¹ - Show less +8 more•Institutions (3)

Technische Universität München¹, European Bioinformatics Institute², Stanford University³

24 Jul 2018-bioRxiv

TL;DR: Kipoi, a collaborative initiative to define standards and to foster reuse of trained models in genomics, is presented, providing a unified framework to archive, share, access, use, and build on models developed by the community.

...read moreread less

Abstract: Advanced machine learning models applied to large-scale genomics datasets hold the promise to be major drivers for genome science. Once trained, such models can serve as a tool to probe the relationships between data modalities, including the effect of genetic variants on phenotype. However, lack of standardization and limited accessibility of trained models have hampered their impact in practice. To address this, we present Kipoi, a collaborative initiative to define standards and to foster reuse of trained models in genomics. Already, the Kipoi repository contains over 2,000 trained models that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. The Kipoi model standard grants automated software installation and provides unified interfaces to apply and interpret models. We illustrate Kipoi through canonical use cases, including model benchmarking, transfer learning, variant effect prediction, and building new models from existing ones. By providing a unified framework to archive, share, access, use, and build on models developed by the community, Kipoi will foster the dissemination and use of machine learning models in genomics.

...read moreread less

36 citations

Journal Article•DOI•

Differential analysis of chromatin accessibility and histone modifications for predicting mouse developmental enhancers.

[...]

Shaliu Fu¹, Qin Wang¹, Jill Moore², Michael J. Purcaro², Henry Pratt², Kaili Fan¹, Cuihua Gu¹, Cizhong Jiang¹, Ruixin Zhu¹, Anshul Kundaje³, Aiping Lu¹, Zhiping Weng¹, Zhiping Weng² - Show less +9 more•Institutions (3)

Tongji University¹, University of Massachusetts Medical School², Stanford University³

30 Nov 2018-Nucleic Acids Research

TL;DR: Nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays are compared and a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks is devised.

...read moreread less

Abstract: Enhancers are distal cis-regulatory elements that modulate gene expression. They are depleted of nucleosomes and enriched in specific histone modifications; thus, calling DNase-seq and histone mark ChIP-seq peaks can predict enhancers. We evaluated nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays. DNase and H3K27ac peaks were consistently more predictive than H3K4me1/2/3 and H3K9ac peaks. DFilter and Hotspot2 were the best DNase peak callers, while HOMER, MUSIC, MACS2, DFilter and F-seq were the best H3K27ac peak callers. We observed that the differential DNase or H3K27ac signals between two distant tissues increased the area under the precision-recall curve (PR-AUC) of DNase peaks by 17.5-166.7% and that of H3K27ac peaks by 7.1-22.2%. We further improved this differential signal method using multiple contrast tissues. Evaluated using a blind test, the differential H3K27ac signal method substantially improved PR-AUC from 0.48 to 0.75 for predicting heart enhancers. We further validated our approach using postnatal retina and cerebral cortex enhancers identified by massively parallel reporter assays, and observed improvements for both tissues. In summary, we compared nine peak callers and devised a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks.

...read moreread less

34 citations

Posted Content•DOI•

Transcriptome-wide association studies: opportunities and challenges

[...]

Michael Wainberg¹, Nasa Sinnott-Armstrong¹, Nicholas Mancuso², Alvaro N. Barbeira³, David A. Knowles¹, David E. Golan¹, Raili Ermel⁴, Arno Ruusalepp⁴, Thomas Quertermous¹, Ke Hao⁵, Johan L.M. Björkegren⁵, Hae Kyung Im³, Bogdan Pasaniuc², Manuel A. Rivas¹, Anshul Kundaje¹ - Show less +11 more•Institutions (5)

Stanford University¹, University of California, Los Angeles², University of Chicago³, Tartu University Hospital⁴, Icahn School of Medicine at Mount Sinai⁵

14 Oct 2018-bioRxiv

TL;DR: Property of TWAS is explored as a potential approach to prioritize causal genes, using simulations and case studies of literature-curated candidate causal genes for schizophrenia, LDL cholesterol and Crohn’s disease, and guidelines and best practices are provided.

...read moreread less

Abstract: Transcriptome-wide association studies (TWAS) integrate GWAS and gene expression datasets to find gene-trait associations. In this Perspective, we explore properties of TWAS as a potential approach to prioritize causal genes, using simulations and case studies of literature-curated candidate causal genes for schizophrenia, LDL cholesterol and Crohn9s disease. We explore risk loci where TWAS accurately prioritizes the likely causal gene, as well as loci where TWAS prioritizes multiple genes, some of which are unlikely to be causal, because they share the same variants as eQTLs. We illustrate that TWAS is especially prone to spurious prioritization when using expression data from tissues or cell types that are less related to the trait, due to substantial variation in both expression levels and eQTL strengths across cell types. Nonetheless, TWAS prioritizes candidate causal genes at GWAS loci more accurately than simple baselines based on proximity to lead GWAS variant and expression in trait-related tissue. We discuss current strategies and future opportunities for improving the performance of TWAS for causal gene prioritization. Our results showcase the strengths and limitations of using expression variation across individuals to determine causal genes at GWAS loci and provide guidelines and best practices when using TWAS to prioritize candidate causal genes.

...read moreread less

21 citations

Posted Content•DOI•

Discovering epistatic feature interactions from neural network models of regulatory DNA sequences

[...]

Peyton Greenside¹, Tyler C. Shimko¹, Polly M. Fordyce¹, Anshul Kundaje¹•Institutions (1)

Stanford University¹

17 Apr 2018-bioRxiv

...read moreread less

Abstract: Transcription factors bind complex regulatory DNA sequence patterns in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn these cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order, epistatic feature interactions encoded by the models. We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.

...read moreread less

Posted Content•

TF-MoDISco v0.4.2.2-alpha: Technical Note

[...]

Avanti Shrikumar, Katherine Tian, Anna Shcherbina, Žiga Avsec, Abhimanyu Banerjee, Mahfuza Sharmin, Surag Nair, Anshul Kundaje - Show less +4 more

31 Oct 2018

TL;DR: The methods behind TF-MoDISco version 0.4.2-alpha, an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data, are described.

...read moreread less

Posted Content•

Computationally Efficient Measures of Internal Neuron Importance.

[...]

Avanti Shrikumar, Jocelin Su, Anshul Kundaje

26 Jul 2018-arXiv: Learning

TL;DR: It is shown that the formula for Total Conductance is mathematically equivalent to Path Integrated Gradients computed on a hidden layer in the network, which is a pre-existing computationally efficient approach that is applicable to calculating internal neuron importance.

...read moreread less

Abstract: The challenge of assigning importance to individual neurons in a network is of interest when interpreting deep learning models In recent work, Dhamdhere et al proposed Total Conductance, a "natural refinement of Integrated Gradients" for attributing importance to internal neurons Unfortunately, the authors found that calculating conductance in tensorflow required the addition of several custom gradient operators and did not scale well In this work, we show that the formula for Total Conductance is mathematically equivalent to Path Integrated Gradients computed on a hidden layer in the network We provide a scalable implementation of Total Conductance using standard tensorflow gradient operators that we call Neuron Integrated Gradients We compare Neuron Integrated Gradients to DeepLIFT, a pre-existing computationally efficient approach that is applicable to calculating internal neuron importance We find that DeepLIFT produces strong empirical results and is faster to compute, but because it lacks the theoretical properties of Neuron Integrated Gradients, it may not always be preferred in practice Colab notebook reproducing results: this http URL

...read moreread less

Journal Article•DOI•

A common pattern of DNase I footprinting throughout the human mtDNA unveils clues for a chromatin-like organization.

[...]

Amit Blumberg¹, Charles G. Danko², Anshul Kundaje³, Dan Mishmar¹•Institutions (3)

Ben-Gurion University of the Negev¹, Cornell University², Stanford University³

12 Jul 2018-Genome Research

TL;DR: Takeaway is that human mtDNA has a conserved protein-DNA organization, which is likely involved in mtDNA regulation, and altered mt-DGF pattern in interleukin 3-treated CD34+ cells, certain tissue differences, and significant prevalence change in fetal versus nonfetal samples offer first clues to their physiological importance.

...read moreread less

Abstract: Human mitochondrial DNA (mtDNA) is believed to lack chromatin and histones. Instead, it is coated solely by the transcription factor TFAM. We asked whether mtDNA packaging is more regulated than once thought. To address this, we analyzed DNase-seq experiments in 324 human cell types and found, for the first time, a pattern of 29 mtDNA Genomic footprinting (mt-DGF) sites shared by ∼90% of the samples. Their syntenic conservation in mouse DNase-seq experiments reflect selective constraints. Colocalization with known mtDNA regulatory elements, with G-quadruplex structures, in TFAM-poor sites (in HeLa cells) and with transcription pausing sites, suggest a functional regulatory role for such mt-DGFs. Altered mt-DGF pattern in interleukin 3-treated CD34+ cells, certain tissue differences, and significant prevalence change in fetal versus nonfetal samples, offer first clues to their physiological importance. Taken together, human mtDNA has a conserved protein-DNA organization, which is likely involved in mtDNA regulation.

...read moreread less

Journal Article•DOI•

ChIP-ping the branches of the tree: functional genomics and the evolution of eukaryotic gene regulation.

[...]

Georgi K. Marinov, Anshul Kundaje

01 Mar 2018-Briefings in Functional Genomics

TL;DR: The most recent major technological transformation happened a decade ago, with the move from using tiling arrays [chromatin immunoprecipitation (ChIP)-on-Chip] to high-throughput sequencing as a readout for ChIP assays.

...read moreread less

Abstract: Advances in the methods for detecting protein-DNA interactions have played a key role in determining the directions of research into the mechanisms of transcriptional regulation. The most recent major technological transformation happened a decade ago, with the move from using tiling arrays [chromatin immunoprecipitation (ChIP)-on-Chip] to high-throughput sequencing (ChIP-seq) as a readout for ChIP assays. In addition to the numerous other ways in which it is superior to arrays, by eliminating the need to design and manufacture them, sequencing also opened the door to carrying out comparative analyses of genome-wide transcription factor occupancy across species and studying chromatin biology in previously less accessible model and nonmodel organisms, thus allowing us to understand the evolution and diversity of regulatory mechanisms in unprecedented detail. Here, we review the biological insights obtained from such studies in recent years and discuss anticipated future developments in the field.

...read moreread less

Posted Content•DOI•

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

[...]

Rajiv Movva¹, Peyton Greenside¹, Avanti Shrikumar¹, Anshul Kundaje¹•Institutions (1)

Stanford University¹

17 Aug 2018-bioRxiv

TL;DR: In this article, a convolutional neural network (CNN)-based framework was proposed to predict and interpret the regulatory activity of DNA sequences as measured by massively parallel reporter assays (MPRAs), which are a powerful approach to characterize the relationship between noncoding DNA sequence and gene expression.

...read moreread less

Abstract: The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ~500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

...read moreread less

Proceedings Article•DOI•

Prediction of protein-ligand interactions from paired protein sequence motifs and ligand substructures.

[...]

Peyton Greenside¹, Maureen Hillenmeyer, Anshul Kundaje¹•Institutions (1)

Stanford University¹

01 Jan 2018

TL;DR: It is demonstrated that it is possible to predict protein-ligand interactions using only motif-based features and that interpretation of these features can reveal new insights into the molecular mechanics underlying each interaction.

...read moreread less

Abstract: Identification of small molecule ligands that bind to proteins is a critical step in drug discovery. Computational methods have been developed to accelerate the prediction of protein-ligand binding, but often depend on 3D protein structures. As only a limited number of protein 3D structures have been resolved, the ability to predict protein-ligand interactions without relying on a 3D representation would be highly valuable. We use an interpretable confidence-rated boosting algorithm to predict protein-ligand interactions with high accuracy from ligand chemical substructures and protein 1D sequence motifs, without relying on 3D protein structures. We compare several protein motif definitions, assess generalization of our model's predictions to unseen proteins and ligands, demonstrate recovery of well established interactions and identify globally predictive protein-ligand motif pairs. By bridging biological and chemical perspectives, we demonstrate that it is possible to predict protein-ligand interactions using only motif-based features and that interpretation of these features can reveal new insights into the molecular mechanics underlying each interaction. Our work also lays a foundation to explore more predictive feature sets and sophisticated machine learning approaches as well as other applications, such as predicting unintended interactions or the effects of mutations.

...read moreread less

Posted Content•DOI•

Long-range single-molecule mapping of chromatin accessibility in eukaryotes

[...]

Zohar Shipony¹, Georgi K. Marinov¹, Matthew P. Swaffer¹, Nicholas A. Sinnott-Armstrong¹, Jan M. Skotheim¹, Anshul Kundaje¹, William J. Greenleaf¹ - Show less +3 more•Institutions (1)

Stanford University¹

22 Dec 2018-bioRxiv

TL;DR: A method for profiling accessibility of individual chromatin fibers at multi-kilobase length scale (SMAC-seq, or Single-Molecule long-read Accessible Chromatin mapping sequencing assay), enabling the simultaneous, high-resolution, single-molecule assessment of the chromatin state of distal genomic elements.

...read moreread less

Abstract: Active regulatory elements in eukaryotes are typically characterized by an open, nucleosome depleted chromatin structure; mapping areas of open chromatin has accordingly emerged as a widely used tool in the arsenal of modern functional genomics. However, existing approaches for profiling chromatin accessibility are limited by their reliance on DNA fragmentation and short read sequencing, which leaves them unable to provide information about the state of chromatin on larger scales or reveal coordination between the chromatin state of individual distal regulatory elements. To address these limitations we have developed a method for profiling accessibility of individual chromatin molecules at multi-kilobase length scale (SMAC-seq, or Single-Molecule long-read Accessible Chromatin mapping sequencing assay), enabling the simultaneous, high resolution, single-molecule assessment of the chromatin state of distal genomic elements. Our strategy is based on combining the preferential methylation of open chromatin regions by DNA methyltransferases (CpG and GpC 5-methylcytosine (5mC) and N 6 -methyl adenosine (m 6 A) enzymes) and the ability of long-read single-molecule nanopore sequencing to directly read out the methylation state of individual DNA bases. Applying SMAC-seq to the budding yeast Saccharomyces cerevisiae , we demonstrate that aggregate SMAC-seq signals match bulk-level accessibility measurements, observe single-molecule protection footprints of nucleosomes and transcription factors, and quantify the correlation between the chromatin states of distal genomic elements.

...read moreread less

Posted Content•DOI•

Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses

[...]

Bérénice A. Benayoun¹, Elizabeth A. Pollina², Param Priya Singh³, Salah Mahmoudi³, Itamar Harel⁴, Kerriann M. Casey³, Ben W. Dulken³, Anshul Kundaje³, Anne Brunet³ - Show less +5 more•Institutions (4)

University of Southern California¹, Harvard University², Stanford University³, Hebrew University of Jerusalem⁴

31 May 2018-bioRxiv

TL;DR: This resource identifies chromatin and transcriptional states that are characteristic of youthful tissues, which could be leveraged to restore aspects of youthful functionality to old tissues.

...read moreread less

Abstract: Aging is accompanied by the functional decline of tissues. However, a systematic study of epigenomic and transcriptomic changes across tissues during aging is missing. Here we generated chromatin maps and transcriptomes from 4 tissues and one cell type from young, middle-age, and old mice, yielding 143 high-quality datasets. We focused specifically on chromatin marks linked to gene expression regulation and cell identity: histone H3 trimethylation at lysine 4 (H3K4me3), a mark enriched at promoters, and histone H3 acetylation at lysine 27 (H3K27ac), a mark enriched at active enhancers. Epigenomic and transcriptomic landscapes could easily distinguish between ages. Machine learning analysis showed that chromatin states could predict transcriptional changes during aging. Analysis of datasets from all tissues identified recurrent age-related chromatin and transcriptional changes in key processes, including the upregulation of immune system response pathways, including the interferon signaling pathway. The upregulation of interferon response pathway with age was accompanied by increased transcription of various endogenous retroviral sequences. Pathways deregulated during mouse aging across tissues, notably innate immune pathways, were also deregulated with aging in other vertebrate species — African turquoise killifish, rat, and humans — indicating common signatures of age across species. To date, our dataset represents the largest multi-tissue epigenomic and transcriptomic dataset for vertebrate aging. This resource identifies chromatin and transcriptional states that are characteristic of youthful tissues, which could be leveraged to restore aspects of youthful functionality to old tissues.

...read moreread less

Posted Content•

A Flexible and Adaptive Framework for Abstention Under Class Imbalance

[...]

Avanti Shrikumar, Amr Alexandari, Anshul Kundaje

20 Feb 2018-arXiv: Machine Learning

TL;DR: This framework leverages the insight that calibrated probability estimates can be used as a proxy for the true class labels, thereby allowing us to estimate the change in an arbitrary metric if an example were abstained on.

...read moreread less

Abstract: In practical applications of machine learning, it is often desirable to identify and abstain on examples where the model's predictions are likely to be incorrect. Much of the prior work on this topic focused on out-of-distribution detection or performance metrics such as top-k accuracy. Comparatively little attention was given to metrics such as area-under-the-curve or Cohen's Kappa, which are extremely relevant for imbalanced datasets. Abstention strategies aimed at top-k accuracy can produce poor results on these metrics when applied to imbalanced datasets, even when all examples are in-distribution. We propose a framework to address this gap. Our framework leverages the insight that calibrated probability estimates can be used as a proxy for the true class labels, thereby allowing us to estimate the change in an arbitrary metric if an example were abstained on. Using this framework, we derive computationally efficient metric-specific abstention algorithms for optimizing the sensitivity at a target specificity level, the area under the ROC, and the weighted Cohen's Kappa. Because our method relies only on calibrated probability estimates, we further show that by leveraging recent work on domain adaptation under label shift, we can generalize to test-set distributions that may have a different class imbalance compared to the training set distribution. On various experiments involving medical imaging, natural language processing, computer vision and genomics, we demonstrate the effectiveness of our approach. Source code available at this https URL. Colab notebooks reproducing results available at this https URL.

...read moreread less

Posted Content•

TF-MoDISco v0.4.4.2-alpha: Technical Note.

[...]

Avanti Shrikumar, Katherine Tian, Anna Shcherbina, Ziga Avsec, Abhimanyu Banerjee, Mahfuza Sharmin, Surag Nair, Anshul Kundaje - Show less +4 more

31 Oct 2018

TL;DR: TF-MoDISco as mentioned in this paper is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data, which is based on the TF-MoDSco algorithm.

...read moreread less

Posted Content•DOI•

Gkmexplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer Support Vector Machines Using Integrated Gradients

[...]

Avanti Shrikumar¹, Eva Prakash, Anshul Kundaje¹•Institutions (1)

Stanford University¹

01 Nov 2018-bioRxiv

TL;DR: Shrikumar et al. as discussed by the authors proposed gkmexplain, a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models, using simulated regulatory DNA sequences.

...read moreread less

Abstract: Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Note: Avanti Shrikumar and Eva Prakash are co-first authors. Avanti Shrikumar and Anshul Kundaje are co-corresponding authors.

...read moreread less

Posted Content•

Technical Note on Transcription Factor Motif Discovery from Importance Scores (TF-MoDISco) version 0.5.1.1

[...]

Avanti Shrikumar, Katherine Tian, Žiga Avsec, Anna Shcherbina, Abhimanyu Banerjee, Mahfuza Sharmin, Surag Nair, Anshul Kundaje - Show less +4 more

31 Oct 2018-arXiv: Learning

TL;DR: TFMoDISco (Transcription Factor Motif Discovery from Importance Scores) as mentioned in this paper is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data.

...read moreread less

Abstract: TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data. This technical note focuses on version v0.5.1.1. The implementation is available at https://github.com/kundajelab/tfmodisco/tree/v0.5.1.1

...read moreread less

Posted Content•

Learning to Abstain via Curve Optimization.

[...]

Amr Alexandari, Avanti Shrikumar, Anshul Kundaje

20 Feb 2018

TL;DR: This work develops a novel approach to the problem of selecting a budget-constrained subset of test examples to abstain on, by analytically optimizing the expected marginal improvement in a desired performance metric, such as the area under the ROC curve or Precision-Recall curve.

...read moreread less

Abstract: In practical applications of machine learning, it is often desirable to identify and abstain on examples where the a model's predictions are likely to be incorrect. We consider the problem of selecting a budget-constrained subset of test examples to abstain on, with the goal of maximizing performance on the remaining examples. We develop a novel approach to this problem by analytically optimizing the expected marginal improvement in a desired performance metric, such as the area under the ROC curve or Precision-Recall curve. We compare our approach to other abstention techniques for deep learning models based on posterior probability and uncertainty estimates obtained using test-time dropout. On various tasks in computer vision, natural language processing, and bioinformatics, we demonstrate the consistent effectiveness of our approach over other techniques. We also introduce novel diagnostics based on influence functions to understand the behavior of abstention methods in the presence of noisy training data, and leverage the insights to propose a new influence-based abstention method.

...read moreread less

Posted Content•

Selective Classification via Curve Optimization

[...]

Amr Alexandari, Avanti Shrikumar, Anshul Kundaje

20 Feb 2018

TL;DR: In this paper, a metric-specific abstention algorithm was proposed to optimize the sensitivity at a target specificity level, the area under the ROC, and the weighted Cohen's Kappa.

...read moreread less