scispace - formally typeset
Search or ask a question

Showing papers by "Anshul Kundaje published in 2018"


Journal ArticleDOI
TL;DR: It is found that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art.
Abstract: Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

1,491 citations


Journal ArticleDOI
TL;DR: A deep neural network is developed to predict open chromatin regions from DNA sequence alone and is able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell-type-specific chromatin accessibility.
Abstract: Induced pluripotent stem cells (iPSCs) are an essential tool for studying cellular differentiation and cell types that are otherwise difficult to access. We investigated the use of iPSCs and iPSC-derived cells to study the impact of genetic variation on gene regulation across different cell types and as models for studies of complex disease. To do so, we established a panel of iPSCs from 58 well-studied Yoruba lymphoblastoid cell lines (LCLs); 14 of these lines were further differentiated into cardiomyocytes. We characterized regulatory variation across individuals and cell types by measuring gene expression levels, chromatin accessibility, and DNA methylation. Our analysis focused on a comparison of inter-individual regulatory variation across cell types. While most cell-type-specific regulatory quantitative trait loci (QTLs) lie in chromatin that is open only in the affected cell types, we found that 20% of cell-type-specific regulatory QTLs are in shared open chromatin. This observation motivated us to develop a deep neural network to predict open chromatin regions from DNA sequence alone. Using this approach, we were able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell-type-specific chromatin accessibility.

115 citations


Journal ArticleDOI
TL;DR: It is shown that SCLC can arise from different cell types of origin, which profoundly influences the eventual genetic and epigenetic changes that enable metastatic progression, and underscores the importance of the identity of cell type of origin in influencing tumor evolution and metastatic mechanisms.
Abstract: The extent to which early events shape tumor evolution is largely uncharacterized, even though a better understanding of these early events may help identify key vulnerabilities in advanced tumors. Here, using genetically defined mouse models of small cell lung cancer (SCLC), we uncovered distinct metastatic programs attributable to the cell type of origin. In one model, tumors gain metastatic ability through amplification of the transcription factor NFIB and a widespread increase in chromatin accessibility, whereas in the other model, tumors become metastatic in the absence of NFIB-driven chromatin alterations. Gene-expression and chromatin accessibility analyses identify distinct mechanisms as well as markers predictive of metastatic progression in both groups. Underlying the difference between the two programs was the cell type of origin of the tumors, with NFIB-independent metastases arising from mature neuroendocrine cells. Our findings underscore the importance of the identity of cell type of origin in influencing tumor evolution and metastatic mechanisms. Significance: We show that SCLC can arise from different cell types of origin, which profoundly influences the eventual genetic and epigenetic changes that enable metastatic progression. Understanding intertumoral heterogeneity in SCLC, and across cancer types, may illuminate mechanisms of tumor progression and uncover how the cell type of origin affects tumor evolution. Cancer Discov; 8(10); 1316–31. ©2018 AACR. See related commentary by Pozo et al., p. 1216. This article is highlighted in the In This Issue feature, p. 1195

109 citations


Journal ArticleDOI
TL;DR: The Umap software for identifying uniquely mappable regions of any genome is introduced and its Bismap extension identifies mappability of the bisulfite-converted genome.
Abstract: Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called 'mappability', which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at https://bismap.hoffmanlab.org for use with genome browsers.

103 citations


Journal ArticleDOI
TL;DR: A concordance measure called DIfferences between Smoothed COntact maps (GenomeDISCO) is introduced for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments, which accurately distinguishes biological replicates from samples obtained from different cell types.
Abstract: Motivation The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of three-dimensional chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts. Results We introduce a concordance measure called DIfferences between Smoothed COntact maps (GenomeDISCO) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP. Availability and implementation Software implementing GenomeDISCO is available at https://github.com/kundajelab/genomedisco. Supplementary information Supplementary data are available at Bioinformatics online.

74 citations


Journal ArticleDOI
TL;DR: This work presents a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence and makes significant strides in improving the interpretability of deep learning models for genomics.
Abstract: Motivation Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models. Results We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics. Availability and implementation Code is available at: https://github.com/kundajelab/dfim. Supplementary information Supplementary data are available at Bioinformatics online.

65 citations


Posted Content
TL;DR: TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data and the implementation is available at this URL.
Abstract: TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data. This technical note focuses on version v0.5.6.5. The implementation is available at this https URL

61 citations


Posted ContentDOI
24 Jul 2018-bioRxiv
TL;DR: Kipoi, a collaborative initiative to define standards and to foster reuse of trained models in genomics, is presented, providing a unified framework to archive, share, access, use, and build on models developed by the community.
Abstract: Advanced machine learning models applied to large-scale genomics datasets hold the promise to be major drivers for genome science. Once trained, such models can serve as a tool to probe the relationships between data modalities, including the effect of genetic variants on phenotype. However, lack of standardization and limited accessibility of trained models have hampered their impact in practice. To address this, we present Kipoi, a collaborative initiative to define standards and to foster reuse of trained models in genomics. Already, the Kipoi repository contains over 2,000 trained models that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. The Kipoi model standard grants automated software installation and provides unified interfaces to apply and interpret models. We illustrate Kipoi through canonical use cases, including model benchmarking, transfer learning, variant effect prediction, and building new models from existing ones. By providing a unified framework to archive, share, access, use, and build on models developed by the community, Kipoi will foster the dissemination and use of machine learning models in genomics.

36 citations


Journal ArticleDOI
TL;DR: Nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays are compared and a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks is devised.
Abstract: Enhancers are distal cis-regulatory elements that modulate gene expression. They are depleted of nucleosomes and enriched in specific histone modifications; thus, calling DNase-seq and histone mark ChIP-seq peaks can predict enhancers. We evaluated nine peak-calling algorithms for predicting enhancers validated by transgenic mouse assays. DNase and H3K27ac peaks were consistently more predictive than H3K4me1/2/3 and H3K9ac peaks. DFilter and Hotspot2 were the best DNase peak callers, while HOMER, MUSIC, MACS2, DFilter and F-seq were the best H3K27ac peak callers. We observed that the differential DNase or H3K27ac signals between two distant tissues increased the area under the precision-recall curve (PR-AUC) of DNase peaks by 17.5-166.7% and that of H3K27ac peaks by 7.1-22.2%. We further improved this differential signal method using multiple contrast tissues. Evaluated using a blind test, the differential H3K27ac signal method substantially improved PR-AUC from 0.48 to 0.75 for predicting heart enhancers. We further validated our approach using postnatal retina and cerebral cortex enhancers identified by massively parallel reporter assays, and observed improvements for both tissues. In summary, we compared nine peak callers and devised a superior method for predicting tissue-specific mouse developmental enhancers by reranking the called peaks.

34 citations


Posted ContentDOI
14 Oct 2018-bioRxiv
TL;DR: Property of TWAS is explored as a potential approach to prioritize causal genes, using simulations and case studies of literature-curated candidate causal genes for schizophrenia, LDL cholesterol and Crohn’s disease, and guidelines and best practices are provided.
Abstract: Transcriptome-wide association studies (TWAS) integrate GWAS and gene expression datasets to find gene-trait associations. In this Perspective, we explore properties of TWAS as a potential approach to prioritize causal genes, using simulations and case studies of literature-curated candidate causal genes for schizophrenia, LDL cholesterol and Crohn9s disease. We explore risk loci where TWAS accurately prioritizes the likely causal gene, as well as loci where TWAS prioritizes multiple genes, some of which are unlikely to be causal, because they share the same variants as eQTLs. We illustrate that TWAS is especially prone to spurious prioritization when using expression data from tissues or cell types that are less related to the trait, due to substantial variation in both expression levels and eQTL strengths across cell types. Nonetheless, TWAS prioritizes candidate causal genes at GWAS loci more accurately than simple baselines based on proximity to lead GWAS variant and expression in trait-related tissue. We discuss current strategies and future opportunities for improving the performance of TWAS for causal gene prioritization. Our results showcase the strengths and limitations of using expression variation across individuals to determine causal genes at GWAS loci and provide guidelines and best practices when using TWAS to prioritize candidate causal genes.

21 citations


Posted ContentDOI
17 Apr 2018-bioRxiv
TL;DR: This work presents a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence and makes significant strides in improving the interpretability of deep learning models for genomics.
Abstract: Transcription factors bind complex regulatory DNA sequence patterns in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn these cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order, epistatic feature interactions encoded by the models. We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.

Posted Content
31 Oct 2018
TL;DR: The methods behind TF-MoDISco version 0.4.2-alpha, an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data, are described.

Posted Content
TL;DR: It is shown that the formula for Total Conductance is mathematically equivalent to Path Integrated Gradients computed on a hidden layer in the network, which is a pre-existing computationally efficient approach that is applicable to calculating internal neuron importance.
Abstract: The challenge of assigning importance to individual neurons in a network is of interest when interpreting deep learning models In recent work, Dhamdhere et al proposed Total Conductance, a "natural refinement of Integrated Gradients" for attributing importance to internal neurons Unfortunately, the authors found that calculating conductance in tensorflow required the addition of several custom gradient operators and did not scale well In this work, we show that the formula for Total Conductance is mathematically equivalent to Path Integrated Gradients computed on a hidden layer in the network We provide a scalable implementation of Total Conductance using standard tensorflow gradient operators that we call Neuron Integrated Gradients We compare Neuron Integrated Gradients to DeepLIFT, a pre-existing computationally efficient approach that is applicable to calculating internal neuron importance We find that DeepLIFT produces strong empirical results and is faster to compute, but because it lacks the theoretical properties of Neuron Integrated Gradients, it may not always be preferred in practice Colab notebook reproducing results: this http URL

Journal ArticleDOI
TL;DR: Takeaway is that human mtDNA has a conserved protein-DNA organization, which is likely involved in mtDNA regulation, and altered mt-DGF pattern in interleukin 3-treated CD34+ cells, certain tissue differences, and significant prevalence change in fetal versus nonfetal samples offer first clues to their physiological importance.
Abstract: Human mitochondrial DNA (mtDNA) is believed to lack chromatin and histones. Instead, it is coated solely by the transcription factor TFAM. We asked whether mtDNA packaging is more regulated than once thought. To address this, we analyzed DNase-seq experiments in 324 human cell types and found, for the first time, a pattern of 29 mtDNA Genomic footprinting (mt-DGF) sites shared by ∼90% of the samples. Their syntenic conservation in mouse DNase-seq experiments reflect selective constraints. Colocalization with known mtDNA regulatory elements, with G-quadruplex structures, in TFAM-poor sites (in HeLa cells) and with transcription pausing sites, suggest a functional regulatory role for such mt-DGFs. Altered mt-DGF pattern in interleukin 3-treated CD34+ cells, certain tissue differences, and significant prevalence change in fetal versus nonfetal samples, offer first clues to their physiological importance. Taken together, human mtDNA has a conserved protein-DNA organization, which is likely involved in mtDNA regulation.

Journal ArticleDOI
TL;DR: The most recent major technological transformation happened a decade ago, with the move from using tiling arrays [chromatin immunoprecipitation (ChIP)-on-Chip] to high-throughput sequencing as a readout for ChIP assays.
Abstract: Advances in the methods for detecting protein-DNA interactions have played a key role in determining the directions of research into the mechanisms of transcriptional regulation. The most recent major technological transformation happened a decade ago, with the move from using tiling arrays [chromatin immunoprecipitation (ChIP)-on-Chip] to high-throughput sequencing (ChIP-seq) as a readout for ChIP assays. In addition to the numerous other ways in which it is superior to arrays, by eliminating the need to design and manufacture them, sequencing also opened the door to carrying out comparative analyses of genome-wide transcription factor occupancy across species and studying chromatin biology in previously less accessible model and nonmodel organisms, thus allowing us to understand the evolution and diversity of regulatory mechanisms in unprecedented detail. Here, we review the biological insights obtained from such studies in recent years and discuss anticipated future developments in the field.

Posted ContentDOI
17 Aug 2018-bioRxiv
TL;DR: In this article, a convolutional neural network (CNN)-based framework was proposed to predict and interpret the regulatory activity of DNA sequences as measured by massively parallel reporter assays (MPRAs), which are a powerful approach to characterize the relationship between noncoding DNA sequence and gene expression.
Abstract: The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ~500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: It is demonstrated that it is possible to predict protein-ligand interactions using only motif-based features and that interpretation of these features can reveal new insights into the molecular mechanics underlying each interaction.
Abstract: Identification of small molecule ligands that bind to proteins is a critical step in drug discovery. Computational methods have been developed to accelerate the prediction of protein-ligand binding, but often depend on 3D protein structures. As only a limited number of protein 3D structures have been resolved, the ability to predict protein-ligand interactions without relying on a 3D representation would be highly valuable. We use an interpretable confidence-rated boosting algorithm to predict protein-ligand interactions with high accuracy from ligand chemical substructures and protein 1D sequence motifs, without relying on 3D protein structures. We compare several protein motif definitions, assess generalization of our model's predictions to unseen proteins and ligands, demonstrate recovery of well established interactions and identify globally predictive protein-ligand motif pairs. By bridging biological and chemical perspectives, we demonstrate that it is possible to predict protein-ligand interactions using only motif-based features and that interpretation of these features can reveal new insights into the molecular mechanics underlying each interaction. Our work also lays a foundation to explore more predictive feature sets and sophisticated machine learning approaches as well as other applications, such as predicting unintended interactions or the effects of mutations.

Posted ContentDOI
22 Dec 2018-bioRxiv
TL;DR: A method for profiling accessibility of individual chromatin fibers at multi-kilobase length scale (SMAC-seq, or Single-Molecule long-read Accessible Chromatin mapping sequencing assay), enabling the simultaneous, high-resolution, single-molecule assessment of the chromatin state of distal genomic elements.
Abstract: Active regulatory elements in eukaryotes are typically characterized by an open, nucleosome depleted chromatin structure; mapping areas of open chromatin has accordingly emerged as a widely used tool in the arsenal of modern functional genomics. However, existing approaches for profiling chromatin accessibility are limited by their reliance on DNA fragmentation and short read sequencing, which leaves them unable to provide information about the state of chromatin on larger scales or reveal coordination between the chromatin state of individual distal regulatory elements. To address these limitations we have developed a method for profiling accessibility of individual chromatin molecules at multi-kilobase length scale (SMAC-seq, or Single-Molecule long-read Accessible Chromatin mapping sequencing assay), enabling the simultaneous, high resolution, single-molecule assessment of the chromatin state of distal genomic elements. Our strategy is based on combining the preferential methylation of open chromatin regions by DNA methyltransferases (CpG and GpC 5-methylcytosine (5mC) and N 6 -methyl adenosine (m 6 A) enzymes) and the ability of long-read single-molecule nanopore sequencing to directly read out the methylation state of individual DNA bases. Applying SMAC-seq to the budding yeast Saccharomyces cerevisiae , we demonstrate that aggregate SMAC-seq signals match bulk-level accessibility measurements, observe single-molecule protection footprints of nucleosomes and transcription factors, and quantify the correlation between the chromatin states of distal genomic elements.

Posted ContentDOI
31 May 2018-bioRxiv
TL;DR: This resource identifies chromatin and transcriptional states that are characteristic of youthful tissues, which could be leveraged to restore aspects of youthful functionality to old tissues.
Abstract: Aging is accompanied by the functional decline of tissues. However, a systematic study of epigenomic and transcriptomic changes across tissues during aging is missing. Here we generated chromatin maps and transcriptomes from 4 tissues and one cell type from young, middle-age, and old mice, yielding 143 high-quality datasets. We focused specifically on chromatin marks linked to gene expression regulation and cell identity: histone H3 trimethylation at lysine 4 (H3K4me3), a mark enriched at promoters, and histone H3 acetylation at lysine 27 (H3K27ac), a mark enriched at active enhancers. Epigenomic and transcriptomic landscapes could easily distinguish between ages. Machine learning analysis showed that chromatin states could predict transcriptional changes during aging. Analysis of datasets from all tissues identified recurrent age-related chromatin and transcriptional changes in key processes, including the upregulation of immune system response pathways, including the interferon signaling pathway. The upregulation of interferon response pathway with age was accompanied by increased transcription of various endogenous retroviral sequences. Pathways deregulated during mouse aging across tissues, notably innate immune pathways, were also deregulated with aging in other vertebrate species — African turquoise killifish, rat, and humans — indicating common signatures of age across species. To date, our dataset represents the largest multi-tissue epigenomic and transcriptomic dataset for vertebrate aging. This resource identifies chromatin and transcriptional states that are characteristic of youthful tissues, which could be leveraged to restore aspects of youthful functionality to old tissues.

Posted Content
TL;DR: This framework leverages the insight that calibrated probability estimates can be used as a proxy for the true class labels, thereby allowing us to estimate the change in an arbitrary metric if an example were abstained on.
Abstract: In practical applications of machine learning, it is often desirable to identify and abstain on examples where the model's predictions are likely to be incorrect. Much of the prior work on this topic focused on out-of-distribution detection or performance metrics such as top-k accuracy. Comparatively little attention was given to metrics such as area-under-the-curve or Cohen's Kappa, which are extremely relevant for imbalanced datasets. Abstention strategies aimed at top-k accuracy can produce poor results on these metrics when applied to imbalanced datasets, even when all examples are in-distribution. We propose a framework to address this gap. Our framework leverages the insight that calibrated probability estimates can be used as a proxy for the true class labels, thereby allowing us to estimate the change in an arbitrary metric if an example were abstained on. Using this framework, we derive computationally efficient metric-specific abstention algorithms for optimizing the sensitivity at a target specificity level, the area under the ROC, and the weighted Cohen's Kappa. Because our method relies only on calibrated probability estimates, we further show that by leveraging recent work on domain adaptation under label shift, we can generalize to test-set distributions that may have a different class imbalance compared to the training set distribution. On various experiments involving medical imaging, natural language processing, computer vision and genomics, we demonstrate the effectiveness of our approach. Source code available at this https URL. Colab notebooks reproducing results available at this https URL.

Posted Content
31 Oct 2018
TL;DR: TF-MoDISco as mentioned in this paper is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data, which is based on the TF-MoDSco algorithm.
Abstract: TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data. This technical note focuses on version v0.5.6.5. The implementation is available at this https URL

Posted ContentDOI
01 Nov 2018-bioRxiv
TL;DR: Shrikumar et al. as discussed by the authors proposed gkmexplain, a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models, using simulated regulatory DNA sequences.
Abstract: Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Note: Avanti Shrikumar and Eva Prakash are co-first authors. Avanti Shrikumar and Anshul Kundaje are co-corresponding authors.

Posted Content
TL;DR: TFMoDISco (Transcription Factor Motif Discovery from Importance Scores) as mentioned in this paper is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data.
Abstract: TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data. This technical note focuses on version v0.5.1.1. The implementation is available at https://github.com/kundajelab/tfmodisco/tree/v0.5.1.1

Posted Content
20 Feb 2018
TL;DR: This work develops a novel approach to the problem of selecting a budget-constrained subset of test examples to abstain on, by analytically optimizing the expected marginal improvement in a desired performance metric, such as the area under the ROC curve or Precision-Recall curve.
Abstract: In practical applications of machine learning, it is often desirable to identify and abstain on examples where the a model's predictions are likely to be incorrect. We consider the problem of selecting a budget-constrained subset of test examples to abstain on, with the goal of maximizing performance on the remaining examples. We develop a novel approach to this problem by analytically optimizing the expected marginal improvement in a desired performance metric, such as the area under the ROC curve or Precision-Recall curve. We compare our approach to other abstention techniques for deep learning models based on posterior probability and uncertainty estimates obtained using test-time dropout. On various tasks in computer vision, natural language processing, and bioinformatics, we demonstrate the consistent effectiveness of our approach over other techniques. We also introduce novel diagnostics based on influence functions to understand the behavior of abstention methods in the presence of noisy training data, and leverage the insights to propose a new influence-based abstention method.

Posted Content
20 Feb 2018
TL;DR: In this paper, a metric-specific abstention algorithm was proposed to optimize the sensitivity at a target specificity level, the area under the ROC, and the weighted Cohen's Kappa.
Abstract: In practical applications of machine learning, it is often desirable to identify and abstain on examples where the model's predictions are likely to be incorrect. Much of the prior work on this topic focused on out-of-distribution detection or performance metrics such as top-k accuracy. Comparatively little attention was given to metrics such as area-under-the-curve or Cohen's Kappa, which are extremely relevant for imbalanced datasets. Abstention strategies aimed at top-k accuracy can produce poor results on these metrics when applied to imbalanced datasets, even when all examples are in-distribution. We propose a framework to address this gap. Our framework leverages the insight that calibrated probability estimates can be used as a proxy for the true class labels, thereby allowing us to estimate the change in an arbitrary metric if an example were abstained on. Using this framework, we derive computationally efficient metric-specific abstention algorithms for optimizing the sensitivity at a target specificity level, the area under the ROC, and the weighted Cohen's Kappa. Because our method relies only on calibrated probability estimates, we further show that by leveraging recent work on domain adaptation under label shift, we can generalize to test-set distributions that may have a different class imbalance compared to the training set distribution. On various experiments involving medical imaging, natural language processing, computer vision and genomics, we demonstrate the effectiveness of our approach. Source code available at this https URL. Colab notebooks reproducing results available at this https URL.