scispace - formally typeset
Search or ask a question

Showing papers by "Anshul Kundaje published in 2020"


Journal ArticleDOI
29 Jul 2020-Nature
TL;DR: The authors summarize the data produced by phase III of the Encyclopedia of DNA Elements (ENCODE) project, a resource for better understanding of the human and mouse genomes, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development.
Abstract: The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

999 citations


Posted Content
TL;DR: WILDS is presented, a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, and is hoped to encourage the development of general-purpose methods that are anchored to real-world distribution shifts and that work well across different applications and problem settings.
Abstract: Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity, these real-world distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts which naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training results in substantially lower out-of-distribution than in-distribution performance, and that this gap remains even with models trained by existing methods for handling distribution shifts. This underscores the need for new training methods that produce models which are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at this https URL.

579 citations


Journal ArticleDOI
Orit Rozenblatt-Rosen1, Aviv Regev2, Aviv Regev3, Aviv Regev1  +370 moreInstitutions (19)
16 Apr 2020-Cell
TL;DR: The Human Tumor Atlas Network (HTAN), part of the NCI Cancer Moonshot Initiative, will establish a clinical, experimental, computational, and organizational framework to generate informative and accessible three-dimensional atlases of cancer transitions for a diverse set of tumor types.

279 citations


Journal ArticleDOI
14 Oct 2020-Nature
TL;DR: TheMAQC Society Board of Directors*, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. Ioannidis, John Quackenbush & Hugo J. W. Aerts
Abstract: Benjamin Haibe-Kains1,2,3,4,5 ✉, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, Massive Analysis Quality Control (MAQC) Society Board of Directors*, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbush & Hugo J. W. L. Aerts

179 citations


Journal ArticleDOI
11 Mar 2020-Nature
TL;DR: A scalable cancer-spheroid model is devised and genome-wide CRISPR screens in 2D monolayers and 3D lung-cancer spheroids are performed to identify carboxypeptidase D, acting via the IGF1R, as a 3D-specific driver of cancer growth.
Abstract: Cancer genomics studies have identified thousands of putative cancer driver genes1. Development of high-throughput and accurate models to define the functions of these genes is a major challenge. Here we devised a scalable cancer-spheroid model and performed genome-wide CRISPR screens in 2D monolayers and 3D lung-cancer spheroids. CRISPR phenotypes in 3D more accurately recapitulated those of in vivo tumours, and genes with differential sensitivities between 2D and 3D conditions were highly enriched for genes that are mutated in lung cancers. These analyses also revealed drivers that are essential for cancer growth in 3D and in vivo, but not in 2D. Notably, we found that carboxypeptidase D is responsible for removal of a C-terminal RKRR motif2 from the α-chain of the insulin-like growth factor 1 receptor that is critical for receptor activity. Carboxypeptidase D expression correlates with patient outcomes in patients with lung cancer, and loss of carboxypeptidase D reduced tumour growth. Our results reveal key differences between 2D and 3D cancer models, and establish a generalizable strategy for performing CRISPR screens in spheroids to reveal cancer vulnerabilities. CRISPR screens in a 3D spheroid cancer model system more accurately recapitulate cancer phenotypes than existing 2D models and were used to identify carboxypeptidase D, acting via the IGF1R, as a 3D-specific driver of cancer growth.

178 citations


Journal ArticleDOI
TL;DR: A machine-learning classifier is developed to integrate this multi-omic framework and predict dozens of functional single-nucleotide polymorphisms for Alzheimer's disease and Parkinson's disease, and dissected the complex inverted haplotype of the MAPT (encoding tau) PD risk locus.
Abstract: Genome-wide association studies of neurological diseases have identified thousands of variants associated with disease phenotypes. However, most of these variants do not alter coding sequences, making it difficult to assign their function. Here, we present a multi-omic epigenetic atlas of the adult human brain through profiling of single-cell chromatin accessibility landscapes and three-dimensional chromatin interactions of diverse adult brain regions across a cohort of cognitively healthy individuals. We developed a machine-learning classifier to integrate this multi-omic framework and predict dozens of functional SNPs for Alzheimer's and Parkinson's diseases, nominating target genes and cell types for previously orphaned loci from genome-wide association studies. Moreover, we dissected the complex inverted haplotype of the MAPT (encoding tau) Parkinson's disease risk locus, identifying putative ectopic regulatory interactions in neurons that may mediate this disease association. This work expands understanding of inherited variation and provides a roadmap for the epigenomic dissection of causal regulatory variation in disease.

177 citations


Journal ArticleDOI
TL;DR: In this article, the authors identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al. and provide solutions with implications for the broader field, including the broader cancer screening.
Abstract: In their study, McKinney et al. showed the high potential of artificial intelligence for breast cancer screening. However, the lack of detailed methods and computer code undermines its scientific value. We identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al and provide solutions with implications for the broader field.

166 citations


Posted ContentDOI
19 Jul 2020-bioRxiv
TL;DR: A deep learning model is introduced that uses DNA sequence to predict base-resolution ChIP-nexus binding profiles of pluripotency TFs, and interpretation tools are developed to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions.
Abstract: Summary The arrangement of transcription factor (TF) binding motifs (syntax) is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution ChIP-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using CRISPR-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data. Highlights The neural network BPNet accurately predicts TF binding data at base-resolution. Model interpretation discovers TF motifs and TF interactions dependent on soft syntax. Motifs for Nanog and partners are preferentially spaced at ∼10.5 bp periodicity. Directional cooperativity is validated: Sox2 enhances Nanog binding, but not vice versa.

160 citations


Journal ArticleDOI
29 Jul 2020-Nature
TL;DR: A map of cohesin-mediated Chromatin loops in 24 types of human cells identifies loops that show cell-type-specific variation, indicating that chromatin loops may help to specify cell-specific gene expression programs and functions.
Abstract: Physical interactions between distal regulatory elements have a key role in regulating gene expression, but the extent to which these interactions vary between cell types and contribute to cell-type-specific gene expression remains unclear. Here, to address these questions as part of phase III of the Encyclopedia of DNA Elements (ENCODE), we mapped cohesin-mediated chromatin loops, using chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), and analysed gene expression in 24 diverse human cell types, including core ENCODE cell lines. Twenty-eight per cent of all chromatin loops vary across cell types; these variations modestly correlate with changes in gene expression and are effective at grouping cell types according to their tissue of origin. The connectivity of genes corresponds to different functional classes, with housekeeping genes having few contacts, and dosage-sensitive genes being more connected to enhancer elements. This atlas of chromatin loops complements the diverse maps of regulatory architecture that comprise the ENCODE Encyclopedia, and will help to support emerging analyses of genome structure and function. A map of cohesin-mediated chromatin loops in 24 types of human cells identifies loops that show cell-type-specific variation, indicating that chromatin loops may help to specify cell-specific gene expression programs and functions.

119 citations


Posted ContentDOI
30 Dec 2020-bioRxiv
TL;DR: In this paper, a single-cell atlas of gene expression and chromatin accessibility was generated by mapping the activity of gene-regulatory elements to identify genomic regions crucial to corticogenesis.
Abstract: Genetic perturbations of cerebral cortical development can lead to neurodevelopmental disease, including autism spectrum disorder (ASD). To identify genomic regions crucial to corticogenesis, we mapped the activity of gene-regulatory elements generating a single-cell atlas of gene expression and chromatin accessibility both independently and jointly. This revealed waves of gene regulation by key transcription factors (TFs) across a nearly continuous differentiation trajectory into glutamatergic neurons, distinguished the expression programs of glial lineages, and identified lineage-determining TFs that exhibited strong correlation between linked gene-regulatory elements and expression levels. These highly connected genes adopted an active chromatin state in early differentiating cells, consistent with lineage commitment. Basepair-resolution neural network models identified strong cell-type specific enrichment of noncoding mutations predicted to be disruptive in a cohort of ASD subjects and identified frequently disrupted TF binding sites. This approach illustrates how cell-type specific mapping can provide insights into the programs governing human development and disease.

103 citations


Journal ArticleDOI
TL;DR: This strategy is based on combining the preferential methylation of open chromatin regions by DNA methyltransferases with low sequence specificity, in this case EcoGII, an N 6 -methyladenosine (m 6 A) methyltransferase, and the ability of nanopore sequencing to directly read DNA modifications.
Abstract: Mapping open chromatin regions has emerged as a widely used tool for identifying active regulatory elements in eukaryotes. However, existing approaches, limited by reliance on DNA fragmentation and short-read sequencing, cannot provide information about large-scale chromatin states or reveal coordination between the states of distal regulatory elements. We have developed a method for profiling the accessibility of individual chromatin fibers, a single-molecule long-read accessible chromatin mapping sequencing assay (SMAC-seq), enabling the simultaneous, high-resolution, single-molecule assessment of chromatin states at multikilobase length scales. Our strategy is based on combining the preferential methylation of open chromatin regions by DNA methyltransferases with low sequence specificity, in this case EcoGII, an N6-methyladenosine (m6A) methyltransferase, and the ability of nanopore sequencing to directly read DNA modifications. We demonstrate that aggregate SMAC-seq signals match bulk-level accessibility measurements, observe single-molecule nucleosome and transcription factor protection footprints, and quantify the correlation between chromatin states of distal genomic elements.

Journal ArticleDOI
23 Dec 2020-Cell
TL;DR: HT-recruit is developed, a pooled assay where protein libraries are recruited to a reporter, and their transcriptional effects are measured by sequencing, and a relationship between repressor function and evolutionary age for the KRAB domains is found, and it is discovered that Homeodomain repressor strength is collinear with Hox genetic organization.

Posted ContentDOI
10 Sep 2020-bioRxiv
TL;DR: The HT-recruit as discussed by the authors is a pooled assay where protein libraries are recruited to a reporter, and their transcriptional effects are measured by sequencing, using this approach, they measure gene silencing and activation for thousands of domains.
Abstract: Summary Thousands of proteins localize to the nucleus; however, it remains unclear which contain transcriptional effectors. Here, we develop HT-recruit - a pooled assay where protein libraries are recruited to a reporter, and their transcriptional effects are measured by sequencing. Using this approach, we measure gene silencing and activation for thousands of domains. We find a relationship between repressor function and evolutionary age for the KRAB domains, discover Homeodomain repressor strength is collinear with Hox genetic organization, and identify activities for several Domains of Unknown Function. Deep mutational scanning of the CRISPRi KRAB maps the co-repressor binding surface and identifies substitutions that improve stability/silencing. By tiling 238 proteins, we find repressors as short as 10 amino acids. Finally, we report new activator domains, including a divergent KRAB. Together, these results provide a resource of 600 human proteins containing effectors and demonstrate a scalable strategy for assigning functions to protein domains.

Posted ContentDOI
18 Oct 2020-bioRxiv
TL;DR: This work undertook multi-omic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep learning framework to model the cis-regulatory logic of Chromatin accessibility.
Abstract: Transcription factors (TFs) bind DNA sequence motif vocabularies in cis-regulatory elements (CREs) to modulate chromatin state and gene expression during cell state transitions. A quantitative understanding of how motif lexicons influence dynamic regulatory activity has been elusive due to the combinatorial nature of the cis-regulatory code. To address this, we undertook multi-omic data profiling of chromatin and expression dynamics across epidermal differentiation to identify 40,103 dynamic CREs associated with 3,609 dynamically expressed genes, then applied an interpretable deep learning framework to model the cis-regulatory logic of chromatin accessibility. This identified cooperative DNA sequence rules in dynamic CREs regulating synchronous gene modules with diverse roles in skin differentiation. Massively parallel reporter analysis validated temporal dynamics and cooperative cis-regulatory logic. Variants linked to human polygenic skin disease were enriched in these time-dependent combinatorial motif rules. This integrative approach reveals the combinatorial cis-regulatory lexicon of epidermal differentiation and represents a general framework for deciphering the organizational principles of the cis-regulatory code in dynamic gene regulation.

Posted ContentDOI
06 May 2020-medRxiv
TL;DR: Genetic architectures of proximal and distal CRC are partly distinct, and studies of risk factors and mechanisms of carcinogenesis, and precision prevention strategies should take into consideration the anatomical subsite of the tumor.
Abstract: Objective An understanding of the etiologic heterogeneity of colorectal cancer (CRC) is critical for improving precision prevention, including individualized screening recommendations and the discovery of novel drug targets and repurposable drug candidates for chemoprevention. Known differences in molecular characteristics and environmental risk factors among tumors arising in different locations of the colorectum suggest partly distinct mechanisms of carcinogenesis. The extent to which the contribution of inherited genetic risk factors for sporadic CRC differs by anatomical subsite of the primary tumor has not been examined. Design To identify new anatomical subsite-specific risk loci, we performed genome-wide association study (GWAS) meta-analyses including data of 48,214 CRC cases and 64,159 controls of European ancestry. We characterized effect heterogeneity at CRC risk loci using multinomial modeling. Results We identified 13 loci that reached genome-wide significance (P Conclusion Genetic architectures of proximal and distal CRC are partly distinct. Studies of risk factors and mechanisms of carcinogenesis, and precision prevention strategies should take into consideration the anatomical subsite of the tumor. Significance of this study What is already known about this subject? Heterogeneity among colorectal cancer (CRC) tumors originating at different locations of the colorectum has been revealed in somatic genomes, epigenomes, and transcriptomes, and in some established environmental risk factors for CRC. Genome-wide association studies (GWAS) have identified over 100 genetic variants for overall CRC risk; however, a comprehensive analysis of the extent to which genetic risk factors differ by the anatomical sublocation of the primary tumor is lacking. What are the new findings? In this large consortium-based study, we analyzed clinical and genome-wide genotype data of 112,373 CRC cases and controls of European ancestry to comprehensively examine whether CRC case subgroups defined by anatomical sublocation have distinct germline genetic etiologies. We discovered 13 new loci at genome-wide significance (P Systematic heterogeneity analysis of genetic risk variants for CRC identified thus far, revealed that the genetic architectures of proximal and distal CRC are partly distinct. Taken together, our results further support the idea that tumors arising in different anatomical sublocations of the colorectum may have distinct etiologies. How might it impact on clinical practice in the foreseeable future? Our results provide an informative resource for understanding the differential role that genes and pathways may play in the mechanisms of proximal and distal CRC carcinogenesis. The new insights into the etiologies of proximal and distal CRC may inform the development of new precision prevention strategies, including individualized screening recommendations and the discovery of novel drug targets and repurposable drug candidates for chemoprevention. Our findings suggest that future studies of etiological risk factors for CRC and molecular mechanisms of carcinogenesis should take into consideration the anatomical sublocation of the colorectal tumor.

Posted ContentDOI
06 Jan 2020-bioRxiv
TL;DR: The complex inverted haplotype of the MAPT (encoding tau) PD risk locus is dissect, identifying ectopic enhancer-gene contacts in neurons that increase MAPT expression and may mediate this disease association, greatly expanding the understanding of inherited variation in AD and PD.
Abstract: Genome-wide association studies (GWAS) have identified thousands of variants associated with disease phenotypes. However, the majority of these variants do not alter coding sequences, making it difficult to assign their function. To this end, we present a multi-omic epigenetic atlas of the adult human brain through profiling of the chromatin accessibility landscapes and three-dimensional chromatin interactions of seven brain regions across a cohort of 39 cognitively healthy individuals. Single-cell chromatin accessibility profiling of 70,631 cells from six of these brain regions identifies 24 distinct cell clusters and 359,022 cell type-specific regulatory elements, capturing the regulatory diversity of the adult brain. We develop a machine learning classifier to integrate this multi-omic framework and predict dozens of functional single nucleotide polymorphisms (SNPs), nominating gene and cellular targets for previously orphaned GWAS loci. These predictions both inform well-studied disease-relevant genes, such as BIN1 in microglia for Alzheimer’s disease (AD) and reveal novel gene-disease associations, such as STAB1 in microglia and MAL in oligodendrocytes for Parkinson’s disease (PD). Moreover, we dissect the complex inverted haplotype of the MAPT (encoding tau) PD risk locus, identifying ectopic enhancer-gene contacts in neurons that increase MAPT expression and may mediate this disease association. This work greatly expands our understanding of inherited variation in AD and PD and provides a roadmap for the epigenomic dissection of noncoding regulatory variation in disease.

Proceedings Article
01 Jan 2020
TL;DR: The Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized to improve deep learning models’ stability, interpretability, and performance on held-out data, especially when training data is severely limited.
Abstract: Deep learning models can accurately map genomic DNA sequences to associated functional molecular readouts such as protein–DNA binding data. Base-resolution importance (i.e. “attribution”) scores inferred from these models can highlight predictive sequence motifs and syntax. Unfortunately, these models are prone to overfitting and are sensitive to random initializations, often resulting in noisy and irreproducible attributions that obfuscate underlying motifs. To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized. We evaluate different model architectures with and without attribution priors trained on genome-wide binary or continuous molecular profiles. We show that our attribution prior dramatically improves models’ stability, interpretability, and performance on held-out data, especially when training data is severely limited. Our attribution prior also allows models to identify biologically meaningful sequence motifs more sensitively and precisely within individual regulatory elements. The prior is agnostic to the model architecture or predicted experimental assay, yet provides similar gains across all experiments. This work represents an important advancement in improving the reliability of deep learning models for deciphering the regulatory code of the genome.

Posted ContentDOI
28 Aug 2020-bioRxiv
TL;DR: During asymmetric division size-independent transcription is insufficient for size- Independent protein expression and chromatin-binding ensures equal amounts of protein are partitioned to unequally sized cells to maintain size- independent protein amounts.
Abstract: Summary Cell size and biosynthesis are inextricably linked. As cells grow, total protein synthesis increases in proportion to cell size so that protein concentrations remain constant. As an exception, the budding yeast cell-cycle inhibitor Whi5 is synthesized in a constant amount per cell cycle, so that it is diluted in large cells to trigger division. Here, we show that this size-independent expression of Whi5 results from size-independent transcription. A screen for similar genes identified histones as the major class of size-independent transcripts during the cell cycle, consistent with histone synthesis being coupled to genome content rather than cell size. However, during asymmetric division size-independent transcription is insufficient for size-independent protein expression and chromatin-binding ensures equal amounts of protein are partitioned to unequally sized cells to maintain size-independent protein amounts. Thus, specific transcriptional and partitioning mechanisms determine size-independent protein expression to control cell size.

Posted ContentDOI
03 Sep 2020-bioRxiv
TL;DR: Insight is revealed into principles of genome regulation, mechanisms that influence IBD are illuminated, and a generalizable strategy to connect common disease risk variants to their molecular and cellular functions is demonstrated.
Abstract: Genome-wide association studies have now identified tens of thousands of noncoding loci associated with human diseases and complex traits, each of which could reveal insights into biological mechanisms of disease. Many of the underlying causal variants are thought to affect enhancers, but we have lacked genome-wide maps of enhancer-gene regulation to interpret such variants. We previously developed the Activity-by-Contact (ABC) Model to predict enhancer-gene connections and demonstrated that it can accurately predict the results of CRISPR perturbations across several cell types. Here, we apply this ABC Model to create enhancer-gene maps in 131 cell types and tissues, and use these maps to interpret the functions of fine-mapped GWAS variants. For inflammatory bowel disease (IBD), causal variants are >20-fold enriched in enhancers in particular cell types, and ABC outperforms other regulatory methods at connecting noncoding variants to target genes. Across 72 diseases and complex traits, ABC links 5,036 GWAS signals to 2,249 unique genes, including a class of 577 genes that appear to influence multiple phenotypes via variants in enhancers that act in different cell types. Guided by these variant-to-function maps, we show that an enhancer containing an IBD risk variant regulates the expression of PPIF to tune mitochondrial membrane potential. Together, our study reveals insights into principles of genome regulation, illuminates mechanisms that influence IBD, and demonstrates a generalizable strategy to connect common disease risk variants to their molecular and cellular functions.

Posted ContentDOI
08 Jun 2020-bioRxiv
TL;DR: MTSplice (Multi-tissue Splicing), a neural network which quantitatively predicts effects of human genetic variants on splicing of cassette exons in 56 tissues, outperforms MMSplice on predicting effects associated with naturally occurring genetic variants in most tissues of the GTEx dataset.
Abstract: Tissue-specific splicing of exons plays an important role in determining tissue identity. However, computational tools predicting tissue-specific effects of variants on splicing are lacking. To address this issue, we developed MTSplice (Multi-tissue Splicing), a neural network which quantitatively predicts effects of human genetic variants on splicing of cassette exons in 56 tissues. MTSplice combines the state-of-the-art predictor MMSplice, which models constitutive regulatory sequences, with a new neural network which models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting effects associated with naturally occurring genetic variants in most tissues of the GTEx dataset. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. MTSplice is provided free of use and open source at the model repository Kipoi. We foresee MTSplice to be useful for functional prediction and prioritization of variants associated with tissue-specific disorders.

Posted ContentDOI
31 Mar 2020-bioRxiv
TL;DR: A deep learning model that uses DNA sequence to predict base-resolution ChIP-nexus binding profiles of four pluripotency TFs Oct4, Sox2, Nanog, and Klf4 finds that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences TF binding in a directional manner.
Abstract: Genes are regulated by cis-regulatory sequences, which contain transcription factor (TF) binding motifs in specific arrangements (syntax). To understand how motif syntax influences TF binding, we train a deep learning model, BPNet, that uses DNA sequence to predict base-resolution ChIP-nexus binding profiles of four pluripotency TFs Oct4, Sox2, Nanog, and Klf4. We interpret the model to accurately map hundreds of thousands of motifs in the genome, learn predictive motif representations, and identify rules by which specific motifs interact. We find that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences TF binding in a directional manner. Most strikingly, Nanog shows a strong preference for binding with helical periodicity. We validate our model using CRISPR-induced point mutations, demonstrating that interpretable deep learning models are a powerful approach to uncover the motifs and syntax of cis-regulatory sequences.

Posted ContentDOI
13 Oct 2020-bioRxiv
TL;DR: FastISM is an algorithm that speeds up ISM by a factor of over 10x for commonly used convolutional neural network architectures, and far surpasses the runtime of backpropagation-based methods on multi-output architectures, making it feasible to run ISM on a large number of sequences.
Abstract: Deep learning models such as convolutional neural networks are able to accurately map biological sequences to associated functional readouts and properties by learning predictive de novo representations. In-silico saturation mutagenesis (ISM) is a popular feature attribution technique for inferring contributions of all characters in an input sequence to the model’s predicted output. The main drawback of ISM is its runtime, as it involves multiple forward propagations of all possible mutations of each character in the input sequence through the trained model to predict the effects on the output. We present fastISM, an algorithm that speeds up ISM by a factor of over 10x for commonly used convolutional neural network architectures. fastISM is based on the observations that the majority of computation in ISM is spent in convolutional layers, and a single mutation only disrupts a limited region of intermediate layers, rendering most computation redundant. fastISM reduces the gap between backpropagation-based feature attribution methods and ISM. It far surpasses the runtime of backpropagation-based methods on multi-output architectures, making it feasible to run ISM on a large number of sequences. An easy-to-use Keras/TensorFlow 2 implementation of fastISM is available at https://github.com/kundajelab/fastISM, and a hands-on tutorial at https://colab.research.google.com/github/kundajelab/fastISM/blob/master/notebooks/colab/DeepSEA.ipynb.

Posted ContentDOI
12 Jun 2020-bioRxiv
TL;DR: This paper proposed an attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of a Fourier spectrum are penalized.
Abstract: Deep learning models can accurately map genomic DNA sequences to associated functional molecular readouts such as protein–DNA binding data. Base-resolution importance (i.e. “attribution”) scores inferred from these models can highlight predictive sequence motifs and syntax. Unfortunately, these models are prone to overfitting and are sensitive to random initializations, often resulting in noisy and irreproducible attributions that obfuscate underlying motifs. To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized. We evaluate different model architectures with and without attribution priors trained on genome-wide binary or continuous molecular profiles. We show that our attribution prior dramatically improves models’ stability, interpretability, and performance on held-out data, especially when training data is severely limited. Our attribution prior also allows models to identify biologically meaningful sequence motifs more sensitively and precisely within individual regulatory elements. The prior is agnostic to the model architecture or predicted experimental assay, yet provides similar gains across all experiments. This work represents an important advancement in improving the reliability of deep learning models for deciphering the regulatory code of the genome.

Posted ContentDOI
04 Nov 2020-bioRxiv
TL;DR: This work extends conjoined and RCPS models to signal profile prediction, and introduces a strong baseline: a standard model that is converted to a conjoined model only after it has been trained, which it is proved can represent the solution learned by the conjoined models.
Abstract: Predictive models that map double-stranded regulatory DNA to molecular signals of regulatory activity should, in principle, produce identical predictions regardless of whether the sequence of the forward strand or its reverse complement (RC) is supplied as input. Unfortunately, standard convolutional neural network architectures can produce highly divergent predictions across strands, even when the training set is augmented with RC sequences. Two strategies have emerged in the literature to enforce this symmetry: conjoined a.k.a. “siamese” architectures where the model is run in parallel on both strands & predictions are combined, and RC parameter sharing or RCPS where weight sharing ensures that the response of the model is equivariant across strands. However, systematic benchmarks are lacking, and neither architecture has been adapted to base-resolution signal profile prediction tasks. In this work, we extend conjoined and RCPS models to signal profile prediction, and introduce a strong baseline: a standard model (trained on RC augmented data) that is converted to a conjoined model only after it has been trained, which we call a “post-hoc” conjoined model. We then conduct benchmarks on both binary and signal profile prediction. We find post-hoc conjoined models consistently perform as well as or better than models that were conjoined during training, and present a mathematical intuition for why. We also find that - despite its theoretical appeal - RCPS performs surprisingly poorly on certain tasks, in particular, signal profile prediction. In fact, RCPS can sometimes do worse than even standard models trained with RC data augmentation. We prove that the RCPS models can represent the solution learned by the conjoined models, implying that the poor performance of RCPS may be due to optimization difficulties. We therefore suggest that users interested in RC symmetry should default to post-hoc conjoined models as a reliable baseline before exploring RCPS. Code: https://github.com/hannahgz/BenchmarkRCStrategies

Posted ContentDOI
02 Jul 2020-bioRxiv
TL;DR: The 3D genome of Breviolum minutum is analyzed, and large topological domains without chromatin loops are found, implicating transcription-induced supercoiling as the primary topological force in dinoflagellates.
Abstract: Dinoflagellate chromosomes represent a unique evolutionary experiment, as they exist in a permanently condensed, liquid crystalline state, are not packaged by histones, and contain genes organized into polycistronic arrays, with minimal transcriptional regulation. We analyze the 3D genome of Breviolum minutum, and find large topological domains without chromatin loops, demarcated by convergent gene array boundaries (“dinoTADs). Transcriptional inhibition degrades dinoTADs, implicating transcription-induced supercoiling as the primary topological force in dinoflagellates.

Proceedings Article
21 Nov 2020
TL;DR: The authors proposed a bias-corrected calibration for label shift, which improves robustness to poor label shift by estimating source-domain priors, and showed that the maximum likelihood with appropriate calibration outperforms BBSL and RLLS.
Abstract: Label shift refers to the phenomenon where the prior class probability p(y) changes between the training and test distributions, while the conditional probability p(x|y) stays fixed. Label shift arises in settings like medical diagnosis, where a classifier trained to predict disease given symptoms must be adapted to scenarios where the baseline prevalence of the disease is different. Given estimates of p(y|x) from a predictive model, Saerens et al. proposed an efficient maximum likelihood algorithm to correct for label shift that does not require model retraining, but a limiting assumption of this algorithm is that p(y|x) is calibrated, which is not true of modern neural networks. Recently, Black Box Shift Learning (BBSL) and Regularized Learning under Label Shifts (RLLS) have emerged as state-of-the-art techniques to cope with label shift when a classifier does not output calibrated probabilities, but both methods require model retraining with importance weights and neither has been benchmarked against maximum likelihood. Here we (1) show that combining maximum likelihood with a type of calibration we call bias-corrected calibration outperforms both BBSL and RLLS across diverse datasets and distribution shifts, (2) prove that the maximum likelihood objective is concave, and (3) introduce a principled strategy for estimating source-domain priors that improves robustness to poor calibration. This work demonstrates that the maximum likelihood with appropriate calibration is a formidable and efficient baseline for label shift adaptation; notebooks reproducing experiments available at this https URL