scispace - formally typeset
Search or ask a question

Showing papers by "Anshul Kundaje published in 2019"


Journal ArticleDOI
TL;DR: The ENCODE blacklist is defined- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment.
Abstract: Functional genomics assays based on high-throughput sequencing greatly expand our ability to understand the genome. Here, we define the ENCODE blacklist- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The removal of the ENCODE blacklist is an essential quality measure when analyzing functional genomics data.

850 citations


Journal ArticleDOI
TL;DR: Property of TWAS is explored as a potential approach to prioritize causal genes at GWAS loci, by using simulations and case studies of literature-curated candidate causal genes for schizophrenia, low-density-lipoprotein cholesterol and Crohn’s disease.
Abstract: Transcriptome-wide association studies (TWAS) integrate genome-wide association studies (GWAS) and gene expression datasets to identify gene-trait associations. In this Perspective, we explore properties of TWAS as a potential approach to prioritize causal genes at GWAS loci, by using simulations and case studies of literature-curated candidate causal genes for schizophrenia, low-density-lipoprotein cholesterol and Crohn's disease. We explore risk loci where TWAS accurately prioritizes the likely causal gene as well as loci where TWAS prioritizes multiple genes, some likely to be non-causal, owing to sharing of expression quantitative trait loci (eQTL). TWAS is especially prone to spurious prioritization with expression data from non-trait-related tissues or cell types, owing to substantial cross-cell-type variation in expression levels and eQTL strengths. Nonetheless, TWAS prioritizes candidate causal genes more accurately than simple baselines. We suggest best practices for causal-gene prioritization with TWAS and discuss future opportunities for improvement. Our results showcase the strengths and limitations of using eQTL datasets to determine causal genes at GWAS loci.

504 citations


Journal ArticleDOI
TL;DR: Genome-wide association analyses based on whole-genome sequencing and imputation identify 40 new risk variants for colorectal cancer, including a strongly protective low-frequency variant at CHD1 and loci implicating signaling and immune function in disease etiology.
Abstract: To further dissect the genetic architecture of colorectal cancer (CRC), we performed whole-genome sequencing of 1,439 cases and 720 controls, imputed discovered sequence variants and Haplotype Reference Consortium panel variants into genome-wide association study data, and tested for association in 34,869 cases and 29,051 controls. Findings were followed up in an additional 23,262 cases and 38,296 controls. We discovered a strongly protective 0.3% frequency variant signal at CHD1. In a combined meta-analysis of 125,478 individuals, we identified 40 new independent signals at P < 5 × 10-8, bringing the number of known independent signals for CRC to ~100. New signals implicate lower-frequency variants, Kruppel-like factors, Hedgehog signaling, Hippo-YAP signaling, long noncoding RNAs and somatic drivers, and support a role for immune function. Heritability analyses suggest that CRC risk is highly polygenic, and larger, more comprehensive studies enabling rare variant analysis will improve understanding of biology underlying this risk and influence personalized screening strategies and drug development.

324 citations


Journal ArticleDOI
TL;DR: This resource identifies chromatin and transcriptional states that are characteristic of young tissues, which could be leveraged to restore aspects of youthful functionality to old tissues.
Abstract: Aging is accompanied by the functional decline of tissues. However, a systematic study of epigenomic and transcriptomic changes across tissues during aging is missing. Here, we generated chromatin maps and transcriptomes from four tissues and one cell type from young, middle-aged, and old mice-yielding 143 high-quality data sets. We focused on chromatin marks linked to gene expression regulation and cell identity: histone H3 trimethylation at lysine 4 (H3K4me3), a mark enriched at promoters, and histone H3 acetylation at lysine 27 (H3K27ac), a mark enriched at active enhancers. Epigenomic and transcriptomic landscapes could easily distinguish between ages, and machine-learning analysis showed that specific epigenomic states could predict transcriptional changes during aging. Analysis of data sets from all tissues identified recurrent age-related chromatin and transcriptional changes in key processes, including the up-regulation of immune system response pathways such as the interferon response. The up-regulation of the interferon response pathway with age was accompanied by increased transcription and chromatin remodeling at specific endogenous retroviral sequences. Pathways misregulated during mouse aging across tissues, notably innate immune pathways, were also misregulated with aging in other vertebrate species-African turquoise killifish, rat, and humans-indicating common signatures of age across species. To date, our data set represents the largest multitissue epigenomic and transcriptomic data set for vertebrate aging. This resource identifies chromatin and transcriptional states that are characteristic of young tissues, which could be leveraged to restore aspects of youthful functionality to old tissues.

192 citations


Journal ArticleDOI
TL;DR: 3D model of breast cancer shows that a stiff extracellular matrix promotes a tumorigenic phenotype through broad changes in chromatin accessibility and in the activity of histone deacetylases and the transcription factor Sp1, and reveals that chromatin state is a critical mediator of mechanotransduction.
Abstract: In breast cancer, the increased stiffness of the extracellular matrix is a key driver of malignancy. Yet little is known about the epigenomic changes that underlie the tumorigenic impact of extracellular matrix mechanics. Here, we show in a three-dimensional culture model of breast cancer that stiff extracellular matrix induces a tumorigenic phenotype through changes in chromatin state. We found that increased stiffness yielded cells with more wrinkled nuclei and with increased lamina-associated chromatin, that cells cultured in stiff matrices displayed more accessible chromatin sites, which exhibited footprints of Sp1 binding, and that this transcription factor acts along with the histone deacetylases 3 and 8 to regulate the induction of stiffness-mediated tumorigenicity. Just as cell culture on soft environments or in them rather than on tissue-culture plastic better recapitulates the acinar morphology observed in mammary epithelium in vivo, mammary epithelial cells cultured on soft microenvironments or in them also more closely replicate the in vivo chromatin state. Our results emphasize the importance of culture conditions for epigenomic studies, and reveal that chromatin state is a critical mediator of mechanotransduction. In a 3D model of breast cancer, a stiff extracellular matrix promotes a tumorigenic phenotype through broad changes in chromatin accessibility and in the activity of histone deacetylases and the transcription factor Sp1.

120 citations


Journal ArticleDOI
TL;DR: It is argued here that it is a quirk of history that both MRT and gene editing have come to the forefront of public attention at roughly the same time and to best protect citizens from harm, limited regulatory pathways that can be monitored and carefully delineated are preferable to shadowy practices and a potential regulatory race to the bottom.
Abstract: We argue here that it is a quirk of history that both MRT and gene editing have come to the forefront of public attention at roughly the same time. The early start on MRT in the United Kingdom enabled that country to successfully developed quite different regulatory policy approaches to the two technologies5; in contrast, the fear of germline gene editing in the United States and Canada has frozen the policy conversation on MRT. We should not let fear drive use of a sledgehammer for regulation when a scalpel will better enable us to divide the good from the bad. Although realistic about the barriers to change, we have outlined possible ways forward for both the United States and Canada that would enable progress on MRT, or possibly some limited germline gene editing without opening the floodgate. We argue that this path, and not outright prohibition, is the best way forward because citizens deserve to benefit from the advancement of science and its applications. Moreover, in our globalized world, national prohibitions cannot fully achieve their goals. As the travel of patients to Mexico for MRT performed by US doctors demonstrates (as do other examples)27,28, patients who desperately wish to access certain interventions will travel abroad to get them. Unless countries such as the United States and Canada are willing to limit the entry of children born through these technologies—were it even possible, and we are skeptical—and extend their criminal jurisdiction extraterritorially to prevent the use of these technologies, the reality is that some citizens of each country will bring germline alterations back into the country. Our view is that, to best protect citizens from harm, limited regulatory pathways that can be monitored and carefully delineated are preferable to shadowy practices and a potential regulatory race to the bottom. ❐

119 citations


Journal ArticleDOI
TL;DR: This work assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices, to identify low-quality experiments.
Abstract: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments. In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.

105 citations


Journal ArticleDOI
TL;DR: The authors find Cas9 nuclease, CRISPRi/a each have distinct off-target effects, and that these can be accurately identified and removed using the GuideScan sgRNA specificity score.
Abstract: Pooled CRISPR-Cas9 screens are a powerful method for functionally characterizing regulatory elements in the non-coding genome, but off-target effects in these experiments have not been systematically evaluated. Here, we investigate Cas9, dCas9, and CRISPRi/a off-target activity in screens for essential regulatory elements. The sgRNAs with the largest effects in genome-scale screens for essential CTCF loop anchors in K562 cells were not single guide RNAs (sgRNAs) that disrupted gene expression near the on-target CTCF anchor. Rather, these sgRNAs had high off-target activity that, while only weakly correlated with absolute off-target site number, could be predicted by the recently developed GuideScan specificity score. Screens conducted in parallel with CRISPRi/a, which do not induce double-stranded DNA breaks, revealed that a distinct set of off-targets also cause strong confounding fitness effects with these epigenome-editing tools. Promisingly, filtering of CRISPRi libraries using GuideScan specificity scores removed these confounded sgRNAs and enabled identification of essential regulatory elements.

87 citations


Journal ArticleDOI
17 Jun 2019-PLOS ONE
TL;DR: The results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.
Abstract: The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

63 citations


Posted ContentDOI
03 Nov 2019-bioRxiv
TL;DR: This almanac recapitulates diverse pathways and protein complexes and predicts the functions of 102 uncharacterized genes and establishes co-essentiality profiling as a powerful resource for biological pathway identification and discovery of novel gene functions.
Abstract: SUMMARY A central remaining question in the post-genomic era is how genes interact to form biological pathways. Measurements of gene dependency across hundreds of cell lines have been used to cluster genes into ‘co-essential’ pathways, but this approach has been limited by ubiquitous false positives. Here, we develop a statistical method that enables robust identification of gene co-essentiality and yields a genome-wide set of functional modules. This almanac recapitulates diverse pathways and protein complexes and predicts the functions of 102 uncharacterized genes. Validating top predictions, we show that TMEM189 encodes plasmanylethanolamine desaturase, the long-sought key enzyme for plasmalogen synthesis. We also show that C15orf57 binds the AP2 complex, localizes to clathrin-coated pits, and enables efficient transferrin uptake. Finally, we provide an interactive web tool for the community to explore the results (coessentiality.net). Our results establish co-essentiality profiling as a powerful resource for biological pathway identification and discovery of novel gene functions.

60 citations


Journal ArticleDOI
TL;DR: It is shown that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor and novel strategies for training models are employed to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types.
Abstract: Motivation Genome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types. Results We introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis- and trans-regulation of chromatin dynamics across 123 diverse cellular contexts. Availability and implementation The code is available at https://github.com/kundajelab/ChromDragoNN. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: GkmExplain is proposed: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients and consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gmmVMs models of chromatin accessibility in lymphoblastoid cell-lines.
Abstract: Summary Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines Availability and implementation Code and example notebooks to reproduce results are at https://githubcom/kundajelab/gkmexplain Supplementary information Supplementary data are available at Bioinformatics online

Posted ContentDOI
21 Aug 2019-bioRxiv
TL;DR: A deep learning model is trained that uses DNA sequence to predict base-resolution binding profiles of four pluripotency transcription factors Oct4, Sox2, Nanog, and Klf4 and finds that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences motif interactions at protein and nucleosome range.
Abstract: Genes are regulated through enhancer sequences, in which transcription factor binding motifs and their specific arrangements (syntax) form a cis-regulatory code. To understand the relationship between motif syntax and transcription factor binding, we train a deep learning model that uses DNA sequence to predict base-resolution binding profiles of four pluripotency transcription factors Oct4, Sox2, Nanog, and Klf4. We interpret the model to accurately map hundreds of thousands of motifs in the genome, learn novel motif representations and identify rules by which motifs and syntax influence transcription factor binding. We find that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences motif interactions at protein and nucleosome range. Most strikingly, Nanog binding is driven by motifs with a strong preference for ∼10.5 bp spacings corresponding to helical periodicity. Interpreting deep learning models applied to high-resolution binding data is a powerful and versatile approach to uncover the motifs and syntax of cis-regulatory sequences.

Journal ArticleDOI
TL;DR: In a population-scale cohort, lower BMI was consistently associated with reduced diabetes risk across BMI, family history, and genetic risk categories, suggesting all individuals can substantially reduce their diabetes risk through weight loss.
Abstract: Background: Lifestyle interventions to reduce body mass index (BMI) are critical public health strategies for type 2 diabetes prevention. While weight loss interventions have shown demonstrable ben ...

Posted ContentDOI
11 Apr 2019-bioRxiv
TL;DR: It is shown that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor and a multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts are introduced.
Abstract: Motivation Genome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks (CNNs) have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types. Results We introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis and trans regulation of chromatin dynamics across 123 diverse cellular contexts. Availability The code is available at https://github.com/kundajelab/ChromDragoNN Contact akundaje@stanford.edu

01 Jan 2019
TL;DR: The authors in this article performed whole-genome sequencing of 1,439 cases and 720 controls, imputed discovered sequence variants and Haplotype Reference Consortium panel variants into genome-wide association study data, and tested for association in 34,869 cases and 29,051 controls.
Abstract: To further dissect the genetic architecture of colorectal cancer (CRC), we performed whole-genome sequencing of 1,439 cases and 720 controls, imputed discovered sequence variants and Haplotype Reference Consortium panel variants into genome-wide association study data, and tested for association in 34,869 cases and 29,051 controls. Findings were followed up in an additional 23,262 cases and 38,296 controls. We discovered a strongly protective 0.3% frequency variant signal at CHD1. In a combined meta-analysis of 125,478 individuals, we identified 40 new independent signals at P < 5 × 10−8, bringing the number of known independent signals for CRC to ~100. New signals implicate lower-frequency variants, Krüppel-like factors, Hedgehog signaling, Hippo-YAP signaling, long noncoding RNAs and somatic drivers, and support a role for immune function. Heritability analyses suggest that CRC risk is highly polygenic, and larger, more comprehensive studies enabling rare variant analysis will improve understanding of biology underlying this risk and influence personalized screening strategies and drug development.Genome-wide association analyses based on whole-genome sequencing and imputation identify 40 new risk variants for colorectal cancer, including a strongly protective low-frequency variant at CHD1 and loci implicating signaling and immune function in disease etiology.


Posted Content
TL;DR: This article proposed a bias-corrected calibration for label shift, which improves robustness to poor label shift by estimating source-domain priors, and showed that the maximum likelihood with appropriate calibration outperforms BBSL and RLLS.
Abstract: Label shift refers to the phenomenon where the prior class probability p(y) changes between the training and test distributions, while the conditional probability p(x|y) stays fixed. Label shift arises in settings like medical diagnosis, where a classifier trained to predict disease given symptoms must be adapted to scenarios where the baseline prevalence of the disease is different. Given estimates of p(y|x) from a predictive model, Saerens et al. proposed an efficient maximum likelihood algorithm to correct for label shift that does not require model retraining, but a limiting assumption of this algorithm is that p(y|x) is calibrated, which is not true of modern neural networks. Recently, Black Box Shift Learning (BBSL) and Regularized Learning under Label Shifts (RLLS) have emerged as state-of-the-art techniques to cope with label shift when a classifier does not output calibrated probabilities, but both methods require model retraining with importance weights and neither has been benchmarked against maximum likelihood. Here we (1) show that combining maximum likelihood with a type of calibration we call bias-corrected calibration outperforms both BBSL and RLLS across diverse datasets and distribution shifts, (2) prove that the maximum likelihood objective is concave, and (3) introduce a principled strategy for estimating source-domain priors that improves robustness to poor calibration. This work demonstrates that the maximum likelihood with appropriate calibration is a formidable and efficient baseline for label shift adaptation; notebooks reproducing experiments available at this https URL

Journal ArticleDOI
22 Feb 2019-iScience
TL;DR: The results reveal a gradual and dynamic emergence of the adult mtDNA footprinting pattern during embryogenesis of both mammals, suggesting that the structured adult chromatin-like mtDNA organization is gradually formed during mammalian embryogenesis.

Posted Content
21 Jan 2019
TL;DR: Temperature Scaling is extended with class-specific bias parameters, which largely eliminates systematic bias in the calibrated probabilities and allows for effective domain adaptation under label shift, and is found that EM with Bias-Corrected Temperature Scaling significantly outperforms both EM with temperature Scaling and the recently-proposed Black-Box Shift Estimation.
Abstract: Label shift refers to the phenomenon where the prior class probability p(y) changes between the training and test distributions, while the conditional probability p(x|y) stays fixed. Label shift arises in settings like medical diagnosis, where a classifier trained to predict disease given symptoms must be adapted to scenarios where the baseline prevalence of the disease is different. Given estimates of p(y|x) from a predictive model, Saerens et al. proposed an efficient maximum likelihood algorithm to correct for label shift that does not require model retraining, but a limiting assumption of this algorithm is that p(y|x) is calibrated, which is not true of modern neural networks. Recently, Black Box Shift Learning (BBSL) and Regularized Learning under Label Shifts (RLLS) have emerged as state-of-the-art techniques to cope with label shift when a classifier does not output calibrated probabilities, but both methods require model retraining with importance weights and neither has been benchmarked against maximum likelihood. Here we (1) show that combining maximum likelihood with a type of calibration we call bias-corrected calibration outperforms both BBSL and RLLS across diverse datasets and distribution shifts, (2) prove that the maximum likelihood objective is concave, and (3) introduce a principled strategy for estimating source-domain priors that improves robustness to poor calibration. This work demonstrates that the maximum likelihood with appropriate calibration is a formidable and efficient baseline for label shift adaptation; notebooks reproducing experiments available at this https URL

Posted ContentDOI
18 Jan 2019-bioRxiv
TL;DR: Results show off-target activity can severely limit identification of essential functional motifs by active Cas9, while strictly filtered CRISPRi screens can be reliably used for assaying larger regulatory elements.
Abstract: Pooled CRISPR-Cas9 screens have recently emerged as a powerful method for functionally characterizing regulatory elements in the non-coding genome, but off-target effects in these experiments have not been systematically evaluated. Here, we conducted a genome-scale screen for essential CTCF loop anchors in the K562 leukemia cell line. Surprisingly, the primary drivers of signal in this screen were single guide RNAs (sgRNAs) with low specificity scores. After removing these guides, we found that there were no CTCF loop anchors critical for cell growth. We also observed this effect in an independent screen fine-mapping the core motifs in enhancers of the GATA1 gene. We then conducted screens in parallel with CRISPRi and CRISPRa, which do not induce DNA damage, and found that an unexpected and distinct set of off-targets also caused strong confounding growth effects with these epigenome-editing platforms. Promisingly, strict filtering of CRISPRi libraries using GuideScan specificity scores removed these confounded sgRNAs and allowed for the identification of essential enhancers, which we validated extensively. Together, our results show off-target activity can severely limit identification of essential functional motifs by active Cas9, while strictly filtered CRISPRi screens can be reliably used for assaying larger regulatory elements.

Posted ContentDOI
25 Jul 2019-bioRxiv
TL;DR: A novel generative neural network architecture for targeted DNA sequence editing – the EDA architecture – consisting of an encoder, decoder, and analyzer is proposed and significantly improves predicted binding of SPI1 of genomic sequences with the minimal set of edits.
Abstract: Targeted optimizing of existing DNA sequences for useful properties, has the potential to enable several synthetic biology applications from modifying DNA to treat genetic disorders to designing regulatory elements to fine tune context-specific gene expression. Current approaches for targeted genome editing are largely based on prior biological knowledge or ad-hoc rules. Few if any machine learning approaches exist for targeted optimization of regulatory DNA sequences. Here, we propose a novel generative neural network architecture for targeted DNA sequence editing - the EDA architecture - consisting of an encoder, decoder, and analyzer. We showcase the use of EDA to optimize regulatory DNA sequences to bind to the transcription factor SPI1. Compared to other state-of-the-art approaches such as a textual variational autoencoder and rule-based editing, EDA significantly improves predicted binding of SPI1 of genomic sequences with the minimal set of edits. We also use EDA to design regulatory elements with optimized grammars of CREB1 binding sites that can tune reporter expression levels as measured by massively parallel reporter assays (MPRA). We analyze the properties of the binding sites in the edited sequences and find patterns that are consistent with previously reported grammatical rules which tie gene expression to CRE binding site density, spacing and affinity.

Journal ArticleDOI
TL;DR: In this article, the authors used the fluorescence uiquitin cell cycle indicator system adapted into human pluripotent stem cells (hPSCs) and performed RNA sequencing on cell cycle sorted hPSCs primed and unprimed for differentiation.
Abstract: Understanding the molecular properties of the cell cycle of human pluripotent stem cells (hPSCs) is critical for effectively promoting differentiation. Here, we use the Fluorescence Ubiquitin Cell Cycle Indicator system adapted into hPSCs and perform RNA sequencing on cell cycle sorted hPSCs primed and unprimed for differentiation. Gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner in cells primed for differentiation without altering genes associated with pluripotency. Furthermore, we identify an important role for PI3K signaling in regulating the early transitory states of hPSCs toward differentiation. Stem Cells 2019;37:1151-1157.

Posted Content
TL;DR: This work uses the use of multi-modal neural networks to learn predictive models of gene expression that include cis and trans regulatory components and achieves high performance and substantially outperform other state-of-the-art methods such as boosting algorithms that use pre-defined cis-regulatory features.
Abstract: Deciphering gene regulatory networks is a central problem in computational biology. Here, we explore the use of multi-modal neural networks to learn predictive models of gene expression that include cis and trans regulatory components. We learn models of stress response in the budding yeast Saccharomyces cerevisiae. Our models achieve high performance and substantially outperform other state-of-the-art methods such as boosting algorithms that use pre-defined cis-regulatory features. Our model learns several cis and trans regulators including well-known master stress response regulators. We use our models to perform in-silico TF knock-out experiments and demonstrate that in-silico predictions of target gene changes correlate with the results of the corresponding TF knockout microarray experiment.

Journal ArticleDOI
TL;DR: This model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types and proves that when suitably initialized, this model finds the underlying TAD structure with high probability.
Abstract: Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. in addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.


Posted Content
21 Jan 2019
TL;DR: By combining EM with a type of calibration the authors call bias-corrected calibration, EM outperform both BBSL and RLLS across diverse datasets and distribution shifts and introduces a theoretically principled strategy for estimating source-domain priors that improves robustness to poor calibration.
Abstract: Label shift refers to the phenomenon where the prior class probability p(y) changes between the training and test distributions, while the conditional probability p(x|y) stays fixed. Label shift arises in settings like medical diagnosis, where a classifier trained to predict disease given symptoms must be adapted to scenarios where the baseline prevalence of the disease is different. Given estimates of p(y|x) from a predictive model, Saerens et al. proposed an efficient maximum likelihood algorithm to correct for label shift that does not require model retraining, but a limiting assumption of this algorithm is that p(y|x) is calibrated, which is not true of modern neural networks. Recently, Black Box Shift Learning (BBSL) and Regularized Learning under Label Shifts (RLLS) have emerged as state-of-the-art techniques to cope with label shift when a classifier does not output calibrated probabilities, but both methods require model retraining with importance weights and neither has been benchmarked against maximum likelihood. Here we (1) show that combining maximum likelihood with a type of calibration we call bias-corrected calibration outperforms both BBSL and RLLS across diverse datasets and distribution shifts, (2) prove that the maximum likelihood objective is concave, and (3) introduce a principled strategy for estimating source-domain priors that improves robustness to poor calibration. This work demonstrates that the maximum likelihood with appropriate calibration is a formidable and efficient baseline for label shift adaptation; notebooks reproducing experiments available at this https URL

Posted ContentDOI
13 Nov 2019-bioRxiv
TL;DR: A CRISPR/Cas9-mediated saturation mutagenesis approach to generate comprehensive libraries of point mutations near an editing site and its editing complementary sequence (ECS) at the endogenous genomic locus and guidance for designing and screening of antisense RNA sequences that form dsRNA duplex with the target transcript for ADAR-mediated transcriptome engineering is developed.
Abstract: Adenosine-to-inosine (A-to-I) RNA editing catalyzed by ADAR enzymes occurs in double-stranded RNAs (dsRNAs). How the RNA sequence and structure (i.e., the cis-regulation) determine the editing efficiency and specificity is poorly understood, despite a compelling need towards functional understanding of known editing events and transcriptome engineering of desired adenosines. We developed a CRISPR/Cas9-mediated saturation mutagenesis approach to generate comprehensive libraries of point mutations near an editing site and its editing complementary sequence (ECS) at the endogenous genomic locus. We used machine learning to integrate diverse RNA sequence features and computationally predicted structures to model editing levels measured by deep sequencing and identified cis-regulatory features of RNA editing. As proof-of-concept, we applied this integrative approach to three editing substrates. Our models explained over 70% of variation in editing levels. The models indicate that RNA sequence and structure features synergistically determine the editing levels. Our integrative approach can be broadly applied to any editing site towards the goal of deciphering the RNA editing code. It also provides guidance for designing and screening of antisense RNA sequences that form dsRNA duplex with the target transcript for ADAR-mediated transcriptome engineering.

Patent
15 Mar 2019
TL;DR: In this paper, a process to reveal biological attributes from nucleic acids is described, where nucleic acid is used to develop frequency sequence signal maps, construct V-plots, and/or to train computational models.
Abstract: Processes to reveal biological attributes from nucleic acids are provided. In some instances, nucleic acids are used to develop frequency sequence signal maps, construct V-plots, and/or to train computational models. In some instances, trained computational models are used to predict features that reveal biological attributes.

Posted ContentDOI
12 Feb 2019-bioRxiv
TL;DR: An important role for PI3K signaling is identified in regulating the early transitory states of hPSCs towards differentiation and Gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner in cells primed for differentiation without altering genes associated with pluripotency.
Abstract: Understanding the molecular properties of the cell cycle of human pluripotent stem cells (hPSCs) is critical for effectively promoting differentiation. Here, we use the Fluorescence Ubiquitin Cell Cycle Indicator (FUCCI) system adapted into hPSCs and perform RNA-sequencing on cell cycle sorted hPSCs primed and unprimed for differentiation. Gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner in cells primed for differentiation without altering genes associated with pluripotency. Furthermore, we identify an important role for PI3K signaling in regulating the early transitory states of hPSCs towards differentiation.