scispace - formally typeset
Search or ask a question

Showing papers by "Anshul Kundaje published in 2016"


Journal ArticleDOI
TL;DR: The chromatin accessibility and transcriptional landscapes in 13 human primary blood cell types that span the hematopoietic hierarchy are defined and 'enhancer cytometry' is enabled for enumeration of pure cell types from complex populations.
Abstract: We define the chromatin accessibility and transcriptional landscapes in 13 human primary blood cell types that span the hematopoietic hierarchy. Exploiting the finding that the enhancer landscape better reflects cell identity than mRNA levels, we enable 'enhancer cytometry' for enumeration of pure cell types from complex populations. We identify regulators governing hematopoietic differentiation and further show the lineage ontogeny of genetic elements linked to diverse human diseases. In acute myeloid leukemia (AML), chromatin accessibility uncovers unique regulatory evolution in cancer cells with a progressively increasing mutation burden. Single AML cells exhibit distinctive mixed regulome profiles corresponding to disparate developmental stages. A method to account for this regulatory heterogeneity identified cancer-specific deviations and implicated HOX factors as key regulators of preleukemic hematopoietic stem cell characteristics. Thus, regulome dynamics can provide diverse insights into hematopoietic development and disease.

888 citations


Posted Content
TL;DR: DeepLIFT (Learning Important FeaTures), an efficient and effective method for computing importance scores in a neural network that compares the activation of each neuron to its 'reference activation' and assigns contribution scores according to the difference.
Abstract: Note: This paper describes an older version of DeepLIFT. See https://arxiv.org/abs/1704.02685 for the newer version. Original abstract follows: The purported "black box" nature of neural networks is a barrier to adoption in applications where interpretability is essential. Here we present DeepLIFT (Learning Important FeaTures), an efficient and effective method for computing importance scores in a neural network. DeepLIFT compares the activation of each neuron to its 'reference activation' and assigns contribution scores according to the difference. We apply DeepLIFT to models trained on natural images and genomic data, and show significant advantages over gradient-based methods.

532 citations


Journal ArticleDOI
14 Jul 2016-Cell
TL;DR: The extrinsic signals controlling each binary lineage decision were defined, enabling us to logically block differentiation toward unwanted fates and rapidly steer pluripotent stem cells toward 80%-99% pure human mesodermal lineages at most branchpoints.

337 citations


Journal ArticleDOI
TL;DR: A meta‐analysis of direct FOXO targets across tissues and organisms provides insight into the evolution of the FOXO network and highlights downstream genes and cofactors that may be particularly important for FOXO's conserved function in adult homeostasis and longevity.
Abstract: FOXO transcription factors (FOXOs) are central regulators of lifespan across species, yet they also have cell-specific functions, including adult stem cell homeostasis and immune function. Direct targets of FOXOs have been identified genome-wide in several species and cell types. However, whether FOXO targets are specific to cell types and species or conserved across cell types and throughout evolution remains uncharacterized. Here, we perform a meta-analysis of direct FOXO targets across tissues and organisms, using data from mammals as well as Caenorhabditis elegans and Drosophila. We show that FOXOs bind cell type-specific targets, which have functions related to that particular cell. Interestingly, FOXOs also share targets across different tissues in mammals, and the function and even the identity of these shared mammalian targets are conserved in invertebrates. Evolutionarily conserved targets show enrichment for growth factor signaling, metabolism, stress resistance, and proteostasis, suggesting an ancestral, conserved role in the regulation of these processes. We also identify candidate cofactors at conserved FOXO targets that change in expression with age, including CREB and ETS family factors. This meta-analysis provides insight into the evolution of the FOXO network and highlights downstream genes and cofactors that may be particularly important for FOXO's conserved function in adult homeostasis and longevity.

159 citations


Journal ArticleDOI
TL;DR: By analyzing whole blood transcriptomes of 922 individuals, this study conducts the first large-scale, genome-wide analysis of the impact of both sex and genetic variation on patterns of gene expression, including comparison between the X Chromosome and autosomes.
Abstract: The X Chromosome, with its unique mode of inheritance, contributes to differences between the sexes at a molecular level, including sex-specific gene expression and sex-specific impact of genetic variation. Improving our understanding of these differences offers to elucidate the molecular mechanisms underlying sex-specific traits and diseases. However, to date, most studies have either ignored the X Chromosome or had insufficient power to test for the sex-specific impact of genetic variation. By analyzing whole blood transcriptomes of 922 individuals, we have conducted the first large-scale, genome-wide analysis of the impact of both sex and genetic variation on patterns of gene expression, including comparison between the X Chromosome and autosomes. We identified a depletion of expression quantitative trait loci (eQTL) on the X Chromosome, especially among genes under high selective constraint. In contrast, we discovered an enrichment of sex-specific regulatory variants on the X Chromosome. To resolve the molecular mechanisms underlying such effects, we generated chromatin accessibility data through ATAC-sequencing to connect sex-specific chromatin accessibility to sex-specific patterns of expression and regulatory variation. As sex-specific regulatory variants discovered in our study can inform sex differences in heritable disease prevalence, we integrated our data with genome-wide association study data for multiple immune traits identifying several traits with significant sex biases in genetic susceptibilities. Together, our study provides genome-wide insight into how genetic variation, the X Chromosome, and sex shape human gene regulation and disease.

86 citations


Journal ArticleDOI
TL;DR: It is shown that high expression of the transcription factor ARNTL2 predicts poor lung adenocarcinoma patient outcome and this findings shed light on the molecular mechanisms that enable single cancer cells to form allochthonous tumors in foreign tissue environments.

78 citations


Journal ArticleDOI
TL;DR: This molecular atlas will facilitate study of human mesoderm development (which cannot be interrogated in vivo due to restrictions on human embryo studies) and provides a broad resource for the study of gene regulation in development at the single-cell level, knowledge that might one day be exploited for regenerative medicine.
Abstract: Mesoderm is the developmental precursor to myriad human tissues including bone, heart, and skeletal muscle. Unravelling the molecular events through which these lineages become diversified from one another is integral to developmental biology and understanding changes in cellular fate. To this end, we developed an in vitro system to differentiate human pluripotent stem cells through primitive streak intermediates into paraxial mesoderm and its derivatives (somites, sclerotome, dermomyotome) and separately, into lateral mesoderm and its derivatives (cardiac mesoderm). Whole-population and single-cell analyses of these purified populations of human mesoderm lineages through RNA-seq, ATAC-seq, and high-throughput surface marker screens illustrated how transcriptional changes co-occur with changes in open chromatin and surface marker landscapes throughout human mesoderm development. This molecular atlas will facilitate study of human mesoderm development (which cannot be interrogated in vivo due to restrictions on human embryo studies) and provides a broad resource for the study of gene regulation in development at the single-cell level, knowledge that might one day be exploited for regenerative medicine.

51 citations


Posted ContentDOI
20 Dec 2016-bioRxiv
TL;DR: The Umap software for identifying uniquely mappable regions of any genome is introduced and its Bismap extension identifies mappability of the bisulfite-converted genome.
Abstract: MOTIVATION: Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding, and chemical modifications. Every region in a genome assembly has a property called mappability which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. At best, sequencing assays will produce misleadingly low numbers of reads in these regions. At worst, these regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. While many tools consider mappability during the read mapping process, subsequent analysis often loses this information. Both to correct assumptions of uniformity in downstream analysis, and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. RESULTS: We introduce the Umap software for efficiently identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. With a read length of 24 bp, 15.5% of the unmodified genome and 30% of the bisulfite-converted genome is not uniquely mappable. This complicates interpretation of functional genomics experiments using short-read sequencing, especially in regulatory regions. For example, 42% of human CpG islands overlap with regions that are not uniquely mappable. Similarly, in some ENCODE ChIP-seq datasets, up to 30% of peaks overlap with regions that are not uniquely mappable. We also explored differentially methylated regions from a case-control study and identified regions that were not uniquely mappable. In the widely used 450K methylation array, 962 probes are not uniquely mappable. Genome mappability is higher with longer sequencing reads, but most publicly available ChIP-seq and reduced representation bisulfite sequencing datasets have shorter reads. Therefore, uneven and low mappability remains a concern in a majority of existing data. AVAILABILITY: A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at http://bismap.hoffmanlab.org for use with the UCSC and Ensembl genome browsers. We have deposited in Zenodo the current version of our software (http://doi.org/10.5281/zenodo.60940) and the mappability data used in this project (http://doi.org/10.5281/zenodo.60943). In addition, the software (https://bitbucket.org/hoffmanlab/umap) is freely available under the GNU General Public License, version 3 (GPLv3).

28 citations


Posted ContentDOI
05 Dec 2016-bioRxiv
TL;DR: A deep neural network is developed to predict open chromatin regions from DNA sequence alone and the sequences of segregating haplotypes are used to predict the effects of common SNPs on cell type-specific chromatin accessibility.
Abstract: Induced pluripotent stem cells (iPSCs) are an essential tool for studying cellular differentiation and cell types that are otherwise difficult to access. We investigated the use of iPSCs and iPSC-derived cells to study the impact of genetic variation across different cell types and as models for studies of complex disease. We established a panel of iPSCs from 58 well- studied Yoruba lymphoblastoid cell lines (LCLs); 14 of these lines were further differentiated into cardiomyocytes. We characterized regulatory variation across individuals and cell types by measuring gene expression, chromatin accessibility and DNA methylation. Regulatory variation between individuals is lower in iPSCs than in the differentiated cell types, consistent with the intuition that developmental processes are generally canalized. While most cell type-specific regulatory quantitative trait loci (QTLs) lie in chromatin that is open only in the affected cell types, we found that 20% of cell type-specific QTLs are in shared open chromatin. Finally, we developed a deep neural network to predict open chromatin regions from DNA sequence alone and were able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell type-specific chromatin accessibility.

25 citations


Posted ContentDOI
07 May 2016-bioRxiv
TL;DR: Aconvolutional denoising algorithm, Coda, is introduced that uses convolutional neural networks to learn a mapping from suboptimal to high-quality histone ChIP-seq data, and has the potential to improve data quality at reduced costs.
Abstract: Chromatin immunoprecipitation sequencing (ChIP-seq) experiments targeting histone modifications are commonly used to characterize the dynamic epigenomes of diverse cell types and tissues. However, suboptimal experimental parameters such as poor ChIP enrichment, low cell input, low library complexity, and low sequencing depth can significantly affect the quality and sensitivity of histone ChIP-seq experiments. We show that a convolutional neural network trained to learn a mapping between suboptimal and high-quality histone ChIP-seq data in reference cell types can overcome various sources of noise and substantially enhance signal when applied to low-quality samples across individuals, cell types, and species. This approach allows us to reduce cost and increase data quality. More broadly, our approach -- using a high-dimensional discriminative model to encode a generative noise process -- is generally applicable to biological problems where it is easy to generate noisy data but difficult to analytically characterize the noise or underlying data distribution.

15 citations



Proceedings Article
01 Jan 2016
TL;DR: This paper proposes an optimization framework to mine useful structures from noisy networks in an unsupervised manner and uses Capture-C-generated partial labels to further denoise the Hi-C network.
Abstract: Complex networks play an important role in a plethora of disciplines in natural sciences. Cleaning up noisy observed networks, poses an important challenge in network analysis Existing methods utilize labeled data to alleviate the noise effect in the network. However, labeled data is usually expensive to collect while unlabeled data can be gathered cheaply. In this paper, we propose an optimization framework to mine useful structures from noisy networks in an unsupervised manner. The key feature of our optimization framework is its ability to utilize local structures as well as global patterns in the network. We extend our method to incorporate multi-resolution networks in order to add further resistance to high-levels of noise. We also generalize our framework to utilize partial labels to enhance the performance. We specifically focus our method on multi-resolution Hi-C data by recovering clusters of genomic regions that co-localize in 3D space. Additionally, we use Capture-C-generated partial labels to further denoise the Hi-C network. We empirically demonstrate the effectiveness of our framework in denoising the network and improving community detection results.

Posted ContentDOI
17 Oct 2016-bioRxiv
TL;DR: This protocol describes how to use Segway to annotate the genome, starting with reads from a ChIP-seq experiment, and visualizing the annotation in a genome browser.
Abstract: Biochemical techniques measure many individual properties of chromatin along the genome. These properties include DNA accessibility (measured by DNase-seq) and the presence of individual transcription factors and histone modifications (measured by ChIP-seq). Segway is software that transforms multiple datasets on chromatin properties into a single annotation of the genome that a biologist can more easily interpret. This protocol describes how to use Segway to annotate the genome, starting with reads from a ChIP-seq experiment. It includes pre-processing of data, training the Segway model, annotating the genome, assigning biological meanings to labels, and visualizing the annotation in a genome browser.


Posted ContentDOI
18 May 2016-bioRxiv
TL;DR: Deep sequencing and analysis of nascent mtDNA-encoded RNA transcripts in diverse human cell lines and metazoan organisms paves the path towards in vivo, quantitative, reference sequence-free analysis of mtDNA transcription in all eukaryotes.
Abstract: Mitochondrial DNA (mtDNA) genes are long known to be co-transcribed in polycistrones, yet it remains impossible to study nascent mtDNA transcripts quantitatively in vivo using existing tools. To this end we used deep sequencing (GRO-seq and PRO-seq) and analyzed nascent mtDNA-encoded RNA transcripts in diverse human cell lines and metazoan organisms. Surprisingly, accurate detection of human mtDNA transcription initiation sites (TIS) in the heavy and light strands revealed a novel conserved transcription pausing site near the light strand TIS, upstream to the transcription-replication transition region. This pausing site correlated with the presence of a bacterial pausing sequence motif, yet the transcription pausing index varied quantitatively among the cell lines. Analysis of non-human organisms enabled de novo mtDNA sequence assembly, as well as detection of previously unknown mtDNA TIS, pausing, and transcription termination sites with unprecedented accuracy. Whereas mammals (chimpanzee, rhesus macaque, rat, and mouse) showed a human-like mtDNA transcription pattern, the invertebrate pattern (Drosophila and C. elegans) profoundly diverged. Our approach paves the path towards in vivo, quantitative, reference sequence-free analysis of mtDNA transcription in all eukaryotes.