scispace - formally typeset
Search or ask a question

Showing papers by "Anshul Kundaje published in 2012"


01 Sep 2012
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.

2,767 citations


Journal ArticleDOI
TL;DR: This work discusses how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data and develops a set of working standards and guidelines for ChIP experiments that are updated routinely.
Abstract: Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.

1,801 citations


Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The combinatorial, co-association of transcription factors is found to be highly context specific: distinct combinations of factors bind at specific genomic locations.
Abstract: Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.

1,449 citations


Journal ArticleDOI
TL;DR: An integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones.
Abstract: Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook (http://factorbook.org) and will continually update this repository as more ENCODE data are generated.

834 citations


Journal ArticleDOI
TL;DR: It is concluded that with very few exceptions, ribosomes are able to distinguish coding from noncoding transcripts and, hence, that ectopic translation and cryptic mRNAs are rare in the human lncRNAome.
Abstract: In addition to over 20,000 protein-coding genes and known small-RNA, including microRNA host genes, the human genome includes at least 9640 loci transcribed solely into long, non-protein-coding RNAs (long noncoding RNAs; lncRNAs), often with multiple transcript isoforms (Derrien et al. 2012). Of these, only a minority (under 100) have been functionally characterized at an individual level by forward and reverse genetic approaches in organismal and cell culture models. The remainder are known purely via high-throughput discovery and expression analysis. Well-known examples of lncRNAs that have been functionally characterized in-depth include the imprinted Myc target H19 (Gabory et al. 2009), the epigenetic homeobox gene regulator HOTAIR, which promotes cancer metastasis (Gupta et al. 2010), and Xist, the lncRNA that is responsible for inactivation of the mammalian X-chromosome (Jeon and Lee 2011). While these few examples already attest to the diversity of lncRNA functions in chromatin remodeling and imprinting, the diversity of heretofore-uncharacterized lncRNAs hints at numerous additional lncRNA-dependent regulatory mechanisms in mammalian systems. Miat is another example of a recently discovered lncRNA that takes part in a direct network feedback loop with the Pou5f1 pluripotency factor in stem cells (Pou5f1 is also known as Oct4); Miat is both a direct target of and a direct regulator of Pou5f1 (Lipovich et al. 2010; Sheik Mohamed et al. 2010). Hence, lncRNAs can be both regulated by and regulators of key transcription factors. LncRNA genes are transcribed in a diverse range of human tissues and cell lines, and show highly specific spatial and temporal expression profiles, which, in conjunction with detailed molecular characterization of the lncRNAs, attest to numerous distinct functions. These functions include, but are not limited to, epigenetic and post-transcriptional gene expression regulation, sense-antisense interactions with known protein-coding genes, direct binding and regulation of transcription factor proteins, nuclear pore gatekeeping, and enhancer function by transcriptional initiation of lncRNAs that cause chromatin remodeling (Lipovich et al. 2010). Mammalian lncRNAs have epigenetic signatures comparable to those of protein-coding genes, frequently associate with the polycomb repressor complex PRC2 which renders them capable of regulating numerous target genes through histone modifications suppressing gene expression, and mediate global transcriptional programs of cancer transcription factors (Guttman et al. 2009; Khalil et al. 2009; Huarte et al. 2010; Derrien et al. 2012). A particularly intriguing property of mammalian lncRNAs is their lack of evolutionary conservation, relative to protein-coding genes. Primate-specific lncRNAs in the human genome are increasingly well-documented in the literature (for a review citing multiple pertinent recent reports, see Lipovich et al. 2010). Previously, Tay et al. (2009) screened the human genome for primate-specific single-copy genomic sequences, uncovering 131 primate-specific transcriptional units supported by transcriptome data. The brain-derived neurotrophic factor (BDNF) gene, a key contributor to synaptic plasticity, learning, memory, and multiple neurological diseases, is overlapped by a cis-encoded primate-specific lncRNA (Pruunsild et al. 2007). Most recently, Derrien et al. (2012) found that ∼30% of human lncRNA transcripts in GENCODE, many of which are expressed in the brain, are primate specific. The resulting relevance of lncRNAs to species-specific phenotypes, including primate and human uniqueness, highlights the importance of using empirical methodologies to document whether lncRNAs are actually non-protein-coding. The majority of definitively known lncRNAs have been annotated using empirical evidence such as cDNA and EST alignments to genome assemblies (Carninci et al. 2005; Katayama et al. 2005; Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project 2009). Yet, despite the attention that they have received, the noncoding status of most lncRNA genes and transcripts has been established mostly through computational means including: examining the size of open reading frames (ORFs), assessing conservation of ORFs that are shorter than known proteins, and looking for conserved translation initiation and termination codons. However, a recent flurry of literature suggests that there may exist a class of bifunctional RNAs encoding both mRNAs and functional noncoding transcripts: Indeed, there is direct evidence for rare members of this transcript class in human, mouse, and fly (Hube et al. 2006; Kondo et al. 2010; Dinger et al. 2011; Ingolia et al. 2011; Ulveling et al. 2011). Hence, identifying the fraction of ostensibly noncoding RNAs that may encode polypeptides is a compelling and open question. In this report, we utilize empirical evidence to estimate, in two ENCODE cell lines, the fraction of annotated lncRNAs that may encode, and therefore possibly function through, polypeptides. As part of the Encyclopedia of DNA Elements (ENCODE) project, matched-sample long polyA+ and polyA− RNA-seq data were produced, along with tandem mass spectrometry (MS/MS) data for cellular proteins, for the Tier-1 “ENCODE-prioritized” human cell lines K562 and GM12878. The RNA-seq data provides measures of relative gene expression in various cellular compartments (Djebali et al. 2012); for both GM12878 and K562, nucleus, cytosol, and whole-cell samples were used to sequence both polyA+ and polyA− RNA populations. These data have been used to obtain measures of transcript abundance for all genes in GENCODE v7 annotation (the annotation generated for the ENCODE Consortium), based on ENCODE and other data (Harrow et al. 2012). The mass spec data were produced via a “shotgun” approach, wherein cells were cultured, subcellular fractionation performed, followed by protein separations, tryptic digestion, and MS/MS analysis. The resulting spectra were mapped directly to a 6-frame translation of the entire hg19 assembly to produce a “proteogenomic track” within the UCSC Genome Browser (Kent 2002; Karolchik et al. 2009), and were also mapped against the GENCODE gene annotation set (J Khatun, Y Yu, J Wrobel, BA Risk, HP Gunawardena, A Secrest, WJ Spitzer, L Xie, L Wang, X Chen, et al., in prep.). Integrative analysis of RNA and proteomics data has been explored in the literature and is examined in another ENCODE paper, highlighting translation of novel splice variants and expressed pseudogenes (Tian et al. 2004, Djebali et al. 2012). However, these data have not yet been applied to examine the empirical evidence for or against translation of computationally classified human long noncoding RNAs. A recent joint study of RNA and proteomic data in mouse revealed that protein levels and mRNA levels correlate such that RNA concentration is predictive of at least 40% of the variation in protein levels (Schwanhausser et al. 2011). Since lncRNA genes are expressed, on average, at 4% of the level of protein-coding genes in the ENCODE cell lines (Derrien et al. 2012), we expect a similarly low level of expression for any putative protein(s) translated from lncRNAs. Therefore, to interrogate the translational competence of lncRNAs, we must account for the relative expression levels of these transcripts. It has been shown that the quantity of detectable matches between MS/MS spectra and their corresponding peptides in a transcript correlate to protein abundance levels (Lu et al. 2007). This means that the number of detected peptide matches is an approximate surrogate for protein abundance (Liu et al. 2004; Vogel and Marcotte 2008). We used this characteristic to determine a calibration function that links mRNA expression abundance and protein expression abundance for the ENCODE data from K562 and GM12878. In our analysis, 21% of GENCODE v7 protein-coding genes are represented by at least one uniquely mapping peptide in any MS/MS sample, and the majority of those genes detected are expressed above 5 RPKMs in the whole-cell RNA-seq data (Harrow et al. 2012). We used these data, applying state-of-the-art machine-learning models to estimate the translational competence of transcripts as a function of RNA expression levels in various cellular compartments and RNA fractions. Using these models, we “regressed out” the expression-level effects to compare the translation competency of ostensibly noncoding transcripts to that of known mRNAs. We then manually examined each lncRNA for which we obtained empirical evidence of coding capacity. From these data, we determined the proportion of lncRNAs that appear to be truly “noncoding” in ENCODE Tier 1 cell lines, and we examined the exceptional cases where there was strong evidence of protein translation to determine whether these are indeed translated lncRNAs or simply misannotated mRNAs.

383 citations


Journal ArticleDOI
TL;DR: This study builds a novel quantitative model and finds that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy, and that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq.
Abstract: Background: Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines. Results: We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA. Conclusions: Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.

270 citations


Journal ArticleDOI
TL;DR: Three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity, and the machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.
Abstract: Background Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.

264 citations


Journal ArticleDOI
TL;DR: A new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks is developed, and extensive heterogeneity in how histone modifications are deposited and how nucleosomes are positioned around binding sites is revealed.
Abstract: Gene regulation at functional elements (e.g., enhancers, promoters, insulators) is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding. To enhance our understanding of gene regulation, the ENCODE Consortium has generated a wealth of ChIP-seq data on DNA-binding proteins and histone modifications. We additionally generated nucleosome positioning data on two cell lines, K562 and GM12878, by MNase digestion and high-depth sequencing. Here we relate 14 chromatin signals (12 histone marks, DNase, and nucleosome positioning) to the binding sites of 119 DNA-binding proteins across a large number of cell lines. We developed a new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks. We applied CAGT on a total of 5084 data set pairs to obtain an exhaustive catalog of high-resolution patterns of histone modifications and nucleosome positioning signals around bound transcription factors. Our analyses reveal extensive heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned around binding sites. With the exception of the CTCF/cohesin complex, asymmetry of nucleosome positioning is predominant. Asymmetry of histone modifications is also widespread, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. The fine-resolution signal shapes discovered by CAGT unveiled novel correlation patterns between chromatin marks, nucleosome positioning, and sequence content. Meta-analyses of the signal profiles revealed a common vocabulary of chromatin signals shared across multiple cell lines and binding proteins.

192 citations