scispace - formally typeset
Search or ask a question
Posted ContentDOI

Deconvolving sequence features that discriminate between overlapping regulatory annotations

09 May 2017-bioRxiv (Cold Spring Harbor Laboratory)-pp 100511
TL;DR: SeqUnwinder is developed, a principled approach to deconvolving interpretable discriminative sequence features associated with overlapping annotation labels that can be unraveled during motor neuron programming and cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines.
Abstract: Genomic loci with regulatory potential can be identified and annotated with various properties. For example, genomic sites may be annotated as being bound by a given transcription factor (TF) in one or more cell types. The same sites may be further labeled as being proximal or distal to known promoters. Given such a collection of labeled sites, it is natural to ask what sequence features are associated with each annotation label. However, discovering such label-specific sequence features is often confounded by overlaps between annotation labels; e.g. if regulatory sites specific to a given cell type are also more likely to be promoter-proximal, it is difficult to assess whether motifs identified in that set of sites are associated with the cell type or associated with promoters. In order to meet this challenge, we developed SeqUnwinder, a principled approach to deconvolving interpretable discriminative sequence features associated with overlapping annotation labels. We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly, we show SeqUnwinder9s ability to unravel sequence features associated with the dynamic binding behavior of TFs during motor neuron programming from features associated with chromatin state in the initial embryonic stem cells. Secondly, we characterize distinct sequence properties of multi-condition and cell-specific TF binding sites after controlling for uneven associations with promoter proximity. Finally, we demonstrate the scalability of SeqUnwinder to discover cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines. Availability: https://github.com/seqcode/sequnwinder

Summary (4 min read)

Introduction

  • Many regulatory genomics analyses focus on finding DNA sequence features that are characteristic of a biological property.
  • Almost all existing discriminative motif-finders assume that the class labels are mutually exclusive, and therefore cannot appropriately handle scenarios such as that outlined in Fig 1A .
  • No existing discriminative feature discovery method is applicable to multi-label classification scenarios where a set of genomic sequences contains several annotation labels with arbitrary rates of overlap between them.

SeqUnwinder overview

  • The intuition behind SeqUnwinder is that sequence features associated with a particular annotation label should be similarly enriched across all subclasses spanned by the label (regardless of how the subclasses have been defined).
  • In other words, while the k-mer weight parameters for each subclass are learned directly from the data, the weight parameters for the labels are learned exclusively through the regularization constraint.
  • The label-or subclass-specific k-mer model is scanned across the original genomic sites to identify focused regions (which the authors term "hills") that contain discriminative sequence signals (Fig 1C) .
  • The labels can come from any source, enabling a high degree of analysis flexibility.
  • SeqUnwinder implements a multi-threaded version of the ADMM [12] framework to train the model and typically runs in less than a few hours for most datasets.

SeqUnwinder deconvolves sequence features associated with overlapping labels

  • At 70% of the sequences associated with each label, the authors inserted appropriate motif instances by sampling from the distributions defined by the position-specific scoring matrices of label assigned motifs (Fig 2A ).
  • SeqUnwinder and the MCC model correctly identify motifs similar to all inserted motifs (Fig 2B ).
  • Since DREME takes only two classes as input: a foreground set and a background set, the authors ran four different DREME runs for each of the four labels.
  • The authors used these measures to calculate the F1 score (harmonic mean of precision and recall) at different overlapping levels (Fig 2D) .
  • SeqUnwinder performs better than the other approaches in the intermediate range of label overlaps, and accurately characterizes label-specific sequence features even when the simulated labels overlap at 90% of sites.

SeqUnwinder uncovers co-factor driven TF binding dynamics during iMN programming

  • To demonstrate its unique abilities in a real analysis problem, the authors use SeqUnwinder to study TF binding during induced motor neuron (iMN) programming.
  • Annotating Isl1/Lhx3 sites using both sets of labels (Isl1/Lhx3 binding dynamics and ES activity) results in six different subclasses.
  • All methods discover similar sets of motifs.
  • SeqUnwinder, in contrast, makes much cleaner associations; the Oct4 motif is only associated with the "early" label, and the Onecut motif is only associated with the "late" label, suggesting that these motifs are not merely coincidental features due to the ES activity status of the binding sites.

Multi-condition TF binding sites are characterized by stronger cognate motif instances

  • The sequence properties of tissue-specific TF binding sites have been extensively studied [4, 5, 16] .
  • Studies of individual TFs suggest that binding affinity to cognate motif instances may play a role in distinguishing multi-condition binding sites from tissue-specific sites [15, 17] .
  • The authors applied SeqUnwinder to each labeled sequence collection in order to characterize labelspecific sequence features (see S2 Table for cross-validation classification performance values).
  • The authors illustrate the process with SeqUnwinder's results for YY1.
  • From Fig 5C, it is clear that distal K562-and GM12878-specific sites lacking a cognate motif instance have higher collective degrees.

SeqUnwinder identifies sequence features at shared and cell-specific DHS in six different ENCODE cell-lines

  • Finally, the authors aim to demonstrate the utility of SeqUnwinder in identifying sequence features at large numbers of genomic loci annotated with several labels.
  • Indeed, these strict category definitions may introduce sequence composition biases into each category.
  • This result is consistent with previous findings suggesting relatively invariant CTCF binding across cellular contexts [25, 26] .
  • ETS factors have been shown to directly convert human fibroblasts to endothelial cells [34] .
  • Interestingly, some of the motifs associated with cell-type specific DHS sites were also found in their analyses of cell-type specific TF binding sites above (Fig 5B ) .

Discussion

  • Classification models have shown great potential in identifying sequence features at defined genomic sites.
  • To systematically address this, the authors developed SeqUnwinder.
  • Therefore, the motifs that the authors previously assigned to early or late TF binding behaviors could have been merely associated with ES-active and ES-inactive sites, respectively.
  • Incorporating graphs defined by label similarities [42, 43] may thus be productive in the context of analyses across cell lineages or developmental time-series.
  • SeqUnwinder may also be easily extended to incorporate different kinds of sequence kernels and DNA shape features [35, 36, 44] .

SeqUnwinder model

  • The training features for the classifier are based on k-mer frequencies in a fixed window around input loci.
  • The value or range of k is user-definable in the SeqUnwinder software, but all analyses in this work use models based on all 4-mers and 5-mers.
  • When counting frequencies, the authors map each k-mer to the same entry as its reverse complement.
  • The parameters of SeqUnwinder are k-mer weights for each subclass (combination of annotation labels).
  • Briefly, label-specific k-mer weights are encouraged to be similar to k-mer weights in all subclasses the label spans by regularizing on the differences of k-mer weights.

Training the SeqUnwinder model

  • The w n and w p update steps separate out and are iteratively updated until convergence.
  • The above equation is solved using the scaled alternating direction method of multipliers (ADMM) framework [12] .
  • Briefly, the ADMM framework splits the above problem into 2 smaller sub-problems, which are much easier to solve.
  • Where abs and rel are the absolute and relative tolerance, respectively.
  • Intuitively, the w tþ1 n update step is distributed across multiple threads by splitting the M training examples into smaller subsets.

Converting weighted k-mer models into interpretable sequence features

  • While SeqUnwinder models label-specific sequence features using high-dimensional kmer weight vectors, it is often desirable to visualize these sequence features in terms of a collection of interpretable position-specific scoring matrices.
  • Specifically, the authors first scan the k-mer models learned during the training process across fixed-sized sequence windows around the input genomic loci to identify local high-scoring regions called "hills".
  • Next, the authors cluster the hills using K-means clustering with Euclidean distance metric and k-mer counts as features.
  • Note that the heatmaps in each figure which display these label-specific discriminative scores have been generated with a shared color scheme; i.e., the maximum shade of yellow is defined to correspond to a modelspecific score of +0.4, while the maximum shade of blue is set to a score of -0.4.
  • Focused motif searches in the hills thus can find motifs that are longer than the longest k-mers in the underlying SeqUnwinder model.

Generation of synthetic datasets

  • To test SeqUnwinder in simulated settings, the authors generated various synthetic datasets.
  • The sizes of simulated datasets (6,000-9,000 sequences) were chosen to roughly reflect the number of peaks in a typical ChIP-seq dataset.
  • The exact choice of order of the background Markov model (i.e. 2 nd -order versus a higher order) is arbitrary, but should not be expected to affect the relative performances of the methods in correctly associating embedded motifs with correct labels.
  • Next, the authors randomly assigned labels to the simulated sequences at different frequencies.
  • Each motif instance was sampled from the probability density function defined by the PWM of the motif.

Processing iMN programming data-sets

  • Defining early, shared and late binding labels.
  • Isl1/Lhx3 binding sites called in both 12 and 48h datasets with a further filter of not being differentially bound (q-value cutoff of <0.01), were assigned as "shared" sites.
  • A union list of 1million 500bp regions comprising the enriched domains (see below) of DNa-seI, H3K4me2, H3K4me1, H3K27ac, and H3K4me3 was used as the positive set for training the classifier.
  • Weka's implementation of Random Forests was used to train the classifier (https://github.

Processing ENCODE datasets

  • The binding profiles for the factors were profiled using MultiGPS [15] .
  • Binding sites labeled as cell-type specific sites were required to have significantly higher ChIP enrichment compared to other cell-lines.
  • Further, contiguous blocks within 200bp were stitched together to call enriched domains.

Annotation of de novo identified motifs

  • All de novo motifs identified using SeqUnwinder were annotated using the cis-bp database.
  • Briefly, de novo motifs were matched against the cis-bp database using STAMP [49] .
  • The best matching hit with a p-value of less than 10e-6 was used to name the de novo identified motifs.

Availability and reproducibility

  • SeqUnwinder is freely available under the MIT open source license from: https://github.com/ seqcode/sequnwinder.
  • Complete output files produced by the SeqUnwinder runs described in this work, along with scripts and data for reproducing all analysis figures, are available from: https://github.com/ikaka89/sequnwinderPaper.

Did you find this useful? Give us your feedback

Figures (5)

Content maybe subject to copyright    Report

RESEARCH ARTICLE
Deconvolving sequence features that
discriminate between overlapping regulatory
annotations
Akshay Kakumanu
1
, Silvia Velasco
2
, Esteban Mazzoni
2
, Shaun Mahony
1
*
1 Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The
Pennsylvania State University, University Park, PA, United States of America, 2 Department of Biology, New
York University, 100 Washington Square East, New York, NY, United States of America
* mahony@psu.edu
Abstract
Genomic loci with regulatory potential can be annotated with various properties. For exam-
ple, genomic sites bound by a given transcription factor (TF) can be divided according to
whether they are proximal or distal to known promoters. Sites can be further labeled accord-
ing to the cell types and conditions in which they are active. Given such a collection of
labeled sites, it is natural to ask what sequence features are associated with each annota-
tion label. However, discovering such label-specific sequence features is often confounded
by overlaps between the labels; e.g. if regulatory sites specific to a given cell type are also
more likely to be promoter-proximal, it is difficult to assess whether motifs identified in that
set of sites are associated with the cell type or associated with promoters. In order to meet
this challenge, we developed SeqUnwinder, a principled approach to deconvolving inter-
pretable discriminative sequence features associated with overlapping annotation labels.
We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly,
SeqUnwinder is able to unravel sequence features associated with the dynamic binding
behavior of TFs during motor neuron programming from features associated with chromatin
state in the initial embryonic stem cells. Secondly, we characterize distinct sequence proper-
ties of multi-condition and cell-specific TF binding sites after controlling for uneven associa-
tions with promoter proximity. Finally, we demonstrate the scalability of SeqUnwinder to
discover cell-specific sequence features from over one hundred thousand genomic loci that
display DNase I hypersensitivity in one or more ENCODE cell lines.
Author summary
Transcription factor proteins control gene expression by recognizing and interacting with
short DNA sequence patterns in regulatory regions on the genome. Current genomics
experiments allow us to find regulatory regions associated with a particular biochemical
activity over the entire genome; for example, all regions where a particular transcription
factor interacts with the genome in a given cell type. Given a collection of regulatory
regions, we often aim to discover short DNA sequence patterns that are more common in
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 1 / 22
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Kakumanu A, Velasco S, Mazzoni E,
Mahony S (2017) Deconvolving sequence features
that discriminate between overlapping regulatory
annotations. PLoS Comput Biol 13(10): e1005795.
https://doi.org/10.1371/journal.pcbi.1005795
Editor: Ilya Ioshikhes, Ottawa University, CANADA
Received: May 9, 2017
Accepted: September 26, 2017
Published: October 19, 2017
Copyright: © 2017 Kakumanu et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: Software code
available from https://github.com/seqcode/
sequnwinder Complete output files produced by
the SeqUnwinder runs described in this
manuscript, along with scripts and data for
reproducing all analysis figures, are available from:
https://github.com/ikaka89/sequnwinderPaper.
Experimental data are available from GEO archive
under accession GSE80321.
Funding: This work was supported by National
Institutes of Health grant R01HD079682 (to EOM).
The funders had no role in study design, data

the collection than in other regions. Performing such “DNA motif-finding” analysis can
give us hints about the patterns that determine gene regulation in the analyzed cell type.
Here we describe a new method for DNA motif-finding called SeqUnwinder. Our
approach analyzes collections of regulatory regions where each has been labeled according
to various biological properties. For example, the labels could correspond to various cell
types in which the regulatory region is active. SeqUnwinder then performs machine-
learning analysis to unravel DNA sequence features that are characteristic of each label
(e.g. features that distinguish regulatory regions in each cell type from other cell types).
SeqUnwinder is the first method to enable analysis of regulatory region collections that
contain several overlapping labels.
Introduction
Many regulatory genomics analyses focus on finding DNA sequence features that are charac-
teristic of a biological property. Given a set of sequences that are bound by a particular tran-
scription factor (TF), for example, we typically aim to discover short, degenerate DNA
patterns that may represent the DNA binding preferences of the TF itself, the binding prefer-
ences of coincident TFs, or general properties of the regions that make them favorable for
binding.
The de novo DNA motif-finding problem is typically cast in the context of two mutually
exclusive sequence sets. Most popular motif-finding methods use unsupervised machine-
learning approaches to discover motifs in ‘foreground’ input sequences that are over-repre-
sented with respect to a set of ‘background’ sequences (e.g. “bound” vs. “unbound”, respec-
tively) [1,2]. Several other methods explicitly solve a two-class classification problem, where
the goal is to find sequence features that discriminate between two mutually exclusive class
labels [36].
Current characterizations of regulatory sites move beyond binary labels such as “bound”
and “unbound”. For example, in a given cell type, each regulatory element could be labeled as
bound or unbound by each of several TFs and enriched or depleted for several chromatin
states [79]. As we add more regulatory class labels, it becomes difficult to define mutually
exclusive sets of sequences that are representative of each label. Relatedly, our analyses may
become confounded by uneven degrees of overlap between the class labels, leading to incorrect
associations between sequence features and regulatory activities. Therefore, a simple recasting
of discriminative motif-finding as a multi-class classification problem (where classes are
required to be mutually exclusive) is not always appropriate.
As an example, consider the hypothetical scenario presented in Fig 1A. In this example, a
given TF’s binding sites have been profiled in types A, B, and C. Thus, each TF binding event
can be labeled as specific to a cell type or common to all or a subset. Let’s assume that after fur-
ther labeling the sites as being proximal or distal to promoters (Pr and Di, respectively), we
find that the TF’s binding sites in cell A are more likely to be promoter proximal than sites in
other cell types. Promoter regions have sequence features that are distinct from distal regions
(e.g. the presence of core promoter elements and distinct GC-content patterns). Therefore, if
we search for sequence features that are discriminative of cell A’s sites without accounting for
the uneven overlaps with other labels, it is likely that some discovered features will actually be
generic properties of proximal regions. Such results could in turn affect our conclusions
regarding the biological mechanisms of TF binding in cell A. To resolve DNA features
Discriminative sequence features for overlapping regulatory annotations
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 2 / 22
collection and analysis, decision to publish, or
preparation of the manuscript.
Competing interests: The authors have declared
that no competing interests exist.

associated with each cell type’s label from those associated with confounding labels (e.g. pro-
moter proximity), we need motif-finders that are able to analyze multiple labels in parallel.
Almost all existing discriminative motif-finders assume that the class labels are mutually
exclusive, and therefore cannot appropriately handle scenarios such as that outlined in Fig 1A.
For example, the multi-class discriminative sequence feature frameworks proposed by Tava-
zoie and colleagues [3,10,11] are limited to analysis of mutually exclusive classes. A few existing
methods do allow a limited analysis of datasets where annotation labels partially overlap, but
these approaches were designed for two-class classification problems where the multi-task
framework enables modeling of the “common” task in addition to the two classes. For exam-
ple, Arvey, et al. [4] used a multi-task SVM classifier to learn sequence features associated with
cell type-specific TF binding across two cell types, along with features shared by TF binding
sites in both cell types. The group lasso based logistic regression classifier SeqGL [5] also
implements a similar multi-task framework to identify features that are discriminative between
two classes and features that are common to both. No existing discriminative feature discovery
Fig 1. Overview of SeqUnwinder, which takes an input list of annotated genomic sites and identifies label-specific discriminative motifs. (A)
Schematic showing a typical input instance for SeqUnwinder: a list of genomic coordinates and corresponding annotation labels. (B) The underlying
classification framework implemented in SeqUnwinder. Subclasses (combination of annotation labels) are treated as different classes in a multi-class
classification framework. The label-specific properties are implicitly modeled using L1-regularization. (C) Weighted k-mer models are used to identify 10-
15bp focus regions called hills. MEME is used to identify motifs at hills. (D) De novo identified motifs in C) are scored using the weighted k-mer model to
obtain label-specific scores.
https://doi.org/10.1371/journal.pcbi.1005795.g001
Discriminative sequence features for overlapping regulatory annotations
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 3 / 22

method is applicable to multi-label classification scenarios where a set of genomic sequences
contains several annotation labels with arbitrary rates of overlap between them.
In this work, we present SeqUnwinder, a hierarchical classification framework for charac-
terizing interpretable sequence features associated with overlapping sets of genomic annota-
tion labels. We demonstrate the unique analysis abilities of SeqUnwinder using both synthetic
sequence datasets and collections of real TF ChIP-seq and DNase-seq experiments. In each
demonstration, SeqUnwinder cleanly associates interpretable sequence features with various
cell- or condition-specific annotation labels, while simultaneously removing the effects of con-
founding signals. SeqUnwinder scales effectively to large collections of genomic loci that have
been annotated with several overlapping labels, and is thus designed to deal with the complex-
ity of modern data sets.
Results
SeqUnwinder overview
The intuition behind SeqUnwinder is that sequence features associated with a particular anno-
tation label should be similarly enriched across all subclasses spanned by the label (regardless
of how the subclasses have been defined). SeqUnwinder’s analysis begins by defining genomic
site subclasses based on the combinations of labels annotated at these sites (Fig 1B). The site
subclasses are treated as distinct classes for a multi-class logistic regression model that uses k-
mer frequencies as predictors. At the same time, k-mer models are also learned for each label
by incorporating them in an L1 regularization term (see Methods). In other words, while the
k-mer weight parameters for each subclass are learned directly from the data, the weight
parameters for the labels are learned exclusively through the regularization constraint. The
regularization encourages each label’s model to take the form of the features that are consis-
tently enriched across the subclasses spanned by that label (Fig 1B). The trained classifier
encapsulates weighted k-mer models specific to each label and each subclass (i.e. combination
of labels). The label- or subclass-specific k-mer model is scanned across the original genomic
sites to identify focused regions (which we term “hills”) that contain discriminative sequence
signals (Fig 1C). Finally, to aid interpretability, SeqUnwinder identifies over-represented
motifs in the hills and scores them using label- and subclass-specific k-mer models (Fig 1D).
SeqUnwinder is easy to use, taking as input a list of DNA sequences or genomic coordinates
that are each annotated with a set of user-defined labels. The labels can come from any source,
enabling a high degree of analysis flexibility. SeqUnwinder implements a multi-threaded ver-
sion of the ADMM [12] framework to train the model and typically runs in less than a few
hours for most datasets. Output includes both k-mer models and position-specific scoring
matrices and weights associating these motifs with each subclass and label.
SeqUnwinder deconvolves sequence features associated with
overlapping labels
To demonstrate the properties of SeqUnwinder, we simulated 9,000 regulatory regions and
annotated each of them with labels from two overlapping sets: A, B, C and X, Y (Fig 2A). We
assigned a different motif to each label. At 70% of the sequences associated with each label, we
inserted appropriate motif instances by sampling from the distributions defined by the posi-
tion-specific scoring matrices of label assigned motifs (Fig 2A). We used this collection of
sequences and label assignments to compare SeqUnwinder with a simple multi-class classifica-
tion approach (MCC). In MCC training, each label was treated as a distinct class and therefore
each regulatory sequence is included multiple times in accordance with its annotated labels.
Discriminative sequence features for overlapping regulatory annotations
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 4 / 22

SeqUnwinder and the MCC model correctly identify motifs similar to all inserted motifs
(Fig 2B). However, the MCC approach makes several incorrect motif-label associations, poten-
tially due to high overlap between labels. In contrast, the label-specific scores of the identified
motifs in the SeqUnwinder model are not confounded by overlap between annotation labels.
For example, even though labels X and A highly overlap, SeqUnwinder correctly assigns each
motif to its respective label.
Next, we assessed the performance of SeqUnwinder at different levels of label overlaps. We
simulated 100 datasets with 6000 simulated sequences, varying the degree of overlap between
two sets of labels ({A, B} and {X, Y}) from 50% to 99% (Fig 2C). We then compared SeqUnwin-
der with MCC and DREME [1], a popular discriminative motif discovery tool. Since DREME
takes only two classes as input: a foreground set and a background set, we ran four different
DREME runs for each of the four labels. We calculated the true positive (discovered motif
Fig 2. Performance of SeqUnwinder on simulated datasets. (A) 9000 simulated genomic sites with corresponding motif associations. (B) Label-
specific scores for all de novo motifs identified using MCC (left) and SeqUnwinder (right) models on simulated genomic sites in “A”. For consistency across
figures, we fix the color saturation values to -0.4 and 0.4 (C) Schematic showing 100 genomic datasets with 6000 genomic sites and varying degrees of
label overlap ranging from 0.5 to 0.99. (D) Performance of MCC (multi-class logistic classifier), DREME, and SeqUnwinder on simulated datasets in “C”,
measured using the F1-score, (E) true positive rates, and (F) false positive rates.
https://doi.org/10.1371/journal.pcbi.1005795.g002
Discriminative sequence features for overlapping regulatory annotations
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 5 / 22

Citations
More filters
01 Feb 2015
TL;DR: In this article, the authors describe the integrative analysis of 111 reference human epigenomes generated as part of the NIH Roadmap Epigenomics Consortium, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression.
Abstract: The reference human genome sequence set the stage for studies of genetic variation and its association with human disease, but epigenomic studies lack a similar reference. To address this need, the NIH Roadmap Epigenomics Consortium generated the largest collection so far of human epigenomes for primary cells and tissues. Here we describe the integrative analysis of 111 reference human epigenomes generated as part of the programme, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. We establish global maps of regulatory elements, define regulatory modules of coordinated activity, and their likely activators and repressors. We show that disease- and trait-associated genetic variants are enriched in tissue-specific epigenomic marks, revealing biologically relevant cell types for diverse human traits, and providing a resource for interpreting the molecular basis of human disease. Our results demonstrate the central role of epigenomic information for understanding gene regulation, cellular differentiation and human disease.

4,409 citations

01 Feb 2012
TL;DR: ChromHMM is developed, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets, and visualizing the resulting genome-wide maps of chromatin state annotations.
Abstract: Chromatin state annotation using combinations of chromatin modification patterns has emerged as a powerful approach for discovering regulatory regions and their cell type specific activity patterns, and for interpreting disease-association studies1-5. However, the computational challenge of learning chromatin state models from large numbers of chromatin modification datasets in multiple cell types still requires extensive bioinformatics expertise making it inaccessible to the wider scientific community. To address this challenge, we have developed ChromHMM, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets, and visualizing the resulting genome-wide maps of chromatin state annotations.

365 citations

References
More filters
Journal ArticleDOI
TL;DR: This work presents a general framework for detecting regulatory DNA and RNA motifs that relies on directly assessing the mutual information between sequence and gene expression measurements, and provides a versatile motif discovery framework with exceptional sensitivity and near-zero false-positive rates.

327 citations


"Deconvolving sequence features that..." refers background in this paper

  • ...Multi-class discriminative sequence feature frameworks have been proposed (Beer & Tavazoie, 2004; Elemento et al, 2007), but these approaches have also been limited to analysis of mutually exclusive classes....

    [...]

Journal ArticleDOI
TL;DR: The authors used an integrative approach to study estrogen receptor α (ER) and found that ER exhibits two distinct modes of binding, i.e., shared sites, bound in multiple cell types, are characterized by high-affinity estrogen response elements (EREs), inaccessible chromatin, and a lack of DNA methylation, while cell-specific sites, characterized by lack of EREs, co-occurrence with other transcription factors, and cell-type-specific chromatin accessibility and DNA methylization.

285 citations


"Deconvolving sequence features that..." refers background or result in this paper

  • ...Studies of individual TFs suggest that binding affinity to cognate motif instances may play a role in distinguishing multi-condition binding sites from tissue-specific sites (Gertz et al, 2013; Mahony et al, 2014)....

    [...]

  • ...Our results support the model that high affinity cognate motif instances are a striking feature of multi-conditionally bound sites across a broad range of TFs (Gertz et al, 2013; Mahony et al, 2014)....

    [...]

Journal ArticleDOI
TL;DR: Three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity, and the machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.
Abstract: Background Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.

264 citations


"Deconvolving sequence features that..." refers background or result in this paper

  • ...…a particular TF in multiple cell types (i.e. “shared” or multi-condition sites) are often strongly biased towards being located in gene promoter regions, in contrast to cell-specific binding sites, which are typically distally located (Yip et al, 2012; Wang et al, 2012; Kheradpour & Kellis, 2014)....

    [...]

  • ...Similar findings were previously identified at the “high occupancy of transcription-related factors (HOT)” regions (Yip et al, 2012)....

    [...]

Journal ArticleDOI
03 Feb 2012-Cell
TL;DR: It is demonstrated that the five genetic components essential for cardiac specification in Drosophila, including the effectors of Wg and Dpp signaling, act as a collective unit to cooperatively regulate heart enhancer activity, both in vivo and in–vitro.

262 citations


"Deconvolving sequence features that..." refers result in this paper

  • ...These results are consistent with the “TF collective” model proposed by Junion and colleagues (Junion et al, 2012)....

    [...]

Journal ArticleDOI
TL;DR: A support vector machine (SVM) framework which can accurately identify EP300-bound enhancers using only genomic sequence and an unbiased set of general sequence features is developed and indicates that some features operate in a general or tissue-independent manner.
Abstract: Enhancers are gene regulatory sequences that can control transcriptional activities at a distance, independent of their position and orientation with respect to affected genes (Banerji 1981). Enhancer activity is modulated by interactions between sequence specific DNA binding proteins and sequence elements in the enhancer. Since individual transcription factor binding sites (TFBSs) can be relatively short and degenerate, TFBSs tend to be clustered to achieve precise temporal and developmental specificity (Kadonaga 2004). Factors bound to these sequences often interact with common coactivators, which, in turn, recruit the basal transcription machinery (Blackwood and Kadonaga 1998; Carter et al. 2002). Identifying the sequence elements and the combinatorial rules that determine enhancer function is necessary to fully understand how enhancers direct the spatial and temporal regulation of gene expression. Experimentally identified enhancers with similar functions can be a good starting point for in-depth study of the underlying rules encoded in the regulatory DNA sequence. However, the systematic functional identification of such enhancers has been limited due to the fact that they are often distant from the genes they regulate, requiring the interrogation of large amounts of potential regulatory sequence. Most investigations make use of two complementary approaches to detect putative regulatory regions: comparative genomics, which identifies enhancers by their sequence conservation across related species; and functional genomics, which identifies enhancers by the common binding of transcriptionally associated factors or marks (for review, see Noonan and McCallion 2010). Comparative genomics is based on the generally accepted hypothesis that functionally important regulatory sequences are under purifying selection. As a result, conserved noncoding sequences (CNSs) are natural candidates for putative enhancers. Early studies used CNSs to detect putative enhancers and test their activity in zebrafish or mouse reporter assays (Woolfe et al. 2004; Pennacchio et al. 2006; Visel et al. 2008). Although these conservation-based approaches achieve some success, limitations also exist. The function and spatio-temporal specificity of CNSs cannot be determined by conservation alone and, therefore, requires additional experimentation. More importantly, several studies have shown that noncoding sequences that apparently lack conservation (as assessed by sequence alignment) may still contain functional regulatory elements (Fisher et al. 2006; ENCODE Project Consortium 2007; McGaughey et al. 2008). Functional genomics is an experimentally driven approach that utilizes recently developed techniques of microarray hybridization or massively parallel sequencing in combination with chromatin immunoprecipitation (ChIP) on specific transcription factors (Johnson et al. 2007; Robertson et al. 2007), chromatin signatures (Heintzman et al. 2007, 2009), or coactivators (Visel et al. 2009; Kim et al. 2010). Specifically, some chromatin signatures or coactivator association (such as monomethylation of lysine 4 of histone H3, acetylation of lysine 27 of histone H3, and binding by coactivators EP300/CREBBP) are predictive markers of enhancer activity (Heintzman et al. 2007, 2009). The transcriptional coactivators EP300 (also known as P300) and CREBBP (also known as CBP) have proven to be useful for enhancer identification because of their general roles as cofactors in mammalian transcription. Through highly conserved protein-protein interactions, EP300/CREBBP are hypothesized to operate as coactivators in at least three ways: as a direct bridge between sequence-specific transcription factors (TFs) and RNA Polymerase II, as an indirect bridge between sequence specific TFs and other coactivators which recruit RNA Pol II, or by modifying chromatin structure via intrinsic acetyl-transferase activity (Chan and La Thangue 2001). Several studies have reported genome-wide mapping of EP300/CREBBP-bound enhancers in different contexts, for example, tissue-specific activity in dissected mouse tissue (Visel et al. 2009) and environment-dependent activity in neurons (Kim et al. 2010). Visel et al. validated that 90% of the EP300 enhancers tested recapitulated the expected spatial and temporal activity in vivo in a transgenic mouse enhancer assay. Functionally identified EP300-bound regions thus provide a robust starting point for further investigation of enhancers and their sequence properties. In principle, a complete understanding of enhancer mechanism would include a description of specific internal sequence features and how they contribute to enhancer function. Previous studies that have attempted to predict enhancers from sequence have typically used sequence conservation, colocalization of previously characterized TFBSs [from databases such as TRANSFAC (Matys et al. 2003) or JASPAR (Bryne et al. 2008)], or a combination of the two. Many of these existing approaches were assessed by Su et al. (2010), who found that some were successful in identifying enhancers in Drosophila but that few generalized to mammalian systems. The most successful method in mammalian enhancer prediction used a combination of conservation and low-order Markov models of sequence features (Elnitski et al. 2003; King et al. 2005). In more recent work, Leung and Eisen (2009) used word frequency profile similarity between pairs of sequences to detect novel enhancers, but training on small numbers of enhancers can be susceptible to noise. Another notable recent computational approach uses combinations of known TFBSs and de novo position weight matrices (PWMs) to detect enhancers (Narlikar et al. 2010). In this paper, we present a discriminative computational framework to detect enhancers from DNA sequence alone that does not rely on conservation or known TF binding specificities. We use a support vector machine (SVM) to differentiate enhancers from nonfunctional regions, using DNA sequence elements as features. SVMs (Boser et al. 1992; Vapnik 1995) have been successfully applied in many biological contexts (for review, see Scholkopf et al. 2004; Ben-Hur et al. 2008): cancer tissue classification (Furey et al. 2000); protein domain classification (Karchin et al. 2002; Leslie et al. 2002, 2004); splice site prediction (Ratsch et al. 2005; Sonnenburg et al. 2007); and nucleosome positioning (Peckham et al. 2007). In our case, because of the potentially diverse mechanisms which direct EP300 and CREBBP binding, we use a complete set of DNA sequence features to capture combinations of binding sites active in different tissues and times of development. To study these distinct modes of regulation, we investigate EP300/CREBBP binding in mouse embryos (Visel et al. 2009), activated cultured neurons (Kim et al. 2010), and embryonic stem (ES) cells (Chen et al. 2008). Our analysis will initially focus on Visel's data set, where several thousands of EP300-bound DNA elements were collected by ChIP-seq in dissected mouse embryo forebrain, midbrain, and limb. We evaluate our method by predicting enhancers vs. random sequence and between EP300/CREBBP ChIP-seq data sets. These comparisons reveal a diversity of predictive sequence features, both within and across data sets. Supplemental Table S1 provides an outline of the analyses performed in this paper. We show that sequence features in the experimentally identified enhancer set are sufficient to accurately discriminate enhancers from random genomic regions. We also show that the most predictive sequence elements are related to biologically relevant transcription factor binding sites. Notably, our method also finds that some sequence elements are significantly absent in the enhancers (those with large negative SVM weights). For example, we find that binding sites for the zinc finger E-box binding homeobox (ZEB) transcription factor family is depleted in the forebrain enhancers, consistent with its biological role as a transcriptional repressor (Vandewalle et al. 2008). In addition, we provide evidence that enriched sequence elements are positionally constrained within the enhancers and that they are more evolutionarily conserved than less predictive elements in the enhancers, reflecting the combinatorial structure of tissue-specific enhancers. We further apply our SVM method to predict putative enhancers in both the mouse genome and the human genome from DNA sequence alone. Many of these novel enhancers overlap with regions enriched in EP300 ChIP-seq reads, exhibit greatly increased hypersensitivity to DNase I in the mouse brain, and are proximal to biologically relevant genes. All of these assessments exclude the original EP300 training set enhancers from the analysis. The successful identification of tissue-specific DNase I hypersensitive sites provides powerful independent evidence for the validity of our approach.

233 citations


"Deconvolving sequence features that..." refers background or methods in this paper

  • ...For example (Lee et al, 2011), trained an SVM classifier to discriminate putative enhancers from random sequences using an unbiased set of k-mers as predictors....

    [...]

  • ...…convolutional neural networks and support vector machines (SVMs) with various k-mer based sequence kernels (Arvey et al, 2012; Ghandi et al, 2014; Lee et al, 2011), these applications typically focus on discriminating between two mutually exclusive classes (Bailey, 2011; Alipanahi et al, 2015)....

    [...]

Frequently Asked Questions (16)
Q1. What contributions have the authors mentioned in the paper "Deconvolving sequence features that discriminate between overlapping regulatory annotations" ?

The authors demonstrate the novel analysis abilities of SeqUnwinder using three examples. Finally, the authors demonstrate the scalability of SeqUnwinder to discover cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines. 

To calculate collective degree, the authors used a total of 158, 102, and 202 ChIP-seq datasets in GM12878, H1-hESC, and K562 cell-types, respectively. 

A significant depletion of motif instances at sites annotated by a label compared to other labels can very likely result in non-positive scores. 

Most popular motif-finding methods use unsupervised machinelearning approaches to discover motifs in ‘foreground’ input sequences that are over-represented with respect to a set of ‘background’ sequences (e.g. “bound” vs. “unbound”, respectively) [1,2]. 

In other words, while the k-mer weight parameters for each subclass are learned directly from the data, the weight parameters for the labels are learned exclusively through the regularization constraint. 

By implicitly accounting for the effects of overlapping annotation labels, SeqUnwinder can deconvolve sequence features associated with motor neuron programming dynamics and ES chromatin status. 

the authors found IRF and RUNX motifs enriched at GM12878-specific binding sites for 11 and 7 of the 17 examined TFs, respectively. 

One advantage of the “hill-finding” approach is that it implicitly takes into account positional relationships between high-scoring k-mers on the genome; short stretches that contain multiple high-scoring k-mers will form larger “hills”. 

Several variants of the basic string kernel (e.g. mismatch kernel [35], di-mismatch kernel [4], wild-card kernel [5,35], and gkm-kernel [36]) have been proposed and have been shown to substantially improve the classifier performance. 

Since DREME takes only two classes as input: a foreground set and a background set, the authors ran four different DREME runs for each of the four labels. 

binding sites showing significantly differential binding in any of the possible 3 pair-wise comparisons were removed from the shared set. 

SeqUnwinder’s characterization of cell-specific motif features in collections of DNase-seq datasets may therefore serve as a source of predictive features for efforts that aim to predict cell-specific TF binding from accessibility experimental data alone [39–41]. 

To speed-up implementation, the authors restrict the unbiased k-mer features to only those k-mers that are present in at least 5% of the hills. 

All sites with significantly greater Isl1/Lhx3 ChIP enrichment at 12h compared to 48h (q-value cutoff of<0.01) were labeled as “early”. 

the motifs that the authors previously assigned to early or late TF binding behaviors could have been merely associated with ES-active and ES-inactive sites, respectively. 

the cognate motif was not specifically predictive of cell-type-specific labels for the examined TFs, with the exception of H1-hESC-specific sites for CEBPB, NRSF and SRF.