How many datasets were used to calculate collective degree?

To calculate collective degree, the authors used a total of 158, 102, and 202 ChIP-seq datasets in GM12878, H1-hESC, and K562 cell-types, respectively.

What is the likely reason for the non-positive score?

A significant depletion of motif instances at sites annotated by a label compared to other labels can very likely result in non-positive scores.

How can SeqUnwinder deconvolve sequence features associated with motor neuron programming?

By implicitly accounting for the effects of overlapping annotation labels, SeqUnwinder can deconvolve sequence features associated with motor neuron programming dynamics and ES chromatin status.

How many TFs were found to have cognate motifs?

the authors found IRF and RUNX motifs enriched at GM12878-specific binding sites for 11 and 7 of the 17 examined TFs, respectively.

What is the advantage of the “hill-finding” approach?

One advantage of the “hill-finding” approach is that it implicitly takes into account positional relationships between high-scoring k-mers on the genome; short stretches that contain multiple high-scoring k-mers will form larger “hills”.

What variants of the basic string kernel have been proposed?

Several variants of the basic string kernel (e.g. mismatch kernel [35], di-mismatch kernel [4], wild-card kernel [5,35], and gkm-kernel [36]) have been proposed and have been shown to substantially improve the classifier performance.

How many different DREME runs did the authors run for each of the labels?

Since DREME takes only two classes as input: a foreground set and a background set, the authors ran four different DREME runs for each of the four labels.

What was the significance of the binding sites removed from the shared set?

binding sites showing significantly differential binding in any of the possible 3 pair-wise comparisons were removed from the shared set.

What is the way to predict TF binding?

SeqUnwinder’s characterization of cell-specific motif features in collections of DNase-seq datasets may therefore serve as a source of predictive features for efforts that aim to predict cell-specific TF binding from accessibility experimental data alone [39–41].

How do the authors restrict the k-mer features to the hills?

To speed-up implementation, the authors restrict the unbiased k-mer features to only those k-mers that are present in at least 5% of the hills.

What was the q-value cutoff of the labeled sites?

All sites with significantly greater Isl1/Lhx3 ChIP enrichment at 12h compared to 48h (q-value cutoff of<0.01) were labeled as “early”.

What are the motifs that the authors previously assigned to early or late TF binding behaviors?

the motifs that the authors previously assigned to early or late TF binding behaviors could have been merely associated with ES-active and ES-inactive sites, respectively.

What are the TFs that are not correlated with the cognate motif?

the cognate motif was not specifically predictive of cell-type-specific labels for the examined TFs, with the exception of H1-hESC-specific sites for CEBPB, NRSF and SRF.

(Open Access) Deconvolving sequence features that discriminate between overlapping regulatory annotations (2017) | Akshay Kakumanu

Q: What contributions have the authors mentioned in the paper "Deconvolving sequence features that discriminate between overlapping regulatory annotations" ?

The authors demonstrate the novel analysis abilities of SeqUnwinder using three examples. Finally, the authors demonstrate the scalability of SeqUnwinder to discover cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines.

Q: What is the common method of finding a motif in a sequence?

Most popular motif-finding methods use unsupervised machinelearning approaches to discover motifs in ‘foreground’ input sequences that are over-represented with respect to a set of ‘background’ sequences (e.g. “bound” vs. “unbound”, respectively) [1,2].

Q: How are the weight parameters learned for the labels?

In other words, while the k-mer weight parameters for each subclass are learned directly from the data, the weight parameters for the labels are learned exclusively through the regularization constraint.

RESEARCH ARTICLE

Deconvolving sequence features that

discriminate between overlapping regulatory

annotations

Akshay Kakumanu

, Silvia Velasco

, Esteban Mazzoni

, Shaun Mahony

1 Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The

Pennsylvania State University, University Park, PA, United States of America, 2 Department of Biology, New

York University, 100 Washington Square East, New York, NY, United States of America

* mahony@psu.edu

Abstract

Genomic loci with regulatory potential can be annotated with various properties. For exam-

ple, genomic sites bound by a given transcription factor (TF) can be divided according to

whether they are proximal or distal to known promoters. Sites can be further labeled accord-

ing to the cell types and conditions in which they are active. Given such a collection of

labeled sites, it is natural to ask what sequence features are associated with each annota-

tion label. However, discovering such label-specific sequence features is often confounded

by overlaps between the labels; e.g. if regulatory sites specific to a given cell type are also

more likely to be promoter-proximal, it is difficult to assess whether motifs identified in that

set of sites are associated with the cell type or associated with promoters. In order to meet

this challenge, we developed SeqUnwinder, a principled approach to deconvolving inter-

pretable discriminative sequence features associated with overlapping annotation labels.

We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly,

SeqUnwinder is able to unravel sequence features associated with the dynamic binding

behavior of TFs during motor neuron programming from features associated with chromatin

state in the initial embryonic stem cells. Secondly, we characterize distinct sequence proper-

ties of multi-condition and cell-specific TF binding sites after controlling for uneven associa-

tions with promoter proximity. Finally, we demonstrate the scalability of SeqUnwinder to

discover cell-specific sequence features from over one hundred thousand genomic loci that

display DNase I hypersensitivity in one or more ENCODE cell lines.

Author summary

Transcription factor proteins control gene expression by recognizing and interacting with

short DNA sequence patterns in regulatory regions on the genome. Current genomics

experiments allow us to find regulatory regions associated with a particular biochemical

activity over the entire genome; for example, all regions where a particular transcription

factor interacts with the genome in a given cell type. Given a collection of regulatory

regions, we often aim to discover short DNA sequence patterns that are more common in

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 1 / 22

a1111111111

OPEN ACCESS

Citation: Kakumanu A, Velasco S, Mazzoni E,

Mahony S (2017) Deconvolving sequence features

that discriminate between overlapping regulatory

annotations. PLoS Comput Biol 13(10): e1005795.

https://doi.org/10.1371/journal.pcbi.1005795

Editor: Ilya Ioshikhes, Ottawa University, CANADA

Received: May 9, 2017

Accepted: September 26, 2017

Published: October 19, 2017

access article distributed under the terms of the

Creative Commons Attribution License, which

permits unrestricted use, distribution, and

reproduction in any medium, provided the original

author and source are credited.

Data Availability Statement: Software code

available from https://github.com/seqcode/

sequnwinder Complete output files produced by

the SeqUnwinder runs described in this

manuscript, along with scripts and data for

reproducing all analysis figures, are available from:

https://github.com/ikaka89/sequnwinderPaper.

Experimental data are available from GEO archive

under accession GSE80321.

Funding: This work was supported by National

Institutes of Health grant R01HD079682 (to EOM).

The funders had no role in study design, data

the collection than in other regions. Performing such “DNA motif-finding” analysis can

give us hints about the patterns that determine gene regulation in the analyzed cell type.

Here we describe a new method for DNA motif-finding called SeqUnwinder. Our

approach analyzes collections of regulatory regions where each has been labeled according

to various biological properties. For example, the labels could correspond to various cell

types in which the regulatory region is active. SeqUnwinder then performs machine-

learning analysis to unravel DNA sequence features that are characteristic of each label

(e.g. features that distinguish regulatory regions in each cell type from other cell types).

SeqUnwinder is the first method to enable analysis of regulatory region collections that

contain several overlapping labels.

Introduction

Many regulatory genomics analyses focus on finding DNA sequence features that are charac-

teristic of a biological property. Given a set of sequences that are bound by a particular tran-

scription factor (TF), for example, we typically aim to discover short, degenerate DNA

patterns that may represent the DNA binding preferences of the TF itself, the binding prefer-

ences of coincident TFs, or general properties of the regions that make them favorable for

binding.

The de novo DNA motif-finding problem is typically cast in the context of two mutually

exclusive sequence sets. Most popular motif-finding methods use unsupervised machine-

learning approaches to discover motifs in ‘foreground’ input sequences that are over-repre-

sented with respect to a set of ‘background’ sequences (e.g. “bound” vs. “unbound”, respec-

tively) [1,2]. Several other methods explicitly solve a two-class classification problem, where

the goal is to find sequence features that discriminate between two mutually exclusive class

labels [3–6].

Current characterizations of regulatory sites move beyond binary labels such as “bound”

and “unbound”. For example, in a given cell type, each regulatory element could be labeled as

bound or unbound by each of several TFs and enriched or depleted for several chromatin

states [7–9]. As we add more regulatory class labels, it becomes difficult to define mutually

exclusive sets of sequences that are representative of each label. Relatedly, our analyses may

become confounded by uneven degrees of overlap between the class labels, leading to incorrect

associations between sequence features and regulatory activities. Therefore, a simple recasting

of discriminative motif-finding as a multi-class classification problem (where classes are

required to be mutually exclusive) is not always appropriate.

As an example, consider the hypothetical scenario presented in Fig 1A. In this example, a

given TF’s binding sites have been profiled in types A, B, and C. Thus, each TF binding event

can be labeled as specific to a cell type or common to all or a subset. Let’s assume that after fur-

ther labeling the sites as being proximal or distal to promoters (Pr and Di, respectively), we

find that the TF’s binding sites in cell A are more likely to be promoter proximal than sites in

other cell types. Promoter regions have sequence features that are distinct from distal regions

(e.g. the presence of core promoter elements and distinct GC-content patterns). Therefore, if

we search for sequence features that are discriminative of cell A’s sites without accounting for

the uneven overlaps with other labels, it is likely that some discovered features will actually be

generic properties of proximal regions. Such results could in turn affect our conclusions

regarding the biological mechanisms of TF binding in cell A. To resolve DNA features

Discriminative sequence features for overlapping regulatory annotations

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 2 / 22

collection and analysis, decision to publish, or

preparation of the manuscript.

Competing interests: The authors have declared

that no competing interests exist.

associated with each cell type’s label from those associated with confounding labels (e.g. pro-

moter proximity), we need motif-finders that are able to analyze multiple labels in parallel.

Almost all existing discriminative motif-finders assume that the class labels are mutually

exclusive, and therefore cannot appropriately handle scenarios such as that outlined in Fig 1A.

For example, the multi-class discriminative sequence feature frameworks proposed by Tava-

zoie and colleagues [3,10,11] are limited to analysis of mutually exclusive classes. A few existing

methods do allow a limited analysis of datasets where annotation labels partially overlap, but

these approaches were designed for two-class classification problems where the multi-task

framework enables modeling of the “common” task in addition to the two classes. For exam-

ple, Arvey, et al. [4] used a multi-task SVM classifier to learn sequence features associated with

cell type-specific TF binding across two cell types, along with features shared by TF binding

sites in both cell types. The group lasso based logistic regression classifier SeqGL [5] also

implements a similar multi-task framework to identify features that are discriminative between

two classes and features that are common to both. No existing discriminative feature discovery

Fig 1. Overview of SeqUnwinder, which takes an input list of annotated genomic sites and identifies label-specific discriminative motifs. (A)

Schematic showing a typical input instance for SeqUnwinder: a list of genomic coordinates and corresponding annotation labels. (B) The underlying

classification framework implemented in SeqUnwinder. Subclasses (combination of annotation labels) are treated as different classes in a multi-class

classification framework. The label-specific properties are implicitly modeled using L1-regularization. (C) Weighted k-mer models are used to identify 10-

15bp focus regions called hills. MEME is used to identify motifs at hills. (D) De novo identified motifs in C) are scored using the weighted k-mer model to

obtain label-specific scores.

https://doi.org/10.1371/journal.pcbi.1005795.g001

Discriminative sequence features for overlapping regulatory annotations

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 3 / 22

method is applicable to multi-label classification scenarios where a set of genomic sequences

contains several annotation labels with arbitrary rates of overlap between them.

In this work, we present SeqUnwinder, a hierarchical classification framework for charac-

terizing interpretable sequence features associated with overlapping sets of genomic annota-

tion labels. We demonstrate the unique analysis abilities of SeqUnwinder using both synthetic

sequence datasets and collections of real TF ChIP-seq and DNase-seq experiments. In each

demonstration, SeqUnwinder cleanly associates interpretable sequence features with various

cell- or condition-specific annotation labels, while simultaneously removing the effects of con-

founding signals. SeqUnwinder scales effectively to large collections of genomic loci that have

been annotated with several overlapping labels, and is thus designed to deal with the complex-

ity of modern data sets.

Results

SeqUnwinder overview

The intuition behind SeqUnwinder is that sequence features associated with a particular anno-

tation label should be similarly enriched across all subclasses spanned by the label (regardless

of how the subclasses have been defined). SeqUnwinder’s analysis begins by defining genomic

site subclasses based on the combinations of labels annotated at these sites (Fig 1B). The site

subclasses are treated as distinct classes for a multi-class logistic regression model that uses k-

mer frequencies as predictors. At the same time, k-mer models are also learned for each label

by incorporating them in an L1 regularization term (see Methods). In other words, while the

k-mer weight parameters for each subclass are learned directly from the data, the weight

parameters for the labels are learned exclusively through the regularization constraint. The

regularization encourages each label’s model to take the form of the features that are consis-

tently enriched across the subclasses spanned by that label (Fig 1B). The trained classifier

encapsulates weighted k-mer models specific to each label and each subclass (i.e. combination

of labels). The label- or subclass-specific k-mer model is scanned across the original genomic

sites to identify focused regions (which we term “hills”) that contain discriminative sequence

signals (Fig 1C). Finally, to aid interpretability, SeqUnwinder identifies over-represented

motifs in the hills and scores them using label- and subclass-specific k-mer models (Fig 1D).

SeqUnwinder is easy to use, taking as input a list of DNA sequences or genomic coordinates

that are each annotated with a set of user-defined labels. The labels can come from any source,

enabling a high degree of analysis flexibility. SeqUnwinder implements a multi-threaded ver-

sion of the ADMM [12] framework to train the model and typically runs in less than a few

hours for most datasets. Output includes both k-mer models and position-specific scoring

matrices and weights associating these motifs with each subclass and label.

SeqUnwinder deconvolves sequence features associated with

overlapping labels

To demonstrate the properties of SeqUnwinder, we simulated 9,000 regulatory regions and

annotated each of them with labels from two overlapping sets: A, B, C and X, Y (Fig 2A). We

assigned a different motif to each label. At 70% of the sequences associated with each label, we

inserted appropriate motif instances by sampling from the distributions defined by the posi-

tion-specific scoring matrices of label assigned motifs (Fig 2A). We used this collection of

sequences and label assignments to compare SeqUnwinder with a simple multi-class classifica-

tion approach (MCC). In MCC training, each label was treated as a distinct class and therefore

each regulatory sequence is included multiple times in accordance with its annotated labels.

Discriminative sequence features for overlapping regulatory annotations

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 4 / 22

SeqUnwinder and the MCC model correctly identify motifs similar to all inserted motifs

(Fig 2B). However, the MCC approach makes several incorrect motif-label associations, poten-

tially due to high overlap between labels. In contrast, the label-specific scores of the identified

motifs in the SeqUnwinder model are not confounded by overlap between annotation labels.

For example, even though labels X and A highly overlap, SeqUnwinder correctly assigns each

motif to its respective label.

Next, we assessed the performance of SeqUnwinder at different levels of label overlaps. We

simulated 100 datasets with 6000 simulated sequences, varying the degree of overlap between

two sets of labels ({A, B} and {X, Y}) from 50% to 99% (Fig 2C). We then compared SeqUnwin-

der with MCC and DREME [1], a popular discriminative motif discovery tool. Since DREME

takes only two classes as input: a foreground set and a background set, we ran four different

DREME runs for each of the four labels. We calculated the true positive (discovered motif

Fig 2. Performance of SeqUnwinder on simulated datasets. (A) 9000 simulated genomic sites with corresponding motif associations. (B) Label-

specific scores for all de novo motifs identified using MCC (left) and SeqUnwinder (right) models on simulated genomic sites in “A”. For consistency across

figures, we fix the color saturation values to -0.4 and 0.4 (C) Schematic showing 100 genomic datasets with 6000 genomic sites and varying degrees of

label overlap ranging from 0.5 to 0.99. (D) Performance of MCC (multi-class logistic classifier), DREME, and SeqUnwinder on simulated datasets in “C”,

measured using the F1-score, (E) true positive rates, and (F) false positive rates.

https://doi.org/10.1371/journal.pcbi.1005795.g002

Discriminative sequence features for overlapping regulatory annotations

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 5 / 22

Deconvolving sequence features that discriminate between overlapping regulatory annotations

Figures

Citations

Integrative analysis of 111 reference human epigenomes

ChromHMM: automating chromatin-state discovery and characterization

References

A map of the cis-regulatory sequences in the mouse genome

Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human Genome

DREME: motif discovery in transcription factor ChIP-seq data

Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors.

Global analysis of the insulator binding protein CTCF in chromatin barrier regions reveals demarcation of active and repressive domains

Related Papers (5)

Deconvolving sequence features that discriminate between overlapping regulatory annotations.

Mocap: Large-scale inference of transcription factor binding sites from chromatin accessibility

Comprehensive Human Transcription Factor Binding Site Map for Combinatory Binding Motifs Discovery

Deep neural networks identify context-specific determinants of transcription factor binding affinity

Discovery of cell-type specific DNA motif grammar in cis-regulatory elements using random Forest

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "Deconvolving sequence features that discriminate between overlapping regulatory annotations" ?

Q2. How many datasets were used to calculate collective degree?

Q3. What is the likely reason for the non-positive score?

Q4. What is the common method of finding a motif in a sequence?

Q5. How are the weight parameters learned for the labels?

Q6. How can SeqUnwinder deconvolve sequence features associated with motor neuron programming?

Q7. How many TFs were found to have cognate motifs?

Q8. What is the advantage of the “hill-finding” approach?

Q9. What variants of the basic string kernel have been proposed?

Q10. How many different DREME runs did the authors run for each of the labels?

Q11. What was the significance of the binding sites removed from the shared set?

Q12. What is the way to predict TF binding?

Q13. How do the authors restrict the k-mer features to the hills?

Q14. What was the q-value cutoff of the labeled sites?

Q15. What are the motifs that the authors previously assigned to early or late TF binding behaviors?

Q16. What are the TFs that are not correlated with the cognate motif?