Deconvolving sequence features that discriminate between overlapping regulatory annotations
Summary (4 min read)
Introduction
- Many regulatory genomics analyses focus on finding DNA sequence features that are characteristic of a biological property.
- Almost all existing discriminative motif-finders assume that the class labels are mutually exclusive, and therefore cannot appropriately handle scenarios such as that outlined in Fig 1A .
- No existing discriminative feature discovery method is applicable to multi-label classification scenarios where a set of genomic sequences contains several annotation labels with arbitrary rates of overlap between them.
SeqUnwinder overview
- The intuition behind SeqUnwinder is that sequence features associated with a particular annotation label should be similarly enriched across all subclasses spanned by the label (regardless of how the subclasses have been defined).
- In other words, while the k-mer weight parameters for each subclass are learned directly from the data, the weight parameters for the labels are learned exclusively through the regularization constraint.
- The label-or subclass-specific k-mer model is scanned across the original genomic sites to identify focused regions (which the authors term "hills") that contain discriminative sequence signals (Fig 1C) .
- The labels can come from any source, enabling a high degree of analysis flexibility.
- SeqUnwinder implements a multi-threaded version of the ADMM [12] framework to train the model and typically runs in less than a few hours for most datasets.
SeqUnwinder deconvolves sequence features associated with overlapping labels
- At 70% of the sequences associated with each label, the authors inserted appropriate motif instances by sampling from the distributions defined by the position-specific scoring matrices of label assigned motifs (Fig 2A ).
- SeqUnwinder and the MCC model correctly identify motifs similar to all inserted motifs (Fig 2B ).
- Since DREME takes only two classes as input: a foreground set and a background set, the authors ran four different DREME runs for each of the four labels.
- The authors used these measures to calculate the F1 score (harmonic mean of precision and recall) at different overlapping levels (Fig 2D) .
- SeqUnwinder performs better than the other approaches in the intermediate range of label overlaps, and accurately characterizes label-specific sequence features even when the simulated labels overlap at 90% of sites.
SeqUnwinder uncovers co-factor driven TF binding dynamics during iMN programming
- To demonstrate its unique abilities in a real analysis problem, the authors use SeqUnwinder to study TF binding during induced motor neuron (iMN) programming.
- Annotating Isl1/Lhx3 sites using both sets of labels (Isl1/Lhx3 binding dynamics and ES activity) results in six different subclasses.
- All methods discover similar sets of motifs.
- SeqUnwinder, in contrast, makes much cleaner associations; the Oct4 motif is only associated with the "early" label, and the Onecut motif is only associated with the "late" label, suggesting that these motifs are not merely coincidental features due to the ES activity status of the binding sites.
Multi-condition TF binding sites are characterized by stronger cognate motif instances
- The sequence properties of tissue-specific TF binding sites have been extensively studied [4, 5, 16] .
- Studies of individual TFs suggest that binding affinity to cognate motif instances may play a role in distinguishing multi-condition binding sites from tissue-specific sites [15, 17] .
- The authors applied SeqUnwinder to each labeled sequence collection in order to characterize labelspecific sequence features (see S2 Table for cross-validation classification performance values).
- The authors illustrate the process with SeqUnwinder's results for YY1.
- From Fig 5C, it is clear that distal K562-and GM12878-specific sites lacking a cognate motif instance have higher collective degrees.
Discussion
- Classification models have shown great potential in identifying sequence features at defined genomic sites.
- To systematically address this, the authors developed SeqUnwinder.
- Therefore, the motifs that the authors previously assigned to early or late TF binding behaviors could have been merely associated with ES-active and ES-inactive sites, respectively.
- Incorporating graphs defined by label similarities [42, 43] may thus be productive in the context of analyses across cell lineages or developmental time-series.
- SeqUnwinder may also be easily extended to incorporate different kinds of sequence kernels and DNA shape features [35, 36, 44] .
SeqUnwinder model
- The training features for the classifier are based on k-mer frequencies in a fixed window around input loci.
- The value or range of k is user-definable in the SeqUnwinder software, but all analyses in this work use models based on all 4-mers and 5-mers.
- When counting frequencies, the authors map each k-mer to the same entry as its reverse complement.
- The parameters of SeqUnwinder are k-mer weights for each subclass (combination of annotation labels).
- Briefly, label-specific k-mer weights are encouraged to be similar to k-mer weights in all subclasses the label spans by regularizing on the differences of k-mer weights.
Training the SeqUnwinder model
- The w n and w p update steps separate out and are iteratively updated until convergence.
- The above equation is solved using the scaled alternating direction method of multipliers (ADMM) framework [12] .
- Briefly, the ADMM framework splits the above problem into 2 smaller sub-problems, which are much easier to solve.
- Where abs and rel are the absolute and relative tolerance, respectively.
- Intuitively, the w tþ1 n update step is distributed across multiple threads by splitting the M training examples into smaller subsets.
Converting weighted k-mer models into interpretable sequence features
- While SeqUnwinder models label-specific sequence features using high-dimensional kmer weight vectors, it is often desirable to visualize these sequence features in terms of a collection of interpretable position-specific scoring matrices.
- Specifically, the authors first scan the k-mer models learned during the training process across fixed-sized sequence windows around the input genomic loci to identify local high-scoring regions called "hills".
- Next, the authors cluster the hills using K-means clustering with Euclidean distance metric and k-mer counts as features.
- Note that the heatmaps in each figure which display these label-specific discriminative scores have been generated with a shared color scheme; i.e., the maximum shade of yellow is defined to correspond to a modelspecific score of +0.4, while the maximum shade of blue is set to a score of -0.4.
- Focused motif searches in the hills thus can find motifs that are longer than the longest k-mers in the underlying SeqUnwinder model.
Generation of synthetic datasets
- To test SeqUnwinder in simulated settings, the authors generated various synthetic datasets.
- The sizes of simulated datasets (6,000-9,000 sequences) were chosen to roughly reflect the number of peaks in a typical ChIP-seq dataset.
- The exact choice of order of the background Markov model (i.e. 2 nd -order versus a higher order) is arbitrary, but should not be expected to affect the relative performances of the methods in correctly associating embedded motifs with correct labels.
- Next, the authors randomly assigned labels to the simulated sequences at different frequencies.
- Each motif instance was sampled from the probability density function defined by the PWM of the motif.
Processing iMN programming data-sets
- Defining early, shared and late binding labels.
- Isl1/Lhx3 binding sites called in both 12 and 48h datasets with a further filter of not being differentially bound (q-value cutoff of <0.01), were assigned as "shared" sites.
- A union list of 1million 500bp regions comprising the enriched domains (see below) of DNa-seI, H3K4me2, H3K4me1, H3K27ac, and H3K4me3 was used as the positive set for training the classifier.
- Weka's implementation of Random Forests was used to train the classifier (https://github.
Processing ENCODE datasets
- The binding profiles for the factors were profiled using MultiGPS [15] .
- Binding sites labeled as cell-type specific sites were required to have significantly higher ChIP enrichment compared to other cell-lines.
- Further, contiguous blocks within 200bp were stitched together to call enriched domains.
Annotation of de novo identified motifs
- All de novo motifs identified using SeqUnwinder were annotated using the cis-bp database.
- Briefly, de novo motifs were matched against the cis-bp database using STAMP [49] .
- The best matching hit with a p-value of less than 10e-6 was used to name the de novo identified motifs.
Availability and reproducibility
- SeqUnwinder is freely available under the MIT open source license from: https://github.com/ seqcode/sequnwinder.
- Complete output files produced by the SeqUnwinder runs described in this work, along with scripts and data for reproducing all analysis figures, are available from: https://github.com/ikaka89/sequnwinderPaper.
Did you find this useful? Give us your feedback
Citations
4,409 citations
365 citations
References
327 citations
"Deconvolving sequence features that..." refers background in this paper
...Multi-class discriminative sequence feature frameworks have been proposed (Beer & Tavazoie, 2004; Elemento et al, 2007), but these approaches have also been limited to analysis of mutually exclusive classes....
[...]
285 citations
"Deconvolving sequence features that..." refers background or result in this paper
...Studies of individual TFs suggest that binding affinity to cognate motif instances may play a role in distinguishing multi-condition binding sites from tissue-specific sites (Gertz et al, 2013; Mahony et al, 2014)....
[...]
...Our results support the model that high affinity cognate motif instances are a striking feature of multi-conditionally bound sites across a broad range of TFs (Gertz et al, 2013; Mahony et al, 2014)....
[...]
264 citations
"Deconvolving sequence features that..." refers background or result in this paper
...…a particular TF in multiple cell types (i.e. “shared” or multi-condition sites) are often strongly biased towards being located in gene promoter regions, in contrast to cell-specific binding sites, which are typically distally located (Yip et al, 2012; Wang et al, 2012; Kheradpour & Kellis, 2014)....
[...]
...Similar findings were previously identified at the “high occupancy of transcription-related factors (HOT)” regions (Yip et al, 2012)....
[...]
262 citations
"Deconvolving sequence features that..." refers result in this paper
...These results are consistent with the “TF collective” model proposed by Junion and colleagues (Junion et al, 2012)....
[...]
233 citations
"Deconvolving sequence features that..." refers background or methods in this paper
...For example (Lee et al, 2011), trained an SVM classifier to discriminate putative enhancers from random sequences using an unbiased set of k-mers as predictors....
[...]
...…convolutional neural networks and support vector machines (SVMs) with various k-mer based sequence kernels (Arvey et al, 2012; Ghandi et al, 2014; Lee et al, 2011), these applications typically focus on discriminating between two mutually exclusive classes (Bailey, 2011; Alipanahi et al, 2015)....
[...]
Related Papers (5)
Frequently Asked Questions (16)
Q2. How many datasets were used to calculate collective degree?
To calculate collective degree, the authors used a total of 158, 102, and 202 ChIP-seq datasets in GM12878, H1-hESC, and K562 cell-types, respectively.
Q3. What is the likely reason for the non-positive score?
A significant depletion of motif instances at sites annotated by a label compared to other labels can very likely result in non-positive scores.
Q4. What is the common method of finding a motif in a sequence?
Most popular motif-finding methods use unsupervised machinelearning approaches to discover motifs in ‘foreground’ input sequences that are over-represented with respect to a set of ‘background’ sequences (e.g. “bound” vs. “unbound”, respectively) [1,2].
Q5. How are the weight parameters learned for the labels?
In other words, while the k-mer weight parameters for each subclass are learned directly from the data, the weight parameters for the labels are learned exclusively through the regularization constraint.
Q6. How can SeqUnwinder deconvolve sequence features associated with motor neuron programming?
By implicitly accounting for the effects of overlapping annotation labels, SeqUnwinder can deconvolve sequence features associated with motor neuron programming dynamics and ES chromatin status.
Q7. How many TFs were found to have cognate motifs?
the authors found IRF and RUNX motifs enriched at GM12878-specific binding sites for 11 and 7 of the 17 examined TFs, respectively.
Q8. What is the advantage of the “hill-finding” approach?
One advantage of the “hill-finding” approach is that it implicitly takes into account positional relationships between high-scoring k-mers on the genome; short stretches that contain multiple high-scoring k-mers will form larger “hills”.
Q9. What variants of the basic string kernel have been proposed?
Several variants of the basic string kernel (e.g. mismatch kernel [35], di-mismatch kernel [4], wild-card kernel [5,35], and gkm-kernel [36]) have been proposed and have been shown to substantially improve the classifier performance.
Q10. How many different DREME runs did the authors run for each of the labels?
Since DREME takes only two classes as input: a foreground set and a background set, the authors ran four different DREME runs for each of the four labels.
Q11. What was the significance of the binding sites removed from the shared set?
binding sites showing significantly differential binding in any of the possible 3 pair-wise comparisons were removed from the shared set.
Q12. What is the way to predict TF binding?
SeqUnwinder’s characterization of cell-specific motif features in collections of DNase-seq datasets may therefore serve as a source of predictive features for efforts that aim to predict cell-specific TF binding from accessibility experimental data alone [39–41].
Q13. How do the authors restrict the k-mer features to the hills?
To speed-up implementation, the authors restrict the unbiased k-mer features to only those k-mers that are present in at least 5% of the hills.
Q14. What was the q-value cutoff of the labeled sites?
All sites with significantly greater Isl1/Lhx3 ChIP enrichment at 12h compared to 48h (q-value cutoff of<0.01) were labeled as “early”.
Q15. What are the motifs that the authors previously assigned to early or late TF binding behaviors?
the motifs that the authors previously assigned to early or late TF binding behaviors could have been merely associated with ES-active and ES-inactive sites, respectively.
Q16. What are the TFs that are not correlated with the cognate motif?
the cognate motif was not specifically predictive of cell-type-specific labels for the examined TFs, with the exception of H1-hESC-specific sites for CEBPB, NRSF and SRF.