scispace - formally typeset

Posted ContentDOI

Deconvolving sequence features that discriminate between overlapping regulatory annotations

09 May 2017-bioRxiv (Cold Spring Harbor Laboratory)-pp 100511

TL;DR: SeqUnwinder is developed, a principled approach to deconvolving interpretable discriminative sequence features associated with overlapping annotation labels that can be unraveled during motor neuron programming and cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines.
Abstract: Genomic loci with regulatory potential can be identified and annotated with various properties. For example, genomic sites may be annotated as being bound by a given transcription factor (TF) in one or more cell types. The same sites may be further labeled as being proximal or distal to known promoters. Given such a collection of labeled sites, it is natural to ask what sequence features are associated with each annotation label. However, discovering such label-specific sequence features is often confounded by overlaps between annotation labels; e.g. if regulatory sites specific to a given cell type are also more likely to be promoter-proximal, it is difficult to assess whether motifs identified in that set of sites are associated with the cell type or associated with promoters. In order to meet this challenge, we developed SeqUnwinder, a principled approach to deconvolving interpretable discriminative sequence features associated with overlapping annotation labels. We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly, we show SeqUnwinder9s ability to unravel sequence features associated with the dynamic binding behavior of TFs during motor neuron programming from features associated with chromatin state in the initial embryonic stem cells. Secondly, we characterize distinct sequence properties of multi-condition and cell-specific TF binding sites after controlling for uneven associations with promoter proximity. Finally, we demonstrate the scalability of SeqUnwinder to discover cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines. Availability: https://github.com/seqcode/sequnwinder

Content maybe subject to copyright    Report

RESEARCH ARTICLE
Deconvolving sequence features that
discriminate between overlapping regulatory
annotations
Akshay Kakumanu
1
, Silvia Velasco
2
, Esteban Mazzoni
2
, Shaun Mahony
1
*
1 Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The
Pennsylvania State University, University Park, PA, United States of America, 2 Department of Biology, New
York University, 100 Washington Square East, New York, NY, United States of America
* mahony@psu.edu
Abstract
Genomic loci with regulatory potential can be annotated with various properties. For exam-
ple, genomic sites bound by a given transcription factor (TF) can be divided according to
whether they are proximal or distal to known promoters. Sites can be further labeled accord-
ing to the cell types and conditions in which they are active. Given such a collection of
labeled sites, it is natural to ask what sequence features are associated with each annota-
tion label. However, discovering such label-specific sequence features is often confounded
by overlaps between the labels; e.g. if regulatory sites specific to a given cell type are also
more likely to be promoter-proximal, it is difficult to assess whether motifs identified in that
set of sites are associated with the cell type or associated with promoters. In order to meet
this challenge, we developed SeqUnwinder, a principled approach to deconvolving inter-
pretable discriminative sequence features associated with overlapping annotation labels.
We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly,
SeqUnwinder is able to unravel sequence features associated with the dynamic binding
behavior of TFs during motor neuron programming from features associated with chromatin
state in the initial embryonic stem cells. Secondly, we characterize distinct sequence proper-
ties of multi-condition and cell-specific TF binding sites after controlling for uneven associa-
tions with promoter proximity. Finally, we demonstrate the scalability of SeqUnwinder to
discover cell-specific sequence features from over one hundred thousand genomic loci that
display DNase I hypersensitivity in one or more ENCODE cell lines.
Author summary
Transcription factor proteins control gene expression by recognizing and interacting with
short DNA sequence patterns in regulatory regions on the genome. Current genomics
experiments allow us to find regulatory regions associated with a particular biochemical
activity over the entire genome; for example, all regions where a particular transcription
factor interacts with the genome in a given cell type. Given a collection of regulatory
regions, we often aim to discover short DNA sequence patterns that are more common in
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 1 / 22
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Kakumanu A, Velasco S, Mazzoni E,
Mahony S (2017) Deconvolving sequence features
that discriminate between overlapping regulatory
annotations. PLoS Comput Biol 13(10): e1005795.
https://doi.org/10.1371/journal.pcbi.1005795
Editor: Ilya Ioshikhes, Ottawa University, CANADA
Received: May 9, 2017
Accepted: September 26, 2017
Published: October 19, 2017
Copyright: © 2017 Kakumanu et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: Software code
available from https://github.com/seqcode/
sequnwinder Complete output files produced by
the SeqUnwinder runs described in this
manuscript, along with scripts and data for
reproducing all analysis figures, are available from:
https://github.com/ikaka89/sequnwinderPaper.
Experimental data are available from GEO archive
under accession GSE80321.
Funding: This work was supported by National
Institutes of Health grant R01HD079682 (to EOM).
The funders had no role in study design, data

the collection than in other regions. Performing such “DNA motif-finding” analysis can
give us hints about the patterns that determine gene regulation in the analyzed cell type.
Here we describe a new method for DNA motif-finding called SeqUnwinder. Our
approach analyzes collections of regulatory regions where each has been labeled according
to various biological properties. For example, the labels could correspond to various cell
types in which the regulatory region is active. SeqUnwinder then performs machine-
learning analysis to unravel DNA sequence features that are characteristic of each label
(e.g. features that distinguish regulatory regions in each cell type from other cell types).
SeqUnwinder is the first method to enable analysis of regulatory region collections that
contain several overlapping labels.
Introduction
Many regulatory genomics analyses focus on finding DNA sequence features that are charac-
teristic of a biological property. Given a set of sequences that are bound by a particular tran-
scription factor (TF), for example, we typically aim to discover short, degenerate DNA
patterns that may represent the DNA binding preferences of the TF itself, the binding prefer-
ences of coincident TFs, or general properties of the regions that make them favorable for
binding.
The de novo DNA motif-finding problem is typically cast in the context of two mutually
exclusive sequence sets. Most popular motif-finding methods use unsupervised machine-
learning approaches to discover motifs in ‘foreground’ input sequences that are over-repre-
sented with respect to a set of ‘background’ sequences (e.g. “bound” vs. “unbound”, respec-
tively) [1,2]. Several other methods explicitly solve a two-class classification problem, where
the goal is to find sequence features that discriminate between two mutually exclusive class
labels [36].
Current characterizations of regulatory sites move beyond binary labels such as “bound”
and “unbound”. For example, in a given cell type, each regulatory element could be labeled as
bound or unbound by each of several TFs and enriched or depleted for several chromatin
states [79]. As we add more regulatory class labels, it becomes difficult to define mutually
exclusive sets of sequences that are representative of each label. Relatedly, our analyses may
become confounded by uneven degrees of overlap between the class labels, leading to incorrect
associations between sequence features and regulatory activities. Therefore, a simple recasting
of discriminative motif-finding as a multi-class classification problem (where classes are
required to be mutually exclusive) is not always appropriate.
As an example, consider the hypothetical scenario presented in Fig 1A. In this example, a
given TF’s binding sites have been profiled in types A, B, and C. Thus, each TF binding event
can be labeled as specific to a cell type or common to all or a subset. Let’s assume that after fur-
ther labeling the sites as being proximal or distal to promoters (Pr and Di, respectively), we
find that the TF’s binding sites in cell A are more likely to be promoter proximal than sites in
other cell types. Promoter regions have sequence features that are distinct from distal regions
(e.g. the presence of core promoter elements and distinct GC-content patterns). Therefore, if
we search for sequence features that are discriminative of cell A’s sites without accounting for
the uneven overlaps with other labels, it is likely that some discovered features will actually be
generic properties of proximal regions. Such results could in turn affect our conclusions
regarding the biological mechanisms of TF binding in cell A. To resolve DNA features
Discriminative sequence features for overlapping regulatory annotations
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 2 / 22
collection and analysis, decision to publish, or
preparation of the manuscript.
Competing interests: The authors have declared
that no competing interests exist.

associated with each cell type’s label from those associated with confounding labels (e.g. pro-
moter proximity), we need motif-finders that are able to analyze multiple labels in parallel.
Almost all existing discriminative motif-finders assume that the class labels are mutually
exclusive, and therefore cannot appropriately handle scenarios such as that outlined in Fig 1A.
For example, the multi-class discriminative sequence feature frameworks proposed by Tava-
zoie and colleagues [3,10,11] are limited to analysis of mutually exclusive classes. A few existing
methods do allow a limited analysis of datasets where annotation labels partially overlap, but
these approaches were designed for two-class classification problems where the multi-task
framework enables modeling of the “common” task in addition to the two classes. For exam-
ple, Arvey, et al. [4] used a multi-task SVM classifier to learn sequence features associated with
cell type-specific TF binding across two cell types, along with features shared by TF binding
sites in both cell types. The group lasso based logistic regression classifier SeqGL [5] also
implements a similar multi-task framework to identify features that are discriminative between
two classes and features that are common to both. No existing discriminative feature discovery
Fig 1. Overview of SeqUnwinder, which takes an input list of annotated genomic sites and identifies label-specific discriminative motifs. (A)
Schematic showing a typical input instance for SeqUnwinder: a list of genomic coordinates and corresponding annotation labels. (B) The underlying
classification framework implemented in SeqUnwinder. Subclasses (combination of annotation labels) are treated as different classes in a multi-class
classification framework. The label-specific properties are implicitly modeled using L1-regularization. (C) Weighted k-mer models are used to identify 10-
15bp focus regions called hills. MEME is used to identify motifs at hills. (D) De novo identified motifs in C) are scored using the weighted k-mer model to
obtain label-specific scores.
https://doi.org/10.1371/journal.pcbi.1005795.g001
Discriminative sequence features for overlapping regulatory annotations
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 3 / 22

method is applicable to multi-label classification scenarios where a set of genomic sequences
contains several annotation labels with arbitrary rates of overlap between them.
In this work, we present SeqUnwinder, a hierarchical classification framework for charac-
terizing interpretable sequence features associated with overlapping sets of genomic annota-
tion labels. We demonstrate the unique analysis abilities of SeqUnwinder using both synthetic
sequence datasets and collections of real TF ChIP-seq and DNase-seq experiments. In each
demonstration, SeqUnwinder cleanly associates interpretable sequence features with various
cell- or condition-specific annotation labels, while simultaneously removing the effects of con-
founding signals. SeqUnwinder scales effectively to large collections of genomic loci that have
been annotated with several overlapping labels, and is thus designed to deal with the complex-
ity of modern data sets.
Results
SeqUnwinder overview
The intuition behind SeqUnwinder is that sequence features associated with a particular anno-
tation label should be similarly enriched across all subclasses spanned by the label (regardless
of how the subclasses have been defined). SeqUnwinder’s analysis begins by defining genomic
site subclasses based on the combinations of labels annotated at these sites (Fig 1B). The site
subclasses are treated as distinct classes for a multi-class logistic regression model that uses k-
mer frequencies as predictors. At the same time, k-mer models are also learned for each label
by incorporating them in an L1 regularization term (see Methods). In other words, while the
k-mer weight parameters for each subclass are learned directly from the data, the weight
parameters for the labels are learned exclusively through the regularization constraint. The
regularization encourages each label’s model to take the form of the features that are consis-
tently enriched across the subclasses spanned by that label (Fig 1B). The trained classifier
encapsulates weighted k-mer models specific to each label and each subclass (i.e. combination
of labels). The label- or subclass-specific k-mer model is scanned across the original genomic
sites to identify focused regions (which we term “hills”) that contain discriminative sequence
signals (Fig 1C). Finally, to aid interpretability, SeqUnwinder identifies over-represented
motifs in the hills and scores them using label- and subclass-specific k-mer models (Fig 1D).
SeqUnwinder is easy to use, taking as input a list of DNA sequences or genomic coordinates
that are each annotated with a set of user-defined labels. The labels can come from any source,
enabling a high degree of analysis flexibility. SeqUnwinder implements a multi-threaded ver-
sion of the ADMM [12] framework to train the model and typically runs in less than a few
hours for most datasets. Output includes both k-mer models and position-specific scoring
matrices and weights associating these motifs with each subclass and label.
SeqUnwinder deconvolves sequence features associated with
overlapping labels
To demonstrate the properties of SeqUnwinder, we simulated 9,000 regulatory regions and
annotated each of them with labels from two overlapping sets: A, B, C and X, Y (Fig 2A). We
assigned a different motif to each label. At 70% of the sequences associated with each label, we
inserted appropriate motif instances by sampling from the distributions defined by the posi-
tion-specific scoring matrices of label assigned motifs (Fig 2A). We used this collection of
sequences and label assignments to compare SeqUnwinder with a simple multi-class classifica-
tion approach (MCC). In MCC training, each label was treated as a distinct class and therefore
each regulatory sequence is included multiple times in accordance with its annotated labels.
Discriminative sequence features for overlapping regulatory annotations
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 4 / 22

SeqUnwinder and the MCC model correctly identify motifs similar to all inserted motifs
(Fig 2B). However, the MCC approach makes several incorrect motif-label associations, poten-
tially due to high overlap between labels. In contrast, the label-specific scores of the identified
motifs in the SeqUnwinder model are not confounded by overlap between annotation labels.
For example, even though labels X and A highly overlap, SeqUnwinder correctly assigns each
motif to its respective label.
Next, we assessed the performance of SeqUnwinder at different levels of label overlaps. We
simulated 100 datasets with 6000 simulated sequences, varying the degree of overlap between
two sets of labels ({A, B} and {X, Y}) from 50% to 99% (Fig 2C). We then compared SeqUnwin-
der with MCC and DREME [1], a popular discriminative motif discovery tool. Since DREME
takes only two classes as input: a foreground set and a background set, we ran four different
DREME runs for each of the four labels. We calculated the true positive (discovered motif
Fig 2. Performance of SeqUnwinder on simulated datasets. (A) 9000 simulated genomic sites with corresponding motif associations. (B) Label-
specific scores for all de novo motifs identified using MCC (left) and SeqUnwinder (right) models on simulated genomic sites in “A”. For consistency across
figures, we fix the color saturation values to -0.4 and 0.4 (C) Schematic showing 100 genomic datasets with 6000 genomic sites and varying degrees of
label overlap ranging from 0.5 to 0.99. (D) Performance of MCC (multi-class logistic classifier), DREME, and SeqUnwinder on simulated datasets in “C”,
measured using the F1-score, (E) true positive rates, and (F) false positive rates.
https://doi.org/10.1371/journal.pcbi.1005795.g002
Discriminative sequence features for overlapping regulatory annotations
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005795 October 19, 2017 5 / 22

Figures (5)
Citations
More filters

01 Feb 2015
Abstract: The reference human genome sequence set the stage for studies of genetic variation and its association with human disease, but epigenomic studies lack a similar reference. To address this need, the NIH Roadmap Epigenomics Consortium generated the largest collection so far of human epigenomes for primary cells and tissues. Here we describe the integrative analysis of 111 reference human epigenomes generated as part of the programme, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. We establish global maps of regulatory elements, define regulatory modules of coordinated activity, and their likely activators and repressors. We show that disease- and trait-associated genetic variants are enriched in tissue-specific epigenomic marks, revealing biologically relevant cell types for diverse human traits, and providing a resource for interpreting the molecular basis of human disease. Our results demonstrate the central role of epigenomic information for understanding gene regulation, cellular differentiation and human disease.

3,734 citations


Jason Ernst1, Jason Ernst2, Manolis Kellis1Institutions (2)
01 Feb 2012
TL;DR: ChromHMM is developed, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets, and visualizing the resulting genome-wide maps of chromatin state annotations.
Abstract: Chromatin state annotation using combinations of chromatin modification patterns has emerged as a powerful approach for discovering regulatory regions and their cell type specific activity patterns, and for interpreting disease-association studies1-5. However, the computational challenge of learning chromatin state models from large numbers of chromatin modification datasets in multiple cell types still requires extensive bioinformatics expertise making it inaccessible to the wider scientific community. To address this challenge, we have developed ChromHMM, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets, and visualizing the resulting genome-wide maps of chromatin state annotations.

95 citations


References
More filters

Book
Stephen Boyd1, Neal Parikh1, Eric Chu1, Borja Peleato1  +1 moreInstitutions (2)
23 May 2011
TL;DR: It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas.
Abstract: Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for l1 problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop MapReduce implementations.

14,958 citations


"Deconvolving sequence features that..." refers methods in this paper

  • ...To further speed up convergence, a relaxed version of ADMM was implemented as described in (Boyd et al, 2011)....

    [...]

  • ...In the relaxed version, is replaced by for the and update steps, where α is the over-relaxation parameter and is set to 1.9 as suggested in (Boyd et al, 2011) Converting weighted k-mer models into interpretable sequence features While SeqUnwinder models label-specific sequence features using…...

    [...]

  • ...The stopping criteria for the ADMM algorithm is: and Where ϵabs and ϵrel are the absolute and relative tolerance, respectively....

    [...]

  • ...Briefly, the ADMM framework splits the above problem into 2 smaller sub-problems, which are much easier to solve....

    [...]

  • ...Of note, to speed up the implementation of SeqUnwinder, a distributed version of ADMM was implemented....

    [...]


Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

11,598 citations


Journal ArticleDOI
Peter J. Rousseeuw1Institutions (1)
TL;DR: A new graphical display is proposed for partitioning techniques, where each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation, and provides an evaluation of clustering validity.
Abstract: A new graphical display is proposed for partitioning techniques. Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters. The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration. The average silhouette width provides an evaluation of clustering validity, and might be used to select an ‘appropriate’ number of clusters.

10,821 citations


"Deconvolving sequence features that..." refers background or methods in this paper

  • ...Nature 518: 317–330 Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20: 53–65 Setty M & Leslie CS (2015) SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps....

    [...]

  • ...Nature 518: 317–330 Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20: 53–65 Setty M & Leslie CS (2015) SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps. PLoS Comput. Biol. 11: e1004271 Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee L, Lobanenkov VV & Ren B (2012) A map of the cis-regulatory sequences in the mouse genome. Nature 488: 116–120 Velasco S, Ibrahim MM, Kakumanu A, Garipler G, Aydin B, Al-Sayegh MA, Hirsekorn A, Abdul-Rahman F, Satija R, Ohler U, Mahony S & Mazzoni EO (2017) A Multi-step Transcriptional and Chromatin State Cascade Underlies Motor Neuron Programming from Embryonic Stem Cells. Cell Stem Cell 20: 205– 217.e8 Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M & Weng Z (2012) Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22: 1798– 1812 Welch JJ, Watts JA, Vakoc CR, Yao Y, Wang H, Hardison RC, Blobel GA, Chodosh LA & Weiss MJ (2004) Global regulation of erythroid gene expression by transcription factor GATA-1. Blood 104: 3136–3147 Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, Rozowsky J, Birney E, Bickel P, Snyder M & Gerstein M (2012) Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors....

    [...]

  • ...Nature 518: 317–330 Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20: 53–65 Setty M & Leslie CS (2015) SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps. PLoS Comput. Biol. 11: e1004271 Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee L, Lobanenkov VV & Ren B (2012) A map of the cis-regulatory sequences in the mouse genome....

    [...]

  • ...Nature 518: 317–330 Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20: 53–65 Setty M & Leslie CS (2015) SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps. PLoS Comput. Biol. 11: e1004271 Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee L, Lobanenkov VV & Ren B (2012) A map of the cis-regulatory sequences in the mouse genome. Nature 488: 116–120 Velasco S, Ibrahim MM, Kakumanu A, Garipler G, Aydin B, Al-Sayegh MA, Hirsekorn A, Abdul-Rahman F, Satija R, Ohler U, Mahony S & Mazzoni EO (2017) A Multi-step Transcriptional and Chromatin State Cascade Underlies Motor Neuron Programming from Embryonic Stem Cells. Cell Stem Cell 20: 205– 217.e8 Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M & Weng Z (2012) Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors....

    [...]

  • ...Nature 518: 317–330 Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis....

    [...]


Journal ArticleDOI
TL;DR: It is demonstrated in macrophages and B cells that collaborative interactions of the common factor PU.1 with small sets of macrophage- or B cell lineage-determining transcription factors establish cell-specific binding sites that are associated with the majority of promoter-distal H3K4me1-marked genomic regions.
Abstract: Genome-scale studies have revealed extensive, cell type-specific colocalization of transcription factors, but the mechanisms underlying this phenomenon remain poorly understood. Here, we demonstrate in macrophages and B cells that collaborative interactions of the common factor PU.1 with small sets of macrophage- or B cell lineage-determining transcription factors establish cell-specific binding sites that are associated with the majority of promoter-distal H3K4me1-marked genomic regions. PU.1 binding initiates nucleosome remodeling, followed by H3K4 monomethylation at large numbers of genomic regions associated with both broadly and specifically expressed genes. These locations serve as beacons for additional factors, exemplified by liver X receptors, which drive both cell-specific gene expression and signal-dependent responses. Together with analyses of transcription factor binding and H3K4me1 patterns in other cell types, these studies suggest that simple combinations of lineage-determining transcription factors can specify the genomic sites ultimately responsible for both cell identity and cell type-specific responses to diverse signaling inputs.

7,287 citations


"Deconvolving sequence features that..." refers background in this paper

  • ...Multi-condition TF binding sites are characterized by stronger cognate motif instances The sequence properties of tissue-specific TF binding sites have been extensively studied (Heinz et al, 2010; Arvey et al, 2012; Setty & Leslie, 2015)....

    [...]


Journal ArticleDOI
Dong C. Liu1, Jorge Nocedal1Institutions (1)
TL;DR: The numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence, and the convergence properties are studied to prove global convergence on uniformly convex problems.
Abstract: We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cycles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.

5,833 citations


"Deconvolving sequence features that..." refers methods in this paper

  • ...The above sub-problem is solved using the LBFGS (limited-memory Broyden Fletcher Goldfarb Shanno) algorithm (Liu & Nocedal, 1989) Sub-problem 2 The solution to the above equation is given by the shrinkage function defined as follows: -...

    [...]


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20151
20121