scispace - formally typeset
Search or ask a question

Showing papers by "Wing-Kin Sung published in 2011"


Journal ArticleDOI
TL;DR: Five distinct chromatin domains are uncovered that suggest potential new models of CTCF function in chromatin organization and transcriptional control, and demarcate chromatin-nuclear membrane attachments and influence proper gene expression through extensive cross-talk between promoters and regulatory elements.
Abstract: Mammalian genomes are viewed as functional organizations that orchestrate spatial and temporal gene regulation. CTCF, the most characterized insulator-binding protein, has been implicated as a key genome organizer. However, little is known about CTCF-associated higher-order chromatin structures at a global scale. Here we applied chromatin interaction analysis by paired-end tag (ChIA-PET) sequencing to elucidate the CTCF-chromatin interactome in pluripotent cells. From this analysis, we identified 1,480 cis- and 336 trans-interacting loci with high reproducibility and precision. Associating these chromatin interaction loci with their underlying epigenetic states, promoter activities, enhancer binding and nuclear lamina occupancy, we uncovered five distinct chromatin domains that suggest potential new models of CTCF function in chromatin organization and transcriptional control. Specifically, CTCF interactions demarcate chromatin-nuclear membrane attachments and influence proper gene expression through extensive cross-talk between promoters and regulatory elements. This highly complex nuclear organization offers insights toward the unifying principles that govern genome plasticity and function.

642 citations


Journal ArticleDOI
TL;DR: The Assemblathon 1 competition is described, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies, and it is established that it is possible to assemble the genome to a high level of coverage and accuracy.
Abstract: Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome A total of 41 assemblies from 17 different groups were received Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://wwwassemblathonorg/

548 citations


Journal ArticleDOI
TL;DR: This work explored the feasibility of an exact solution for scaffolding and presented a first tractable solution, and described a graph contraction procedure that allows the solution to scale to large scaffolding problems and demonstrate this by scaffolding several large real and synthetic datasets.
Abstract: Scaffolding, the problem of ordering and orienting contigs, typically using paired-end reads, is a crucial step in the assembly of high-quality draft genomes. Even as sequencing technologies and mate-pair protocols have improved significantly, scaffolding programs still rely on heuristics, with no guarantees on the quality of the solution. In this work, we explored the feasibility of an exact solution for scaffolding and present a first tractable solution for this problem (Opera). We also describe a graph contraction procedure that allows the solution to scale to large scaffolding problems and demonstrate this by scaffolding several large real and synthetic datasets. In comparisons with existing scaffolders, Opera simultaneously produced longer and more accurate scaffolds demonstrating the utility of an exact approach. Opera also incorporates an exact quadratic programming formulation to precisely compute gap sizes (Availability: http://sourceforge.net/projects/operasf/).

205 citations


Journal ArticleDOI
TL;DR: It is shown that the transfection of all three TFs was necessary to reprogramme the ERα‐negative MDA‐MB‐231 and BT‐549 cells to restore the estrogen‐responsive growth resembling estrogen‐treated ER α‐positive MCF‐7 cells.
Abstract: Despite the role of the estrogen receptor α (ERα) pathway as a key growth driver for breast cells, the phenotypic consequence of exogenous introduction of ERα into ERα-negative cells paradoxically has been growth inhibition. We mapped the binding profiles of ERα and its interacting transcription factors (TFs), FOXA1 and GATA3 in MCF-7 breast carcinoma cells, and observed that these three TFs form a functional enhanceosome that regulates the genes driving core ERα function and cooperatively modulate the transcriptional networks previously ascribed to ERα alone. We demonstrate that these enhanceosome occupied sites are associated with optimal enhancer characteristics with highest p300 co-activator recruitment, RNA Pol II occupancy, and chromatin opening. Most importantly, we show that the transfection of all three TFs was necessary to reprogramme the ERα-negative MDA-MB-231 and BT-549 cells to restore the estrogen-responsive growth resembling estrogen-treated ERα-positive MCF-7 cells. Cumulatively, these results suggest that all the enhanceosome components comprising ERα, FOXA1, and GATA3 are necessary for the full repertoire of cancer-associated effects of the ERα.

176 citations


Journal ArticleDOI
TL;DR: AP-2γ, which has been implicated in breast cancer oncogenesis, binds to ERBS in a ligand-independent manner and is suggested to be a novel collaborative factor in ERα-mediated transcription.
Abstract: Oestrogen receptor α (ERα) is key player in the progression of breast cancer. Recently, the cistrome and interactome of ERα were mapped in breast cancer cells, revealing the importance of spatial organization in oestrogen-mediated transcription. However, the underlying mechanism of this process is unclear. Here, we show that ERα binding sites (ERBS) identified from the Chromatin Interaction Analysis-Paired End DiTag of ERα are enriched for AP-2 motifs. We demonstrate the transcription factor, AP-2γ, which has been implicated in breast cancer oncogenesis, binds to ERBS in a ligand-independent manner. Furthermore, perturbation of AP-2γ expression impaired ERα DNA binding, long-range chromatin interactions, and gene transcription. In genome-wide analyses, we show that a large number of AP-2γ and ERα binding events converge together across the genome. The majority of these shared regions are also occupied by the pioneer factor, FoxA1. Molecular studies indicate there is functional interplay between AP-2γ and FoxA1. Finally, we show that most ERBS associated with long-range chromatin interactions are colocalized with AP-2γ and FoxA1. Together, our results suggest AP-2γ is a novel collaborative factor in ERα-mediated transcription.

152 citations


Book ChapterDOI
28 Mar 2011
TL;DR: This work explored the feasibility of an exact solution for scaffolding and presented a first fixed-parameter tractable solution for assembly (Opera), and described a graph contraction procedure that allows the solution to scale to large scaffolding problems.
Abstract: Scaffolding, the problem of ordering and orienting contigs, typically using paired-end reads, is a crucial step in the assembly of highquality draft genomes. Even as sequencing technologies and mate-pair protocols have improved significantly, scaffolding programs still rely on heuristics, with no gaurantees on the quality of the solution. In this work we explored the feasibility of an exact solution for scaffolding and present a first fixed-parameter tractable solution for assembly (Opera). We also describe a graph contraction procedure that allows the solution to scale to large scaffolding problems and demonstrate this by scaffolding several large real and synthetic datasets. In comparisons with existing scaffolders, Opera simultaneously produced longer and more accurate scaffolds demonstrating the utility of an exact approach. Opera also incorporates an exact quadratic programming formulation to precisely compute gap sizes.

126 citations


Journal ArticleDOI
TL;DR: The quantitative and connective nature of DNA-PET data is precise in delineating the genealogy of complex rearrangement events, and it is discovered that large duplications are among the initial rearrangements that trigger genome instability for extensive amplification in epithelial cancers.
Abstract: Somatic genome rearrangements are thought to play important roles in cancer development. We optimized a long-span paired-end-tag (PET) sequencing approach using 10-Kb genomic DNA inserts to study human genome structural variations (SVs). The use of a 10-Kb insert size allows the identification of breakpoints within repetitive or homology-containing regions of a few kilobases in size and results in a higher physical coverage compared with small insert libraries with the same sequencing effort. We have applied this approach to comprehensively characterize the SVs of 15 cancer and two noncancer genomes and used a filtering approach to strongly enrich for somatic SVs in the cancer genomes. Our analyses revealed that most inversions, deletions, and insertions are germ-line SVs, whereas tandem duplications, unpaired inversions, interchromosomal translocations, and complex rearrangements are over-represented among somatic rearrangements in cancer genomes. We demonstrate that the quantitative and connective nature of DNA–PET data is precise in delineating the genealogy of complex rearrangement events, we observe signatures that are compatible with breakagefusion-bridge cycles, and we discover that large duplications are among the initial rearrangements that trigger genome instability for extensive amplification in epithelial cancers. [Supplemental material is available for this article. The sequencing data from this study have been submitted to NCBI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo) under accession no. GSE26954.]

87 citations


Journal ArticleDOI
TL;DR: It is found that single segmental tandem duplication spanning several genes is a major source of the fusion gene transcripts in both cell lines and primary tumors involving adjacent genes placed in the reverse-order position by the duplication event.
Abstract: Using a long-span, paired-end deep sequencing strategy, we have comprehensively identified cancer genome rearrangements in eight breast cancer genomes. Herein, we show that 40%–54% of these structural genomic rearrangements result in different forms of fusion transcripts and that 44% are potentially translated. We find that single segmental tandem duplication spanning several genes is a major source of the fusion gene transcripts in both cell lines and primary tumors involving adjacent genes placed in the reverse-order position by the duplication event. Certain other structural mutations, however, tend to attenuate gene expression. From these candidate gene fusions, we have found a fusion transcript (RPS6KB1–VMP1) recurrently expressed in ∼30% of breast cancers associated with potential clinical consequences. This gene fusion is caused by tandem duplication on 17q23 and appears to be an indicator of local genomic instability altering the expression of oncogenic components such as MIR21 and RPS6KB1.

87 citations


Journal ArticleDOI
TL;DR: A novel web-based co-motif scanning program, which exploits the imbalanced nature of co-TF binding, is developed, which is a user-friendly, parameter-less and powerful predictive web- based program for understanding the mechanism of transcriptional co-regulation.
Abstract: Transcription factors (TFs) do not function alone but work together with other TFs (called co-TFs) in a combinatorial fashion to precisely control the transcription of target genes. Mining co-TFs is thus important to understand the mechanism of transcriptional regulation. Although existing methods can identify co-TFs, their accuracy depends heavily on the chosen background model and other parameters such as the enrichment window size and the PWM score cut-off. In this study, we have developed a novel web-based co-motif scanning program called CENTDIST (http://compbio.ddns.comp.nus.edu.sg/~chipseq/centdist/). In comparison to current co-motif scanning programs, CENTDIST does not require the input of any user-specific parameters and background information. Instead, CENTDIST automatically determines the best set of parameters and ranks co-TF motifs based on their distribution around ChIP-seq peaks. We tested CENTDIST on 14 ChIP-seq data sets and found CENTDIST is more accurate than existing methods. In particular, we applied CENTDIST on an Androgen Receptor (AR) ChIP-seq data set from a prostate cancer cell line and correctly predicted all known co-TFs (eight TFs) of AR in the top 20 hits as well as discovering AP4 as a novel co-TF of AR (which was missed by existing methods). Taken together, CENTDIST, which exploits the imbalanced nature of co-TF binding, is a user-friendly, parameter-less and powerful predictive web-based program for understanding the mechanism of transcriptional co-regulation.

53 citations


Journal ArticleDOI
TL;DR: A method that eschews the traditional graph-based approach in favor of a simple 3' extension approach that has potential to be massively parallelized and able to obtain assemblies that are more contiguous, complete and less error prone compared with existing methods is presented.
Abstract: Motivation: Many de novo genome assemblers have been proposed recently. The basis for most existing methods relies on the de bruijn graph: a complex graph structure that attempts to encompass the entire genome. Such graphs can be prohibitively large, may fail to capture subtle information and is difficult to be parallelized. Result: We present a method that eschews the traditional graph-based approach in favor of a simple 3′ extension approach that has potential to be massively parallelized. Our results show that it is able to obtain assemblies that are more contiguous, complete and less error prone compared with existing methods. Availability: The software package can be found at http://www.comp.nus.edu.sg/~bioinfo/peasm/. Alternatively it is available from authors upon request. Contact:[email protected]; [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

47 citations


Journal ArticleDOI
TL;DR: This paper shows that even for k = O ( lg lg n), the authors can index A succinctly such that both query and update operations can be supported using the same time complexities, and the time for update becomes the worst-case time.

Journal ArticleDOI
TL;DR: An evaluation of the algorithms shows that it is useful to identify nc RNA molecules in other species which are in the same family of a known ncRNA.
Abstract: The secondary structure of an ncRNA molecule is known to play an important role in its biological functions. Aligning a known ncRNA to a target candidate to determine the sequence and structural similarity helps in identifying de novo ncRNA molecules that are in the same family of the known ncRNA. However, existing algorithms cannot handle complex pseudoknot structures which are found in nature. In this article, we propose algorithms to handle two types of complex pseudoknots: simple non-standard pseudoknots and recursive pseudoknots. Although our methods are not designed for general pseudoknots, it already covers all known ncRNAs in both Rfam and PseudoBase databases. An evaluation of our algorithms shows that it is useful to identify ncRNA molecules in other species which are in the same family of a known ncRNA.

Journal ArticleDOI
TL;DR: A stepwise topological transformation of genome is introduced from 1‐dimension (1D, linear) to 2‐ dimension (2D, networks) to 3‐dimensional (3D, architecture) to discuss how such transformations could advance the understanding of genome biology.
Abstract: Eukaryotic genome is, not only linearly but also spatially, organized into non-random architecture. Though the linear organization of genes and their epigenetic descriptors are well characterized, the relevance of their spatial organization is beginning to unfold only recently. It is increasingly being recognized that physical interactions among distant genomic elements could serve as an important mean to eukaryotic genome regulation. With the advent of proximity ligation based techniques coupled with next generation sequencing, it is now possible to explore whole genome chromatin interactions at high resolution. Emerging data on genome-wide chromatin interactions suggest that distantly located genes are not independent entities and instead cross-talk with each other in an extensive manner, supporting the notion of “chromatin interaction networks”. Moreover, the data also advance the field to “3-dimensional (3D) chromatin structure and dynamics”, which would enable molecular biologists to explore the spatiotemporal regulation of genome. In this article, we introduce a stepwise topological transformation of genome from 1-dimension (1D, linear) to 2-dimension (2D, networks) to 3-dimension (3D, architecture) and discuss how such transformations could advance our understanding of genome biology. J. Cell. Biochem. 112: 2218–2221, 2011. © 2011 Wiley-Liss, Inc.

Journal ArticleDOI
TL;DR: In this paper, the problem of indexing a text S[1..n] for pattern matching with up to k errors was revisited, and an O(n)-space index that supports k-error matching was presented.

Journal ArticleDOI
TL;DR: Although determining whether a coloring is convex on an arbitrary network is hard, it can be found efficiently on galled networks and a fixed parameter tractable algorithm is presented that finds the recoloring distance of such a network whose running time is quadratic in the network size and exponential in that distance.
Abstract: A coloring of a graph is convex if the vertices that pertain to any color induce a connected subgraph; a partial coloring (which assigns colors to a subset of the vertices) is convex if it can be completed to a convex (total) coloring. Convex coloring has applications in fields such as phylogenetics, communication or transportation networks, etc.When a coloring of a graph is not convex, a natural question is how far it is from a convex one. This problem is denoted as convex recoloring (CR). While the initial works on CR defined and studied the problem on trees, recent efforts aim at either generalizing the underlying graphs or specializing the input colorings.In this work, we extend the underlying graph and the input coloring to partially colored galled networks. We show that although determining whether a coloring is convex on an arbitrary network is hard, it can be found efficiently on galled networks. We present a fixed parameter tractable algorithm that finds the recoloring distance of such a network whose running time is quadratic in the network size and exponential in that distance. This complexity is achieved by amortized analysis that uses a novel technique for contracting colored graphs that seems to be of independent interest.

Journal ArticleDOI
08 Nov 2011-PLOS ONE
TL;DR: Next-generation sequencing of DNA fragments generated in Actinomycin D-treated human HL-60 leukemic cells were used to generate a high-throughput, global map of apoptotic DNA breakpoints and highlighted that DNA breaks are non-random and show a significant association with active genes and open chromatin regions.
Abstract: DNA fragmentation is a well-recognized hallmark of apoptosis. However, the precise DNA sequences cleaved during apoptosis triggered by distinct mechanisms remain unclear. We used next-generation sequencing of DNA fragments generated in Actinomycin D-treated human HL-60 leukemic cells to generate a high-throughput, global map of apoptotic DNA breakpoints. These data highlighted that DNA breaks are non-random and show a significant association with active genes and open chromatin regions. We noted that transcription factor binding sites were also enriched within a fraction of the apoptotic breakpoints. Interestingly, extensive apoptotic cleavage was noted within genes that are frequently translocated in human cancers. We speculate that the non-random fragmentation of DNA during apoptosis may contribute to gene translocations and the development of human cancers.

Journal ArticleDOI
TL;DR: It is proved that even the very restricted case of determining if there exists a MUL tree consistent with the input and having just one leaf duplication is an NP-hard problem, and the general minimization problem is difficult to approximate.
Abstract: We investigate the computational complexity of inferring a smallest possible multilabeled phylogenetic tree (MUL tree) which is consistent with each of the rooted triplets in a given set. This problem has not been studied previously in the literature. We prove that even the very restricted case of determining if there exists a MUL tree consistent with the input and having just one leaf duplication is an NP-hard problem. Furthermore, we showthatthe general minimization problem is difficult to approximate, although a simple polynomial-time approximation algorithm achieves an approximation ratio close to our derived inapproximability bound. Finally, we provide an exact algorithm for the problem running in exponential time and space. As a by-product, we also obtain new, strong inapproximability results for two partitioning problems on directed graphs called Acyclic Partition and Acyclic Tree-Partition.

Proceedings ArticleDOI
06 Sep 2011
TL;DR: A method of data preprocessing and two different association rule mining approaches for discovering breast cancer regulatory mechanisms of gene module are developed.
Abstract: To gain insight into regulatory mechanisms underlying the transcription process of gene expressions, we need to understand the co-expressed gene sets under common regulatory mechanisms. Though computational methods have been developing to identify expression module, challenges still remain for cancer related gene expression profiling. In this paper, we have developed a method of data preprocessing and two different association rule mining approaches for discovering breast cancer regulatory mechanisms of gene module. Our data preprocessing task involved with two independent data sources: (a) a single breast cancer patient profile data file, (b) a candidate enhancer information data file. Using the integrated data, we also conducted four experiments of the association rule mining.

Journal ArticleDOI
TL;DR: The results imply the first polynomial time algorithms for both MASP and MCSP when both k and the maximum degree D of the input trees are constant.
Abstract: Consider a set of labels L and a set of unordered trees $\mathcal{T}=\{\mathcal{T}^{(1)},\mathcal{T}^{(2)},\ldots ,\allowbreak \mathcal{T}^{(k)}\}$ where each tree $\mathcal{T}^{(i)}$ is distinctly leaf-labeled by some subset of L. One fundamental problem is to find the biggest tree (denoted as supertree) to represent $\mathcal{T}$ which minimizes the disagreements with the trees in $\mathcal{T}$ under certain criteria. In this paper, we focus on two particular supertree problems, namely, the maximum agreement supertree problem (MASP) and the maximum compatible supertree problem (MCSP). These two problems are known to be NP-hard for k≥3. This paper gives improved algorithms for both MASP and MCSP. In particular, our results imply the first polynomial time algorithms for both MASP and MCSP when both k and the maximum degree D of the input trees are constant.

Journal ArticleDOI
TL;DR: This work proposes a method called D-SLIMMER to mine for SLiMs in PPI data on the basis of the interaction density between a nonlinear motif in one protein and a SLiM in the other protein, and shows that D- SLIMMER outperformed existing methods notably for discovering domain-SLiMs interaction motifs.
Abstract: Many biologically important protein-protein interactions (PPIs) have been found to be mediated by short linear motifs (SLiMs). These interactions are mediated by the binding of a protein domain, often with a nonlinear interaction interface, to a SLiM. We propose a method called D-SLIMMER to mine for SLiMs in PPI data on the basis of the interaction density between a nonlinear motif (i.e., a protein domain) in one protein and a SLiM in the other protein. Our results on a benchmark of 113 experimentally verified reference SLiMs showed that D-SLIMMER outperformed existing methods notably for discovering domain-SLiMs interaction motifs. To illustrate the significance of the SLiMs detected, we highlighted two SLiMs discovered from the PPI data by D-SLIMMER that are variants of the known ELM SLiM, as well as a literature-backed SLiM that is yet to be listed in the reference databases. We also presented a novel SLiM predicted by D-SLIMMER that was strongly supported by existing biological literatures. These examples showed that D-SLIMMER is able to find SLiMs that are biologically relevant.

Journal ArticleDOI
TL;DR: The raw sequences and processed data generated from this study can be downloaded with accession number GSE28247.
Abstract: Nat. Genet. 43, 630–638 (2011); published online 19 June; corrected after print 11 July 2011 In the version of this article initially published, the accession codes section contained inaccuracies. The raw sequences and processed data generated from this study can be downloaded with accession number GSE28247.

Book ChapterDOI
05 Dec 2011
TL;DR: This work studies strict and majority rule consensus MUL-trees, and presents the first ever polynomial-time algorithms for building a consensus Mul-tree, and shows that although it is NP-hard to find a majority ruleensus MUL -tree, the variant is unique and can be constructed efficiently.
Abstract: A MUL-tree is a generalization of a phylogenetic tree that allows the same leaf label to be used many times. Lott et al. [9,10] recently introduced the problem of inferring a so-called consensus MUL-tree from a set of conflicting MUL-trees and gave an exponential-time algorithm for a special greedy variant. Here, we study strict and majority rule consensus MUL-trees , and present the first ever polynomial-time algorithms for building a consensus MUL-tree. We give a simple, fast algorithm for building a strict consensus MUL-tree. We also show that although it is NP-hard to find a majority rule consensus MUL-tree, the variant which we call the singular majority rule consensus MUL-tree is unique and can be constructed efficiently.