scispace - formally typeset
Search or ask a question
Author

Wing-Kin Sung

Bio: Wing-Kin Sung is an academic researcher from National University of Singapore. The author has contributed to research in topics: Gene & Chromatin immunoprecipitation. The author has an hindex of 64, co-authored 327 publications receiving 26116 citations. Previous affiliations of Wing-Kin Sung include University of Hong Kong & Yale University.


Papers
More filters
01 Jan 2016
TL;DR: It is proved that the problem of determining whether a given set of rooted triplets can be merged without conflicts into a galled phylogenetic network, and if so, constructing such a network, is NP-hard if extended to non-dense inputs.

54 citations

Journal ArticleDOI
TL;DR: A novel web-based co-motif scanning program, which exploits the imbalanced nature of co-TF binding, is developed, which is a user-friendly, parameter-less and powerful predictive web- based program for understanding the mechanism of transcriptional co-regulation.
Abstract: Transcription factors (TFs) do not function alone but work together with other TFs (called co-TFs) in a combinatorial fashion to precisely control the transcription of target genes. Mining co-TFs is thus important to understand the mechanism of transcriptional regulation. Although existing methods can identify co-TFs, their accuracy depends heavily on the chosen background model and other parameters such as the enrichment window size and the PWM score cut-off. In this study, we have developed a novel web-based co-motif scanning program called CENTDIST (http://compbio.ddns.comp.nus.edu.sg/~chipseq/centdist/). In comparison to current co-motif scanning programs, CENTDIST does not require the input of any user-specific parameters and background information. Instead, CENTDIST automatically determines the best set of parameters and ranks co-TF motifs based on their distribution around ChIP-seq peaks. We tested CENTDIST on 14 ChIP-seq data sets and found CENTDIST is more accurate than existing methods. In particular, we applied CENTDIST on an Androgen Receptor (AR) ChIP-seq data set from a prostate cancer cell line and correctly predicted all known co-TFs (eight TFs) of AR in the top 20 hits as well as discovering AP4 as a novel co-TF of AR (which was missed by existing methods). Taken together, CENTDIST, which exploits the imbalanced nature of co-TF binding, is a user-friendly, parameter-less and powerful predictive web-based program for understanding the mechanism of transcriptional co-regulation.

53 citations

Journal ArticleDOI
15 Apr 2009-PLOS ONE
TL;DR: The results suggest careful analysis of control libraries can reveal promising biological insights, and can be attributed to genomic copy number, inherent sequencing bias, plausible mapping ambiguity, and cell-type specific chromatin structure.
Abstract: Background The growth of sequencing-based Chromatin Immuno-Precipitation studies call for a more in-depth understanding of the nature of the technology and of the resultant data to reduce false positives and false negatives. Control libraries are typically constructed to complement such studies in order to mitigate the effect of systematic biases that might be present in the data. In this study, we explored multiple control libraries to obtain better understanding of what they truly represent. Methodology First, we analyzed the genome-wide profiles of various sequencing-based libraries at a low resolution of 1 Mbp, and compared them with each other as well as against aCGH data. We found that copy number plays a major influence in both ChIP-enriched as well as control libraries. Following that, we inspected the repeat regions to assess the extent of mapping bias. Next, significantly tag-rich 5 kbp regions were identified and they were associated with various genomic landmarks. For instance, we discovered that gene boundaries were surprisingly enriched with sequenced tags. Further, profiles between different cell types were noticeably distinct although the cell types were somewhat related and similar. Conclusions We found that control libraries bear traces of systematic biases. The biases can be attributed to genomic copy number, inherent sequencing bias, plausible mapping ambiguity, and cell-type specific chromatin structure. Our results suggest careful analysis of control libraries can reveal promising biological insights.

51 citations

Posted Content
TL;DR: In this article, the authors presented an algorithm for comparing trees that are labeled in an arbitrary manner, which is faster than the previous algorithms and is at the core of their maximum agreement subtree algorithm.
Abstract: A widely used method for determining the similarity of two labeled trees is to compute a maximum agreement subtree of the two trees. Previous work on this similarity measure is only concerned with the comparison of labeled trees of two special kinds, namely, uniformly labeled trees (i.e., trees with all their nodes labeled by the same symbol) and evolutionary trees (i.e., leaf-labeled trees with distinct symbols for distinct leaves). This paper presents an algorithm for comparing trees that are labeled in an arbitrary manner. In addition to this generality, this algorithm is faster than the previous algorithms. Another contribution of this paper is on maximum weight bipartite matchings. We show how to speed up the best known matching algorithms when the input graphs are node-unbalanced or weight-unbalanced. Based on these enhancements, we obtain an efficient algorithm for a new matching problem called the hierarchical bipartite matching problem, which is at the core of our maximum agreement subtree algorithm.

51 citations

Book ChapterDOI
15 Dec 2003
TL;DR: This paper generalizes the study and shows that even for k = O(lg lg n), both query and update operations can be maintained using the same time complexities, and the time for update becomes worst-case time.
Abstract: The Searchable Partial Sums is a data structure problem that maintains a sequence of n non-negative k-bit integers; in addition, it allows us to modify the entries by the update operation, while supporting two types of queries: sum and search. Recently, researchers focus on the succinct representation of the data structure in kn+o(kn) bits. They study the tradeoff in time between the query and the update operations, under the word RAM with word size O(lg U) bits. For the special case where k=1 (which is known as Dynamic Bit Vector problem), Raman et al. showed that both queries can be supported in O(logb n) time, while update requires O(b) amortized time for any b with lg n/lg lg n ≤ b ≤ n. This paper generalizes the study and shows that even for k = O(lg lg n), both query and update operations can be maintained using the same time complexities. Also, the time for update becomes worst-case time.

49 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations

Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Journal ArticleDOI
TL;DR: This work presents Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer, and uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions.
Abstract: We present Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer. MACS empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, and is freely available.

13,008 citations