scispace - formally typeset
Search or ask a question
Author

Wing-Kin Sung

Bio: Wing-Kin Sung is an academic researcher from National University of Singapore. The author has contributed to research in topics: Gene & Chromatin immunoprecipitation. The author has an hindex of 64, co-authored 327 publications receiving 26116 citations. Previous affiliations of Wing-Kin Sung include University of Hong Kong & Yale University.


Papers
More filters
Proceedings ArticleDOI
01 Jan 2006
TL;DR: The PEM method outperforms the previous versions of SAM and the spline based EDGE approaches in identifying genes of interest, which are differentially expressed in various manner, and is incorporated into the permutation based SAM framework for significance analysis.
Abstract: Replication of time series in microarray experiments is costly. To analyze time series data with no replicate, many model-specific approaches have been proposed. However, they fail to identify the genes whose expression patterns do not fit the pre-defined models. Besides, modeling the temporal expression patterns is difficult when the dynamics of gene expression in the experiment is poorly understood. We propose a method called PEM (Partial Energy ratio for Microarray) for the analysis of time course cDNA microarray data. In the PEM method, we assume the gene expressions vary smoothly in the temporal domain. This assumption is comparatively weak and hence the method is general enough to identify genes expressed in unexpected patterns. To identify the differentially expressed genes, a new statistic is developed by comparing the energies of two convoluted profiles. We further improve the statistic for microarray analysis by introducing the concept of partial energy. The PEM statistic is incorporated into the permutation based SAM framework for significance analysis. We evaluated the PEM method with an artificial dataset and two published time course cDNA microarray datasets on yeast. The experimental results show the robustness and the generality of the PEM method. It outperforms the previous versions of SAM and the spline based EDGE approaches in identifying genes of interest, which are differentially expressed in various manner.

6 citations

Book ChapterDOI
04 Sep 2001
TL;DR: This paper model an online catalog organization as a decision tree structure and proposes a metric, based on the popularity of products and the relative importance of product attribute values, to evaluate the quality of a catalog organization, to produce better catalog organizations.
Abstract: The organization of a web site is important to help users get the most out of the site. Designing such an organization, however, is a complicated problem. Traditionally, this design is mainly done by hand. To what extent this can be automated is a challenging problem. Recently, there have been investigations on how to reorganize an existing web site based on some criteria. But none of them has addressed the problem of organizing a web site automatically from scratch. In this paper, we attempt to tackle this problem by restricting the domain to online catalog organization. We model an online catalog organization as a decision tree structure and propose a metric, based on the popularity of products and the relative importance of product attribute values, to evaluate the quality of a catalog organization. The problem is then formulated as a decision tree construction problem. Although traditional decision tree algorithms, such as C4.5, can be used to generate online catalog organization, the catalog constructed is generally not good based on our metric. An efficient greedy algorithm (GENCAT) is thus developed and the experimental results show that GENCAT produces better catalog organizations based on our metric.

6 citations

Book ChapterDOI
16 Dec 2017
TL;DR: The fastest known algorithm for the k-mappability problem with k = 1 requires time complexity of at most O(n) and space complexity of O(log n) as mentioned in this paper.
Abstract: In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where \(k=1\). The fastest known algorithm for \(k=1\) requires time \(\mathcal {O}(mn \log n/\log \log n)\) and space \(\mathcal {O}(n)\). We present two new algorithms that require worst-case time \(\mathcal {O}(mn)\) and \(\mathcal {O}(n \log n \log \log n)\), respectively, and space \(\mathcal {O}(n)\), thus greatly improving the state of the art. Moreover, we present another algorithm that requires average-case time and space \(\mathcal {O}(n)\) for integer alphabets of size \(\sigma \) if \(m=\varOmega (\log _\sigma n)\). Notably, we show that this algorithm is generalizable for arbitrary k, requiring average-case time \(\mathcal {O}(kn)\) and space \(\mathcal {O}(n)\) if \(m=\varOmega (k\log _\sigma n)\).

6 citations

Journal ArticleDOI
TL;DR: This paper considers the problem of inferring the nested arc-annotation P2 for S2 such that (S1, P1) and (S2, P2) have the largest common substructure, and obtains an improved algorithm that runs in min{O(nm2 + n2m),O (nm2 log n), O(nm3)} time and min{M2 + mn, O(m2log n + n)} space.
Abstract: The nested arc-annotation of a sequence is an important model used to represent structural information for RNA and protein sequences. Given two sequences S1 and S2 and a nested arc-annotation P1 for S1, this paper considers the problem of inferring the nested arc-annotation P2 for S2 such that (S1, P1) and (S2, P2) have the largest common substructure. The problem has a direct application in predicting the secondary structure of an RNA sequence given a closely related sequence with known secondary structure. The currently most efficient algorithm for this problem requires O(nm3) time and O(nm2) space where n is the length of the sequence with known arc-annotation and m is the length of the sequence whose arc-annotation is to be inferred. By using sparsification on a new recursive dynamic programming algorithm and applying a Hirschberg-like traceback technique with compression, we obtain an improved algorithm that runs in min{O(nm2 + n2m),O(nm2 log n), O(nm3)} time and min{O(m2 + mn), O(m2 log n + n)} space.

6 citations

Proceedings ArticleDOI
08 Jun 2000
TL;DR: The problem of automatic Web site construction is introduced and a solution for solving a major step of the problem based on decision tree algorithms is proposed, found to be useful in automatic construction of product catalogs.
Abstract: Organization of a Web site is important to help users get the most out of the site. A good Web site should help visitors find the information they want easily. Visitors typically find information by searching for selected terms of interest or by following links from one Web page to another. The first approach is more useful if the visitor knows exactly what he is seeking, while the second approach is useful when the visitor has less of a preconceived notion about what he wants. The organization of a Web site is especially important in the latter case. Traditionally, Web site organization is done by hand. In this paper, we introduce the problem of automatic Web site construction and propose a solution for solving a major step of the problem based on decision tree algorithms. The solution is found to be useful in automatic construction of product catalogs.

6 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations

Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Journal ArticleDOI
TL;DR: This work presents Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer, and uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions.
Abstract: We present Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer. MACS empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, and is freely available.

13,008 citations