scispace - formally typeset
Search or ask a question

Showing papers by "Wing-Kin Sung published in 2002"


Book ChapterDOI
15 Aug 2002
TL;DR: This paper initiates the study of constructing compressed suffix arrays directly from text with the main contribution is a new construction algorithm that uses only O(n) bits of working memory, and more importantly, the time complexity remains the same as before.
Abstract: With the first Human DNA being decoded into a sequence of about 2.8 billion base pairs, many biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 Gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from text. The main contribution is a new construction algorithm that uses only O(n) bits of working memory, and more importantly, the time complexity remains the same as before, i.e., O(n log n).

77 citations


Journal ArticleDOI
TL;DR: A new decomposition theorem is presented for maximum weight bipartite matchings and the weight of a maximum weight matching of G - {u} for all nodes u in O(W) time is computed.
Abstract: Let G be a bipartite graph with positive integer weights on the edges and without isolated nodes. Let n, N, and W be the node count, the largest edge weight, and the total weight of G. Let k(x, y) be log x / log (x2/y). We present a new decomposition theorem for maximum weight bipartite matchings and use it to design an $O(\sqrt{n}W / k(n, W/N))$-time algorithm for computing a maximum weight matching of G. This algorithm bridges a long-standing gap between the best known time complexity of computing a maximum weight matching and that of computing a maximum cardinality matching. Given G and a maximum weight matching of G, we can further compute the weight of a maximum weight matching of G - {u} for all nodes u in O(W) time.

60 citations


Journal ArticleDOI
01 Nov 2002
TL;DR: This paper proposes a metric, based on the popularity of products and the relative importance of product attribute values, to evaluate the quality of a catalog organization and develops an efficient greedy algorithm, GENCAT, which produces better catalog organizations based on this metric.
Abstract: A good online catalog is crucial to the success of an e-commerce web site. Traditionally, an online catalog is mainly built by hand. To what extent this can be automated is a challenging problem. Recently, there have been investigations on how to reorganize an existing online catalog based on some criteria, but none of them has addressed the problem of organizing an online catalog automatically from scratch. This paper attempts to tackle this problem. We model an online catalog organization as a decision tree structure and propose a metric, based on the popularity of products and the relative importance of product attribute values, to evaluate the quality of a catalog organization. The problem is then formulated as a decision tree construction problem. Although traditional decision tree algorithms, such as C4.5, can be used to generate online catalog organization, the catalog constructed is generally not good based on our metric. An efficient greedy algorithm (GENCAT) is thus developed, and the experimental results show that GENCAT produces better catalog organizations based on our metric.

13 citations


Book ChapterDOI
17 Sep 2002
TL;DR: This paper proves that the reported dramatic drop in performance is attributable to algorithmic artifacts, and presents instead an algorithm for sequence reconstruction under hybridization noise, which exhibits graceful degradation of performance as the error-rate increases.
Abstract: DNA sequencing-by-hybridization (SBH) is a powerful potential alternative to current sequencing by electrophoresis. Different SBH methods have been compared under the hypothesis of error-free hybridization. However both false negatives and false positive are likely to occur in practice. Under the assumption of random independent hybridization errors, Doi and Imai [3] recently concluded that the algorithms of [15], which are asymptotically optimal in the error-free case, cannot be successfully adapted to noisy conditions. In this paper we prove that the reported dramatic drop in performance is attributable to algorithmic artifacts, and present instead an algorithm for sequence reconstruction under hybridization noise, which exhibits graceful degradation of performance as the error-rate increases. As a downside, the computational cost of sequence reconstruction rises noticeably under noisy conditions.

7 citations


Posted Content
TL;DR: This paper gives the first sub-quadratic time algorithm for finding the non-shared edges of two phylogenies, which is then used to speed up the existing approximation algorithm for the NNI distance.
Abstract: The number of the non-shared edges of two phylogenies is a basic measure of the dissimilarity between the phylogenies. The non-shared edges are also the building block for approximating a more sophisticated metric called the nearest neighbor interchange (NNI) distance. In this paper, we give the first subquadratic-time algorithm for finding the non-shared edges, which are then used to speed up the existing approximating algorithm for the NNI distance from $O(n^2)$ time to $O(n \log n)$ time. Another popular distance metric for phylogenies is the subtree transfer (STT) distance. Previous work on computing the STT distance considered degree-3 trees only. We give an approximation algorithm for the STT distance for degree-$d$ trees with arbitrary $d$ and with generalized STT operations.

1 citations