Showing papers by "Wing-Kin Sung published in 2003"

PDF

Open Access

Proceedings Article•DOI•

Breaking a time-and-space barrier in constructing full-text indices

[...]

Wing-Kai Hon¹, Kunihiko Sadakane², Wing-Kin Sung²•Institutions (2)

University of Hong Kong¹, National University of Singapore²

11 Oct 2003

TL;DR: These are the first algorithms that achieve 0(n log n) time with optimal working space, under a reasonable assumption that log |A| = o(log n) and the general case where the size of the alphabet A is not constant.

...read moreread less

Abstract: Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. It has been open for a long time whether these indices can be constructed in both O(n log n) time and O(n log n)-bit working space, where n denotes the length of the text. In the literature, the fastest algorithm runs in O(n) time, while it requires O(n log n)-bit working space. On the other hand, the most space-efficient algorithm requires O(n)-bit working space while it runs in O(n log n) time. This paper breaks the long-standing time-and-space barrier under the unit-cost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)-bit working space, for texts with constant-size alphabets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm requires O(n log/sup /spl epsi//n) time and O(n)-bit working space for any 0 < /spl epsi/ < 1. Apart from that, our algorithm can also be adopted to build other existing full-text indices, such as Compressed Suffix Tree, Compressed Suffix Arrays and FM-index. We also study the general case where the size of the alphabet A is not constant. Our algorithm can construct a suffix array and a suffix tree using optimal O(n log |A|)-bit working space while running in O(n log log |A|) time and O(n log/sup /spl epsi//n) time, respectively. These are the first algorithms that achieve 0(n log n) time with optimal working space, under a reasonable assumption that log |A| = o(log n).

...read moreread less

113 citations

Book Chapter•DOI•

Succinct Data Structures for Searchable Partial Sums

[...]

Wing-Kai Hon¹, Kunihiko Sadakane², Wing-Kin Sung³•Institutions (3)

University of Hong Kong¹, Kyushu University², National University of Singapore³

15 Dec 2003

TL;DR: This paper generalizes the study and shows that even for k = O(lg lg n), both query and update operations can be maintained using the same time complexities, and the time for update becomes worst-case time.

...read moreread less

Abstract: The Searchable Partial Sums is a data structure problem that maintains a sequence of n non-negative k-bit integers; in addition, it allows us to modify the entries by the update operation, while supporting two types of queries: sum and search. Recently, researchers focus on the succinct representation of the data structure in kn+o(kn) bits. They study the tradeoff in time between the query and the update operations, under the word RAM with word size O(lg U) bits. For the special case where k=1 (which is known as Dynamic Bit Vector problem), Raman et al. showed that both queries can be supported in O(logb n) time, while update requires O(b) amortized time for any b with lg n/lg lg n ≤ b ≤ n. This paper generalizes the study and shows that even for k = O(lg lg n), both query and update operations can be maintained using the same time complexities. Also, the time for update becomes worst-case time.

...read moreread less

49 citations

Proceedings Article•DOI•

Fast and accurate probe selection algorithm for large genomes

[...]

Wing-Kin Sung¹, Wah Heng Lee²•Institutions (2)

National University of Singapore¹, Genome Institute of Singapore²

11 Aug 2003

TL;DR: Based on the new algorithm, optimal short (20 bases) or long (50 or 70 bases) probes can be computed efficiently for large genomes and some smart filtering techniques are used to avoid redundant computation while maintaining the accuracy.

...read moreread less

Abstract: The oligo microarray (DNA chip) technology in recent years has a significant impact on genomic study. Many fields such as gene discovery, drug discovery, toxicological research and disease diagnosis, will certainly benefit from its use. A microarray is an orderly arrangement of thousands of DNA fragments where each DNA fragment is a probe (or a fingerprint) of a gene/cDNA. It is important that each probe must uniquely associate with a particular gene/cDNA. Otherwise, the performance of the microarray will be affected. Existing algorithms usually select probes using the criteria of homogeneity, sensitivity, and specificity. Moreover, they improve efficiency employing some heuristics. Such approaches reduce the accuracy. Instead, we make use of some smart filtering techniques to avoid redundant computation while maintaining the accuracy. Based on the new algorithm, optimal short (20 bases) or long (50 or 70 bases) probes can be computed efficiently for large genomes.

...read moreread less

44 citations

Proceedings Article•DOI•

Protein structure and fold prediction using tree-augmented naive Bayesian classifier.

[...]

Arunkumar Chinnasamy¹, Wing-Kin Sung¹, Ankush Mittal²•Institutions (2)

National University of Singapore¹, Indian Institute of Technology Roorkee²

01 Dec 2003

TL;DR: This paper presents a framework using the Tree-Augmented Networks (TAN) based on the theory of learning Bayesian networks but with less restrictive assumptions than the naiveBayesian networks to enhance TAN's performance.

...read moreread less

Abstract: For determining the structure class and fold class of Protein Structure, computer-based techniques have became essential considering the large volume of the data. Several techniques based on sequence similarity. Neural Networks, SVMs, etc have been applied. This paper presents a framework using the Tree-Augmented Networks (TAN) based on the theory of learning Bayesian networks but with less restrictive assumptions than the naive Bayesian networks. In order to enhance TAN's performance, pre-processing of data is done by feature discretization and post-processing is done by using Mean Probability Voting (MPV) scheme. The advantage of using Bayesian approach over other learning methods is that the network structure is intuitive. In addition, one can read off the TAN structure probabilities to determine the significance of each feature (say, Hydrophobicity) for each class, which help to further understand the mystery of protein structure. Experimental results and comparison with other works over two databases show the effectiveness of our TAN based framework. The idea is implemented as the BAYESPROT web server and it is available at http://www-appn.comp.nus.edu.sg/-bioinfo/bayesprot/Default.htm.

...read moreread less

42 citations

Book Chapter•DOI•

Constructing compressed suffix arrays with large alphabets

[...]

Wing-Kai Hon¹, Tak-Wah Lam¹, Kunihiko Sadakane², Wing-Kin Sung³•Institutions (3)

University of Hong Kong¹, Kyushu University², National University of Singapore³

15 Dec 2003

TL;DR: This paper addresses the problem of how to store a full-text index in the main memory for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters.

...read moreread less

Abstract: Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Σ, this algorithm requires O(|Σ|nlogn) time and (2 H 0 + 1+e)n bits of working space, where H 0 is the 0-th order empirical entropy of T and e is any non-zero constant. This algorithm is good enough when the alphabet size | Σ| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters.

...read moreread less

33 citations

Journal Article•DOI•

On half gapped seed.

[...]

Wei Chen¹, Wing-Kin Sung¹•Institutions (1)

National University of Singapore¹

01 Jan 2003-Genome Informatics

TL;DR: A new type of seed for Blast-like homology search tools called "half seed'' is proposed, which is better than the "consecutive seed'' used by the original Blast tools in both sensitivity and efficiency.

...read moreread less

Abstract: In this paper, we proposed a new type of seed for Blast-like homology search tools called "half seed''. This new seed is better than the "consecutive seed'' used by the original Blast tools in both sensitivity and efficiency. When compared with the "gapped seed'', which is proposed together with a new Blast-like searching tool, PatternHunter, this new seed offers a much wider range of choices for performing tradeoff between sensitivity and efficiency. This property is especially useful when some searching applications want to get more precise results with limitation on hardware resources, or vice versa.

...read moreread less

8 citations

Book Chapter•DOI•

On all-substrings alignment problems

[...]

Wei Fu¹, Wing-Kai Hon², Wing-Kin Sung¹•Institutions (2)

National University of Singapore¹, University of Hong Kong²

25 Jul 2003

TL;DR: This paper proposes faster algorithms that take O(mn2) time and O(MN) space, and improves on the classical Needleman-Wunsch and Smith-Waterman algorithms by finding a compact way to represent all the alignment scores.

...read moreread less

Abstract: Consider two strings A and B of lengths n and m respectively, with n ≪ m. The problem of computing global and local alignments between A and all m2 substrings of B can be solved by the classical Needleman-Wunsch and Smith-Waterman algorithms, respectively, which takes O(m2n) time and O(m2) space. This paper proposes faster algorithms that take O(mn2) time and O(mn) space. The improvement stems from a compact way to represent all the alignment scores.

...read moreread less

6 citations

Journal Article•DOI•

The Enhanced Double Digest Problem for DNA Physical Mapping

[...]

Ming-Yang Kao¹, Jared Samet¹, Wing-Kin Sung¹•Institutions (1)

Yale University¹

01 Mar 2003-Journal of Combinatorial Optimization

TL;DR: In this article, the authors presented a new approach called the enhanced double digest problem (EDD), which can be solved in linear time in certain theoretically interesting cases and is shown to be NP-hard.

...read moreread less

Abstract: The double digest problem is a common NP-hard approach to constructing physical maps of DNA sequences. This paper presents a new approach called the enhanced double digest problem. Although this new problem is also NP-hard, it can be solved in linear time in certain theoretically interesting cases.

...read moreread less

3 citations

Book Chapter•DOI•

Video retrieval by context-based interpretation of time-to-collision descriptors

[...]

Ankush Mittal¹, Wing-Kin Sung²•Institutions (2)

Birla Institute of Technology and Science¹, National University of Singapore²

25 Aug 2003

TL;DR: It is shown how perceptual features such as time-to-collision (TTC) can lead to several high-level categories such as intimacy, suspense and terror which are recovered as a result of the proposed TTC detection algorithm.

...read moreread less

Abstract: Video retrieval using high-level indices is more meaningful than querying using low-level features. In this paper, we show how perceptual features such as time-to-collision (TTC) can lead to several high-level categories. Experiments have been conducted to validate our proposed TTC detection algorithm to compute TTC from the divergence of the image velocity field. A simple and novel method named as the pilot cue is used to further refine our algorithm. Our initial system works with a rule-based approach where the extracted TTC shots (low-level feature) are mapped to their corresponding high-level indices. The information conveyed by their neighboring frames or shots (i.e. contextual information) is used to facilitate the mapping process. Several psychological effects (high-level indices) such as intimacy, suspense and terror are recovered as a result.

...read moreread less