scispace - formally typeset
Search or ask a question

Showing papers by "Wing-Kin Sung published in 2003"


Proceedings ArticleDOI
11 Oct 2003
TL;DR: These are the first algorithms that achieve 0(n log n) time with optimal working space, under a reasonable assumption that log |A| = o(log n) and the general case where the size of the alphabet A is not constant.
Abstract: Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. It has been open for a long time whether these indices can be constructed in both O(n log n) time and O(n log n)-bit working space, where n denotes the length of the text. In the literature, the fastest algorithm runs in O(n) time, while it requires O(n log n)-bit working space. On the other hand, the most space-efficient algorithm requires O(n)-bit working space while it runs in O(n log n) time. This paper breaks the long-standing time-and-space barrier under the unit-cost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)-bit working space, for texts with constant-size alphabets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm requires O(n log/sup /spl epsi//n) time and O(n)-bit working space for any 0 < /spl epsi/ < 1. Apart from that, our algorithm can also be adopted to build other existing full-text indices, such as Compressed Suffix Tree, Compressed Suffix Arrays and FM-index. We also study the general case where the size of the alphabet A is not constant. Our algorithm can construct a suffix array and a suffix tree using optimal O(n log |A|)-bit working space while running in O(n log log |A|) time and O(n log/sup /spl epsi//n) time, respectively. These are the first algorithms that achieve 0(n log n) time with optimal working space, under a reasonable assumption that log |A| = o(log n).

113 citations


Book ChapterDOI
15 Dec 2003
TL;DR: This paper generalizes the study and shows that even for k = O(lg lg n), both query and update operations can be maintained using the same time complexities, and the time for update becomes worst-case time.
Abstract: The Searchable Partial Sums is a data structure problem that maintains a sequence of n non-negative k-bit integers; in addition, it allows us to modify the entries by the update operation, while supporting two types of queries: sum and search. Recently, researchers focus on the succinct representation of the data structure in kn+o(kn) bits. They study the tradeoff in time between the query and the update operations, under the word RAM with word size O(lg U) bits. For the special case where k=1 (which is known as Dynamic Bit Vector problem), Raman et al. showed that both queries can be supported in O(logb n) time, while update requires O(b) amortized time for any b with lg n/lg lg n ≤ b ≤ n. This paper generalizes the study and shows that even for k = O(lg lg n), both query and update operations can be maintained using the same time complexities. Also, the time for update becomes worst-case time.

49 citations


Proceedings ArticleDOI
11 Aug 2003
TL;DR: Based on the new algorithm, optimal short (20 bases) or long (50 or 70 bases) probes can be computed efficiently for large genomes and some smart filtering techniques are used to avoid redundant computation while maintaining the accuracy.
Abstract: The oligo microarray (DNA chip) technology in recent years has a significant impact on genomic study. Many fields such as gene discovery, drug discovery, toxicological research and disease diagnosis, will certainly benefit from its use. A microarray is an orderly arrangement of thousands of DNA fragments where each DNA fragment is a probe (or a fingerprint) of a gene/cDNA. It is important that each probe must uniquely associate with a particular gene/cDNA. Otherwise, the performance of the microarray will be affected. Existing algorithms usually select probes using the criteria of homogeneity, sensitivity, and specificity. Moreover, they improve efficiency employing some heuristics. Such approaches reduce the accuracy. Instead, we make use of some smart filtering techniques to avoid redundant computation while maintaining the accuracy. Based on the new algorithm, optimal short (20 bases) or long (50 or 70 bases) probes can be computed efficiently for large genomes.

44 citations


Proceedings ArticleDOI
01 Dec 2003
TL;DR: This paper presents a framework using the Tree-Augmented Networks (TAN) based on the theory of learning Bayesian networks but with less restrictive assumptions than the naiveBayesian networks to enhance TAN's performance.
Abstract: For determining the structure class and fold class of Protein Structure, computer-based techniques have became essential considering the large volume of the data. Several techniques based on sequence similarity. Neural Networks, SVMs, etc have been applied. This paper presents a framework using the Tree-Augmented Networks (TAN) based on the theory of learning Bayesian networks but with less restrictive assumptions than the naive Bayesian networks. In order to enhance TAN's performance, pre-processing of data is done by feature discretization and post-processing is done by using Mean Probability Voting (MPV) scheme. The advantage of using Bayesian approach over other learning methods is that the network structure is intuitive. In addition, one can read off the TAN structure probabilities to determine the significance of each feature (say, Hydrophobicity) for each class, which help to further understand the mystery of protein structure. Experimental results and comparison with other works over two databases show the effectiveness of our TAN based framework. The idea is implemented as the BAYESPROT web server and it is available at http://www-appn.comp.nus.edu.sg/-bioinfo/bayesprot/Default.htm.

42 citations


Book ChapterDOI
15 Dec 2003
TL;DR: This paper addresses the problem of how to store a full-text index in the main memory for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters.
Abstract: Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Σ, this algorithm requires O(|Σ|nlogn) time and (2 H 0 + 1+e)n bits of working space, where H 0 is the 0-th order empirical entropy of T and e is any non-zero constant. This algorithm is good enough when the alphabet size | Σ| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters.

33 citations


Journal ArticleDOI
TL;DR: A new type of seed for Blast-like homology search tools called "half seed'' is proposed, which is better than the "consecutive seed'' used by the original Blast tools in both sensitivity and efficiency.
Abstract: In this paper, we proposed a new type of seed for Blast-like homology search tools called "half seed''. This new seed is better than the "consecutive seed'' used by the original Blast tools in both sensitivity and efficiency. When compared with the "gapped seed'', which is proposed together with a new Blast-like searching tool, PatternHunter, this new seed offers a much wider range of choices for performing tradeoff between sensitivity and efficiency. This property is especially useful when some searching applications want to get more precise results with limitation on hardware resources, or vice versa.

8 citations


Book ChapterDOI
25 Jul 2003
TL;DR: This paper proposes faster algorithms that take O(mn2) time and O(MN) space, and improves on the classical Needleman-Wunsch and Smith-Waterman algorithms by finding a compact way to represent all the alignment scores.
Abstract: Consider two strings A and B of lengths n and m respectively, with n ≪ m. The problem of computing global and local alignments between A and all m2 substrings of B can be solved by the classical Needleman-Wunsch and Smith-Waterman algorithms, respectively, which takes O(m2n) time and O(m2) space. This paper proposes faster algorithms that take O(mn2) time and O(mn) space. The improvement stems from a compact way to represent all the alignment scores.

6 citations


Journal ArticleDOI
TL;DR: In this article, the authors presented a new approach called the enhanced double digest problem (EDD), which can be solved in linear time in certain theoretically interesting cases and is shown to be NP-hard.
Abstract: The double digest problem is a common NP-hard approach to constructing physical maps of DNA sequences. This paper presents a new approach called the enhanced double digest problem. Although this new problem is also NP-hard, it can be solved in linear time in certain theoretically interesting cases.

3 citations


Book ChapterDOI
25 Aug 2003
TL;DR: It is shown how perceptual features such as time-to-collision (TTC) can lead to several high-level categories such as intimacy, suspense and terror which are recovered as a result of the proposed TTC detection algorithm.
Abstract: Video retrieval using high-level indices is more meaningful than querying using low-level features. In this paper, we show how perceptual features such as time-to-collision (TTC) can lead to several high-level categories. Experiments have been conducted to validate our proposed TTC detection algorithm to compute TTC from the divergence of the image velocity field. A simple and novel method named as the pilot cue is used to further refine our algorithm. Our initial system works with a rule-based approach where the extracted TTC shots (low-level feature) are mapped to their corresponding high-level indices. The information conveyed by their neighboring frames or shots (i.e. contextual information) is used to facilitate the mapping process. Several psychological effects (high-level indices) such as intimacy, suspense and terror are recovered as a result.