scispace - formally typeset
Search or ask a question

Showing papers by "Jeffrey Scott Vitter published in 2009"


Proceedings ArticleDOI
25 Oct 2009
TL;DR: This framework gives linear space data structure with optimal query times for arbitrary score functions and improves the space utilization for the problems in [Muthukrishnan, 2002] while maintaining optimal query performance.
Abstract: Given a set ${\cal D}=\{d_1, d_2,..., d_D\}$ of $D$strings of total length $n$, our task is to report the "most relevant"strings for a given query pattern $P$. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by [Muthukrishnan, 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures taking $O(n \logn)$ words of space. We study this problem in a slightly different framework of reporting the top $k$ most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [Muthukrishnan, 2002] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.

109 citations


Book ChapterDOI
05 Dec 2009
TL;DR: A dynamic succinct index for dynamic dictionary matching is developed using a different (and simpler) paradigm based on suffix sampling, which improves the space complexity to (1 + o(1) log?
Abstract: In this paper we revisit the dynamic dictionary matching problem, which asks for an index for a set of patterns P 1, P 2, ..., P k that can support the following query and update operations efficiently. Given a query text T, we want to find all the occurrences of of these patterns; furthermore, as the set of patterns may change over time, we also want to insert or delete a pattern. The major contribution of this paper is the first succinct index for dynamic dictionary matching. Prior to our work, the most compact index is given by Chan et al. (2007), which is based on the compressed suffix arrays (Grossi and Vitter (2005) and Sadakane (2003)) and the FM-index (Ferragina and Manzini (2005)), and it requires O(n ?) bits where n is the total length of patterns and ? is the alphabet size. We develop a dynamic succinct index using a different (and simpler) paradigm based on suffix sampling. The new index not only improves the space complexity to (1 + o(1))n log? + O(klogn) bits, but also the time complexity of the query and update operations. Specifically, the query and update operations respectively take O(|T|logn + occ) and O(|P|log? + logn) times, where occ is the number of occurrences.

20 citations


Book ChapterDOI
21 Aug 2009
TL;DR: An alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure is given and it is shown that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings.
Abstract: A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994]. However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. [2008] gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure. Given a text T of length n drawn from a *** -sized alphabet set, they achieved O (n log*** )-bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes O (n (H k + 1)) + o (n log*** ) bits of space where H k is the k th-order empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding.

19 citations