scispace - formally typeset
Search or ask a question

Showing papers by "Jeffrey Scott Vitter published in 2008"


Book
09 Jun 2008
TL;DR: The state of the art in the design and analysis of algorithms and data structures for external memory (or EM for short), where the goal is to exploit locality and parallelism in order to reduce the I/O costs is surveyed.
Abstract: Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this manuscript, we survey the state of the art in the design and analysis of algorithms and data structures for external memory (or EM for short), where the goal is to exploit locality and parallelism in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory. For the batched problem of sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss distribution and merging techniques for using the disks independently. We also consider useful techniques for batched EM problems involving matrices, geometric data, and graphs. In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide convenient means in online data structures to make effective use of the data accessed from disk. We also re-examine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length such as character strings, when the data structure is compressed to save space, or when the allocated amount of internal memory can change dynamically. Programming tools and environments are available for simplifying the EM programming task. We report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss are significantly faster than other methods used in practice.

244 citations


Proceedings ArticleDOI
09 Jun 2008
TL;DR: A novel dictionary encoding scheme is proposed that builds upon edge linearizations of the classic trie data structure and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution.
Abstract: Current data structures for searching large string collections either fail to achieve minimum space or cause too many cache misses. In this paper we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and compressed. We provide new insights on front coding [24], introduce other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. Our second contribution is a novel dictionary encoding scheme that builds upon such linearizations and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries [2, 5] and obtain a succinct variant whose space is close to the information-theoretic minimum.

52 citations


Proceedings ArticleDOI
25 Mar 2008
TL;DR: This paper shows how to exploit a sampling technique to compress the existing O(n)-word index to an (n Hk (D) + o(n log sigma))-bit index with only a small sacrifice in search time.
Abstract: The past few years have witnessed several exciting results on compressed representation of a string T that supports efficient pattern matching, and the space complexity has been reduced to |T| Hk (T) + o (|T| log sigma) bits, where Hk(T) denotes the kth-order empirical entropy of T, and sigma is the size of the alphabet. In this paper we study compressed representation for another classical problem of string indexing, which is called dictionary matching in the literature. Precisely, a collection D of strings (called patterns) of total length n is to be indexed so that given a text T, the occurrences of the patterns in T can be found efficiently. In this paper we show how to exploit a sampling technique to compress the existing O(n)-word index to an (n Hk (D) + o(n log sigma))-bit index with only a small sacrifice in search time.

33 citations


Proceedings ArticleDOI
25 Mar 2008
TL;DR: Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.
Abstract: Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logBN + |p|+T/B) I/O operations, where |p| is the length of the compressed query pattern and T is the query output size. The SBC-tree is also dynamic and supports insert and delete operations efficiently. The insertion and deletion of all suffixes of a compressed sequence of length m take O(m logB(N + m)) amortized I/O operations. The SBC-tree index is realized inside PostgreSQL. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.

15 citations


Book ChapterDOI
01 Jan 2008

12 citations


Proceedings Article
19 Jan 2008
TL;DR: A novel approach to reduce the encoding/decoding time to just O(t) operations on small integers (of size O(lg n) bits), without increasing the space required.
Abstract: In this paper, we present a nearly tight analysis of the encoding length of the Burrows-Wheeler Transform (BWT) that is motivated by the text indexing setting. For a text T of n symbols drawn from an alphabet Σ, our encoding scheme achieves bounds in terms of the hth-order empirical entropy Hh of the text, and takes linear time for encoding and decoding. We also describe a lower bound on the encoding length of the BWT that constructs an infinite (non-trivial) class of texts that are among the hardest to compress using the BWT. We then show that our upper bound encoding length is nearly tight with this lower bound for the class of texts we described. In designing our BWT encoding and its lower bound, we also address the t-subset problem; here, the goal is to store a subset of t items drawn from a universe [1..n] using just lg (nt)+O(1) bits of space. A number of solutions to this basic problem are known, however encoding or decoding usually requires either O(t) operations on large integers [Knu05, Rus05] or O(n) operations. We provide a novel approach to reduce the encoding/decoding time to just O(t) operations on small integers (of size O(lg n) bits), without increasing the space required.

7 citations


01 Jan 2008
TL;DR: In this paper, the authors present a nearly tight analysis of the encoding length of the Burrows-Wheeler Transform (BWT) that is motivated by the text indexing setting, and show that their upper bound encoding length is nearly tight with this lower bound for the class of texts they described.
Abstract: In this paper, we present a nearly tight analysis of the encoding length of the Burrows-Wheeler Transform (BWT) that is motivated by the text indexing setting. For a text T of n symbols drawn from an alphabet Σ, our encoding scheme achieves bounds in terms of the hth-order empirical entropy Hh of the text, and takes linear time for encoding and decoding. We also describe a lower bound on the encoding length of the BWT that constructs an infinite (non-trivial) class of texts that are among the hardest to compress using the BWT. We then show that our upper bound encoding length is nearly tight with this lower bound for the class of texts we described. In designing our BWT encoding and its lower bound, we also address the t-subset problem; here, the goal is to store a subset of t items drawn from a universe [1..n] using just lg (nt)+O(1) bits of space. A number of solutions to this basic problem are known, however encoding or decoding usually requires either O(t) operations on large integers [Knu05, Rus05] or O(n) operations. We provide a novel approach to reduce the encoding/decoding time to just O(t) operations on small integers (of size O(lg n) bits), without increasing the space required.

5 citations


01 Jan 2008
TL;DR: It is shown how arithmetic coding works and an implementation that uses table lookup as a fast alternative to arithmetic operations is described that can speed up the implementation further by use of parallel processing.
Abstract: Arithmetic coding provides an effective mechanism for removing redundancy in the encoding of data. We show how arithmetic coding works and describe an efficient implementation that uses table lookup as a first alternative to arithmetic operations. The reduced-precision arithmetic has a provably negligible effect on the amount of compression achieved. We can speed up the implementation further by use of parallel processing. We discuss the role of probability models and how they provide probability information to the arithmetic coder. We conclude with perspectives on the comparative advantages and disadvantages of arithmetic coding. >

3 citations


Proceedings ArticleDOI
14 Jun 2008
TL;DR: This work provides comprehensive results with a full general solution for the problem with asymptotically tight competitive ratios for the single disk caching problem and shows tight results for randomized algorithms against oblivious adversary and give an algorithm achieving better bounds in the resource augmentation model.
Abstract: We consider the natural extension of the well-known single disk caching problem to the parallel disk I/O model (PDM) [17]. The main challenge is to achieve as much parallelism as possible and avoid I/O bottlenecks. We are given a fast memory (cache) of size M memory blocks along with a request sequence Σ =(b1,b2,...,bn) where each block bi resides on one of D disks. In each parallel I/O step, at most one block from each disk can be fetched. The task is to serve Σ in the minimum number of parallel I/Os. Thus, each I/O is analogous to a page fault. The difference here is that during each page fault, up to D blocks can be brought into memory, as long as all of the new blocks entering the memory reside on different disks. The problem has a long history [18, 12, 13, 26]. Note that this problem is non-trivial even if all requests in Σ are unique. This restricted version is called read-once. Despite the progress in the offline version [13, 15] and read-once version [12], the general online problem still remained open. Here, we provide comprehensive results with a full general solution for the problem with asymptotically tight competitive ratios.To exploit parallelism, any parallel disk algorithm needs a certain amount of lookahead into future requests. To provide effective caching, an online algorithm must achieve o(D) competitive ratio. We show a lower bound that states, for lookahead L ≤ M, any online algorithm must be Ω(D)-competitive. For lookahead L greater than M(1+1/e), where e is a constant, the tight upper bound of O(√MD/L) on competitive ratio is achieved by our algorithm SKEW. The previous algorithm tLRU [26] was O((MD/L)2/3)-competitive and this was also shown to be tight [26] for an LRU-based strategy. We achieve the tight ratio using a fairly different strategy than LRU. We also show tight results for randomized algorithms against oblivious adversary and give an algorithm achieving better bounds in the resource augmentation model.

1 citations