Top 9 papers published by Jeffrey Scott Vitter from University of Mississippi in 2008

Book•

Algorithms and Data Structures for External Memory

[...]

09 Jun 2008

TL;DR: The state of the art in the design and analysis of algorithms and data structures for external memory (or EM for short), where the goal is to exploit locality and parallelism in order to reduce the I/O costs is surveyed.

...read moreread less

Abstract: Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this manuscript, we survey the state of the art in the design and analysis of algorithms and data structures for external memory (or EM for short), where the goal is to exploit locality and parallelism in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory. For the batched problem of sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss distribution and merging techniques for using the disks independently. We also consider useful techniques for batched EM problems involving matrices, geometric data, and graphs. In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide convenient means in online data structures to make effective use of the data accessed from disk. We also re-examine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length such as character strings, when the data structure is compressed to save space, or when the allocated amount of internal memory can change dynamically. Programming tools and environments are available for simplifying the EM programming task. We report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss are significantly faster than other methods used in practice.

...read moreread less

244 citations

Proceedings Article•DOI•

On searching compressed string collections cache-obliviously

[...]

Paolo Ferragina¹, Roberto Grossi¹, Ankur Gupta², Rahul Shah³, Jeffrey Scott Vitter⁴ - Show less +1 more•Institutions (4)

University of Pisa¹, Butler University², Louisiana State University³, Purdue University⁴

09 Jun 2008

TL;DR: A novel dictionary encoding scheme is proposed that builds upon edge linearizations of the classic trie data structure and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution.

...read moreread less

Abstract: Current data structures for searching large string collections either fail to achieve minimum space or cause too many cache misses. In this paper we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and compressed. We provide new insights on front coding [24], introduce other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. Our second contribution is a novel dictionary encoding scheme that builds upon such linearizations and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries [2, 5] and obtain a succinct variant whose space is close to the information-theoretic minimum.

...read moreread less

52 citations

Proceedings Article•DOI•

Compressed Index for Dictionary Matching

[...]

Wing-Kai Hon¹, Rahul Shah², Jeffrey Scott Vitter³, Tak-Wah Lam⁴, Siu-Lung Tarn³ - Show less +1 more•Institutions (4)

National Tsing Hua University¹, Louisiana State University², Purdue University³, University of Hong Kong⁴

25 Mar 2008

TL;DR: This paper shows how to exploit a sampling technique to compress the existing O(n)-word index to an (n Hk (D) + o(n log sigma))-bit index with only a small sacrifice in search time.

...read moreread less

Abstract: The past few years have witnessed several exciting results on compressed representation of a string T that supports efficient pattern matching, and the space complexity has been reduced to |T| Hk (T) + o (|T| log sigma) bits, where Hk(T) denotes the kth-order empirical entropy of T, and sigma is the size of the alphabet. In this paper we study compressed representation for another classical problem of string indexing, which is called dictionary matching in the literature. Precisely, a collection D of strings (called patterns) of total length n is to be indexed so that given a text T, the occurrences of the patterns in T can be found efficiently. In this paper we show how to exploit a sampling technique to compress the existing O(n)-word index to an (n Hk (D) + o(n log sigma))-bit index with only a small sacrifice in search time.

...read moreread less

33 citations

Proceedings Article•DOI•

The SBC-tree: an index for run-length compressed sequences

[...]

Mohamed Y. Eltabakh¹, Wing-Kai Hon², Rahul Shah³, Walid G. Aref¹, Jeffrey Scott Vitter¹ - Show less +1 more•Institutions (3)

Purdue University¹, National Tsing Hua University², Louisiana State University³

25 Mar 2008

TL;DR: Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.

...read moreread less

Abstract: Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logBN + |p|+T/B) I/O operations, where |p| is the length of the compressed query pattern and T is the query output size. The SBC-tree is also dynamic and supports insert and delete operations efficiently. The insertion and deletion of all suffixes of a compressed sequence of length m take O(m logB(N + m)) amortized I/O operations. The SBC-tree index is realized inside PostgreSQL. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.

...read moreread less

15 citations

Book Chapter•DOI•

External Sorting and Permuting.

[...]

Jeffrey Scott Vitter¹•Institutions (1)

University of Kansas¹

01 Jan 2008

12 citations

Proceedings Article•

Nearly tight bounds on the encoding length of the Burrows-Wheeler transform

[...]

Ankur Gupta¹, Roberto Grossi², Jeffrey Scott Vitter³•Institutions (3)

Butler University¹, University of Pisa², Purdue University³

19 Jan 2008

TL;DR: A novel approach to reduce the encoding/decoding time to just O(t) operations on small integers (of size O(lg n) bits), without increasing the space required.

...read moreread less

Abstract: In this paper, we present a nearly tight analysis of the encoding length of the Burrows-Wheeler Transform (BWT) that is motivated by the text indexing setting. For a text T of n symbols drawn from an alphabet Σ, our encoding scheme achieves bounds in terms of the hth-order empirical entropy Hh of the text, and takes linear time for encoding and decoding. We also describe a lower bound on the encoding length of the BWT that constructs an infinite (non-trivial) class of texts that are among the hardest to compress using the BWT. We then show that our upper bound encoding length is nearly tight with this lower bound for the class of texts we described. In designing our BWT encoding and its lower bound, we also address the t-subset problem; here, the goal is to store a subset of t items drawn from a universe [1..n] using just lg (nt)+O(1) bits of space. A number of solutions to this basic problem are known, however encoding or decoding usually requires either O(t) operations on large integers [Knu05, Rus05] or O(n) operations. We provide a novel approach to reduce the encoding/decoding time to just O(t) operations on small integers (of size O(lg n) bits), without increasing the space required.

...read moreread less

7 citations

Nearly Tight Bounds on the Encoding Length of the Burrows-Wheeler Transform (Extended Abstract)

[...]

Ankur Gupta, Roberto Grossi, Jeffrey Scott Vitter

01 Jan 2008

TL;DR: In this paper, the authors present a nearly tight analysis of the encoding length of the Burrows-Wheeler Transform (BWT) that is motivated by the text indexing setting, and show that their upper bound encoding length is nearly tight with this lower bound for the class of texts they described.

...read moreread less

Abstract: In this paper, we present a nearly tight analysis of the encoding length of the Burrows-Wheeler Transform (BWT) that is motivated by the text indexing setting. For a text T of n symbols drawn from an alphabet Σ, our encoding scheme achieves bounds in terms of the hth-order empirical entropy Hh of the text, and takes linear time for encoding and decoding. We also describe a lower bound on the encoding length of the BWT that constructs an infinite (non-trivial) class of texts that are among the hardest to compress using the BWT. We then show that our upper bound encoding length is nearly tight with this lower bound for the class of texts we described. In designing our BWT encoding and its lower bound, we also address the t-subset problem; here, the goal is to store a subset of t items drawn from a universe [1..n] using just lg (nt)+O(1) bits of space. A number of solutions to this basic problem are known, however encoding or decoding usually requires either O(t) operations on large integers [Knu05, Rus05] or O(n) operations. We provide a novel approach to reduce the encoding/decoding time to just O(t) operations on small integers (of size O(lg n) bits), without increasing the space required.

...read moreread less

5 citations

Arithmetic Coding for Data Compression.

[...]

Paul G. Howard, Jeffrey Scott Vitter

01 Jan 2008

TL;DR: It is shown how arithmetic coding works and an implementation that uses table lookup as a fast alternative to arithmetic operations is described that can speed up the implementation further by use of parallel processing.

...read moreread less

Abstract: Arithmetic coding provides an effective mechanism for removing redundancy in the encoding of data. We show how arithmetic coding works and describe an efficient implementation that uses table lookup as a first alternative to arithmetic operations. The reduced-precision arithmetic has a provably negligible effect on the amount of compression achieved. We can speed up the implementation further by use of parallel processing. We discuss the role of probability models and how they provide probability information to the arithmetic coder. We conclude with perspectives on the comparative advantages and disadvantages of arithmetic coding. >

...read moreread less

3 citations

Proceedings Article•DOI•

Tight competitive ratios for parallel disk prefetching and caching

[...]

Wing-Kai Hon¹, Rahul Shah², Peter Varman³, Jeffrey Scott Vitter⁴•Institutions (4)

National Tsing Hua University¹, Louisiana State University², Rice University³, Purdue University⁴

14 Jun 2008

TL;DR: This work provides comprehensive results with a full general solution for the problem with asymptotically tight competitive ratios for the single disk caching problem and shows tight results for randomized algorithms against oblivious adversary and give an algorithm achieving better bounds in the resource augmentation model.

...read moreread less

Abstract: We consider the natural extension of the well-known single disk caching problem to the parallel disk I/O model (PDM) [17]. The main challenge is to achieve as much parallelism as possible and avoid I/O bottlenecks. We are given a fast memory (cache) of size M memory blocks along with a request sequence Σ =(b1,b2,...,bn) where each block bi resides on one of D disks. In each parallel I/O step, at most one block from each disk can be fetched. The task is to serve Σ in the minimum number of parallel I/Os. Thus, each I/O is analogous to a page fault. The difference here is that during each page fault, up to D blocks can be brought into memory, as long as all of the new blocks entering the memory reside on different disks. The problem has a long history [18, 12, 13, 26]. Note that this problem is non-trivial even if all requests in Σ are unique. This restricted version is called read-once. Despite the progress in the offline version [13, 15] and read-once version [12], the general online problem still remained open. Here, we provide comprehensive results with a full general solution for the problem with asymptotically tight competitive ratios.To exploit parallelism, any parallel disk algorithm needs a certain amount of lookahead into future requests. To provide effective caching, an online algorithm must achieve o(D) competitive ratio. We show a lower bound that states, for lookahead L ≤ M, any online algorithm must be Ω(D)-competitive. For lookahead L greater than M(1+1/e), where e is a constant, the tight upper bound of O(√MD/L) on competitive ratio is achieved by our algorithm SKEW. The previous algorithm tLRU [26] was O((MD/L)2/3)-competitive and this was also shown to be tight [26] for an LRU-based strategy. We achieve the tight ratio using a fairly different strategy than LRU. We also show tight results for randomized algorithms against oblivious adversary and give an algorithm achieving better bounds in the resource augmentation model.

...read moreread less

1 citations

Showing papers by "Jeffrey Scott Vitter published in 2008"