Showing papers by "Jeffrey Scott Vitter published in 2002"

PDF

Open Access

Book Chapter•DOI•

XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation

[...]

Lipyeow Lim¹, Min Wang², Sriram Padmanabhan², Jeffrey Scott Vitter¹, Ronald Parr¹ - Show less +1 more•Institutions (2)

20 Aug 2002

TL;DR: This paper proposes XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data, which can be more accurate than the more costly off-line method under tight memory constraints.

...read moreread less

Abstract: The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository. All previous methods require an off-line scan of the XML repository to collect the statistics. In this paper, we propose XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data. XPathLearner gathers and refines the statistics using query feedback in an on-line manner and is especially suited to queries in Internet scale applications since the underlying XML repository is either inaccessible or too large to be scanned in its entirety. Besides the on-line property, our method also has two other novel features: (a) XPathLearner is workload-aware in collecting the statistics and thus can be more accurate than the more costly off-line method under tight memory constraints, and (b) XPathLearner automatically adjusts the statistics using query feedback when the underlying XML data change. We show empirically the estimation accuracy of our method using several real data sets.

...read moreread less

91 citations

Journal Article•DOI•

Efficient bulk operations on dynamic R-trees

[...]

Lars Arge¹, Klaus Hinrichs², Jan Vahrenhold², Jeffrey Scott Vitter¹•Institutions (2)

Duke University¹, University of Münster²

01 May 2002-Algorithmica

TL;DR: In this paper, the authors present a simple and efficient technique for performing bulk update and query operations on multidimensional spatial indexes based on the buffer tree lazy buffering technique, which uses ideas from buffer trees and fully utilizes the available internal memory and the page size of the operating system.

...read moreread less

Abstract: In recent years there has been an upsurge of interest in spatial databases. A major issue is how to manipulate efficiently massive amounts of spatial data stored on disk in multidimensional spatial indexes (data structures). Construction of spatial indexes (bulk loading ) has been studied intensively in the database community. The continuous arrival of massive amounts of new data makes it important to update existing indexes (bulk updating ) efficiently. In this paper we present a simple, yet efficient, technique for performing bulk update and query operations on multidimensional indexes. We present our technique in terms of the so-called R-tree and its variants, as they have emerged as practically efficient indexing methods for spatial data. Our method uses ideas from the buffer tree lazy buffering technique and fully utilizes the available internal memory and the page size of the operating system. We give a theoretical analysis of our technique, showing that it is efficient both in terms of I/ O communication, disk storage, and internal computation time. We also present the results of an extensive set of experiments showing that in practice our approach performs better than the previously best known bulk update methods with respect to update time, and that it produces a better quality index in terms of query performance. One important novel feature of our technique is that in most cases it allows us to perform a batch of updates and queries simultaneously. To be able to do so is essential in environments where queries have to be answered even while the index is being updated and reorganized.

...read moreread less

63 citations

Journal Article•

Implementing I/O-efficient data structures using TPIE

[...]

Lars Arge, Octavian Procopiuc, Jeffrey Scott Vitter

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: The design and implementation of the second phase of TPIE, a portable, extensible, flexible, and easy to use C++ programming environment for efficiently implementing I/O-algorithms and data structures, is described.

...read moreread less

Abstract: In recent years, many theoretically I/O-efficient algorithms and data structures have been developed. The TPIE project at Duke University was started to investigate the practical importance of these theoretical results. The goal of this ongoing project is to provide a portable, extensible, flexible, and easy to use C++ programming environment for efficiently implementing I/O-algorithms and data structures. The TPIE library has been developed in two phases. The first phase focused on supporting algorithms with a sequential I/O pattern, while the recently developed second phase has focused on supporting on-line I/O-efficient data structures, which exhibit a more random I/O pattern. This paper describes the design and implementation of the second phase of TPIE.

...read moreread less

61 citations

Book Chapter•DOI•

Implementing I/O-efficient Data Structures Using TPIE

[...]

Lars Arge, Octavian Procopiuc, Jeffrey Scott Vitter

17 Sep 2002

TL;DR: TPIE as discussed by the authors is a C++ environment for implementing I/O-algorithms and data structures, which has been developed in two phases, the first phase focused on supporting algorithms with a sequential IO pattern, while the recently developed second phase has focused on on-line I/Os, which exhibit a more random IO pattern.

...read moreread less

52 citations

Journal Article•DOI•

Efficient sorting using registers and caches

[...]

Rajiv Wickremesinghe¹, Lars Arge¹, Jeffrey S. Chase¹, Jeffrey Scott Vitter¹•Institutions (1)

Duke University¹

31 Dec 2002-ACM Journal of Experimental Algorithms

TL;DR: This paper introduces a new cache-conscious sorting algorithm, R-MERGE, which achieves better performance in practice over algorithms that are superior in the theoretical models, and quantifies the performance effects of features not reflected in the models.

...read moreread less

Abstract: Modern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching, and streaming behavior. Inadequate models lead to poor algorithmic choices and an incomplete understanding of algorithm behavior on real machines.A key step toward developing better models is to quantify the performance effects of features not reflected in the models. This paper explores the effect of memory system features on sorting performance. We introduce a new cache-conscious sorting algorithm, R-MERGE, which achieves better performance in practice over algorithms that are superior in the theoretical models. R-MERGE is designed to minimize memory stall cycles rather than cache misses by considering features common to many system designs.

...read moreread less

38 citations

Book•

External memory algorithms

[...]

Jeffrey Scott Vitter¹•Institutions (1)

Duke University¹

01 Jan 2002

TL;DR: In this paper, the authors survey the state of the art in the design and analysis of external memory (or EM) algorithms, where the goal is to exploit locality in order to reduce the I/O costs.

...read moreread less

Abstract: Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this paper we survey the state of the art in the design and analysis of external memory (or EM) algorithms, where the goal is to exploit locality in order to reduce the I/O costs.For sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss distribution and merging techniques for using the disks independently. We consider EM paradigms for computations involving matrixes, geometric data, and graphs, and we look at problems caused by dynamic memory allocation. We report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss in this chapter are significantly faster than methods currently used in practice.

...read moreread less

30 citations

Book•

Efficient Algorithms for MPEG Video Compression

[...]

Dzung T. Hoang, Jeffrey Scott Vitter¹•Institutions (1)

Duke University¹

01 Mar 2002

TL;DR: The authors, both experts in the field of compression technologies and algorithm design, present some of the most promising algorithms for converting raw data to a compressed form for efficient broadcast.

...read moreread less

Abstract: Video compression is a topic of increasing importance in a world where multimedia technologies and massive data sets are threatening to overflow the capacity of even the most powerful of today's computers Internet as well as business applications such as videoconferencing, video-on-demand, and digital cable television all use compression techniques, either to decrease the required bandwidth for an application or to send more data through a bottleneck in the system Buffering is used at both ends of the transmission to make the communication less "bursty" The interplay between compression and buffer control algorithms in order to address these performance problems and maintain high visual clarity has shown great results, and Efficient Algorithms for MPEG Video Compression is the first book dedicated to the subject The authors, both experts in the field of compression technologies and algorithm design, present some of the most promising algorithms for converting raw data to a compressed form for efficient broadcast

...read moreread less

30 citations

Posted Content•

Approximate Data Structures with Applications

[...]

Yossi Matias¹, Jeffrey Scott Vitter², Neal E. Young³•Institutions (3)

Bell Labs¹, Duke University², University of Maryland, College Park³

10 May 2002-arXiv: Data Structures and Algorithms

TL;DR: The notion of approximate da2a siruclures, in which a small amount of error is tolerated in the output, is introduced, and the tolerance of prototypical algorithms to approximate data structures is considered.

...read moreread less

Abstract: This paper explores the notion of approximate data structures, which return approximately correct answers to queries, but run faster than their exact counterparts. The paper describes approximate variants of the van Emde Boas data structure, which support the same dynamic operations as the standard van Emde Boas data structure (min, max, successor, predecessor, and existence queries, as well as insertion and deletion), except that answers to queries are approximate. The variants support all operations in constant time provided the performance guarantee is 1+1/polylog(n), and in O(loglogn) time provided the performance guarantee is 1+1/polynomial(n), for n elements in the data structure. Applications described include Prim's minimum-spanning-tree algorithm, Dijkstra's single-source shortest paths algorithm, and an on-line variant of Graham's convex hull algorithm. To obtain output which approximates the desired output with the performance guarantee tending to 1, Prim's algorithm requires only linear time, Dijkstra's algorithm requires O(mloglogn) time, and the on-line variant of Graham's algorithm requires constant amortized time per operation.

...read moreread less

23 citations

Journal Article•DOI•

A Simple and Efficient Parallel Disk Mergesort

[...]

Rakesh D. Barve, Jeffrey Scott Vitter¹•Institutions (1)

Duke University¹

01 Jan 2002-Theory of Computing Systems \/ Mathematical Systems Theory

TL;DR: The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files.

...read moreread less

Abstract: External sorting—the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks—is a fundamental subroutine in database systems[G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM ) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K,Section 5.4.9] recently identified SRM (which he calls ``randomized striping'') as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is based on synchronous parallel I/O primitives provided by the TPIE programming environment[TPI]; whenever our program issues an I/O read (write) operation, one block of data is synchronously read from (written to) each disk in parallel. We compare the performance of SRM over a wide range of input sizes with that of disk-striped mergesort (DSM ), which is widely used in practice. DSM consists of a standard mergesort in conjunction with striped I/O for parallel disk access. SRM merges together significantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is often significantly faster than with DSM (since SRM requires fewer passes). The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files. Since both parallel disk merging and multimedia processing deal with streams that get ``consumed'' at nonuniform and partially predictable rates, our techniques for lookahead based upon forecasting data may have relevance in video server applications.

...read moreread less

12 citations

Report•DOI•

External Memory Algorithms: Dealing With Massive Data

[...]

Jeffrey Scott Vitter

29 Mar 2002

TL;DR: The goal of this proposal is to deepen the understanding of the limits of I/O systems and to construct external memory algorithms that are provably efficient, and to address bottleneck issues in parallel disks, text databases, and XML databases.

...read moreread less

Abstract: : The bottleneck in many applications that process massive amounts of data is the I/O communications between internal memory and external memory. The bottleneck is accentuated as processors get faster and parallel processors are used. Parallel disk arrays are often used to increase the I/O bandwidth. The goal of this proposal is to deepen our understanding of the limits of I/O systems and to construct external memory algorithms that are provably efficient. The three measures of performance are number of I/Os, disk storage space, and CPU time. Even when the data fit entirely in memory, communication can still be the bottleneck, and the related issues of caching become important. Theoretical work involves development and analysis of provably efficient external memory algorithms and cache-efficient algorithms for a variety of important application areas. We address several batched and on-line problems, involving text databases, prefetching and streaming data from parallel disks, and database selectivity estimation. Our experimental validation uses our TPIE programming environment. Plans for the coming year are to address bottleneck issues in parallel disks, text databases, and XML databases.

...read moreread less

8 citations

Proceedings Article•DOI•

Aggregate predicate support in DBMS

[...]

Apostol Natsev¹, Gene Y. C. Fuh¹, Weidong Chen, Chi-Huang Chiu², Jeffrey Scott Vitter³ - Show less +1 more•Institutions (3)

IBM¹, National Chiao Tung University², Duke University³

01 Jan 2002

TL;DR: This paper extends the SQL syntax to handle aggregate predicates and work out the semantics of such extensions so that they behave correctly in the existing database model, and proposes a new rk_SORT operator into the database engine.

...read moreread less

Abstract: In this paper we consider aggregate predicates and their support in database systems. Aggregate predicates are the predicate equivalent to aggregate functions in that they can be used to search for tuples that satisfy some aggregate property over a set of tuples (as opposed to simply computing an aggregate property over a set of tuples). The importance of aggregate predicates is exemplified by many modern applications that require ranked search, or top-k queries. Such queries are the norm in multimedia and spatial databases.In order to support the concept of aggregate predicates in DBMS, we introduce several extensions in the query language and the database engine. Specifically, we extend the SQL syntax to handle aggregate predicates and work out the semantics of such extensions so that they behave correctly in the existing database model. We also propose a a new rk_SORT operator into the database engine, and study relevant indexing and query optimization issues.Our approach provides several advantages, including enhanced usability and improved performance. By supporting aggregate predicates natively in the database engine, we are able to reuse existing indexing and query optimization techniques, without sacrificing generality or incurring the runtime overhead of database-external approaches. To the best of our knowledge, the proposed framework is the first to support user-defined indexing with aggregate predicates and search based upon user-defined ranking. We also provide empirical results from a simulation study that validates the effectiveness of our approach.

...read moreread less

Proceedings Article•DOI•

Lexicographically optimal smoothing for broadband traffic multiplexing

[...]

Stergios V. Anastasiadis¹, Peter Varman², Jeffrey Scott Vitter¹, Ke Yi¹•Institutions (2)

Duke University¹, Rice University²

21 Jul 2002

TL;DR: Efficient algorithms for lexicographically optimally smoothing the aggregate bandwidth requirements over a shared network link can better meet quality of service requirements without restricting the scalability of the system.

...read moreread less

Abstract: We investigate the problem of smoothing multiplexed network traffic, when either a streaming server transmits data to multiple clients, or a server accesses data from multiple storage devices or other servers. We introduce efficient algorithms for lexicographically optimally smoothing the aggregate bandwidth requirements over a shared network link. In the data transmission problem, we consider the case in which the clients have different buffer capacities but no bandwidth constraints, or no buffer capacities but different bandwidth constraints. For the data access problem, we handle the general case of a shared buffer capacity and individual network bandwidth constraints. Previous approaches in the literature for the data access problem handled either the case of only a single stream or did not compute the lexicographically optimal schedule.Lexicographically optimal smoothing (lexopt smoothing) has several advantages. By provably minimizing the variance of the required aggregate bandwidth, maximum resource requirements within the network become more predictable, and useful resource utilization increases. Fairness in sharing a network link by multiple users can be improved, and new requests from future clients are more likely to be successfully admitted without the need for frequently rescheduling previously accepted traffic. Efficient resource management at the network edges can better meet quality of service requirements without restricting the scalability of the system.

...read moreread less