scispace - formally typeset
Search or ask a question

Showing papers by "Jeffrey Scott Vitter published in 2002"


Book ChapterDOI
Lipyeow Lim1, Min Wang2, Sriram Padmanabhan2, Jeffrey Scott Vitter1, Ronald Parr1 
20 Aug 2002
TL;DR: This paper proposes XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data, which can be more accurate than the more costly off-line method under tight memory constraints.
Abstract: The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository. All previous methods require an off-line scan of the XML repository to collect the statistics. In this paper, we propose XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data. XPathLearner gathers and refines the statistics using query feedback in an on-line manner and is especially suited to queries in Internet scale applications since the underlying XML repository is either inaccessible or too large to be scanned in its entirety. Besides the on-line property, our method also has two other novel features: (a) XPathLearner is workload-aware in collecting the statistics and thus can be more accurate than the more costly off-line method under tight memory constraints, and (b) XPathLearner automatically adjusts the statistics using query feedback when the underlying XML data change. We show empirically the estimation accuracy of our method using several real data sets.

91 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a simple and efficient technique for performing bulk update and query operations on multidimensional spatial indexes based on the buffer tree lazy buffering technique, which uses ideas from buffer trees and fully utilizes the available internal memory and the page size of the operating system.
Abstract: In recent years there has been an upsurge of interest in spatial databases. A major issue is how to manipulate efficiently massive amounts of spatial data stored on disk in multidimensional spatial indexes (data structures). Construction of spatial indexes (bulk loading ) has been studied intensively in the database community. The continuous arrival of massive amounts of new data makes it important to update existing indexes (bulk updating ) efficiently. In this paper we present a simple, yet efficient, technique for performing bulk update and query operations on multidimensional indexes. We present our technique in terms of the so-called R-tree and its variants, as they have emerged as practically efficient indexing methods for spatial data. Our method uses ideas from the buffer tree lazy buffering technique and fully utilizes the available internal memory and the page size of the operating system. We give a theoretical analysis of our technique, showing that it is efficient both in terms of I/ O communication, disk storage, and internal computation time. We also present the results of an extensive set of experiments showing that in practice our approach performs better than the previously best known bulk update methods with respect to update time, and that it produces a better quality index in terms of query performance. One important novel feature of our technique is that in most cases it allows us to perform a batch of updates and queries simultaneously. To be able to do so is essential in environments where queries have to be answered even while the index is being updated and reorganized.

63 citations


Journal Article
TL;DR: The design and implementation of the second phase of TPIE, a portable, extensible, flexible, and easy to use C++ programming environment for efficiently implementing I/O-algorithms and data structures, is described.
Abstract: In recent years, many theoretically I/O-efficient algorithms and data structures have been developed. The TPIE project at Duke University was started to investigate the practical importance of these theoretical results. The goal of this ongoing project is to provide a portable, extensible, flexible, and easy to use C++ programming environment for efficiently implementing I/O-algorithms and data structures. The TPIE library has been developed in two phases. The first phase focused on supporting algorithms with a sequential I/O pattern, while the recently developed second phase has focused on supporting on-line I/O-efficient data structures, which exhibit a more random I/O pattern. This paper describes the design and implementation of the second phase of TPIE.

61 citations


Book ChapterDOI
17 Sep 2002
TL;DR: TPIE as discussed by the authors is a C++ environment for implementing I/O-algorithms and data structures, which has been developed in two phases, the first phase focused on supporting algorithms with a sequential IO pattern, while the recently developed second phase has focused on on-line I/Os, which exhibit a more random IO pattern.
Abstract: In recent years, many theoretically I/O-efficient algorithms and data structures have been developed. The TPIE project at Duke University was started to investigate the practical importance of these theoretical results. The goal of this ongoing project is to provide a portable, extensible, flexible, and easy to use C++ programming environment for efficiently implementing I/O-algorithms and data structures. The TPIE library has been developed in two phases. The first phase focused on supporting algorithms with a sequential I/O pattern, while the recently developed second phase has focused on supporting on-line I/O-efficient data structures, which exhibit a more random I/O pattern. This paper describes the design and implementation of the second phase of TPIE.

52 citations


Journal ArticleDOI
TL;DR: This paper introduces a new cache-conscious sorting algorithm, R-MERGE, which achieves better performance in practice over algorithms that are superior in the theoretical models, and quantifies the performance effects of features not reflected in the models.
Abstract: Modern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching, and streaming behavior. Inadequate models lead to poor algorithmic choices and an incomplete understanding of algorithm behavior on real machines.A key step toward developing better models is to quantify the performance effects of features not reflected in the models. This paper explores the effect of memory system features on sorting performance. We introduce a new cache-conscious sorting algorithm, R-MERGE, which achieves better performance in practice over algorithms that are superior in the theoretical models. R-MERGE is designed to minimize memory stall cycles rather than cache misses by considering features common to many system designs.

38 citations


Book
01 Jan 2002
TL;DR: In this paper, the authors survey the state of the art in the design and analysis of external memory (or EM) algorithms, where the goal is to exploit locality in order to reduce the I/O costs.
Abstract: Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this paper we survey the state of the art in the design and analysis of external memory (or EM) algorithms, where the goal is to exploit locality in order to reduce the I/O costs.For sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss distribution and merging techniques for using the disks independently. We consider EM paradigms for computations involving matrixes, geometric data, and graphs, and we look at problems caused by dynamic memory allocation. We report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss in this chapter are significantly faster than methods currently used in practice.

30 citations


Book
01 Mar 2002
TL;DR: The authors, both experts in the field of compression technologies and algorithm design, present some of the most promising algorithms for converting raw data to a compressed form for efficient broadcast.
Abstract: Video compression is a topic of increasing importance in a world where multimedia technologies and massive data sets are threatening to overflow the capacity of even the most powerful of today's computers Internet as well as business applications such as videoconferencing, video-on-demand, and digital cable television all use compression techniques, either to decrease the required bandwidth for an application or to send more data through a bottleneck in the system Buffering is used at both ends of the transmission to make the communication less "bursty" The interplay between compression and buffer control algorithms in order to address these performance problems and maintain high visual clarity has shown great results, and Efficient Algorithms for MPEG Video Compression is the first book dedicated to the subject The authors, both experts in the field of compression technologies and algorithm design, present some of the most promising algorithms for converting raw data to a compressed form for efficient broadcast

30 citations


Posted Content
TL;DR: The notion of approximate da2a siruclures, in which a small amount of error is tolerated in the output, is introduced, and the tolerance of prototypical algorithms to approximate data structures is considered.
Abstract: This paper explores the notion of approximate data structures, which return approximately correct answers to queries, but run faster than their exact counterparts. The paper describes approximate variants of the van Emde Boas data structure, which support the same dynamic operations as the standard van Emde Boas data structure (min, max, successor, predecessor, and existence queries, as well as insertion and deletion), except that answers to queries are approximate. The variants support all operations in constant time provided the performance guarantee is 1+1/polylog(n), and in O(loglogn) time provided the performance guarantee is 1+1/polynomial(n), for n elements in the data structure. Applications described include Prim's minimum-spanning-tree algorithm, Dijkstra's single-source shortest paths algorithm, and an on-line variant of Graham's convex hull algorithm. To obtain output which approximates the desired output with the performance guarantee tending to 1, Prim's algorithm requires only linear time, Dijkstra's algorithm requires O(mloglogn) time, and the on-line variant of Graham's algorithm requires constant amortized time per operation.

23 citations


Journal ArticleDOI
TL;DR: The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files.
Abstract: External sorting—the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks—is a fundamental subroutine in database systems[G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM ) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K,Section 5.4.9] recently identified SRM (which he calls ``randomized striping'') as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is based on synchronous parallel I/O primitives provided by the TPIE programming environment[TPI]; whenever our program issues an I/O read (write) operation, one block of data is synchronously read from (written to) each disk in parallel. We compare the performance of SRM over a wide range of input sizes with that of disk-striped mergesort (DSM ), which is widely used in practice. DSM consists of a standard mergesort in conjunction with striped I/O for parallel disk access. SRM merges together significantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is often significantly faster than with DSM (since SRM requires fewer passes). The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files. Since both parallel disk merging and multimedia processing deal with streams that get ``consumed'' at nonuniform and partially predictable rates, our techniques for lookahead based upon forecasting data may have relevance in video server applications.

12 citations


ReportDOI
29 Mar 2002
TL;DR: The goal of this proposal is to deepen the understanding of the limits of I/O systems and to construct external memory algorithms that are provably efficient, and to address bottleneck issues in parallel disks, text databases, and XML databases.
Abstract: : The bottleneck in many applications that process massive amounts of data is the I/O communications between internal memory and external memory. The bottleneck is accentuated as processors get faster and parallel processors are used. Parallel disk arrays are often used to increase the I/O bandwidth. The goal of this proposal is to deepen our understanding of the limits of I/O systems and to construct external memory algorithms that are provably efficient. The three measures of performance are number of I/Os, disk storage space, and CPU time. Even when the data fit entirely in memory, communication can still be the bottleneck, and the related issues of caching become important. Theoretical work involves development and analysis of provably efficient external memory algorithms and cache-efficient algorithms for a variety of important application areas. We address several batched and on-line problems, involving text databases, prefetching and streaming data from parallel disks, and database selectivity estimation. Our experimental validation uses our TPIE programming environment. Plans for the coming year are to address bottleneck issues in parallel disks, text databases, and XML databases.

8 citations


Proceedings ArticleDOI
01 Jan 2002
TL;DR: This paper extends the SQL syntax to handle aggregate predicates and work out the semantics of such extensions so that they behave correctly in the existing database model, and proposes a new rk_SORT operator into the database engine.
Abstract: In this paper we consider aggregate predicates and their support in database systems. Aggregate predicates are the predicate equivalent to aggregate functions in that they can be used to search for tuples that satisfy some aggregate property over a set of tuples (as opposed to simply computing an aggregate property over a set of tuples). The importance of aggregate predicates is exemplified by many modern applications that require ranked search, or top-k queries. Such queries are the norm in multimedia and spatial databases.In order to support the concept of aggregate predicates in DBMS, we introduce several extensions in the query language and the database engine. Specifically, we extend the SQL syntax to handle aggregate predicates and work out the semantics of such extensions so that they behave correctly in the existing database model. We also propose a a new rk_SORT operator into the database engine, and study relevant indexing and query optimization issues.Our approach provides several advantages, including enhanced usability and improved performance. By supporting aggregate predicates natively in the database engine, we are able to reuse existing indexing and query optimization techniques, without sacrificing generality or incurring the runtime overhead of database-external approaches. To the best of our knowledge, the proposed framework is the first to support user-defined indexing with aggregate predicates and search based upon user-defined ranking. We also provide empirical results from a simulation study that validates the effectiveness of our approach.

Proceedings ArticleDOI
21 Jul 2002
TL;DR: Efficient algorithms for lexicographically optimally smoothing the aggregate bandwidth requirements over a shared network link can better meet quality of service requirements without restricting the scalability of the system.
Abstract: We investigate the problem of smoothing multiplexed network traffic, when either a streaming server transmits data to multiple clients, or a server accesses data from multiple storage devices or other servers. We introduce efficient algorithms for lexicographically optimally smoothing the aggregate bandwidth requirements over a shared network link. In the data transmission problem, we consider the case in which the clients have different buffer capacities but no bandwidth constraints, or no buffer capacities but different bandwidth constraints. For the data access problem, we handle the general case of a shared buffer capacity and individual network bandwidth constraints. Previous approaches in the literature for the data access problem handled either the case of only a single stream or did not compute the lexicographically optimal schedule.Lexicographically optimal smoothing (lexopt smoothing) has several advantages. By provably minimizing the variance of the required aggregate bandwidth, maximum resource requirements within the network become more predictable, and useful resource utilization increases. Fairness in sharing a network link by multiple users can be improved, and new requests from future clients are more likely to be successfully admitted without the need for frequently rescheduling previously accepted traffic. Efficient resource management at the network edges can better meet quality of service requirements without restricting the scalability of the system.