scispace - formally typeset
Search or ask a question

Showing papers by "Jeffrey Scott Vitter published in 1999"


Journal ArticleDOI
01 Jun 1999
TL;DR: A novel method is presented that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner and provides significantly more accurate results than other efficient approximation techniques such as random sampling.
Abstract: Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a smalll number of I/Os, depending upon the desired accuracy.We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.

363 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: The theory of indexability is applied to the problem of two-dimensional range searching and it is shown that the special case of 3-sided querying can be solved with constant redundancy and access overhead.
Abstract: In this paper we settle several longstanding open problems in theory of indexability and external orthogonal range searching. In the rst part of the paper, we apply the theory of indexability to the problem of two-dimensional range searching. We show that the special case of 3-sided querying can be solved with constant redundancy and access overhead. From this, we derive indexing schemes for general 4-sided range queries that exhibit an optimal tradeo between redundancy and access overhead. In the second part of the paper, we develop dynamic external memory data structures for the two query types. Our structure for 3-sided queries occupies O(N=B) disk blocks, and it supports insertions and deletions in O(log B N) I/Os and queries in O(log B N + T=B) I/Os, where B is the disk block size, N is the number of points, and T is the query output size. These bounds are optimal. Our structure for general (4-sided) range searching occupies O (N=B)(log(N=B))= log log B N disk blocks and answers queries in O(log B N + T=B) I/Os, which are optimal. It also supports updates in O (log B N)(log(N=B))= log log B N I/Os. Center for Geometric Computing, Department of Computer Science, Duke University, Box 90129, Durham, NC 27708{0129. Supported in part by the U.S. Army Research O ce through MURI grant DAAH04{96{1{0013 and by the National Science Foundation through ESS grant EIA{9870734. Part of this work was done while visiting BRICS, Department of Computer Science, University of Aarhus, Denmark. Email: large@cs.duke.edu. yDepartment of Computer Sciences, University of Texas at Austin, Austin, TX 78712-1188. Email vsam@cs.utexas.edu zCenter for Geometric Computing, Department of Computer Science, Duke University, Box 90129, Durham, NC 27708{0129. Supported in part by the U.S. Army Research O ce through MURI grant DAAH04{96{1{0013 and by the National Science Foundation through grants CCR{9522047 and EIA{9870734. Part of this work was done while visiting BRICS, Department of Computer Science, University of Aarhus, Denmark and I.N.R.I.A., Sophia Antipolis, France. Email: jsv@cs.duke.edu.

167 citations


Book ChapterDOI
15 Jan 1999
TL;DR: This paper presents a simple, yet efficient, technique for performing bulk update and query operations on multidimensional indexes in terms of the so-called R-tree and its variants, as they have emerged as practically efficient indexing methods for spatial data.
Abstract: We present a simple lazy buffering technique for performing bulk operations on multidimensional spatial indexes (data structures), and show that it is efficient in theory as well as in practice. We present the technique in terms of the so-called R-tree and its variants, as they have emerged as practically efficient indexing methods for spatial data.

97 citations



Proceedings ArticleDOI
01 May 1999
TL;DR: Measurements of the read perfor mance of multiple disks that share a SCSI bus under a heavy workload are described and formulas that accurately characterize the observed performance are developed and validated.
Abstract: In modern I O architectures multiple disk drives are at tached to each I O controller A study of the performance of such architectures under I O intensive workloads has re vealed a performance impairment that results from a pre viously unknown form of convoy behavior in disk I O In this paper we describe measurements of the read perfor mance of multiple disks that share a SCSI bus under a heavy workload and develop and validate formulas that accurately characterize the observed performance to within on several platforms for I O sizes in the range KB Two terms in the formula clearly characterize the lost perfor mance seen in our experiments We describe techniques to deal with the performance impairment via user level work arounds that achieve greater overlap of bus transfers with disk seeks and that increase the percentage of transfers that occur at the full bus bandwidth rather than at the lower bandwidth of a disk head Experiments show bandwidth improvements of when using these user level tech niques but only in the case of large I Os

43 citations


Proceedings ArticleDOI
01 Jan 1999
TL;DR: An efficient external-memory dynamic data structure for point location in monotone planar subdivisions and a new variant of B-trees, called leuelbalanced B-Trees, which allow insert, delete, merge, and split operations in O((l+ $ logM,B f) log, N) I/OS (amortized), even if each node stores a pointer to its parent.
Abstract: We present an efficient external-memory dynamic data structure for point location in monotone planar subdivisions. Our data structure uses O(N/B) disk blocks to store a monotone subdivision of size N, where B is the size of a disk block. It supports queries in O(logi N) I/OS (worst-case) and updates in O(lo& N) I/OS (amortized). We also propose a new variant of B-trees, called leuelbalanced B-trees, which allow insert, delete, merge, and split operations in O((l+ $ logM,B f) log, N) I/OS (amortized), 2 5 b 2 B/2, even if each node stores a pointer to its parent. Here M is the size of main memory. Besides being essential to our point-location data structure, we believe that level-balanced B-trees are of significant independent interest. They can, for example, be used to dynamically maintain a planar St-graph using O((1 + $10g~,~ $$) log, N) = O(logi N) I/OS (amortized) per update, so that reachability queries can be answered in O(log, N) I/OS (worst case).

31 citations


01 Jan 1999
TL;DR: The LEDA-SM library as mentioned in this paper extends the LEDA library towards secondary memory computation using I/O efficient algorithms and data structures that do not suffer from the so-called bottleneck.
Abstract: During the last years, many software libraries for in-core computation have been developed. Most internal memory algorithms perform very badly when used in an external memory setting. We introduce LEDA-SM that extends the LEDA-library [22] towards secondary memory computation. LEDA-SM uses I/O-efficient algorithms and data structures that do not suffer from the so called I/O bottleneck. LEDA is used for in-core computation. We explain the design of LEDA-SM and report on performance results.

26 citations


Book ChapterDOI
11 Aug 1999
TL;DR: A variety of on-line data structures for external memory are discussed--some very old and some very new--such as hashing (for dictionaries), B-trees ( for dictionaries and 1-D range search), buffer trees, buffer trees (for batched dynamic problems), interval trees with weight-balanced B-Trees, priority search trees, and R-tree and other spatial structures.
Abstract: The data sets for many of today's computer applications are too large to fit within the computer's internal memory and must instead be stored on external storage devices such as disks. A major performance bottleneck can be the input/output communication (or I/O) between the external and internal memories. In this paper we discuss a variety of on-line data structures for external memory--some very old and some very new--such as hashing (for dictionaries), B-trees (for dictionaries and 1-D range search), buffer trees (for batched dynamic problems), interval trees with weight-balanced B-trees (for stabbing queries), priority search trees (for 3-sided 2-D range search), and R-trees and other spatial structures. We also discuss several open problems along the way.

24 citations


Proceedings ArticleDOI
17 Oct 1999
TL;DR: This work defines the competitive worst-case notion of what it means for an MA algorithm to be dynamically optimal and proves fundamental lower bounds on the performance of MA algorithms for problems such as sorting, standard matrix multiplication, and several related problems.
Abstract: External memory algorithms play a key role in database management systems and large scale processing systems. External memory algorithms are typically tuned for efficient performance given a fixed, statically allocated amount of internal memory. However, with the advent of real-time database system and database systems based upon administratively defined goals, algorithms must increasingly be able to adapt in an online manner when the amount of internal memory allocated to them changes dynamically and unpredictably. We present a theoretical and applicable framework for memory-adaptive algorithms (or simply MA algorithms). We define the competitive worst-case notion of what it means for an MA algorithm to be dynamically optimal and prove fundamental lower bounds on the performance of MA algorithms for problems such as sorting, standard matrix multiplication, and several related problems. Our main tool for proving dynamic optimality is the notion of resource consumption, which measures how efficiently an MA algorithm adapts itself to memory fluctuations. We present the first dynamically optimal algorithm for sorting (based upon mergesort), permuting, FFT, permutation networks, buffer trees, (standard) matrix multiplication, and LU decomposition. In each case, dynamic optimality is demonstrated via a potential function argument showing that the algorithm's resource consumption is within a constant factor of optimal.

19 citations


Proceedings ArticleDOI
01 Jun 1999
TL;DR: The simple randomized merging (SRM ) mergesort algorithm proposed by Barve et al. is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice.
Abstract: External sorting—the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks—is a fundamental subroutine in database systems[G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM ) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K,Section 5.4.9] recently identified SRM (which he calls ``randomized striping'') as the method of choice for sorting with parallel disks.

16 citations


Journal ArticleDOI
TL;DR: The results suggest that modifying LZ77, LZFG, and LZW along these lines yields improvements in compression of about 3%, 6%, and 15%, respectively.

01 Jan 1999
TL;DR: This thesis proposes novel approximation techniques used in both selectivity estimation and approximate computation of OLAP aggregates and presents a novel classification algorithm (classifier) called MIND (MINing in Databases), which is scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm.
Abstract: In this thesis, we study two important techniques that are widely used in database systems: approximation and learning. Approximation has been an area of great interest and importance in database community. A classic example of using approximation in database systems is selectivity estimation. Another example is using approximation techniques to answer OLAP (On-Line Analytical Processing) queries, which is quite new and is initiated by our work. In this thesis, we propose novel approximation techniques used in both selectivity estimation and approximate computation of OLAP aggregates. Our techniques are based on the powerful mathematical tool of wavelets and multiresolution analysis and are fundamentally different from traditional approaches. We present several methods that first attempt to use wavelets in the domain of database approximation. Our methods offer substantial improvements in accuracy and efficiency over existing methods. We also develop efficient and scalable learning techniques for DBMSs to extract patterns from large databases in the context of data mining. Classification is an important and fundamental data mining problem. Almost all of the current classification algorithms have the restriction that the entire training set should fit in the internal memory to achieve efficiency. We present a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND is scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm.

01 Jan 1999
TL;DR: This thesis describes an algorithm for removing geometric and topological flaws, such as missing polygons and cracks, from the input, and presents the Binary Space Partition (BSP), a hierarchical spatial decomposition used to efficiently implement both the model repair and hidden-surface removal algorithms.
Abstract: A central problem in computer graphics is hidden-surface removal : given a set of objects (some of which may be moving continuously), a continuously-moving viewpoint, and an image plane, maintain the scene visible from the viewpoint as projected onto the image plane. Algorithms developed for this problem in computational geometry are theoretically efficient but are often difficult to implement. On the other hand, computer graphics techniques are designed to be fast in practice but may perform badly in the worst-case. In this thesis, we describe our efforts to bridge this gap between theory and practice in the context of hidden-surface removal. Three themes underlie our research: (1) the idea of analysing algorithms in terms of the geometric complexity of the input, which encourages the development of algorithms that are provably efficient for data sets that typically arise in practice, (2) the notion of a kinetic data structure, which is a mechanism for efficiently processing continuously-moving objects, and (3) object complexity, a model inspired by the performance characteristics of current graphics hardware, in which algorithms simply determine which objects are visible rather than compute exactly which portions of each object are visible. We first describe an algorithm for removing geometric and topological flaws, such as missing polygons and cracks, from the input. Next, we present our algorithm for hidden-surface removal, which is based on the novel idea of employing a kinetic data structure for ray-shooting to determine a small superset of the visible objects and efficiently maintain this set as the viewpoint moves continuously. In the last part of the thesis, we describe the Binary Space Partition (BSP), a hierarchical spatial decomposition we use to efficiently implement both our model repair and hidden-surface removal algorithms. We present our BSP construction algorithms, which work particularly well for architectural environments and terrain-like data sets.

Journal ArticleDOI
TL;DR: An algorithm is described that uses re-representing the alphabet so that a representation of a character reflects its properties as a predictor of future text to use an estimator from a restricted class to map contexts to predictions of upcoming characters.


Journal ArticleDOI
TL;DR: A new model of complexity is defined, called object complexity, for measuring the performance of hidden-surface removal algorithms, and an algorithm is presented that solves in the object complexity model the same problem that Bern3 addressed in the scene complexity model.
Abstract: We define a new model of complexity, called object complexity, for measuring the performance of hidden-surface removal algorithms. This model is more appropriate for predicting the performance of these algorithms on current graphics rendering systems than the standard measure of scene complexity used in computational geometry. We also consider the problem of determining the set of visible windows in scenes consisting of n axis-parallel windows in ℝ3. We present an algorithm that runs in optimal Θ(n log n) time. The algorithm solves in the object complexity model the same problem that Bern3 addressed in the scene complexity model.

Book
01 Jan 1999
TL;DR: External memory algorithms and data structures by J. S. Vitter Synopsis data structures for massive data sets by P. B. Gibbons and Y. Matias and a survey of out-of-core algorithms in numerical linear algebra by S. Miltersen.
Abstract: External memory algorithms and data structures by J S Vitter Synopsis data structures for massive data sets by P B Gibbons and Y Matias Calculating robust depth measures for large data sets by I Al-Furaih, T Johnson, and S Ranka Efficient cross-trees for external memory by R Grossi and G F Italiano Computing on data streams by M R Henzinger, P Raghavan, and S Rajagopalan On maximum clique problems in very large graphs by J Abello, P M Pardalos, and M G C Resende I/O-optimal computation of segment intersections by A Crauser, P Ferragina, K Mehlhorn, U Meyer, and E A Ramos On showing lower bounds for external-memory computational geometry problems by L Arge and P B Miltersen A survey of out-of-core algorithms in numerical linear algebra by S Toledo Concrete software libraries by K-P Vo S(b)-tree library: An efficient way of indexing data by K V Shvachko ASP: Adaptive online parallel disk scheduling by M Kallahalla and P J Varman Efficient schemes for distributing data on parallel memory systems by S K Das and M C Pinotti External memory techniques for isosurface extraction in scientific visualization by Y-J Chiang and C T Silva R-tree retrieval of unstructured volume data for visualization by S T Leutenegger and K-L Ma

Proceedings ArticleDOI
01 May 1999
TL;DR: A performance impairment that results from a previously unknown form of convoy behavior in disk I/O, which is called munds is reported on, demonstrating the rounds behavior and quantifying its performance impact.
Abstract: In modern I/O architectures, multiple disk drives are attached to each I/O bus. Under I/O-intensive workloads, the disk latency for a request can be overlapped with the disk latency and data transfers of requests to other disks, potentidly resulting in an aggregate I/O throughput at nearly bus bandwidth. This paper reports on a performance impairment that results from a previously unknown form of convoy behavior in disk I/O, which we call munds. In rounds, independent requests to distinct disks convoy, so that each disk services one request before any disk services its next re quest. We analyze log tiles to describe read performance of multiple Seagate Wren-7 disks that share a SCSI bus under a heavy workload, demonstrating the rounds behavior and quantifying its performance impact.