scispace - formally typeset
Search or ask a question

Showing papers by "Joseph M. Hellerstein published in 1997"


Proceedings ArticleDOI
01 Jun 1997
TL;DR: In this article, the authors propose an online aggregation interface that allows users to both observe the progress of their aggregation queries and control execution on the fly, and present a suite of techniques that extend a database system to meet these requirements.
Abstract: Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other areas of computing. In this paper we propose a new online aggregation interface that permits users to both observe the progress of their aggregation queries and control execution on the fly. After outlining usability and performance requirements for a system supporting online aggregation, we present a suite of techniques that extend a database system to meet these requirements. These include methods for returning the output in random order, for providing control over the relative rate at which different aggregates are computed, and for computing running confidence intervals. Finally, we report on an initial implementation of online aggregation in POSTGRES.

1,109 citations



Proceedings ArticleDOI
01 Jun 1997
TL;DR: This paper presents general algorithms for concurrency control in tree-based access methods as well as a recovery protocol and a mechanism for ensuring repeatable read isolation outside the context of B-trees.
Abstract: This paper presents general algorithms for concurrency control in tree-based access methods as well as a recovery protocol and a mechanism for ensuring repeatable read. The algorithms are developed in the context of the Generalized Search Tree (GiST) data structure, an index structure supporting an extensible set of queries and data types. Although developed in a GiST context, the algorithms are generally applicable to many tree-based access methods. The concurrency control protocol is based on an extension of the link technique originally developed for B-trees, and completely avoids holding node locks during I/Os. Repeatable read isolation is achieved with a novel combination of predicate locks and two-phase locking of data records. To our knowledge, this is the first time that isolation issues have been addressed outside the context of B-trees. A discussion of the fundamental structural differences between B-trees and more general tree structures like GiSTs explains why the algorithms developed here deviate from their B-tree counterparts. An implementation of GiSTs emulating B-trees in DB2/Common Server is underway.

169 citations


Proceedings ArticleDOI
01 Jun 1997
TL;DR: The performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW), finds that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance records.
Abstract: We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance records. On a 64-node cluster, we sort 6.0 GB in just under one minute, while a 32-node cluster finishes the Datamation benchmark in 2.41 seconds.Our implementations can be applied to a variety of disk, memory, and processor configurations; we highlight salient issues for tuning each component of the system. We evaluate the use of commodity operating systems and hardware for parallel sorting. We find existing OS primitives for memory management and file access adequate. Due to aggregate communication and disk bandwidth requirements, the bottleneck of our system is the workstation I/O bus.

165 citations


Proceedings ArticleDOI
01 May 1997
TL;DR: A framework for measuring the efficiency of an indexing scheme for a workload based on two characterizations: storage redundancy and access overhead is defined.
Abstract: We consider the problem of indexing general database workloads (combinations of data sets and sets of potential queries). We define a framework for measuring the efficiency of an indexing scheme for a workload based on two characterizations: storage redundancy (how many times each item in the data set is stored), and access overhead (how many times more blocks than necessary does a query retrieve). Using this framework we present some initial results, showing upper and lower bounds and trade-offs between them in the case of multi-dimensional range queries and set queries.

153 citations


01 Jan 1997
TL;DR: This paper describes the RD-Tree, an index structure for set-valued attributes that is an adaptation of the R-Tree that exploits a natural analogy between spatial objects and sets.
Abstract: The implementation of complex types in Object-Relational database systems requires the development of efficient access methods. In this paper we describe the RD-Tree, an index structure for set-valued attributes. The RD-Tree is an adaptation of the R-Tree that exploits a natural analogy between spatial objects and sets. A particular engineering difficulty arises in representing the keys in an RD-Tree. We propose several different representations, and describe the tradeoffs of using each. An implementation and validation of this work is underway in the SHORE object repository.

52 citations



Journal Article
TL;DR: It is argued that online processing for large queries requires redesigning major portions of a database system, and a mass-market approach for designing and measuring data-intensive processing is proposed.
Abstract: The term \online" has become an all-too-common addendum to database system names of the day. In this article we reexamine the notion of processing queries online. We distinguish between online processing and preprocessing, and argue that online processing for large queries requires redesigning major portions of a database system. We highlight pressing applications for truly online processing, and sketch ongoing research in these applications at Berkeley. We also outline basic techniques for running long queries online. We close by reevaluating the typical measurements of cost/performance for online systems, and propose a mass-market approach for designing and measuring data-intensive processing.

16 citations



Proceedings Article
01 Jan 1997
TL;DR: This paper proposes changing the black-box model to one of a "crystal ball", in which users are given feedback on their queries as they run, so that they can predict the utility of their query results, control the behavior of the queries on the y, and better understand the operation of the system.
Abstract: Information Systems { both databases and textsearch programs { are typically architected as \black boxes": a user submits a request, the system performs an unknown sequences of operations, and after some time an answer set is returned. Two trends are conspiring to make such architectures undesirable. First, users of these systems are often quite naive, and unsure of what they are doing. Second, the queries submitted to these systems are taking increasing amounts of time to complete. These trends together lead to a frustrating experience for users: they are unsure if their inputs are appropriate, and the cost of an inappropriate input is often a long wait followed by a useless or misleading result. In this paper we propose changing the black-box model to one of a \crystal ball", in which users are given feedback on their queries as they run, so that they can predict the utility of their query results, control the behavior of the queries on the y, and better understand the operation of the system. We highlight some initial work in this vein, and describe opportunities for similar e orts in new applications.

1 citations