scispace - formally typeset
Search or ask a question

Showing papers by "Joseph M. Hellerstein published in 2000"


Journal ArticleDOI
16 May 2000
TL;DR: This paper introduces a query processing mechanism called an eddy, which continuously reorders operators in a query plan as it runs, and describes the moments of symmetry during which pipelined joins can be easily reordered, and the synchronization barriers that require inputs from different sources to be coordinated.
Abstract: In large federated and shared-nothing databases, resources can exhibit widely fluctuating characteristics. Assumptions made at the time a query is submitted will rarely hold throughout the duration of query processing. As a result, traditional static query optimization and execution techniques are ineffective in these environments.In this paper we introduce a query processing mechanism called an eddy, which continuously reorders operators in a query plan as it runs. We characterize the moments of symmetry during which pipelined joins can be easily reordered, and the synchronization barriers that require inputs from different sources to be coordinated. By combining eddies with appropriate join algorithms, we merge the optimization and execution phases of query processing, allowing each tuple to have a flexible ordering of the query operators. This flexibility is controlled by a combination of fluid dynamics and a simple learning algorithm. Our initial implementation demonstrates promising results, with eddies performing nearly as well as a static optimizer/executor in static scenarios, and providing dramatic improvements in dynamic execution environments.

902 citations


Proceedings ArticleDOI
22 Oct 2000
TL;DR: The distributed hash table simplifies Internet service construction by decoupling service-specific logic from the complexities of persistent, consistent state management, and by allowing services to inherit the necessary service properties from the DDS rather than having to implement the properties themselves.
Abstract: This paper presents a new persistent data management layer designed to simplify cluster-based Internet service construction. This self-managing layer, called a distributed data structure (DDS), presents a conventional single-site data structure interface to service authors, but partitions and replicates the data across a cluster. We have designed and implemented a distributed hash table DDS that has properties necessary for Internet services (incremental scaling of throughput and data capacity, fault tolerance and high availability, high concurrency, consistency, and durability). The hash table uses two-phase commits to present a coherent view of its data across all cluster nodes, allowing any node to service any task. We show that the distributed hash table simplifies Internet service construction by decoupling service-specific logic from the complexities of persistent, consistent state management, and by allowing services to inherit the necessary service properties from the DDS rather than having to implement the properties themselves. We have scaled the hash table to a 128 node cluster, 1 terabyte of storage, and an in-core read throughput of 61,432 operations/s and write throughput of 13,582 operations/s.

269 citations


Journal Article
TL;DR: A survey of prior work on adaptive query processing is presented, focusing on three characterizations of adaptivity: the frequency of adaptability, the effects of Adaptivity, and the extent of adaptiveness, to set the stage for research in the Telegraph project.
Abstract: As query engines are scaled and federated, they must cope with highly unpredictable and changeable environments. In the Telegraph project, we are attempting to architect and implement a continuously adaptive query engine suitable for global-area systems, massive parallelism, and sensor networks. To set the stage for our research, we present a survey of prior work on adaptive query processing, focusing on three characterizations of adaptivity: the frequency of adaptivity, the effects of adaptivity, and the extent of adaptivity. Given this survey, we sketch directions for research in the Telegraph project.

223 citations



Journal ArticleDOI
TL;DR: This paper presents the database-centric subproject of CONTROL: a complete online query processing facility, implemented in a commercial Object-Relational DBMS from Informix.
Abstract: The goal of the CONTROL project at Berkeley is to develop systems for interactive analysis of large data sets. We focus on systems that provide users with iteratively refining answers to requests and online control of processing, thereby tightening the loop in the data analysis process. This paper presents the database-centric subproject of CONTROL: a complete i>online query processing facility, implemented in a commercial Object-Relational DBMS from Informix. We describe the algorithms at the core of the system, and detail the end-to-end issues required to bring the algorithms together and deliver a complete system.

62 citations


Journal ArticleDOI
01 Dec 2000
TL;DR: A pipelining, dynamically tunable reorder operator for providing user control during long running, data- intensive operations, which is responsive to dynamic preference changes, imposes minimal overheads in overall completion time, and provides dramatic improvements in the quality of the feedback over time.
Abstract: We present a pipelining, dynamically tunable reorder operator for providing user control during long running, data- intensive operations. Users can see partial results and accordingly direct the processing by specifying preferences for various data items; data of interest is prioritized for early processing. The reordering mechanism is efficient and non-blocking and can be used over arbitrary data streams from files and indexes, as well as continuous data feeds. We also investigate several policies for the reordering based on the performance goals of various typical applications. We present performance results for reordering in the context of an online aggregation implementation in Informix and in the context of sorting and scrolling in a large-scale spreadsheet. Our experiments demonstrate that for a variety of data distributions and applications, reordering is responsive to dynamic preference changes, imposes minimal overheads in overall completion time, and provides dramatic improvements in the quality of the feedback over time. Surprisingly, preliminary experiments indicate that online reordering can also be useful in traditional batch query processing, because it can serve as a form of pipelined, approximate sorting.

26 citations


26 Sep 2000
TL;DR: An interactive framework for data cleaning that tightly integrates transformation and discrepancy detection is presented, and a set of transforms that can be used for transformations within data records as well as for higher-order transformations are chosen.
Abstract: Cleaning organizational data of discrepancies in structure and content is important for data warehousing and Enterprise Data Integration (EDI) Current commercial solutions for data cleaning involve many iterations of time-consuming "data quality" analysis to find errors, and long-running transformations to fix them Users need to endure long waits and often write complex transformation programs We present an interactive framework for data cleaning that tightly integrates transformation and discrepancy detection Users gradually build transformations by adding or undoing transforms, in a intu-itive, graphical manner through a spreadsheet-like interface; the effect of a transformis shown at once on records visible on screen In the background, the system incrementally searches for discrepancies on the latest transformed version of data, flagging them as they are found This allows users to gradually construct a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays Balancing the goals of power, ease of specification, and interactive application, we choose a set of transforms that can be used for transformations within data records as well as for higher-order transformations We also present initial work on optimizing a sequence of transforms

24 citations


Proceedings ArticleDOI
01 Feb 2000
TL;DR: The design and analysis of a customized access method for the content-based image retrieval system, Blobworld, and several variants of the R-tree are proposed, tailored to address the problems the analysis revealed.
Abstract: We present the design and analysis of a customized access method for the content-based image retrieval system, Blobworld. Using the amdb access method analysis tool, we analyzed three existing multidimensional access methods to support nearest neighbor search in the context of the Blobworld application. Based on this analysis, we propose several variants of the R-tree, tailored to address the problems the analysis revealed. We implemented the access methods we propose in the Generalized Search Trees (GiST) framework and analyzed them. We found that two of our access methods have better performance characteristics for the Blobworld application than any of the traditional multi-dimensional access methods we examined. Based on this experience, we draw conclusions for nearest neighbor access method design, and for the task of constructing custom access methods tailored to particular applications.

19 citations



01 Jan 2000
TL;DR: This research was undertaken in the context of the generalized search tree (GIST), a tree-structured template access method, which encapsulates standard AM search and update functions and is a suitable basis for AM extensibility in ORDBMSs.
Abstract: Today's extensible object-relational database management systems (ORDBMSs) are being deployed to support nontraditional applications such as dynamic web servers and geographic information systems. ORDBMSs distinguish themselves from purely relational DBMSs by providing an extensible architecture, built around a richer and user-extensible type system combined with object-oriented concepts such as type hierarchies. They retain standard features of relational databases such as declarative access, multiuser operation, transactional isolation and recoverability. One particular aspect of DBMS functionality that is critical to performance is their support for access methods (AMs). In traditional relational DBMSs, B+-trees [Com79] serve as the AM of choice to provide a very high level of performance for applications dealing with the standard SQL datatypes (numeric data, character strings, dates, etc.). In order to provide the same level of performance for non-traditional applications, B+-trees are not sufficient; instead, novel kinds of datatype-specific AMs are required. The most promising approach to supporting those novel AMs is an extensible architecture in which the core services of the ORDBMS can be complemented with externally-supplied AMs. In my dissertation, I investigate general issues that arise in the design and implementation of nontraditional AMs in an extensible ORDBMS. This research was undertaken in the context of the generalized search tree (GIST), a tree-structured template access method, which encapsulates standard AM search and update functions and is a suitable basis for AM extensibility in ORDBMSs. The dissertation contains three contributions. The first is an extension of the GIST API that makes it more flexible and at the same time improves performance when implemented in a typical commercial ORDBMS. The second comprises concurrency and recovery protocols that allow GiSTs to be useful in application scenarios where high concurrency and recoverability are required. With these protocols, GiSTs fully encapsulate physical concurrency, transactional isolation and recovery, and thereby relieve an external access method of the burden of dealing with these issues. The API extensions and the concurrency and recovery protocols together make GiSTs a high-performance alternative to custom AM development in commercial ORDBMS. The third contribution is an AM performance analysis framework, implemented in a corresponding tool, that gives the AM developer a detailed picture of an AM's performance deficiencies while still retaining the GIST framework's independence of the datatype and application.

13 citations



Proceedings Article
01 Jan 2000

Proceedings Article
01 Jan 2000

Journal ArticleDOI
16 May 2000
TL;DR: There are no standard benchmarks for advanced indexing problems, and there has been relatively little work on methodologies for index experimentation and customization.
Abstract: Indexes and access methods have been a staple of database research — and indeed of computer science in general — for decades. A glance at the contents of this year's SIGMOD and PODS proceedings shows another bumper crop of indexing papers.Given the hundreds of indexing papers published in the database literature, a pause for reflection seems in order. From a scientific perspective, it is natural to ask why definitive indexing solutions have eluded us for so many years. What is the grand challenge in indexing? What basic complexities or intricacies underlie this large body of work? What would constitute a successful completion of this research agenda, and what steps will best move us in that direction? Or is it the case that the problem space branches in so many ways that we should expect to continuously need to solve variants of the indexing problem?From the practitioner's perspective, the proliferation of indexing solutions in the literature may be more confusing than helpful. Comprehensively evaluating the research to date is a near-impossible task. An evaluation has to include both functionality (applicability to the practitioner's problem, integration with other data management services like buffer management, query processing and transactions) as well as performance for the practitioner's workloads. Unfortunately, there are no standard benchmarks for advanced indexing problems, and there has been relatively little work on methodologies for index experimentation and customization. How should the research community promote technology transfer in this area? Are the new extensibility interfaces in object-relational DBMSs conducive to this effort?

11 May 2000
TL;DR: An insertion algorithm is developed called the Aggressive Insertion Policy, which uses global rather than greedy information when making insertion decisions and has much better performance characteristics for Blobworld image-search application than several traditional access methods.
Abstract: We present two new techniques for improving the performance of multidimensional indexes. For static data sets, we find that bulk loading techniques are effective at clustering data items in the index; however, traditional designs of an index''s bounding predicates can lead to poor performance. We develop and implement in GiST three new bounding predicates, two of which have much better performance characteristics for our Blobworld image-search application than several traditional access methods. We then proceed to study dynamic data sets, the analysis of which lead to a focus on insertion algorithms. We develop, implement, and analyze an insertion algorithm called the Aggressive Insertion Policy, which uses global rather than greedy information when making insertion decisions.