scispace - formally typeset
Search or ask a question

Showing papers by "Joseph M. Hellerstein published in 2003"


Proceedings Article
01 Jan 2003
TL;DR: The next generation Telegraph system, called TelegraphCQ, is focused on meeting the challenges that arise in handling large streams of continuous queries over high-volume, highly-variable data streams and leverages the PostgreSQL open source code base.
Abstract: Increasingly pervasive networks are leading towards a world where data is constantly in motion. In such a world, conventional techniques for query processing, which were developed under the assumption of a far more static and predictable computational environment, will not be sufficient. Instead, query processors based on adaptive dataflow will be necessary. The Telegraph project has developed a suite of novel technologies for continuously adaptive query processing. The next generation Telegraph system, called TelegraphCQ, is focused on meeting the challenges that arise in handling large streams of continuous queries over high-volume, highly-variable data streams. In this paper, we describe the system architecture and its underlying technology, and report on our ongoing implementation effort, which leverages the PostgreSQL open source code base. We also discuss open issues and our research agenda.

1,248 citations


Proceedings ArticleDOI
09 Jun 2003
TL;DR: This work evaluates issues in the context of TinyDB, a distributed query processor for smart sensor devices, and shows how acquisitional techniques can provide significant reductions in power consumption on the authors' sensor devices.
Abstract: We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often data is physically acquired (sampled) and delivered to query processing operators. By focusing on the locations and costs of acquiring data, we are able to significantly reduce power consumption over traditional passive systems that assume the a priori existence of data. We discuss simple extensions to SQL for controlling data acquisition, and show how acquisitional issues influence query optimization, dissemination, and execution. We evaluate these issues in the context of TinyDB, a distributed query processor for smart sensor devices, and show how acquisitional techniques can provide significant reductions in power consumption on our sensor devices.

1,031 citations


Proceedings ArticleDOI
09 Jun 2003
TL;DR: The current version of TelegraphCQ is shown, which is implemented by leveraging the code base of the open source PostgreSQL database system, which found that a significant portion of the PostgreSQL code was easily reusable.
Abstract: At Berkeley, we are developing TelegraphCQ [1, 2], a dataflow system for processing continuous queries over data streams. TelegraphCQ is based on a novel, highly-adaptive architecture supporting dynamic query workloads in volatile data streaming environments. In this demonstration we show our current version of TelegraphCQ, which we implemented by leveraging the code base of the open source PostgreSQL database system. Although TelegraphCQ differs significantly from a traditional database system, we found that a significant portion of the PostgreSQL code was easily reusable. We also found the extensibility features of PostgreSQL very useful, particularly its rich data types and the ability to load user-developed functions. Challenges: As discussed in [1], sharing and adaptivity are our main techniques for implementing a continuous query system. Doing this in the codebase of a conventional database posed a number of challenges:

767 citations


Book ChapterDOI
09 Sep 2003
TL;DR: This paper presents the initial design of PIER, a massively distributed query engine based on overlay networks, which is intended to bring database query processing facilities to new, widely distributed environments.
Abstract: The database research community prides itself on scalable technologies. Yet database systems traditionally do not excel on one important scalability dimension: the degree of distribution. This limitation has hampered the impact of database technologies on massively distributed systems like the Internet. In this paper, we present the initial design of PIER, a massively distributed query engine based on overlay networks, which is intended to bring database query processing facilities to new, widely distributed environments. We motivate the need for massively distributed queries, and argue for a relaxation of certain traditional database research goals in the pursuit of scalability and widespread adoption. We present simulation results showing PIER gracefully running relational queries across thousands of machines, and show results from the same software base in actual deployment on a large experimental cluster.

532 citations


Proceedings ArticleDOI
05 Mar 2003
TL;DR: A dataflow operator called flux is introduced that encapsulates adaptive state partitioning and dataflow routing that can be used for CQ operators under shifting processing and memory loads and can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case.
Abstract: The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple queries. The scalability of these dataflows is limited by their constituent, stateful operators - e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Long-running CQ dataflows must continue to function robustly in the face of these imbalances. To address this challenge, we introduce a dataflow operator called flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producer-consumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the flux mechanism and these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case.

415 citations


Book ChapterDOI
21 Feb 2003
TL;DR: It is suggested that the peer-to-peer network does not have enough capacity to make naive use of either of search techniques attractive for Web search, and a number of compromises that might achieve the last order of magnitude are suggested.
Abstract: This paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We present a simple feasibility analysis based on the resource constraints and search workload. Our study suggests that the peer-to-peer network does not have enough capacity to make naive use of either of search techniques attractive for Web search. The paper presents a number of existing and novel optimizations for P2P search based on distributed hash tables, estimates their effects on performance, and concludes that in combination these optimizations would bring the problem to within an order of magnitude of feasibility. The paper suggests a number of compromises that might achieve the last order of magnitude.

341 citations


Book ChapterDOI
22 Apr 2003
TL;DR: Initial results that extend the TinyDB sensornet query engine to support more sophisticated data analyses, focusing on three applications: topographic mapping, wavelet-based compression, and vehicle tracking are presented.
Abstract: High-level query languages are an attractive interface for sensor networks, potentially relieving application programmers from the burdens of distributed, embedded programming. In research to date, however, the proposed applications of such interfaces have been limited to simple data collection and aggregation schemes. In this paper, we present initial results that extend the TinyDB sensornet query engine to support more sophisticated data analyses, focusing on three applications: topographic mapping, wavelet-based compression, and vehicle tracking. We use these examples to motivate the feasibility of implementing sophisticated sensing applications in a query-based system, and present some initial results and research questions raised by this agenda.

260 citations


Posted Content
TL;DR: This article proposes research into several important new directions for database management systems, driven by the Internet and increasing amounts of scientific and sensor data.
Abstract: A group of senior database researchers gathers every few years to assess the state of database research and to point out problem areas that deserve additional focus. This report summarizes the discussion and conclusions of the sixth ad-hoc meeting held May 4-6, 2003 in Lowell, Mass. It observes that information management continues to be a critical component of most complex software systems. It recommends that database researchers increase focus on: integration of text, data, code, and streams; fusion of information from heterogeneous data sources; reasoning about uncertain data; unsupervised data mining for interesting correlations; information privacy; and self-adaptation and repair.

208 citations


Proceedings ArticleDOI
05 Mar 2003
TL;DR: A query architecture in which join operators are decomposed into their constituent data structures (State Modules), and dataflow among these SteMs is managed adaptively by an eddy routing operator, allowing continuously adaptive decisions for most major aspects of traditional query optimization.
Abstract: We present a query architecture in which join operators are decomposed into their constituent data structures (State Modules, or SteMs), and dataflow among these SteMs is managed adaptively by an eddy routing operator [R. Avnur et al., (2000)]. Breaking the encapsulation of joins serves two purposes. First, it allows the eddy to observe multiple physical operations embedded in a join algorithm, allowing for better calibration and control of these operations. Second, the SteM on a relation serves as a shared materialization point, enabling multiple competing access methods to share results, which can be leveraged by multiple competing join algorithms. Our architecture extends prior work significantly, allowing continuously adaptive decisions for most major aspects of traditional query optimization: choice of access methods and join algorithms, ordering of operators, and choice of a query spanning tree. SteMs introduce significant routing flexibility to the eddy, enabling more opportunities for adaptation, but also introducing the possibility of incorrect query results. We present constraints on eddy routing through SteMs that ensure correctness while preserving a great deal of flexibility. We also demonstrate the benefits of our architecture via experiments in the Telegraph dataflow system. We show that even a simple routing policy allows significant flexibility in adaptation, including novel effects like automatic "hybridization " of multiple algorithms for a single join.

149 citations


Journal Article
TL;DR: The experiences of extending a traditional DBMS towards managing data streams, and an overview of the current early-access release of the TelegraphCQ system are described.
Abstract: We are building TelegraphCQ, a system to process continuous queries over data streams. Although we had implemented some parts of this technology in earlier Java-based prototypes, our experiences were not positive. As a result, we decided to use PostgreSQL, an open source RDBMS as a starting point for our new implementation. In March 2003, we completed an alpha milestone of TelegraphCQ. In this paper, we report on the development status of our project, with a focus on architectural issues. Specifically, we describe our experiences extending a traditional DBMS towards managing data streams, and an overview of the current early-access release of the system.

105 citations


Journal ArticleDOI
01 Dec 2003
TL;DR: It is shown that battery powered sensor networks, with low-power multihop radios and low-cost processors, occupy a sweet spot in this spectrum that is rife with opportunity for novel database research.
Abstract: Though physical sensing instruments have long been used in astronomy, biology, and civil engineering, the recent emergence of wireless sensor networks and RFID has spurred a renaissance in sensor interest in both academia and industry In this paper, we examine the spectrum of sensing platforms, from billion dollar satellites to tiny RF tags, and discuss the technological differences between them We show that battery powered sensor networks, with low-power multihop radios and low-cost processors, occupy a sweet spot in this spectrum that is rife with opportunity for novel database research We briefly summarize some of our research work in this space and present a number of examples of interesting sensor network-related problems that the database community is uniquely equipped to address

01 Jan 2003
TL;DR: The construction and use of a Prefix Hash Tree (PHT) – a distributed data structure that supports range queries over DHTs, which is an efficient and robust search tree that takes DHT lookups into account.
Abstract: Distributed Hash Tables (DHTs) are scalable peer-to-peer systems that support exact match lookups. This paper describes the construction and use of a Prefix Hash Tree (PHT) – a distributed data structure that supports range queries over DHTs. PHTs use the hash-table interface of DHTs to construct a search tree that is efficient (insertions/lookups take DHT lookups, where D is the data domain being indexed) and robust (the failure of any given node in the search tree does not affect the availability of data stored at other nodes in the PHT).

Journal ArticleDOI
01 Sep 2003
TL;DR: It is shown that the database community's principles and technologies have an important role to play in the design of global-scale networked systems and applications, and a sampling of database query processing techniques are presented, and methods for adoption are discussed.
Abstract: A number of researchers have become interested in the design of global-scale networked systems and applications. Our thesis here is that the database community's principles and technologies have an important role to play in the design of these systems. The point of departure is at the roots of database research: we generalize Codd's notion of data independence to physical environments beyond storage systems. We note analogies between the development of database indexes and the new generation of structured peer-to-peer networks. We illustrate the emergence of data independence in networks by surveying a number of recent network facilities and applications, seen through a database lens. We present a sampling of database query processing techniques that can contribute in this arena, and discuss methods for adoption of these technologies.

01 Jan 2003
TL;DR: The utility and execution of recursive queries as an interface for querying distributed network graph structures and the relationship between innetwork query processing and distance-vector like routing protocols are explored.
Abstract: We explore the utility and execution of recursive queries as an interface for querying distributed network graph structures. To illustrate the power of recursive queries, we give several examples of computing structural properties of a P2P network such as reachability and resilience. To demonstrate the feasibility of our proposal, we sketch execution strategies for these queries using PIER, a P2P relational query processor over Distributed Hash Tables (DHTs). Finally, we discuss the relationship between innetwork query processing and distance-vector like routing protocols.

Journal Article
TL;DR: Amdb as discussed by the authors is a comprehensive graphical design tool for AMs that are constructed on top of the Generalized Search Tree abstraction, which allows the AM designer to detect and isolate deficiencies in an AM design.
Abstract: Designing and tuning access methods (AMs) has always been more of a black art than a rigorous discipline, with performance assessments being mostly reduced to presenting aggregate runtime or I/O numbers. This paper presents amdb, a comprehensive graphical design tool for AMs that are constructed on top of the Generalized Search Tree abstraction. At the core of amdb lies an an analysis framework for AMs that defines performance metrics that are more useful than traditional summary numbers and thereby allow the AM designer to detect and isolate deficiencies in an AM design. Amdb complements the analysis framework with visualization and debugging functionality, allowing the AM designer to investigate the source of those deficiencies that were brought to light with the help of the performance metrics. Several AM design projects undertaken at U.C.Berkeley have confirmed the usefulness of theion. At the core of amdb lies an an analysis framework for AMs that defines performance metrics that are more useful than traditional summary numbers and thereby allow the AM designer to detect and isolate deficiencies in an AM design. Amdb complements the analysis framework with visualization and debugging functionality, allowing the AM designer to investigate the source of those deficiencies that were brought to light with the help of the performance metrics. Several AM design projects undertaken at U.C.Berkeley have confirmed the usefulness of the analysis framework and its integration with visualization facilities in amdb. The analysis process that produces the performance metrics is fully automated and takes a workload—a tree and a set of queries—as input; the metrics characterize the performance of each query as well as that of the tree structure. Central to the framework is the use of the optimal behavior—which can be approximated relatively efficiently—as a point of reference against which the actual observed performance is compared. The framework applies to most balanced tree-structured AMs and is not restricted to particular types of of data or queries.

Journal ArticleDOI
01 Sep 2003
TL;DR: In Spring 2003, Joe Hellerstein at Berkeley and Natassa Ailamaki at CMU collaborated in designing and running parallel editions of an undergraduate database course that exposed students to developing code in the core of a ful-function database system.
Abstract: In Spring 2003, Joe Hellerstein at Berkeley and Natassa Ailamaki at CMU collaborated in designing and running parallel editions of an undergraduate database course that exposed students to developing code in the core of a ful-function database system. As part of this exercise, our course teams developed new programming projects based on the PostgreSQL open-source DBMS. This report describes our experience with this effort.

Patent
W. Mills1, Joseph M. Hellerstein1
29 Aug 2003
TL;DR: In this paper, a technique for processing a request, for access to one or more services, sent from a first computing system to a second computing system, comprises the following steps/operations.
Abstract: Techniques for flexible and efficient access control are provided. For example, in one aspect of the invention, a technique for processing a request, for access to one or more services, sent from a first computing system to a second computing system, comprises the following steps/operations. A determination is made as to whether the request sent from the first computing system to the second computing system should be deferred. Then, the request is redirected when a determination is made that the request should be deferred, such that access to the one or more services is delayed.

01 Jan 2003
TL;DR: This paper performs a measurement study of Gnutella, a popular unstructured network used for file sharing, and proposes the use of a hybrid search infrastructure to improve the search coverage for rare items and presents some preliminary performance results.
Abstract: Unstructured Networks have been used extensively in P2P search systems today primarily for file sharing. These networks exploit heterogeneity in the network and offload most of the query processing load to more powerful nodes. As an alternative to unstructured networks, there have been recent proposals for using inverted indexes on structured networks for searching. These structured networks, otherwise known as distributed hash tables (DHTs), guarantee recall and are well suited for locating rare items. However, they may incur significant bandwidth for keyword-based searches. This paper performs a measurement study of Gnutella, a popular unstructured network used for file sharing. We focus primarily on studying Gnutella's search performance and recall, especially in light of recent ultrapeer enhancements. Our study reveals significant query overheads in Gnutella ultrapeers, and the presence of queries that may benefit from the use of DHTs. Based on our study, we propose the use of a hybrid search infrastructure to improve the search coverage for rare items and present some preliminary performance results. Comments University of California, Berkeley Department of Electrical Engineering and Computer Sciences Technical Report No. CSD-03-1277. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. (Department of Electrical Engineering and Computer Sciences, University of California, Berkeley) NOTE: At the time of publication, author Boon Thau Loo was affiliated with the University of California at Berkeley. Currently (April 2007), he is a faculty member in the Department of Computer and Information Science at the University of Pennsylvania. This technical report is available at ScholarlyCommons: http://repository.upenn.edu/cis_papers/327 Measurement and Analysis of Ultrapeer-based P2P Search Networks Boon Thau Loo Joseph Hellerstein y Ryan Huebsch Scott Shenker z Ion Stoica UC Berkeley Intel Berkeley Research International Computer Science Institute fboonloo, jmh, huebsch, istoicag@cs.berkeley.edu, shenker@icsi.berkeley.edu

10 Jan 2003
TL;DR: It is argued that a D HT-based query engine provides a unified framework for describing workloads and faultloads, injecting them into a DHT, and recording and analyzing the observed system behavior, and to measure and report the resulting system performance.
Abstract: The recent proliferation of decentralized distributed hash table (DHT) proposals suggests a need for DHT benchmarks, both to compare existing implementations and to guide future innovation. We argue that a DHT-based query engine provides a unified framework for describing workloads and faultloads, injecting them into a DHT, and recording and analyzing the observed system behavior. To illustrate this argument, we describe the possibilities and challenges of using one such DHT database engine, PIER, to describe and instantiate network dataflow patterns, and to measure and report the resulting system performance. Together, these capabilities form the foundation of a benchmarking tool.