scispace - formally typeset
Search or ask a question

Showing papers by "Joseph M. Hellerstein published in 2009"


Journal ArticleDOI
01 Aug 2009
TL;DR: This paper highlights the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence, and describes database design methodologies that support the agile working style of analysts in these settings.
Abstract: As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world's largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

535 citations


Journal ArticleDOI
TL;DR: Database research is expanding, with major efforts in system architecture, new languages, cloud services, mobile and virtual worlds, and interplay between structure and text.
Abstract: Database research is expanding, with major efforts in system architecture, new languages, cloud services, mobile and virtual worlds, and interplay between structure and text.

75 citations


Proceedings ArticleDOI
17 Apr 2009
TL;DR: USHER is demonstrated, an end-to-end system that automatically generates data entry forms that enforce and maintain data quality constraints during execution that features a probabilistic engine that drives form-user interactions to encourage correct answers.
Abstract: Organizations in developing regions want to efficiently collect digital data, but standard data gathering practices from the developed world are often inappropriate. Traditional techniques for form design and data quality are expensive and labour-intensive. We propose a new data-driven approach to form design, execution (filling) and quality assurance. We demonstrate USHER, an end-to-end system that automatically generates data entry forms that enforce and maintain data quality constraints during execution. The system features a probabilistic engine that drives form-user interactions to encourage correct answers.

43 citations


Proceedings ArticleDOI
13 Apr 2009
TL;DR: This work uses a model-based approach that constructs and maintains a spanning tree within the network, rooted at the basestation, based on a formal model of the in-network tree construction task framed as an optimization problem.
Abstract: In this work we present new in-network techniques for communication efficient approximate query processing in wireless sensornets. We use a model-based approach that constructs and maintains a spanning tree within the network, rooted at the basestation. The tree maintains compressed summary information for each link that is used to “stub out” traversal during query processing. Our work is based on a formal model of the in-network tree construction task framed as an optimization problem.We demonstrate hardness results for that problem, and develop efficient approximation algorithms for subtasks that are too expensive to compute exactly. We also propose efficient heuristics to accommodate a wider set of workloads, and empirically evaluate their performance and sensitivity to model changes.

30 citations


01 Jan 2009
TL;DR: This paper describes the experience using Overlog and Java to implement a “Big Data” analytics stack that is API-compatible with Hadoop and HDFS, with equivalent performance and presents the experience to validate the enhanced programmer productivity afforded by declarative programming.
Abstract: Cloud computing makes datacenter clusters a commodity, potentially enabling a wide range of programmers to develop new scalable services. However, current cloud platforms do little to simplify truly distributed systems development. In this paper, we explore the use of a declarative, data-centric programming model to achieve this simplicity. We describe our experience using Overlog and Java to implement a “Big Data” analytics stack that is API-compatible with Hadoop and HDFS, with equivalent performance. We extended the system with complex features not yet available in Hadoop, including availability, scalability, and unique monitoring and debugging facilities. We present our experience to validate the enhanced programmer productivity afforded by declarative programming, and inform the design of new development environments for distributed programming.

26 citations


Proceedings ArticleDOI
29 Jun 2009
TL;DR: The key data dependencies inherent in the dynamic programming at the heart of these optimizers are identified and used both to design a flexible parallel query optimization implementation, and to assess the opportunities for parallelism in this context.
Abstract: Query optimization is the most computationally complex task in a database management systems. In many query optimizers, faster CPUs and increased RAM can translate directly to better query plans and thus better overall system performance. Although memory size continues to scale with Moore's Law, processor speeds are leveling off. Chip manufacturers are now focusing on multicore designs that integrate increasing numbers of cores in a single CPU. Query optimizers need to be parallelized in order to continue enjoying the growth trend of Moore's Law. In this paper, we address this problem in the context of the extensible optimizer architectures found in many commercial database systems. We identify the key data dependencies inherent in the dynamic programming at the heart of these optimizers. We use this insight both to design a flexible parallel query optimization implementation, and to assess the opportunities for parallelism in this context. The proposed solutions can serve as a blueprint for retrofitting existing industry-grade optimizers to leverage multicore architectures, without requiring significant rework of the underlying infrastructure.

23 citations



Journal ArticleDOI
01 Aug 2009
TL;DR: Analysts in all areas of human knowledge, from science and engineering to economics, social science and journalism are drowning in data.
Abstract: Analysts in all areas of human knowledge, from science and engineering to economics, social science and journalism are drowning in data. New technologies for sensing, simulation, and communication are helping people to both collect and produce data at exponential rates.

8 citations


Book ChapterDOI
01 Jan 2009

4 citations


01 Jan 2009
TL;DR: The findings indicate that ideas from data management may yield dividends for the design of networked systems in two key areas: (1) declarative interfaces for simplicity yet breadth of expressiveness, and (2) automatic optimizations for automatic performance improvements on the users' behalf.
Abstract: In the face of progressively diverse networking technologies and application traffic, it is increasingly infeasible to custom engineer networked systems for each scenario. Moreover, an expanding class of networks, networked embedded systems, are very difficult to program, yet require a high degree of per-deployment programming customization. We investigate a declarative approach to building and optimizing networked systems, with emphasis on networked embedded systems. Our findings indicate that ideas from data management may yield dividends for the design of networked systems in two key areas: (1) declarative interfaces for simplicity yet breadth of expressiveness, and (2) automatic optimizations for automatic performance improvements on the users' behalf. This dissertation reports on three efforts. First, we designed and implemented DSN: a declarative language, runtime and compiler for networked embedded systems. The new logic-based language in DSN has been highly intuitive for programming—in one case, an algorithm designers' pseudocode mapped nearly line-for-line to working DSN code. Typically, lines-of-code are reduced by an order of magnitude vs. implementations in traditional embedded languages. We built a complementary compiler and runtime that showed negligible performance drop off vs. hand-tuned C implementations. As a result, we have been able to build whole system stacks—save for device drivers—entirely declaratively in under a hundred lines of code. Next, we designed and implemented netopt, a network optimizer that relieves programmers from having to manually solve two general networking problems: rendezvous and proxy selection. As part of this effort, we created novel program analysis and transformation algorithms to automatically select optimal communication rendezvous and proxies based on traffic and network conditions. When combined with either the DSN system, or similar systems for PC-class devices, user programs get 1–2 orders of magnitude performance improvement without need for programmers' assistance. Lastly, we designed and implemented wireless-netopt, an extension of netopt for wireless networking. wireless-netopt includes three wireless network optimizations from different layers of the networking stack. We show that the declarative interface readily supports such new domain-specific optimizations. Furthermore, these optimizations can be applied automatically and without added programmer effort, benefiting programs by 2× in energy savings.

3 citations


Proceedings ArticleDOI
13 Apr 2009
TL;DR: This work investigates an approach that focuses on replacing custom engineering with automated optimization of declarative protocol specifications, and automates network rendezvous and proxy selection from program source.
Abstract: As the diversity of sensornet use cases increases, the combinations of environments and applications that will coexist will make custom engineering increasingly impractical. We investigate an approach that focuses on replacing custom engineering with automated optimization of declarative protocol specifications. Specifically, we automate network rendezvous and proxy selection from program source. These optimizations perform program transformations that are grounded in recursive query optimization, an area of database theory. Our prototype system implementation can automatically choose program executions that are as much as three, and usually one order of magnitude better than original source programs.

01 Jan 2009
TL;DR: In this paper, the Paxos consensus protocol is implemented in Overlog, a distributed declarative programming language, and it can be easily translated to logic, in large part because the primitives used in consensus protocol specifications map directly to simple Overlog constructs such as aggregation and selection.
Abstract: The Paxos consensus protocol can be specified concisely, but is notoriously difficult to implement in practice. We recount our experience building Paxos in Overlog, a distributed declarative programming language. We found that the Paxos algorithm is easily translated to declarative logic, in large part because the primitives used in consensus protocol specifications map directly to simple Overlog constructs such as aggregation and selection. We discuss the programming idioms that appear frequently in our implementation, and the applicability of declarative programming to related application domains.

01 Jan 2009
TL;DR: This thesis addresses the traditional database problem of query optimization in this new setting by making queries more energy efficient by means of minimizing the communication and sensing that is required to provide sufficient answers.
Abstract: Sensor networks are progressively becoming a standard in applications that require the monitoring of physical phenomena. Measurements like temperature, humidity, light, and acceleration are gathered at various locations and can be used to extract information on the phenomenon observed. Sensor networks are naturally distributed, and they display strong resource restrictions. Moreover, the gathered data comes in various degrees of uncertainty, due to noisy and dropped measurements, interference, and the unavoidable discretization of the examined domain. A basic task in sensor networks is to interactively gather data from a subset of nodes in the network. Surprisingly, this problem is non-trivial to implement efficiently and robustly, even for relatively static networks. In this thesis we address the traditional database problem of query optimization in this new setting. We identify the characteristics of sensor network environments and the requirements of applications that are relevant to querying. We focus on making queries more energy efficient by means of minimizing the communication and sensing that is required to provide sufficient answers. Our contributions include theoretical, algorithmic and empirical results. We provide complexity analysis for common data gathering tasks, develop algorithms that approximate the optimal query plans, and apply our techniques to a prototype implementation that tests our theory and algorithms over real world data, demonstrating the feasibility of our approach.

01 Jan 2009
TL;DR: This work presents the FM3 Proof Sketch aggregation protocol, which efficiently and securely computes various approximate order statistics including medians, median absolute deviations, quantiles, ranks, and frequent items and derives robustness and approximation guarantees for those queries in adversarial environments.
Abstract: In-network aggregation can save significant bandwidth in a distributed query systems, but is subject to attack by adversaries. Prior work addressed settings where data sources are trusted, but the aggregation infrastructure needs to be secured. We study extensions that also make aggregate queries robust to adversarial data sources, which can inject spurious values into the data stream to be aggregated. Wagner [31] observed that the field of robust statistics can provide tools here, since robust estimators (medians, trimmed means, median absolute deviations, etc.) provide formal guarantees on the degree to which perturbed data can have an effect on aggregate results. This raises the challenge of developing verifiable in-network algorithms for robust estimators. Many of the natural robust estimators are built on order statistics, so we focus here on verifiable techniques for in-network computation of order statistics. To our knowledge, there is no mechanism guarantees both the efficiency and verifiability of the order statistics computation. In this work, we present the FM3 Proof Sketch aggregation protocol, which efficiently and securely computes various approximate order statistics including medians, median absolute deviations, quantiles, ranks, and frequent items). We derive robustness and approximation guarantees for those queries in adversarial environments, and demonstrate empirically that our scheme is practically useful via experiments on real and synthetic data.

Journal ArticleDOI
01 Aug 2009
TL;DR: Many of the largest database-driven web sites use custom web-scale data managers (WDMs) that are being applied to problems that are well-suited for relational database systems, including Google's Bigtable and Amazon's Dynamo.
Abstract: Many of the largest database-driven web sites use custom web-scale data managers (WDMs). On the surface, these WDMs are being applied to problems that are well-suited for relational database systems. Some examples are the following:• Map-Reduce [5], Hadoop [7], and Dryad [9] are used to process queries on large data sets using sequential scan and aggregation. Hive [8] is a data warehouse built on Hadoop.• Google's Bigtable [3] is used to store a replicated table of rows of semi-structured data.• Amazon's Dynamo [6] is used to store partitioned, replicated databases of key-value pairs. Cassandra [2] is similar.• Object caching systems are used instead of a persistent store, such as memcached [10], Oracle's Coherence, and Microsoft's Velocity project.