scispace - formally typeset
Search or ask a question

Showing papers by "Joseph M. Hellerstein published in 2012"


Journal Article•DOI•
01 Apr 2012
TL;DR: GraphLab as discussed by the authors extends the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees to reduce network congestion and mitigate the effect of network latency in the shared-memory setting.
Abstract: While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

1,505 citations


Journal Article•DOI•
TL;DR: This work characterize the process of industrial data analysis and document how organizational features of an enterprise impact it, and describes recurring pain points, outstanding challenges, and barriers to adoption for visual analytic tools.
Abstract: Organizations rely on data analysts to model customer engagement, streamline operations, improve production, inform business decisions, and combat fraud. Though numerous analysis and visualization tools have been built to improve the scale and efficiency at which analysts can work, there has been little research on how analysis takes place within the social and organizational context of companies. To better understand the enterprise analysts' ecosystem, we conducted semi-structured interviews with 35 data analysts from 25 organizations across a variety of sectors, including healthcare, retail, marketing and finance. Based on our interview data, we characterize the process of industrial data analysis and document how organizational features of an enterprise impact it. We describe recurring pain points, outstanding challenges, and barriers to adoption for visual analytic tools. Finally, we discuss design implications and opportunities for visual analysis research.

403 citations


Journal Article•DOI•
01 Aug 2012
TL;DR: The MADlib project is introduced, including the background that led to its beginnings, and the motivation for its open-source nature, and an overview of the library's architecture and design patterns is provided, and a description of various statistical methods in that context is provided.
Abstract: MADlib is a free, open-source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind.In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open-source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals.MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.

342 citations


Posted Content•
TL;DR: MADlib as mentioned in this paper is a free, open source library of in-database analytic methods that provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools.
Abstract: MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind. In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals. MADlib is freely available at this http URL, and the project is open for contributions of both new methods, and ports to additional database platforms.

309 citations


Proceedings Article•DOI•
21 May 2012
TL;DR: Profiler is presented, a visual analysis tool for assessing quality issues in tabular data, which applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context.
Abstract: Data quality issues such as missing, erroneous, extreme and duplicate values undermine analysis and are time-consuming to find and fix. Automated methods can help identify anomalies, but determining what constitutes an error is context-dependent and so requires human judgment. While visualization tools can facilitate this process, analysts must often manually construct the necessary views, requiring significant expertise. We present Profiler, a visual analysis tool for assessing quality issues in tabular data. Profiler applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction with millions of data points. We present Profiler's architecture --- including modular components for custom data types, anomaly detection routines and summary visualizations --- and describe its application to motion picture, natural disaster and water quality data sets.

235 citations


Book•
14 Feb 2012
TL;DR: The intuition behind declarative programming of networks is presented, including roots in Datalog, extensions for networked environments, and the semantics of long-running queries over network state.
Abstract: Declarative Networking is a programming methodology that enables developers to concisely specify network protocols and services, which are directly compiled to a dataflow framework that executes the specifications. This paper provides an introduction to basic issues in declarative networking, including language design, optimization, and dataflow execution. We present the intuition behind declarative programming of networks, including roots in Datalog, extensions for networked environments, and the semantics of long-running queries over network state. We focus on a sublanguage we call Network Datalog (NDlog), including execution strategies that provide crisp eventual consistency semantics with significant flexibility in execution. We also describe a more general language called Overlog, which makes some compromises between expressive richness and semantic guarantees. We provide an overview of declarative network protocols, with a focus on routing protocols and overlay networks. Finally, we highlight related work in declarative networking, and new declarative approaches to related problems.

209 citations


Journal Article•DOI•
01 Apr 2012
TL;DR: This work explains why partial quorums are regularly acceptable in practice, analyzing both the staleness of data they return and the latency benefits they offer, and introduces Probabilistically Bounded Staleness (PBS) consistency, which provides expected bounds on staleness with respect to both versions and wall clock time.
Abstract: Data store replication results in a fundamental trade-off between operation latency and data consistency. In this paper, we examine this trade-off in the context of quorum-replicated data stores. Under partial, or non-strict quorum replication, a data store waits for responses from a subset of replicas before answering a query, without guaranteeing that read and write replica sets intersect. As deployed in practice, these configurations provide only basic eventual consistency guarantees, with no limit to the recency of data returned. However, anecdotally, partial quorums are often "good enough" for practitioners given their latency benefits. In this work, we explain why partial quorums are regularly acceptable in practice, analyzing both the staleness of data they return and the latency benefits they offer. We introduce Probabilistically Bounded Staleness (PBS) consistency, which provides expected bounds on staleness with respect to both versions and wall clock time. We derive a closed-form solution for versioned staleness as well as model real-time staleness for representative Dynamo-style systems under internet-scale production workloads. Using PBS, we measure the latency-consistency trade-off for partial quorum systems. We quantitatively demonstrate how eventually consistent systems frequently return consistent data within tens of milliseconds while offering significant latency benefits.

202 citations


Proceedings Article•DOI•
14 Oct 2012
TL;DR: This paper generalizes Bloom to support lattices and extends the power of CALM analysis to whole programs containing arbitrary lattices, and shows how the Bloom interpreter can be generalized to support efficient evaluation of lattice-based code using well-known strategies from logic programming.
Abstract: In recent years there has been interest in achieving application-level consistency criteria without the latency and availability costs of strongly consistent storage infrastructure. A standard technique is to adopt a vocabulary of commutative operations; this avoids the risk of inconsistency due to message reordering. Another approach was recently captured by the CALM theorem, which proves that logically monotonic programs are guaranteed to be eventually consistent. In logic languages such as Bloom, CALM analysis can automatically verify that programs achieve consistency without coordination.In this paper we present BloomL, an extension to Bloom that takes inspiration from both of these traditions. BloomL generalizes Bloom to support lattices and extends the power of CALM analysis to whole programs containing arbitrary lattices. We show how the Bloom interpreter can be generalized to support efficient evaluation of lattice-based code using well-known strategies from logic programming. Finally, we use BloomL to develop several practical distributed programs, including a key-value store similar to Amazon Dynamo, and show how BloomL encourages the safe composition of small, easy-to-analyze lattices into larger programs.

118 citations


Proceedings Article•DOI•
14 Oct 2012
TL;DR: This work exposes causal consistency's serious and inherent scalability limitations due to write propagation requirements and traditional dependency tracking mechanisms, and advocates the use of explicit causality, or application-defined happens-before relations.
Abstract: Causal consistency is the strongest consistency model that is available in the presence of partitions and provides useful semantics for human-facing distributed services Here, we expose its serious and inherent scalability limitations due to write propagation requirements and traditional dependency tracking mechanisms As an alternative to classic potential causality, we advocate the use of explicit causality, or application-defined happens-before relations Explicit causality, a subset of potential causality, tracks only relevant dependencies and reduces several of the potential dangers of causal consistency

95 citations


Proceedings Article•DOI•
07 Oct 2012
TL;DR: The hypothesis is that users prefer to specify quantified queries interactively by trial-and-error by using DataPlay, a query tool with an underlying graphical query language, a unique data model and a graphical interface that provides two interaction features that support trial- and-error query specification.
Abstract: Writing complex queries in SQL is a challenge for users. Prior work has developed several techniques to ease query specification but none of these techniques are applicable to a particularly difficult class of queries: quantified queries. Our hypothesis is that users prefer to specify quantified queries interactively by trial-and-error. We identify two impediments to this form of interactive trial-and-error query specification in SQL: (i) changing quantifiers often requires global syntactical query restructuring, and (ii) the absence of non-answers from SQL's results makes verifying query correctness difficult. We remedy these issues with DataPlay, a query tool with an underlying graphical query language, a unique data model and a graphical interface. DataPlay provides two interaction features that support trial-and-error query specification. First, DataPlay allows users to directly manipulate a graphical query by changing quantifiers and modifying dependencies between constraints. Users receive real-time feedback in the form of updated answers and non-answers. Second, DataPlay can auto-correct a user's query, based on user feedback about which tuples to keep or drop from the answers and non-answers. We evaluated the effectiveness of each interaction feature with a user study and we found that direct query manipulation is more effective than auto-correction for simple queries but auto-correction is more effective than direct query manipulation for more complex queries.

61 citations


Proceedings Article•DOI•
11 Mar 2012
TL;DR: It is found that Shreddr can significantly decrease the effort and cost of data entry, while maintaining a high level of quality, within this case study.
Abstract: For low-resource organizations working in developing regions, infrastructure and capacity for data collection have not kept pace with the increasing demand for accurate and timely data. Despite continued emphasis and investment, many data collection efforts still suffer from delays, inefficiency and difficulties maintaining quality. Data is often still "stuck" on paper forms, making it unavailable for decision-makers and operational staff. We apply techniques from computer vision, database systems and machine learning, and leverage new infrastructure -- online workers and mobile connectivity -- to redesign data entry with high data quality. Shreddr delivers self-serve, low-cost and on-demand data entry service allowing low-resource organizations to quickly transform stacks of paper into structured electronic records through a novel combination of optimizations: batch processing and compression techniques from database systems, automatic document processing using computer vision, and value verification through crowd-sourcing. In this paper, we describe Shreddr's design and implementation, and measure system performance with a large-scale evaluation in Mali, where Shreddr was used to enter over a million values from 36,819 pages. Within this case study, we found that Shreddr can significantly decrease the effort and cost of data entry, while maintaining a high level of quality.

Journal Article•DOI•
01 Aug 2012
TL;DR: This work introduces DataPlay as a sophisticated query specification tool and demonstrates its unique interaction models.
Abstract: DataPlay is a query tool that encourages a trial-and-error approach to query specification. DataPlay uses a graphical query language to make a particularly challenging query specification task - quantification - easier. It constrains the relational data model to enable the presentation of non-answers, in addition to answers, to aid query interpretation. Two novel features of DataPlay are suggesting semantic variations to a query and correcting queries by example. We introduce DataPlay as a sophisticated query specification tool and demonstrate its unique interaction models.

Book Chapter•DOI•
11 Sep 2012
TL;DR: This work begins with a model-theoretic semantics for Dedalus and introduces the ultimate model, which captures non-deterministic eventual outcomes of distributed programs, and identifies restricted sub-languages that guarantee confluence while providing adequate expressivity.
Abstract: Building on recent interest in distributed logic programming, we take a model-theoretic approach to analyzing confluence of asynchronous distributed programs. We begin with a model-theoretic semantics for Dedalus and introduce the ultimate model, which captures non-deterministic eventual outcomes of distributed programs. After showing the question of confluence undecidable for Dedalus, we identify restricted sub-languages that guarantee confluence while providing adequate expressivity. We observe that the semipositive restriction Dedalus+ guarantees confluence while capturing PTIME, but show that its restriction of negation makes certain simple and practical programs difficult to write. To remedy this, we introduce DedalusS, a restriction of Dedalus that allows a kind of stratified negation, but retains the confluence of Dedalus+ and similarly captures PTIME.

01 Jan 2012
TL;DR: How social networks can be used for recruitment and promotion of a crowdsourced citizen science project is examined and compares this recruiting method to the use of traditional media channels.
Abstract: This dissertation explores the application of computer science methodologies, techniques, and technologies to citizen science Citizen science can be broadly defined as scientific research performed in part or in whole by volunteers who are not professional scientists Such projects are increasingly making use of mobile and Internet technologies and social networking systems to collect or categorize data, and to coordinate efforts with other participants The dissertation focuses on observations and experiences from the design, deployment, and testing of a citizen science project, Creek Watch Creek Watch is a collaboration between an HCI research group and a government agency The project allows anyone with an iPhone to submit photos and observations of their local waterways to authorities who use the data for water management, environmental programs, and cleanup events The first version of Creek Watch was designed by a user-centered iterative design method, in collaboration with scientists who need data on waterways As a result, the data collected by Creek Watch is useful to scientists and water authorities, while the App is usable by untrained novices Users of Creek Watch submit reports on their local creek, stream, or other water body that include simple observations about water level, water flow rate, and trash Observations are automatically time stamped and GPS tagged Reports are submitted to a database at creekwatchorg, where scientists and members of the public alike can view reports and download data The deployment of Creek Watch provided several lessons in the launch of an international citizen science mobile App Subsequent versions of the iPhone App solved emergent problems with data quality by providing international translations, an instructional walk-through, and a confirmation screen for first-time submissions This dissertation further examines how social networks can be used for recruitment and promotion of a crowdsourced citizen science project and compares this recruiting method to the use of traditional media channels Results are presented from a series of campaigns to promote Creek Watch, including a press release with news pickups, a participation campaign through local organizations, and a social networking campaign through Facebook and Twitter This dissertation also presents results from the trial of a feature that allows users to post Creek Watch reports automatically to Facebook or Twitter Social networking was a worthwhile avenue for increasing awareness of the project, which increased the conversion rate from browsers to participants The Facebook and Twitter campaign increased participation and was a better recruitment strategy than the participation campaign However, targeting existing communities resulted in the largest increase in data submissions

Proceedings Article•DOI•
21 May 2012
TL;DR: The utility of BloomUnit is illustrated by demonstrating an incremental process by which a programmer might provide and refine a set of queries and constraints until they define a rich set of correctness tests for a distributed system.
Abstract: We present BloomUnit, a testing framework for distributed programs written in the Bloom language. BloomUnit allows developers to write declarative test specifications that describe the input/output behavior of a software module. Test specifications are expressed as Bloom queries over (distributed) execution traces of the program under test. To allow execution traces to be produced automatically, BloomUnit synthesizes program inputs that satisfy user-provided constraints. For a given input, BloomUnit systematically explores the space of possible network message reorderings. BloomUnit searches this space efficiently by exploiting program semantics to ignore "uninteresting" message schedules.We illustrate the utility of BloomUnit by demonstrating an incremental process by which a programmer might provide and refine a set of queries and constraints until they define a rich set of correctness tests for a distributed system.

Posted Content•
TL;DR: Probabilistically bounded staleness (PBS) as mentioned in this paper provides expected bounds on staleness with respect to both versions and wall clock time, and quantitatively demonstrates that eventually consistent systems frequently return consistent data within tens of milliseconds while offering significant latency benefits.
Abstract: Data store replication results in a fundamental trade-off between operation latency and data consistency. In this paper, we examine this trade-off in the context of quorum-replicated data stores. Under partial, or non-strict quorum replication, a data store waits for responses from a subset of replicas before answering a query, without guaranteeing that read and write replica sets intersect. As deployed in practice, these configurations provide only basic eventual consistency guarantees, with no limit to the recency of data returned. However, anecdotally, partial quorums are often "good enough" for practitioners given their latency benefits. In this work, we explain why partial quorums are regularly acceptable in practice, analyzing both the staleness of data they return and the latency benefits they offer. We introduce Probabilistically Bounded Staleness (PBS) consistency, which provides expected bounds on staleness with respect to both versions and wall clock time. We derive a closed-form solution for versioned staleness as well as model real-time staleness for representative Dynamo-style systems under internet-scale production workloads. Using PBS, we measure the latency-consistency trade-off for partial quorum systems. We quantitatively demonstrate how eventually consistent systems frequently return consistent data within tens of milliseconds while offering significant latency benefits.

Proceedings Article•DOI•
14 Oct 2012
TL;DR: This paper presents a meta-modelling framework for scalable distributed storage that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and updating distributed systems.
Abstract: In recent years, distributed programming has become a topic of widespread interest among developers. However, writing reliable distributed programs remains stubbornly difficult. In addition to the inherent challenges of distribution---asynchrony, concurrency, and partial failure---many modern distributed systems operate at massive scale. Scalability concerns have in turn encouraged many developers to eschew strongly consistent distributed storage in favor of application-level consistency criteria [5, 10, 18], which has raised the degree of difficulty still further.