scispace - formally typeset
Search or ask a question

Showing papers by "Eugene Wu published in 2012"


Journal ArticleDOI
01 Aug 2012
TL;DR: DBWipes is presented, a novel data cleaning system that allows users to execute aggregate queries, and interactively detect, understand, and clean errors in the query results.
Abstract: As data analytics becomes mainstream, and the complexity of the underlying data and computation grows, it will be increasingly important to provide tools that help analysts understand the underlying reasons when they encounter errors in the result. While data provenance has been a large step in providing tools to help debug complex workflows, its current form has limited utility when debugging aggregation operators that compute a single output from a large collection of inputs. Traditional provenance will return the entire input collection, which has very low precision. In contrast, users are seeking precise descriptions of the inputs that caused the errors. We propose a Ranked Provenance System, which identifies subsets of inputs that influenced the output error, describes each subset with human readable predicates and orders them by contribution to the error. In this demonstration, we will present DBWipes, a novel data cleaning system that allows users to execute aggregate queries, and interactively detect, understand, and clean errors in the query results. Conference attendees will explore anomalies in campaign donations from the current US presidential election and in readings from a 54-node sensor deployment.

14 citations


01 Jan 2012
TL;DR: This work takes advantage of this ability to reduce the complexity of both ratingand comparison-based rankings by showing crowd workers five to ten items at a time and having them rate or compare all of the items.
Abstract: From an early age, we sort jellybeans with friends, placing watermelon flavor at the top of our list, and pushing licorice to the bottom. We contribute to larger-scale ranking operations when we rate a product in an online catalog. Given the prevalence of human-powered ranking in the digital realm, it is important to study the most effective user interfaces and algorithms for eliciting rankings from humans. In experiments [2] involving Mechanical Turk [1] workers, we found that average ratings (e.g., “On a scale from one to seven”) have a cost linear in the input and result in high-accuracy rankings. Comparison-based sorts (e.g., “Which of these pairs of glasses is cooler?”) have a cost up to quadratic in the input, but result in perfect rankings. We found that a hybrid of the two, where one rates the input and then runs comparisons on the nearly ranked data as a budget allows, is both costand accuracy-effective. Lest we leave you with the foul-tasting notion that people are simply binary comparator-computing cogs in an algorithm-powered machine, there’s more to humans than that! Humans are effective at batch-processing data in ways that machines are not. We take advantage of this ability to reduce the complexity of both ratingand comparison-based rankings by showing crowd workers five to ten items at a time and having them rate or compare all of the items. There is more to this story, but it’s time for a snack! BODY When ranking with humans, ratings are cheap and nice. To improve quality, please compare, for a price.