scispace - formally typeset
Search or ask a question

Showing papers by "Eugene Wu published in 2013"


Journal ArticleDOI
01 Jun 2013
TL;DR: This work proposes Scorpion, a system that takes a set of user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results.
Abstract: Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. Unfortunately, databases and visualization systems do not provide a way to work backwards from an outlier point to the common properties of the (possibly many) unaggregated input tuples that correspond to that outlier. We propose Scorpion, a system that takes a set of user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results. Specifically, this explanation identifies predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, we develop a notion of influence of a predicate on a given output, and design several algorithms that efficiently search for maximum influence predicates over the input data. We show that these algorithms can quickly find outliers in two real data sets (from a sensor deployment and a campaign finance data set), and run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set.

230 citations


Proceedings ArticleDOI
08 Apr 2013
TL;DR: A set of common semantics that can be leveraged to efficiently store fine-grained lineage representations that efficiently capture common locality properties in the lineage data are defined, and a set of APIs are introduced so operator developers can easily export lineage information from user defined operators.
Abstract: Data lineage is a key component of provenance that helps scientists track and query relationships between input and output data. While current systems readily support lineage relationships at the file or data array level, finer-grained support at an array-cell level is impractical due to the lack of support for user defined operators and the high runtime and storage overhead to store such lineage. We interviewed scientists in several domains to identify a set of common semantics that can be leveraged to efficiently store fine-grained lineage. We use the insights to define lineage representations that efficiently capture common locality properties in the lineage data, and a set of APIs so operator developers can easily export lineage information from user defined operators. Finally, we introduce two benchmarks derived from astronomy and genomics, and show that our techniques can reduce lineage query costs by up to 10× while incuring substantially less impact on workflow runtime and storage.

43 citations


01 Jan 2013
TL;DR: A tool that provides news consumers with datasets and visualizations that contextualize facts and figures in the articles they read that creates a synergistic relationship between news consumers and the database research community, providing training data to improve existing algorithms, and a grand challenge for the next generation of dataspace management research.
Abstract: We present MuckRaker, a tool that provides news consumers with datasets and visualizations that contextualize facts and figures in the articles they read. MuckRaker takes advantage of data integration techniques to identify matching datasets, and makes use of data and schema extraction algorithms to identify data points of interest in articles. It presents the output of these algorithms to users requesting additional context, and allows users to further refine these outputs. In doing so, MuckRaker creates a synergistic relationship between news consumers and the database research community, providing training data to improve existing algorithms, and a grand challenge for the next generation of dataspace management research.

3 citations


Proceedings ArticleDOI
29 Jul 2013
TL;DR: This paper presents a meta-modelling framework that automates the development and deployment of smart-phone applications by automating the very labor-intensive and therefore time-heavy and expensive process of developing and testing individual applications.
Abstract: Smart-phone applications ("apps") run across a wide range of environmental conditions, locations, and hardware platforms. They are often subject to an array of interactions that are hard or impossible for developers to emulate or even anticipate during testing. Once an application is released, feedback obtained from users and from analytics over usage and performance data result in further modifications. Many of these changes are relatively small, and can often be parameterized.

2 citations