scispace - formally typeset
Search or ask a question

Showing papers by "Eugene Wu published in 2019"


Posted Content
TL;DR: A framework, called AlphaClean, that rethinks parameter tuning for data cleaning pipelines, which is significantly more robust to straggling data cleaning methods and redundancy in the data cleaning library, and can incorporate state-of-the-art cleaning systems such as HoloClean as cleaning operators.
Abstract: The analyst effort in data cleaning is gradually shifting away from the design of hand-written scripts to building and tuning complex pipelines of automated data cleaning libraries. Hyper-parameter tuning for data cleaning is very different than hyper-parameter tuning for machine learning since the pipeline components and objective functions have structure that tuning algorithms can exploit. This paper proposes a framework, called AlphaClean, that rethinks parameter tuning for data cleaning pipelines. AlphaClean provides users with a rich library to define data quality measures with weighted sums of SQL aggregate queries. AlphaClean applies generate-then-search framework where each pipelined cleaning operator contributes candidate transformations to a shared pool. Asynchronously, in separate threads, a search algorithm sequences them into cleaning pipelines that maximize the user-defined quality measures. This architecture allows AlphaClean to apply a number of optimizations including incremental evaluation of the quality measures and learning dynamic pruning rules to reduce the search space. Our experiments on real and synthetic benchmarks suggest that AlphaClean finds solutions of up-to 9x higher quality than naively applying state-of-the-art parameter tuning methods, is significantly more robust to straggling data cleaning methods and redundancy in the data cleaning library, and can incorporate state-of-the-art cleaning systems such as HoloClean as cleaning operators.

30 citations


Journal ArticleDOI
TL;DR: In this paper, the Pixel Approximate Entropy (PAE) is proposed as a measure of visual complexity for line charts, which adapts the approximate entropy statistical measure commonly used to quantify regularity and unpredictability in time-series data.
Abstract: When inspecting information visualizations under time critical settings, such as emergency response or monitoring the heart rate in a surgery room, the user only has a small amount of time to view the visualization “at a glance”. In these settings, it is important to provide a quantitative measure of the visualization to understand whether or not the visualization is too “complex” to accurately judge at a glance. This paper proposes Pixel Approximate Entropy (PAE), which adapts the approximate entropy statistical measure commonly used to quantify regularity and unpredictability in time-series data, as a measure of visual complexity for line charts. We show that PAE is correlated with user-perceived chart complexity, and that increased chart PAE correlates with reduced judgement accuracy. ‘We also find that the correlation between PAE values and participants’ judgment increases when the user has less time to examine the line charts.

26 citations


Proceedings ArticleDOI
25 Jun 2019
TL;DR: DeepBase as mentioned in this paper is a system to inspect neural network behaviors through a unified interface, allowing users to annotate the data with high-level labels (e.g., part-of-speech tags, image captions).
Abstract: Although deep learning models perform remarkably well across a range of tasks such as language translation and object recognition, it remains unclear what high-level logic, if any, they follow. Understanding this logic may lead to more transparency, better model design, and faster experimentation. Recent machine learning research has leveraged statistical methods to identify hidden units that behave (e.g., activate) similarly to human understandable logic, but those analyses require considerable manual effort. Our insight is that many of those studies follow a common analysis pattern, and therefore there is opportunity to provide a declarative abstraction to easily express, execute and optimize them. This paper describes DeepBase, a system to inspect neural network behaviors through a unified interface. We model logic with user-provided hypothesis functions that annotate the data with high-level labels (e.g., part-of-speech tags, image captions). DeepBase lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses. We discuss how DeepBase can express existing analyses, propose a set of simple and effective optimizations to speed up a standard Python implementation by up to 72x, and reproduce recent studies from the NLP literature.

20 citations


Proceedings ArticleDOI
25 Jun 2019
TL;DR: This tutorial will go logically through these prior art, paying particular attentions on problems that may attract the interest from the database community.
Abstract: The problem of data visualization is to transform data into a visual context such that people can easily understand the significance of data. Nowadays, data visualization becomes especially important, because it is the de facto standard for modern business intelligence and successful data science. This tutorial will cover three specific topics: visualization languages define how the users can interact with various visualization systems; efficient data visualization processes the data and produces visualizations based on well-specified user queries; smart data visualization recommends data visualizations based on underspecified user queries. In this tutorial, we will go logically through these prior art, paying particular attentions on problems that may attract the interest from the database community.

16 citations


Proceedings ArticleDOI
25 Jun 2019
TL;DR: Precision Interfaces as discussed by the authors analyzes structural changes between input queries from an analysis, and generates an output interface with widgets to express those changes, which can generate useful interfaces for simple unanticipated tasks.
Abstract: Interactive tools make data analysis more efficient and more accessible to end-users by hiding the underlying query complexity and exposing interactive widgets for the parts of the query that matter to the analysis. However, creating custom tailored (i.e., precise) interfaces is very costly, and automated approaches are desirable. We propose a syntactic approach that uses queries from an analysis to generate a tailored interface. We model interface widgets as functions I(q) - > q' that modify the current analysis query q, and interfaces as the set of queries that its widgets can express. Our system, Precision Interfaces, analyzes structural changes between input queries from an analysis, and generates an output interface with widgets to express those changes. Our experiments on the Sloan Digital Sky Survey query log suggest that Precision Interfaces can generate useful interfaces for simple unanticipated tasks, and our optimizations can generate interfaces from logs of up to 10,000 queries in >10s.

11 citations


Proceedings ArticleDOI
25 Jun 2019
TL;DR: SmartCrawl, a new framework to progressively crawl the deep web through a keyword-search API to enrich a local database in an e ective way, is proposed and shows that on both simulated and real-world hidden databases, SmartCrawl increases coverage over the local database as compared to the baselines.
Abstract: Data enrichment is the act of extending a local database with new attributes from external data sources. In this paper, we study a novel problem-how to progressively crawl the deep web (i.e., a hidden database) through a keyword-search API to enrich a local database in an e ective way. This is chal- lenging because these interfaces often limit the data access by enforcing the top-k constraint or limiting the number of queries that can be issued within a time window. In response, we propose SmartCrawl, a new framework to collect re- sults e ectively. Given a query budget b, SmartCrawl rst constructs a query pool based on the local database, and then iteratively issues a set of most bene cial queries to the hidden database such that the union of the query results can cover the maximum number of local records. The key technical challenge is how to estimate query bene t, i.e., the number of local records that can be covered by a given query. A simple approach is to estimate it as the query frequency in the local database. We nd that this is ine ective due to i) the impact of |ΔD|, where |ΔD| represents the number of local records that cannot be found in the hidden database, and ii) the top-k constraint enforced by the hidden database. We study how to mitigate the negative impacts of the two factors and propose e ective optimization techniques to improve performance. The experimental results show that on both simulated and real-world hidden databases, SmartCrawl signi cantly increases coverage over the local database as compared to the baselines.

8 citations


Journal ArticleDOI
25 Mar 2019
TL;DR: This paper extended a recent method in online aggregation, called Wander Join, that is optimized for queries that join tables, one of the most computationally expensive operations and applies user interaction techniques that allow the user to view and adjust the convergence rate, providing more transparency and control over the online aggregation process.
Abstract: Progressive visualization offers a great deal of promise for big data visualization; however, current progressive visualization systems do not allow for continuous interaction. What if users want to see more confident results on a subset of the visualization? This can happen when users are in exploratory analysis mode but want to ask some directed questions of the data as well. In a progressive visualization system, the online aggregation algorithm determines the database sampling rate and resulting convergence rate, not the user. In this paper, we extend a recent method in online aggregation, called Wander Join, that is optimized for queries that join tables, one of the most computationally expensive operations. This extension leverages importance sampling to enable user-driven sampling when data joins are in the query. We applied user interaction techniques that allow the user to view and adjust the convergence rate, providing more transparency and control over the online aggregation process. By leveraging importance sampling, our extension of Wander Join also allows for stratified sampling of groups when there is data distribution skew. We also improve the convergence rate of filtering queries, but with additional overhead costs not needed in the original Wander Join algorithm.

7 citations


Proceedings ArticleDOI
20 Nov 2019
TL;DR: This paper introduces a judicious adaptation of predicate analysis on analyzed query plans that avoids unnecessary query optimization, and presents a UDF translator that transparently compiles UDFs from general purpose languages into native equivalents.
Abstract: Result caching is crucial to the performance of data processing systems, but two trends complicate its use. First, immutable datasets make it difficult to efficiently employ powerful result caching techniques like predicate analysis, since predicate analysis typically requires optimized query plans but generating those plans can be costly with data immutability. Second, increased support for user-defined functions (UDFs), which are treated as black boxes by query engines, hinders aggressive result caching. This paper overcomes these problems by introducing 1) a judicious adaptation of predicate analysis on analyzed query plans that avoids unnecessary query optimization, and 2) a UDF translator that transparently compiles UDFs from general purpose languages into native equivalents. We then present Acorn, a concrete implementation of these techniques in Spark SQL that provides speedups of up to 5x across multiple benchmark and real Spark graph processing workloads.

4 citations


Proceedings ArticleDOI
02 May 2019
TL;DR: A corpus of Twitch streamer popularity measures and their behavior data on Twitch and third party platforms is collected to test the community-proposed relationship between behavior on social media accounts and popularity.
Abstract: Twitch, a live video-streaming platform, provides strong financial and social incentives to developing a follower base. While streamers benefit from Twitch's own features for forming a wide community of engaged viewers, many streamers look to external social media platforms to increase their reach and build their following. We collect a corpus of Twitch streamer popularity measures and their behavior data on Twitch and third party platforms. We test the community-proposed relationship between behavior on social media accounts and popularity through examining the timing of creation and use of social media accounts. We conduct these experiments by studying the correlations between streamer behaviors and two popularity measures used by Twitch: followers and average concurrent viewers. We find that we cannot yet define which behaviors have statistically significant correlations with popularity, and propose future directions for this research.

3 citations


Posted Content
28 Jun 2019
TL;DR: DIEL is presented, a framework that achieves cross-layer autoscaling transparently under a simple, declarative interface and makes it easier to develop visualizations that are robust against changes to the size and location of data.
Abstract: We live in an era of big data and rich data visualization. As data sets increase in size, browser-based interactive visualizations eventually hit limits in storage and processing capacity. In order to provide interactivity over large datasets, visualization applications typically need to be extensively rewritten to make use of powerful back-end services. It would be far preferable if front-end developers could write visualizations once in a natural way, and have a framework take responsibility for transparently scaling up the visualization to use back-end services as needed. Achieving this goal requires rethinking how communication and state are managed by the framework: the mapping of interaction logic to server APIs or database queries, handling of results arriving asynchronously over the network, as well as basic cross-layer performance optimizations like caching. In this paper, we present DIEL, a framework that achieves this cross-layer autoscaling transparently under a simple, declarative interface. DIEL treats UI events as a stream of data that is captured in an event history for reuse. Developers declare what the state of the interface should be after the arrival of events. DIEL compiles these declarative specifications into relational queries over both event history and the data to be visualized. In doing so, DIEL makes it easier to develop visualizations that are robust against changes to the size and location of data. To evaluate the DIEL framework, we developed a prototype implementation and confirmed that DIEL supports a range of visualization and interaction designs. Visualizations written using DIEL can transparently and seamlessly scale to use back-end services with little intervention from the developer.

2 citations


Posted Content
TL;DR: DIEL, a declarative programming model to help developers reason about and reconcile concurrency-related issues, is presented and it is shown that resolving conflicts from concurrent processes in real-world interactive visualizations can be done in a few lines of DIEL code.
Abstract: Modern interactive visualizations are akin to distributed systems, where user interactions, background data processing, remote requests, and streaming data read and modify the interface at the same time. This concurrency is crucial to provide an interactive user experience---forbidding it can cripple responsiveness. However, it is notoriously challenging to program distributed systems, and concurrency can easily lead to ambiguous or confusing interface behaviors. In this paper, we present DIEL, a declarative programming model to help developers reason about and reconcile concurrency-related issues. Using DIEL, developers no longer need to procedurally describe how the interface should update based on different input events, but rather declaratively specify what the state of the interface should be as queries over event history. We show that resolving conflicts from concurrent processes in real-world interactive visualizations can be done in a few lines of DIEL code.

Posted Content
TL;DR: DIEL as discussed by the authors is a declarative framework that supports asynchronous events over distributed data, which is in stark contrast to modern methods for browser-based interactive visualization, which feature high-level declarsative specifications.
Abstract: Interactive visualization design and research have primarily focused on local data and synchronous events. However, for more complex use cases---e.g., remote database access and streaming data sources---developers must grapple with distributed data and asynchronous events. Currently, constructing these use cases is difficult and time-consuming; developers are forced to operationally program low-level details like asynchronous database querying and reactive event handling. This approach is in stark contrast to modern methods for browser-based interactive visualization, which feature high-level declarative specifications. In response, we present DIEL, a declarative framework that supports asynchronous events over distributed data. Like many declarative visualization languages, DIEL developers need only specify what data they want, rather than procedural steps for how to assemble it; uniquely, DIEL models asynchronous events (e.g., user interactions or server responses) as streams of data that are captured in event logs. To specify the state of a user interface at any time, developers author declarative queries over the data and event logs; DIEL compiles and optimizes a corresponding dataflow graph, and synthesizes necessary low-level distributed systems details. We demonstrate DIEL's performance and expressivity through ex-ample interactive visualizations that make diverse use of remote data and coordination of asynchronous events. We further evaluate DIEL's usability using the Cognitive Dimensions of Notations framework, revealing wins such as ease of change, and compromises such as premature commitments.