scispace - formally typeset
Search or ask a question

Showing papers by "Eugene Wu published in 2018"


Journal Article•DOI•
01 Aug 2018
TL;DR: This paper will review the WebTables project, and try to place it in the broader context of the decade of work that followed, and show how the progress over the past ten years sets up an exciting agenda for the future.
Abstract: In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables project itself, as well as the broad topic of informal online structured data. In this paper, we1 will review the WebTables project, and try to place it in the broader context of the decade of work that followed. We will also show how the progress over the past ten years sets up an exciting agenda for the future, and will draw upon many corners of the data management community.

54 citations


Proceedings Article•DOI•
01 Feb 2018
TL;DR: Smoke, an in-memory database engine that provides both fast lineage capture and lineage query processing, and tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front.
Abstract: Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to quickly capture lineage across a workflow in order to speed up future queries over lineage. Current lineage systems, however, either incur high lineage capture overheads, high lineage query processing costs, or both. In response, developers resort to manual implementations of applications that, in principal, can be expressed and optimized in lineage terms. This paper describes Smoke, an in-memory database engine that provides both fast lineage capture and lineage query processing. To do so, Smoke tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and lineage query costs by multiple orders of magnitude as compared to state-of-the-art alternatives. On real-world applications, we show that Smoke meets the latency requirements of interactive visualizations (e.g.,

45 citations


Proceedings Article•DOI•
10 Jun 2018
TL;DR: In this article, the authors highlight the connections between data provenance and interactive visualizations, and describe how an interactive visualization system that natively supports provenance can be easily extended with novel interactions.
Abstract: We highlight the connections between data provenance and interactive visualizations. To do so, we first incrementally add interactions to a visualization and show how these interactions are readily expressible in terms of provenance. We then describe how an interactive visualization system that natively supports provenance can be easily extended with novel interactions.

13 citations


Posted Content•
TL;DR: Smoke is introduced, an in-memory database engine that neither lineage capture overhead nor lineage query processing needs to be compromised and can meet the latency requirements of interactive visualizations and outperform hand-written implementations of data profiling primitives.
Abstract: Data lineage describes the relationship between individual input and output data items of a workflow, and has served as an integral ingredient for both traditional (e.g., debugging, auditing, data integration, and security) and emergent (e.g., interactive visualizations, iterative analytics, explanations, and cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to capture the relationships between input and output data items across a workflow with the goal to streamline queries over lineage. Unfortunately, current lineage systems either incur high lineage capture overheads, or lineage query processing costs, or both. As a result, applications, that in principle can express their logic declaratively in lineage terms, resort to hand-tuned implementations. To this end, we introduce Smoke, an in-memory database engine that neither lineage capture overhead nor lineage query processing needs to be compromised. To do so, Smoke introduces tight integration of the lineage capture logic into physical database operators; efficient, write-optimized lineage representations for storage; and optimizations when future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and streamlines lineage queries by multiple orders of magnitude compared to state-of-the-art alternatives. Our experiments on real-world applications highlight that Smoke can meet the latency requirements of interactive visualizations (e.g., <150ms) and outperform hand-written implementations of data profiling primitives.

12 citations


Posted Content•
TL;DR: It is observed that traditional asynchronous interfaces, where results update in place, induce users to wait for the result before interacting, not taking advantage of the asynchronous rendering of the results, but when results are rendered cumulatively over the recent history, users perform asynchronous interactions and get faster task completion times.
Abstract: Asynchronous interfaces allow users to concurrently issue requests while existing ones are processed. While it is widely used to support non-blocking input when there is latency, it's not clear if people can make use of asynchrony as the data is updating, since the UI updates dynamically and the changes can be hard to interpret. Interactive data visualization presents an interesting context for studying the effects of asynchronous interfaces, since interactions are frequent, task latencies can vary widely, and results often require interpretation. In this paper, we study the effects of introducing asynchrony into interactive visualizations, under different latencies, and with different tasks. We observe that traditional asynchronous interfaces, where results update in place, induce users to wait for the result before interacting, not taking advantage of the asynchronous rendering of the results. However, when results are rendered cumulatively over the recent history, users perform asynchronous interactions and get faster task completion times.

10 citations


Posted Content•
TL;DR: It is shown that PAE is correlated with user-perceived chart complexity, and that increased chart PAE correlates with reduced judgement accuracy, and it is found that the correlation between PAE values and participants’ judgment increases when the user has less time to examine the line charts.
Abstract: When inspecting information visualizations under time critical settings, such as emergency response or monitoring the heart rate in a surgery room, the user only has a small amount of time to view the visualization "at a glance". In these settings, it is important to provide a quantitative measure of the visualization to understand whether or not the visualization is too "complex" to accurately judge at a glance. This paper proposes Pixel Approximate Entropy (PAE), which adapts the approximate entropy statistical measure commonly used to quantify regularity and unpredictability in time-series data, as a measure of visual complexity for line charts. We show that PAE is correlated with user-perceived chart complexity, and that increased chart PAE correlates with reduced judgement accuracy. We also find that the correlation between PAE values and participants' judgment increases when the user has less time to examine the line charts.

7 citations


Proceedings Article•DOI•
27 May 2018
TL;DR: Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces that can turn a log of executed programs into an interface, focuses on SQL query logs, but the approach can be generalized to other languages.
Abstract: Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.

7 citations


Posted Content•
TL;DR: It is described how an interactive visualization system that natively supports provenance can be easily extended with novel interactions, and how these interactions are readily expressible in terms of provenance.
Abstract: We highlight the connections between data provenance and interactive visualizations. To do so, we first incrementally add interactions to a visualization and show how these interactions are readily expressible in terms of provenance. We then describe how an interactive visualization system that natively supports provenance can be easily extended with novel interactions.

6 citations


Proceedings Article•DOI•
27 May 2018
TL;DR: This demonstration showcases lineage as the building block across a variety of data-intensive applications, including tooltips and details on demand; crossfilter; and data profiling, and shows how Smoke outperforms alternative lineage systems to meet or improve on existing hand-tuned implementations of these applications.
Abstract: Data lineage is a fundamental type of information that describes the relationships between input and output data items in a workflow. As such, an immense amount of data-intensive applications with logic over the input-output relationships can be expressed declaratively in lineage terms. Unfortunately, many applications resort to hand-tuned implementations because either lineage systems are not fast enough to meet their requirements or due to no knowledge of the lineage capabilities. Recently, we introduced a set of implementation design principles and associated techniques to optimize lineage-enabled database engines and realized them in our prototype database engine, namely, Smoke. In this demonstration, we showcase lineage as the building block across a variety of data-intensive applications, including tooltips and details on demand; crossfilter; and data profiling. In addition, we show how Smoke outperforms alternative lineage systems to meet or improve on existing hand-tuned implementations of these applications.

6 citations


Proceedings Article•
01 Jan 2018
TL;DR: In this paper, a perturbation-based explanation method for tree-ensembles is proposed to identify writing features that, if changed, will most improve the text quality.
Abstract: User-generated, multi-paragraph writing is pervasive and important in many social media platforms (i.e. Amazon reviews, AirBnB host profiles, etc). Ensuring high-quality content is important. Unfortunately, content submitted by users is often not of high quality. Moreover, the characteristics that constitute high quality may even vary between domains in ways that users are unaware of. Automated writing feedback has the potential to immediately point out and suggest improvements during the writing process. Most approaches, however, focus on syntax/phrasing, which is only one characteristic of high-quality content. Existing research develops accurate quality prediction models. We propose combining these models with model explanation techniques to identify writing features that, if changed, will most improve the text quality. To this end, we develop a perturbation-based explanation method for a popular class of models called tree-ensembles. Furthermore, we use a weak-supervision technique to adapt this method to generate feedback for specific text segments in addition to feedback for the entire document. Our user study finds that the perturbation-based approach, when combined with segment-specific feedback, can help improve writing quality on Amazon (review helpfulness) and Airbnb (host profile trustworthiness) by > 14% (3X improvement over recent automated feedback techniques).

5 citations


Posted Content•
TL;DR: DeepBase is described, a system to inspect neural network behaviors through a unified interface that model logic with user-provided hypothesis functions that annotate the data with high-level labels and lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses.
Abstract: Although deep learning models perform remarkably well across a range of tasks such as language translation and object recognition, it remains unclear what high-level logic, if any, they follow. Understanding this logic may lead to more transparency, better model design, and faster experimentation. Recent machine learning research has leveraged statistical methods to identify hidden units that behave (e.g., activate) similarly to human understandable logic, but those analyses require considerable manual effort. Our insight is that many of those studies follow a common analysis pattern, which we term Deep Neural Inspection. There is opportunity to provide a declarative abstraction to easily express, execute, and optimize them. This paper describes DeepBase, a system to inspect neural network behaviors through a unified interface. We model logic with user-provided hypothesis functions that annotate the data with high-level labels (e.g., part-of-speech tags, image captions). DeepBase lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses. We discuss how DeepBase can express existing analyses, propose a set of simple and effective optimizations to speed up a standard Python implementation by up to 72x, and reproduce recent studies from the NLP literature.

Proceedings Article•DOI•
Pei Wang1, Yongjun He2, Ryan Shea1, Jiannan Wang1, Eugene Wu1 •
27 May 2018
TL;DR: This work builds Deeper, a data enrichment system powered by the deep web that uses resources proportional to the size of the local database of interest, and finds that a challenging problem is how to crawl a hidden database.
Abstract: Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or time-intensive by fully crawling a web-based data source. In this work, we explore a more targeted alternative that uses resources (in terms of web API calls) proportional to the size of the local database of interest. We build Deeper, a data enrichment system powered by the deep web. The goal of Deeper is to help data scientists to link a local database to a hidden database so that they can easily enrich the local database with the attributes from the hidden database. We find that a challenging problem is how to crawl a hidden database. This is different from a typical deep web crawling problem, whose goal is to crawl the entire hidden database rather than only the content relating to the data enrichment task. We demonstrate the limitations of straightforward solutions and propose an effective new crawling strategy. We also present the Deeper system architecture and discuss how to implement each component. During the demo, we will use Deeper to enrich a publication database and aim to show that (1) Deeper is an end-to-end data enrichment solution, and (2) the proposed crawling strategy is superior to the straightforward ones.


Posted Content•
TL;DR: A simple UX design—interaction snapshots that allows users to interact concurrently while the snapshots load, and for latency up to 5 seconds, participants were able to complete extrema, threshold, and trend identification tasks with little negative impact.
Abstract: Latency is, unfortunately, a reality when working with large datasets. Guaranteeing imperceptible latency for interactivity is often prohibitively expensive: the application developer may be forced to migrate data processing engines or deal with complex error bounds on samples, and to limit the application to users with high network bandwidth. Instead of relying on the backend, we propose a simple UX design---interaction snapshots. Responses of requests from the interactions are asynchronously loaded in "snapshots". With interaction snapshots, users can interact concurrently while the snapshots load. Our user study participants found it useful not to have to wait for each result and easily navigate to prior snapshots. For latency up to 5 seconds, participants were able to complete extrema, threshold, and trend identification tasks with little negative impact.

Posted Content•
TL;DR: A corpus of Twitch streamer popularity measures and a set of community-defined behavioral norms are collected and it is found that studying the popularity and success of content creators in the long term is a promising and rich research area.
Abstract: Live video-streaming platforms such as Twitch enable top content creators to reap significant profits and influence. To that effect, various behavioral norms are recommended to new entrants and those seeking to increase their popularity and success. Chiefly among them are to simply put in the effort and promote on social media outlets such as Twitter, Instagram, and the like. But does following these behaviors indeed have a relationship with eventual popularity? In this paper, we collect a corpus of Twitch streamer popularity measures --- spanning social and financial measures --- and their behavior data on Twitch and third party platform. We also compile a set of community-defined behavioral norms. We then perform temporal analysis to identify the increased predictive value that a streamer's future behavior contributes to predicting future popularity. At the population level, we find that behavioral information improves the prediction of relative growth that exceeds the median streamer. At the individual level, we find that although it is difficult to quickly become successful in absolute terms, streamers that put in considerable effort are more successful than the rest, and that creating social media accounts to promote oneself is effective irrespective of when the accounts are created. Ultimately, we find that studying the popularity and success of content creators in the long term is a promising and rich research area.