Showing papers by "Eugene Wu published in 2018"

PDF

Open Access

Journal Article•DOI•

[...]

Michael Cafarella¹, Alon Halevy, Hongrae Lee², Jayant Madhavan², Cong Yu², Daisy Zhe Wang³, Eugene Wu⁴ - Show less +3 more•Institutions (4)

University of Michigan¹, Google², University of Florida³, Columbia University⁴

01 Aug 2018

TL;DR: This paper will review the WebTables project, and try to place it in the broader context of the decade of work that followed, and show how the progress over the past ten years sets up an exciting agenda for the future.

...read moreread less

Abstract: In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables project itself, as well as the broad topic of informal online structured data. In this paper, we1 will review the WebTables project, and try to place it in the broader context of the decade of work that followed. We will also show how the progress over the past ten years sets up an exciting agenda for the future, and will draw upon many corners of the data management community.

...read moreread less

54 citations

Proceedings Article•DOI•

Smoke: fine-grained lineage at interactive speed

[...]

Fotis Psallidas¹, Eugene Wu¹•Institutions (1)

Columbia University¹

01 Feb 2018

TL;DR: Smoke, an in-memory database engine that provides both fast lineage capture and lineage query processing, and tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front.

...read moreread less

Abstract: Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to quickly capture lineage across a workflow in order to speed up future queries over lineage. Current lineage systems, however, either incur high lineage capture overheads, high lineage query processing costs, or both. In response, developers resort to manual implementations of applications that, in principal, can be expressed and optimized in lineage terms. This paper describes Smoke, an in-memory database engine that provides both fast lineage capture and lineage query processing. To do so, Smoke tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and lineage query costs by multiple orders of magnitude as compared to state-of-the-art alternatives. On real-world applications, we show that Smoke meets the latency requirements of interactive visualizations (e.g.,

...read moreread less

45 citations

Proceedings Article•DOI•

Provenance for Interactive Visualizations

[...]

Fotis Psallidas¹, Eugene Wu¹•Institutions (1)

Columbia University¹

10 Jun 2018

TL;DR: In this article, the authors highlight the connections between data provenance and interactive visualizations, and describe how an interactive visualization system that natively supports provenance can be easily extended with novel interactions.

...read moreread less

Abstract: We highlight the connections between data provenance and interactive visualizations. To do so, we first incrementally add interactions to a visualization and show how these interactions are readily expressible in terms of provenance. We then describe how an interactive visualization system that natively supports provenance can be easily extended with novel interactions.

...read moreread less

13 citations

Posted Content•

Smoke: Fine-grained Lineage at Interactive Speed

[...]

Fotis Psallidas¹, Eugene Wu¹•Institutions (1)

Columbia University¹

22 Jan 2018-arXiv: Databases

TL;DR: Smoke is introduced, an in-memory database engine that neither lineage capture overhead nor lineage query processing needs to be compromised and can meet the latency requirements of interactive visualizations and outperform hand-written implementations of data profiling primitives.

...read moreread less

Abstract: Data lineage describes the relationship between individual input and output data items of a workflow, and has served as an integral ingredient for both traditional (e.g., debugging, auditing, data integration, and security) and emergent (e.g., interactive visualizations, iterative analytics, explanations, and cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to capture the relationships between input and output data items across a workflow with the goal to streamline queries over lineage. Unfortunately, current lineage systems either incur high lineage capture overheads, or lineage query processing costs, or both. As a result, applications, that in principle can express their logic declaratively in lineage terms, resort to hand-tuned implementations. To this end, we introduce Smoke, an in-memory database engine that neither lineage capture overhead nor lineage query processing needs to be compromised. To do so, Smoke introduces tight integration of the lineage capture logic into physical database operators; efficient, write-optimized lineage representations for storage; and optimizations when future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and streamlines lineage queries by multiple orders of magnitude compared to state-of-the-art alternatives. Our experiments on real-world applications highlight that Smoke can meet the latency requirements of interactive visualizations (e.g., <150ms) and outperform hand-written implementations of data profiling primitives.

...read moreread less

12 citations

Posted Content•

Making Sense of Asynchrony in Interactive Data Visualizations.

[...]

Yifan Wu, Larry Xu, Remco Chang, Joseph M. Hellerstein, Eugene Wu - Show less +1 more

05 Jun 2018-arXiv: Human-Computer Interaction

TL;DR: It is observed that traditional asynchronous interfaces, where results update in place, induce users to wait for the result before interacting, not taking advantage of the asynchronous rendering of the results, but when results are rendered cumulatively over the recent history, users perform asynchronous interactions and get faster task completion times.

...read moreread less

Abstract: Asynchronous interfaces allow users to concurrently issue requests while existing ones are processed. While it is widely used to support non-blocking input when there is latency, it's not clear if people can make use of asynchrony as the data is updating, since the UI updates dynamically and the changes can be hard to interpret. Interactive data visualization presents an interesting context for studying the effects of asynchronous interfaces, since interactions are frequent, task latencies can vary widely, and results often require interpretation. In this paper, we study the effects of introducing asynchrony into interactive visualizations, under different latencies, and with different tasks. We observe that traditional asynchronous interfaces, where results update in place, induce users to wait for the result before interacting, not taking advantage of the asynchronous rendering of the results. However, when results are rendered cumulatively over the recent history, users perform asynchronous interactions and get faster task completion times.

...read moreread less

10 citations

Posted Content•

At a Glance: Pixel Approximate Entropy as a Measure of Line Chart Complexity

[...]

Gabriel Ryan¹, Abigail Mosca², Remco Chang², Eugene Wu¹•Institutions (2)

Columbia University¹, Tufts University²

07 Nov 2018-arXiv: Human-Computer Interaction

TL;DR: It is shown that PAE is correlated with user-perceived chart complexity, and that increased chart PAE correlates with reduced judgement accuracy, and it is found that the correlation between PAE values and participants’ judgment increases when the user has less time to examine the line charts.

...read moreread less

Abstract: When inspecting information visualizations under time critical settings, such as emergency response or monitoring the heart rate in a surgery room, the user only has a small amount of time to view the visualization "at a glance". In these settings, it is important to provide a quantitative measure of the visualization to understand whether or not the visualization is too "complex" to accurately judge at a glance. This paper proposes Pixel Approximate Entropy (PAE), which adapts the approximate entropy statistical measure commonly used to quantify regularity and unpredictability in time-series data, as a measure of visual complexity for line charts. We show that PAE is correlated with user-perceived chart complexity, and that increased chart PAE correlates with reduced judgement accuracy. We also find that the correlation between PAE values and participants' judgment increases when the user has less time to examine the line charts.

...read moreread less

7 citations

Proceedings Article•DOI•

Precision Interfaces for Different Modalities

[...]

Haoci Zhang¹, Viraj Raj¹, Thibault Sellam¹, Eugene Wu¹•Institutions (1)

Columbia University¹

27 May 2018

TL;DR: Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces that can turn a log of executed programs into an interface, focuses on SQL query logs, but the approach can be generalized to other languages.

...read moreread less

Abstract: Building interactive tools to support data analysis is hard because it is not always clear what to build and how to build it. To address this problem, we present Precision Interfaces, a semi-automatic system to generate task-specific data analytics interfaces. Precision Interface can turn a log of executed programs into an interface, by identifying micro-variations between the programs and mapping them to interface components. This paper focuses on SQL query logs, but we can generalize the approach to other languages. Our system operates in two steps: it first builds an interaction graph, which describes how the queries can be transformed into each other. Then, it finds a set of UI components that covers a maximal number of transformations. To restrict the domain of changes to be detected, our system uses a domain-specific language, PILang. We describe each of Precision Interface's components, showcase an early prototype on real program logs, and discuss future research opportunities. This demonstration highlights the potential for data-driven interactive interface mining from query logs. We will first walk participants through the process that Precision Interfaces goes through to generate interactive analysis interfaces from query logs. We will then show the versatility of Precision Interfaces by letting participants choose from multiple different interface modalities, interaction designs, and query logs to generate 2D point-and-click, gestural, and even natural language analysis interfaces for commonly performed analyses.

...read moreread less

7 citations

Posted Content•

Provenance for Interactive Visualizations.

[...]

Fotis Psallidas¹, Eugene Wu¹•Institutions (1)

Columbia University¹

07 May 2018-arXiv: Databases

TL;DR: It is described how an interactive visualization system that natively supports provenance can be easily extended with novel interactions, and how these interactions are readily expressible in terms of provenance.

...read moreread less

6 citations

Proceedings Article•DOI•

Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications

[...]

Fotis Psallidas¹, Eugene Wu¹•Institutions (1)

Columbia University¹

27 May 2018

TL;DR: This demonstration showcases lineage as the building block across a variety of data-intensive applications, including tooltips and details on demand; crossfilter; and data profiling, and shows how Smoke outperforms alternative lineage systems to meet or improve on existing hand-tuned implementations of these applications.

...read moreread less

Abstract: Data lineage is a fundamental type of information that describes the relationships between input and output data items in a workflow. As such, an immense amount of data-intensive applications with logic over the input-output relationships can be expressed declaratively in lineage terms. Unfortunately, many applications resort to hand-tuned implementations because either lineage systems are not fast enough to meet their requirements or due to no knowledge of the lineage capabilities. Recently, we introduced a set of implementation design principles and associated techniques to optimize lineage-enabled database engines and realized them in our prototype database engine, namely, Smoke. In this demonstration, we showcase lineage as the building block across a variety of data-intensive applications, including tooltips and details on demand; crossfilter; and data profiling. In addition, we show how Smoke outperforms alternative lineage systems to meet or improve on existing hand-tuned implementations of these applications.

...read moreread less

6 citations

Proceedings Article•

Leveraging Quality Prediction Models for Automatic Writing Feedback.

[...]

Hamed Nilforoshan¹, Eugene Wu¹•Institutions (1)

Columbia University¹

01 Jan 2018

TL;DR: In this paper, a perturbation-based explanation method for tree-ensembles is proposed to identify writing features that, if changed, will most improve the text quality.

...read moreread less

Abstract: User-generated, multi-paragraph writing is pervasive and important in many social media platforms (i.e. Amazon reviews, AirBnB host profiles, etc). Ensuring high-quality content is important. Unfortunately, content submitted by users is often not of high quality. Moreover, the characteristics that constitute high quality may even vary between domains in ways that users are unaware of. Automated writing feedback has the potential to immediately point out and suggest improvements during the writing process. Most approaches, however, focus on syntax/phrasing, which is only one characteristic of high-quality content. Existing research develops accurate quality prediction models. We propose combining these models with model explanation techniques to identify writing features that, if changed, will most improve the text quality. To this end, we develop a perturbation-based explanation method for a popular class of models called tree-ensembles. Furthermore, we use a weak-supervision technique to adapt this method to generate feedback for specific text segments in addition to feedback for the entire document. Our user study finds that the perturbation-based approach, when combined with segment-specific feedback, can help improve writing quality on Amazon (review helpfulness) and Airbnb (host profile trustworthiness) by > 14% (3X improvement over recent automated feedback techniques).

...read moreread less

5 citations

Posted Content•

DeepBase: Deep Inspection of Neural Networks

[...]

Thibault Sellam, Kevin Lin, Ian Yiran Huang, Yiru Chen, Michelle Yang, Carl Vondrick, Eugene Wu - Show less +3 more

13 Aug 2018-arXiv: Databases

TL;DR: DeepBase is described, a system to inspect neural network behaviors through a unified interface that model logic with user-provided hypothesis functions that annotate the data with high-level labels and lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses.

...read moreread less

Abstract: Although deep learning models perform remarkably well across a range of tasks such as language translation and object recognition, it remains unclear what high-level logic, if any, they follow. Understanding this logic may lead to more transparency, better model design, and faster experimentation. Recent machine learning research has leveraged statistical methods to identify hidden units that behave (e.g., activate) similarly to human understandable logic, but those analyses require considerable manual effort. Our insight is that many of those studies follow a common analysis pattern, which we term Deep Neural Inspection. There is opportunity to provide a declarative abstraction to easily express, execute, and optimize them. This paper describes DeepBase, a system to inspect neural network behaviors through a unified interface. We model logic with user-provided hypothesis functions that annotate the data with high-level labels (e.g., part-of-speech tags, image captions). DeepBase lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses. We discuss how DeepBase can express existing analyses, propose a set of simple and effective optimizations to speed up a standard Python implementation by up to 72x, and reproduce recent studies from the NLP literature.

...read moreread less

Proceedings Article•DOI•

Deeper: A Data Enrichment System Powered by Deep Web

[...]

Pei Wang¹, Yongjun He², Ryan Shea¹, Jiannan Wang¹, Eugene Wu¹ - Show less +1 more•Institutions (2)

Simon Fraser University¹, Nanjing University²

27 May 2018

TL;DR: This work builds Deeper, a data enrichment system powered by the deep web that uses resources proportional to the size of the local database of interest, and finds that a challenging problem is how to crawl a hidden database.

...read moreread less

Abstract: Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or time-intensive by fully crawling a web-based data source. In this work, we explore a more targeted alternative that uses resources (in terms of web API calls) proportional to the size of the local database of interest. We build Deeper, a data enrichment system powered by the deep web. The goal of Deeper is to help data scientists to link a local database to a hidden database so that they can easily enrich the local database with the attributes from the hidden database. We find that a challenging problem is how to crawl a hidden database. This is different from a typical deep web crawling problem, whose goal is to crawl the entire hidden database rather than only the content relating to the data enrichment task. We demonstrate the limitations of straightforward solutions and propose an effective new crawling strategy. We also present the Deeper system architecture and discuss how to implement each component. During the demo, we will use Deeper to enrich a publication database and aim to show that (1) Deeper is an end-to-end data enrichment solution, and (2) the proposed crawling strategy is superior to the straightforward ones.

...read moreread less

Ten Years of Web Tables

[...]

Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Daisy Zhe Wang, Eugene Wu - Show less +3 more

01 Jan 2018

Posted Content•

Facilitating Exploration with Interaction Snapshots under High Latency

[...]

Yifan Wu¹, Remco Chang², Joseph M. Hellerstein¹, Eugene Wu³•Institutions (3)

University of California, Berkeley¹, Tufts University², Columbia University³

05 Jun 2018-arXiv: Human-Computer Interaction

TL;DR: A simple UX design—interaction snapshots that allows users to interact concurrently while the snapshots load, and for latency up to 5 seconds, participants were able to complete extrema, threshold, and trend identification tasks with little negative impact.

...read moreread less

Abstract: Latency is, unfortunately, a reality when working with large datasets. Guaranteeing imperceptible latency for interactivity is often prohibitively expensive: the application developer may be forced to migrate data processing engines or deal with complex error bounds on samples, and to limit the application to users with high network bandwidth. Instead of relying on the backend, we propose a simple UX design---interaction snapshots. Responses of requests from the interactions are asynchronously loaded in "snapshots". With interaction snapshots, users can interact concurrently while the snapshots load. Our user study participants found it useful not to have to wait for each result and easily navigate to prior snapshots. For latency up to 5 seconds, participants were able to complete extrema, threshold, and trend identification tasks with little negative impact.

...read moreread less

Posted Content•

PopFactor: Live-Streamer Behavior and Popularity

[...]

Robert Netzorg¹, Lauren Arnett², Augustin Chaintreau², Eugene Wu²•Institutions (2)

University of California, Berkeley¹, Columbia University²

08 Dec 2018-arXiv: Social and Information Networks

TL;DR: A corpus of Twitch streamer popularity measures and a set of community-defined behavioral norms are collected and it is found that studying the popularity and success of content creators in the long term is a promising and rich research area.

...read moreread less

Abstract: Live video-streaming platforms such as Twitch enable top content creators to reap significant profits and influence. To that effect, various behavioral norms are recommended to new entrants and those seeking to increase their popularity and success. Chiefly among them are to simply put in the effort and promote on social media outlets such as Twitter, Instagram, and the like. But does following these behaviors indeed have a relationship with eventual popularity? In this paper, we collect a corpus of Twitch streamer popularity measures --- spanning social and financial measures --- and their behavior data on Twitch and third party platform. We also compile a set of community-defined behavioral norms. We then perform temporal analysis to identify the increased predictive value that a streamer's future behavior contributes to predicting future popularity. At the population level, we find that behavioral information improves the prediction of relative growth that exceeds the median streamer. At the individual level, we find that although it is difficult to quickly become successful in absolute terms, streamers that put in considerable effort are more successful than the rest, and that creating social media accounts to promote oneself is effective irrespective of when the accounts are created. Ultimately, we find that studying the popularity and success of content creators in the long term is a promising and rich research area.

...read moreread less