Top 7 papers published by Eugene Wu from Columbia University in 2015

Proceedings Article•DOI•

Automated Metadata Construction to Support Portable Building Applications

[...]

Arka Bhattacharya¹, Dezhi Hong², David E. Culler¹, Jorge Ortiz³, Kamin Whitehouse², Eugene Wu⁴ - Show less +2 more•Institutions (4)

University of California, Berkeley¹, University of Virginia², IBM³, Columbia University⁴

04 Nov 2015

TL;DR: A human-in-the-loop synthesis technique is developed which uses syntactic and data-driven steps to parse these sensor tags into a common namespace, which can enable portable building applications.

...read moreread less

Abstract: Commercial buildings consume nearly 19\% of delivered energy in the U.S, nearly half (42%) of which is consumed in buildings with digital control systems comprised of wired sensor networks. These sensors have scant metadata, and are represented by ``tags'' which are obscure, building-specific and not machine parseable. We develop a human-in-the-loop synthesis technique which uses syntactic and data-driven steps to parse these sensor tags into a common namespace, which can enable portable building applications. We show that our technique allows an expert to fully parse a large fraction (~70%) of the tags with 24, 15 and 43 examples for three large commercial buildings comprising 1586, 2522 and 1865 sensors respectively, and deploy three portable applications on two buildings with less than 30 examples.

...read moreread less

69 citations

Journal Article•DOI•

CLAMShell: speeding up crowds for low-latency data labeling

[...]

Daniel Haas¹, Jiannan Wang², Eugene Wu³, Michael J. Franklin¹•Institutions (3)

University of California, Berkeley¹, Simon Fraser University², Columbia University³

01 Dec 2015

TL;DR: This paper introduces CLAMShell, a system that speeds up crowds in order to achieve consistently low-latency data labeling and offers a taxonomy of the sources of labeling latency, and comprehensively tackles each source of latency.

...read moreread less

Abstract: Data labeling is a necessary but often slow process that impedes the development of interactive systems for modern data analysis. Despite rising demand for manual data labeling, there is a surprising lack of work addressing its high and unpredictable latency. In this paper, we introduce CLAMShell, a system that speeds up crowds in order to achieve consistently low-latency data labeling. We offer a taxonomy of the sources of labeling latency and study several large crowd-sourced labeling deployments to understand their empirical latency profiles. Driven by these insights, we comprehensively tackle each source of latency, both by developing novel techniques such as straggler mitigation and pool maintenance and by optimizing existing methods such as crowd retainer pools and active learning. We evaluate CLAMShell in simulation and on live workers on Amazon's Mechanical Turk, demonstrating that our techniques can provide an order of magnitude speedup and variance reduction over existing crowdsourced labeling strategies.

...read moreread less

62 citations

Journal Article•DOI•

Collaborative data analytics with DataHub

[...]

Anant Bhardwaj¹, Amol Deshpande², Aaron J. Elmore³, David R. Karger¹, Samuel Madden¹, Aditya Parameswaran⁴, Harihar Subramanyam¹, Eugene Wu¹, Rebecca Zhang¹ - Show less +5 more•Institutions (4)

Massachusetts Institute of Technology¹, University of Maryland, College Park², University of Chicago³, University of Illinois at Urbana–Champaign⁴

01 Aug 2015

TL;DR: Conference attendees will be able to use the DataHub notebook - an IPython-based notebook for analyzing data and storing the results of data analysis, with DataHub as the common data store.

...read moreread less

Abstract: While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native versioning capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various data-processing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook---an IPython-based notebook for analyzing data and storing the results of data analysis.

...read moreread less

57 citations

Journal Article•DOI•

Wisteria: nurturing scalable data cleaning infrastructure

[...]

Daniel Haas¹, Sanjay Krishnan¹, Jiannan Wang¹, Michael J. Franklin¹, Eugene Wu² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Columbia University²

01 Aug 2015

TL;DR: Wisteria is presented, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation.

...read moreread less

Abstract: Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.

...read moreread less

43 citations

Journal Article•

SampleClean: Fast and Reliable Analytics on Dirty Data.

[...]

Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska, Tova Milo, Eugene Wu - Show less +3 more

01 Jan 2015-IEEE Data(base) Engineering Bulletin

TL;DR: The SampleClean project has developed a new suite of techniques to estimate the results of queries when only a sample of data can be cleaned, and a gradient-descent algorithm is described that extends the key ideas to the increasingly common Machine Learning-based analytics.

...read moreread less

Abstract: An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, incorrect, or inconsistent values. In the SampleClean project, we have developed a new suite of techniques to estimate the results of queries when only a sample of data can be cleaned. Some forms of data corruption, such as duplication, can affect sampling probabilities, and thus, new techniques have to be designed to ensure correctness of the approximate query results. We first describe our initial project on computing statistically bounded estimates of sum, count, and avg queries from samples of cleaned data. We subsequently explored how the same techniques could apply to other problems in database research, namely, materialized view maintenance. To avoid expensive incremental maintenance, we maintain only a sample of rows in a view, and then leverage SampleClean to approximate aggregate query results. Finally, we describe our work on a gradient-descent algorithm that extends the key ideas to the increasingly common Machine Learning-based analytics.

...read moreread less

36 citations

Posted Content•

CLAMShell: Speeding up Crowds for Low-latency Data Labeling

[...]

Daniel Haas¹, Jiannan Wang², Eugene Wu³, Michael J. Franklin¹•Institutions (3)

University of California, Berkeley¹, Simon Fraser University², Columbia University³

20 Sep 2015-arXiv: Databases

TL;DR: CLAMShell as mentioned in this paper is a system that speeds up crowds in order to achieve consistently low-latency data labeling by developing novel techniques such as straggler mitigation and pool maintenance and optimizing existing methods such as crowd retainer pools and active learning.

...read moreread less

Abstract: Data labeling is a necessary but often slow process that impedes the development of interactive systems for modern data analysis. Despite rising demand for manual data labeling, there is a surprising lack of work addressing its high and unpredictable latency. In this paper, we introduce CLAMShell, a system that speeds up crowds in order to achieve consistently low-latency data labeling. We offer a taxonomy of the sources of labeling latency and study several large crowd-sourced labeling deployments to understand their empirical latency profiles. Driven by these insights, we comprehensively tackle each source of latency, both by developing novel techniques such as straggler mitigation and pool maintenance and by optimizing existing methods such as crowd retainer pools and active learning. We evaluate CLAMShell in simulation and on live workers on Amazon's Mechanical Turk, demonstrating that our techniques can provide an order of magnitude speedup and variance reduction over existing crowdsourced labeling strategies.

...read moreread less

7 citations

Proceedings Article•

Data Visualization Management Systems.

[...]

Eugene Wu¹•Institutions (1)

Columbia University¹

01 Jan 2015

1 citations

Showing papers by "Eugene Wu published in 2015"