scispace - formally typeset
Search or ask a question

Showing papers by "Eugene Wu published in 2015"


Proceedings ArticleDOI
04 Nov 2015
TL;DR: A human-in-the-loop synthesis technique is developed which uses syntactic and data-driven steps to parse these sensor tags into a common namespace, which can enable portable building applications.
Abstract: Commercial buildings consume nearly 19\% of delivered energy in the U.S, nearly half (42%) of which is consumed in buildings with digital control systems comprised of wired sensor networks. These sensors have scant metadata, and are represented by ``tags'' which are obscure, building-specific and not machine parseable. We develop a human-in-the-loop synthesis technique which uses syntactic and data-driven steps to parse these sensor tags into a common namespace, which can enable portable building applications. We show that our technique allows an expert to fully parse a large fraction (~70%) of the tags with 24, 15 and 43 examples for three large commercial buildings comprising 1586, 2522 and 1865 sensors respectively, and deploy three portable applications on two buildings with less than 30 examples.

69 citations


Journal ArticleDOI
01 Dec 2015
TL;DR: This paper introduces CLAMShell, a system that speeds up crowds in order to achieve consistently low-latency data labeling and offers a taxonomy of the sources of labeling latency, and comprehensively tackles each source of latency.
Abstract: Data labeling is a necessary but often slow process that impedes the development of interactive systems for modern data analysis. Despite rising demand for manual data labeling, there is a surprising lack of work addressing its high and unpredictable latency. In this paper, we introduce CLAMShell, a system that speeds up crowds in order to achieve consistently low-latency data labeling. We offer a taxonomy of the sources of labeling latency and study several large crowd-sourced labeling deployments to understand their empirical latency profiles. Driven by these insights, we comprehensively tackle each source of latency, both by developing novel techniques such as straggler mitigation and pool maintenance and by optimizing existing methods such as crowd retainer pools and active learning. We evaluate CLAMShell in simulation and on live workers on Amazon's Mechanical Turk, demonstrating that our techniques can provide an order of magnitude speedup and variance reduction over existing crowdsourced labeling strategies.

62 citations


Journal ArticleDOI
01 Aug 2015
TL;DR: Conference attendees will be able to use the DataHub notebook - an IPython-based notebook for analyzing data and storing the results of data analysis, with DataHub as the common data store.
Abstract: While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native versioning capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various data-processing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook---an IPython-based notebook for analyzing data and storing the results of data analysis.

57 citations


Journal ArticleDOI
01 Aug 2015
TL;DR: Wisteria is presented, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation.
Abstract: Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.

43 citations


Journal Article
TL;DR: The SampleClean project has developed a new suite of techniques to estimate the results of queries when only a sample of data can be cleaned, and a gradient-descent algorithm is described that extends the key ideas to the increasingly common Machine Learning-based analytics.
Abstract: An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, incorrect, or inconsistent values. In the SampleClean project, we have developed a new suite of techniques to estimate the results of queries when only a sample of data can be cleaned. Some forms of data corruption, such as duplication, can affect sampling probabilities, and thus, new techniques have to be designed to ensure correctness of the approximate query results. We first describe our initial project on computing statistically bounded estimates of sum, count, and avg queries from samples of cleaned data. We subsequently explored how the same techniques could apply to other problems in database research, namely, materialized view maintenance. To avoid expensive incremental maintenance, we maintain only a sample of rows in a view, and then leverage SampleClean to approximate aggregate query results. Finally, we describe our work on a gradient-descent algorithm that extends the key ideas to the increasingly common Machine Learning-based analytics.

36 citations


Posted Content
TL;DR: CLAMShell as mentioned in this paper is a system that speeds up crowds in order to achieve consistently low-latency data labeling by developing novel techniques such as straggler mitigation and pool maintenance and optimizing existing methods such as crowd retainer pools and active learning.
Abstract: Data labeling is a necessary but often slow process that impedes the development of interactive systems for modern data analysis. Despite rising demand for manual data labeling, there is a surprising lack of work addressing its high and unpredictable latency. In this paper, we introduce CLAMShell, a system that speeds up crowds in order to achieve consistently low-latency data labeling. We offer a taxonomy of the sources of labeling latency and study several large crowd-sourced labeling deployments to understand their empirical latency profiles. Driven by these insights, we comprehensively tackle each source of latency, both by developing novel techniques such as straggler mitigation and pool maintenance and by optimizing existing methods such as crowd retainer pools and active learning. We evaluate CLAMShell in simulation and on live workers on Amazon's Mechanical Turk, demonstrating that our techniques can provide an order of magnitude speedup and variance reduction over existing crowdsourced labeling strategies.

7 citations


Proceedings Article
Eugene Wu1
01 Jan 2015

1 citations