Top 11 papers published by Yeye He from Microsoft in 2021

Proceedings Article•DOI•

Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples

[...]

Peng Li¹, Xiang Cheng¹, Xu Chu¹, Yeye He², Surajit Chaudhuri² - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Microsoft²

09 Jun 2021

TL;DR: AutomaticFuzzyJoin this paper is an unsupervised framework that can infer suitable fuzzy-join programs on given input tables, without requiring explicit human input such as labeled training data.

...read moreread less

Abstract: Fuzzy similarity join is an important database operator widely used in practice. So far the research community has focused exclusively on optimizing fuzzy joinscalability. However, practitioners today also struggle to optimize fuzzy-joinquality, because they face a daunting space of parameters (e.g., distance-functions, distance-thresholds, tokenization-options, etc.), and often have to resort to a manual trial-and-error approach to program these parameters in order to optimize fuzzy-join quality. This key challenge of automatically generating high-quality fuzzy-join programs has received surprisingly little attention thus far. In this work, we study the problem of "auto-program'' fuzzy-joins. Leveraging a geometric interpretation of distance-functions, we develop an unsupervised Auto-FuzzyJoin framework that can infer suitable fuzzy-join programs on given input tables, without requiring explicit human input such as labelled training data. Using Auto-FuzzyJoin, users only need to provide two input tables L and R, and a desired precision target τ (say 0.9). Auto-FuzzyJoin leverages the fact that one of the input is a reference table to automatically program fuzzy-joins that meet the precision target τ in expectation, while maximizing fuzzy-join recall (defined as the number of correctly joined records). Experiments on both existing benchmarks and a new benchmark with 50 fuzzy-join tasks created from Wikipedia data suggest that the proposed Auto-FuzzyJoin significantly outperforms existing unsupervised approaches, and is surprisingly competitive even against supervised approaches (e.g., Magellan and DeepMatcher) when 50% of ground-truth labels are used as training data. We have released our code and benchmark on GitHub\footnote\urlhttps://github.com/chu-data-lab/AutomaticFuzzyJoin to facilitate future research.

...read moreread less

10 citations

Journal Article•DOI•

Demonstration of panda: a weakly supervised entity matching system

[...]

Renzhi Wu¹, Prem Sakala¹, Peng Li¹, Xu Chu¹, Yeye He² - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Microsoft²

01 Jul 2021

TL;DR: In this paper, the problem of identifying tuple pairs in one or more relations that refer to the same real world entities is addressed using supervised machine learning (ML) and deep learning.

...read moreread less

Abstract: Entity matching (EM) refers to the problem of identifying tuple pairs in one or more relations that refer to the same real world entities. Supervised machine learning (ML) approaches, and deep lear...

...read moreread less

8 citations

Journal Article•DOI•

Demonstration of Panda: A Weakly Supervised Entity Matching System

[...]

Renzhi Wu¹, Prem Sakala¹, Peng Li¹, Xu Chu¹, Yeye He² - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Microsoft²

21 Jun 2021-arXiv: Databases

TL;DR: Panda as mentioned in this paper is a weakly supervised system specifically designed for entity matching, where labeling functions (LFs) are user-provided programs that can generate large amounts of (somewhat noisy) labels quickly and cheaply, which can then be combined via a labeling model to generate accurate final predictions.

...read moreread less

Abstract: Entity matching (EM) refers to the problem of identifying tuple pairs in one or more relations that refer to the same real world entities. Supervised machine learning (ML) approaches, and deep learning based approaches in particular, typically achieve state-of-the-art matching results. However, these approaches require many labeled examples, in the form of matching and non-matching pairs, which are expensive and time-consuming to label. In this paper, we introduce Panda, a weakly supervised system specifically designed for EM. Panda uses the same labeling function abstraction as Snorkel, where labeling functions (LF) are user-provided programs that can generate large amounts of (somewhat noisy) labels quickly and cheaply, which can then be combined via a labeling model to generate accurate final predictions. To support users developing LFs for EM, Panda provides an integrated development environment (IDE) that lives in a modern browser architecture. Panda's IDE facilitates the development, debugging, and life-cycle management of LFs in the context of EM tasks, similar to how IDEs such as Visual Studio or Eclipse excel in general-purpose programming. Panda's IDE includes many novel features purpose-built for EM, such as smart data sampling, a builtin library of EM utility functions, automatically generated LFs, visual debugging of LFs, and finally, an EM-specific labeling model. We show in this demo that Panda IDE can greatly accelerate the development of high-quality EM solutions using weak supervision.

...read moreread less

7 citations

Proceedings Article•DOI•

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

[...]

Jie Song¹, Yeye He²•Institutions (2)

University of Michigan¹, Microsoft²

09 Jun 2021

TL;DR: This paper developed a corpus-driven approach to auto-validate machine-generated data by inferring suitable data-validation "patterns" that accurately describe the underlying data-domain, which minimizes false-positives while maximizing data quality issues.

...read moreread less

Abstract: Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, that on string-valued data, these existing approaches yield high false-positive rates and frequently require human intervention. In this work, we develop a corpus-driven approach to auto-validate machine-generated data by inferring suitable data-validation "patterns'' that accurately describe the underlying data-domain, which minimizes false-positives while maximizing data quality issues caught. Evaluations using production data from real data lakes suggest that \sj is substantially more effective than existing methods. Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.

...read moreread less

5 citations

Posted Content•

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

[...]

Jie Song¹, Yeye He²•Institutions (2)

University of Michigan¹, Microsoft²

10 Apr 2021-arXiv: Databases

TL;DR: This article developed a corpus-driven approach to auto-validate machine-generated data by inferring suitable data-validation patterns that accurately describe the underlying data domain, which minimizes false positives while maximizing data quality issues.

...read moreread less

Abstract: Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, that on string-valued data, these existing approaches yield high false-positive rates and frequently require human intervention. In this work, we develop a corpus-driven approach to auto-validate \emph{machine-generated data} by inferring suitable data-validation "patterns" that accurately describe the underlying data domain, which minimizes false positives while maximizing data quality issues caught. Evaluations using production data from real data lakes suggest that Auto-Validate is substantially more effective than existing methods. Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.

...read moreread less

1 citations

Posted Content•

Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples

[...]

Peng Li¹, Xiang Cheng¹, Xu Chu¹, Yeye He², Surajit Chaudhuri² - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Microsoft²

07 Mar 2021-arXiv: Databases

TL;DR: In this paper, an unsupervised framework is proposed to infer suitable fuzzy-join programs on given input tables, without requiring explicit human input such as labeled training data, by leveraging a geometric interpretation of distance functions.

...read moreread less

Abstract: Fuzzy similarity join is an important database operator widely used in practice. So far the research community has focused exclusively on optimizing fuzzy join \textit{scalability}. However, practitioners today also struggle to optimize fuzzy-join \textit{quality}, because they face a daunting space of parameters (e.g., distance-functions, distance-thresholds, tokenization-options, etc.), and often have to resort to a manual trial-and-error approach to program these parameters in order to optimize fuzzy-join quality. This key challenge of automatically generating high-quality fuzzy-join programs has received surprisingly little attention thus far. In this work, we study the problem of "auto-program" fuzzy-joins. Leveraging a geometric interpretation of distance-functions, we develop an unsupervised \textsc{Auto-FuzzyJoin} framework that can infer suitable fuzzy-join programs on given input tables, without requiring explicit human input such as labeled training data. Using \textsc{Auto-FuzzyJoin}, users only need to provide two input tables $L$ and $R$, and a desired precision target $\tau$ (say 0.9). \textsc{Auto-FuzzyJoin} leverages the fact that one of the input is a reference table to automatically program fuzzy-joins that meet the precision target $\tau$ in expectation, while maximizing fuzzy-join recall (defined as the number of correctly joined records). Experiments on both existing benchmarks and a new benchmark with 50 fuzzy-join tasks created from Wikipedia data suggest that the proposed \textsc{Auto-FuzzyJoin} significantly outperforms existing unsupervised approaches, and is surprisingly competitive even against supervised approaches (e.g., Magellan and DeepMatcher) when 50\% of ground-truth labels are used as training data.

...read moreread less

1 citations

Auto-Tag: Tagging-Data-By-Example in Data Lakes using Pre-training and Inferred Domain Patterns

[...]

Yeye He, Jie Song, Yue Wang, Surajit Chaudhuri, Vishal Anil, Blake Lassiter, Yaron Goland, Gaurav Malhotra - Show less +4 more

01 Mar 2021

1 citations

Posted Content•

AutoPipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search

[...]

Junwen Yang, Yeye He, Surajit Chaudhuri

25 Jun 2021-arXiv: Databases

TL;DR: In this article, the authors propose a "by-target" paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm.

...read moreread less

Abstract: Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data pipelines with both string transformations and table-manipulation operators. We propose a novel "by-target" paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target, users would provide input tables (e.g., csv or json files), and point us to a "target table" (e.g., an existing database table or BI dashboard) to demonstrate how the output from the desired pipeline would schematically "look like". While the problem is seemingly underspecified, our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space to make the problem tractable. We develop an Auto-Pipeline system that learns to synthesize pipelines using reinforcement learning and search. Experiments on large numbers of real pipelines crawled from GitHub suggest that Auto-Pipeline can successfully synthesize 60-70% of these complex pipelines (up to 10 steps) in 10-20 seconds on average.

...read moreread less

1 citations

An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication

[...]

Yue Wang, Vivek Narasayya, Yeye He, Surajit Chaudhuri

01 Oct 2021

Proceedings Article•

Auto-Pipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search.

[...]

Junwen Yang¹, Yeye He², Surajit Chaudhuri²•Institutions (2)

University of Chicago¹, Microsoft²

01 Jul 2021

Posted Content•

Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search

[...]

Junwen Yang¹, Yeye He², Surajit Chaudhuri²•Institutions (2)

University of Chicago¹, Microsoft²

25 Jun 2021-arXiv: Databases

TL;DR: In this article, the authors propose a "by-target" paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm.

...read moreread less

Abstract: Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data pipelines with both string transformations and table-manipulation operators. We propose a novel "by-target" paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target, users would provide input tables (e.g., csv or json files), and point us to a "target table" (e.g., an existing database table or BI dashboard) to demonstrate how the output from the desired pipeline would schematically "look like". While the problem is seemingly underspecified, our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space to make the problem tractable. We develop an Auto-Pipeline system that learns to synthesize pipelines using reinforcement learning and search. Experiments on large numbers of real pipelines crawled from GitHub suggest that Auto-Pipeline can successfully synthesize 60-70% of these complex pipelines with up to 10 steps.

...read moreread less

Showing papers by "Yeye He published in 2021"