Home
/
Authors
/
Yeye He

Author

Yeye He

Other affiliations: University of Wisconsin-Madison

Bio: Yeye He is an academic researcher from Microsoft. The author has contributed to research in topics: Set (abstract data type) & Table (database). The author has an hindex of 19, co-authored 59 publications receiving 1178 citations. Previous affiliations of Yeye He include University of Wisconsin-Madison.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2011
2010
2009

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Anonymization of set-valued data via top-down, local generalization

[...]

Yeye He¹, Jeffrey F. Naughton¹•Institutions (1)

University of Wisconsin-Madison¹

01 Aug 2009

TL;DR: A top-down, partition-based approach to anonymizing set-valued data that scales linearly with the input size and scores well on an information-loss data quality metric is proposed.

...read moreread less

Abstract: Set-valued data, in which a set of values are associated with an individual, is common in databases ranging from market basket data, to medical databases of patients' symptoms and behaviors, to query engine search logs. Anonymizing this data is important if we are to reconcile the conflicting demands arising from the desire to release the data for study and the desire to protect the privacy of individuals represented in the data. Unfortunately, the bulk of existing anonymization techniques, which were developed for scenarios in which each individual is associated with only one sensitive value, are not well-suited for set-valued data. In this paper we propose a top-down, partition-based approach to anonymizing set-valued data that scales linearly with the input size and scores well on an information-loss data quality metric. We further note that our technique can be applied to anonymize the infamous AOL query logs, and discuss the merits and challenges in anonymizing query logs using our approach.

...read moreread less

237 citations

Proceedings Article•DOI•

SEISA: set expansion by iterative similarity aggregation

[...]

Yeye He¹, Dong Xin²•Institutions (2)

University of Wisconsin-Madison¹, Microsoft²

28 Mar 2011

TL;DR: A new general framework based on iterative similarity aggregation is proposed, and results are presented to show that, when using general-purpose web data for set expansion, this approach outperforms previous techniques in terms of both precision and recall.

...read moreread less

Abstract: In this paper, we study the problem of expanding a set of given seed entities into a more complete set by discovering other entities that also belong to the same concept set. A typical example is to use "Canon" and "Nikon" as seed entities, and derive other entities (e.g., "Olympus") in the same concept set of camera brands. In order to discover such relevant entities, we exploit several web data sources, including lists extracted from web pages and user queries from a web search engine. While these web data are highly diverse with rich information that usually cover a wide range of the domains of interest, they tend to be very noisy. We observe that previously proposed random walk based approaches do not perform very well on these noisy data sources. Accordingly, we propose a new general framework based on iterative similarity aggregation, and present detailed experimental results to show that, when using general-purpose web data for set expansion, our approach outperforms previous techniques in terms of both precision and recall.

...read moreread less

98 citations

Journal Article•DOI•

ClusterJoin: a similarity joins framework using map-reduce

[...]

Akash Das Sarma¹, Yeye He², Surajit Chaudhuri²•Institutions (2)

Stanford University¹, Microsoft²

01 Aug 2014

TL;DR: A ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based onThe distance threshold, and develops a dynamic load balancing scheme using sampling, which provides strong probabilistic guarantees on the size of partitions, and greatly improves scalability.

...read moreread less

Abstract: Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based on the distance threshold. We design a set of strong candidate filters specific to different distance functions using a novel bisector-based framework, so that each record only needs to be distributed to a small number of partitions while still guaranteeing correctness. To address data skewness, which is common for high dimensional data, we further develop a dynamic load balancing scheme using sampling, which provides strong probabilistic guarantees on the size of partitions, and greatly improves scalability. Experimental evaluation using real data sets shows that our approach is considerably more scalable compared to state-of-the-art algorithms, especially for high dimensional data with low distance thresholds.

...read moreread less

89 citations

Proceedings Article•DOI•

Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning

[...]

Chen Zhao¹, Yeye He²•Institutions (2)

University of Maryland, College Park¹, Microsoft²

13 May 2019

TL;DR: This work proposes a transfer-learning approach to EM, leveraging pre-trained EM models from large-scale, production knowledge bases (KB), and suggests that the pre- trained approach is effective and outperforms existing EM methods.

...read moreread less

Abstract: Entity matching (EM), also known as entity resolution, fuzzy join, and record linkage, refers to the process of identifying records corresponding to the same real-world entities from different data sources. It is an important and long-standing problem in data integration and data mining. So far progresses have been made mainly in the form of model improvements, where models with better accuracy are developed when large amounts of training data is available. In real-world applications we find that advanced approaches can often require too many labeled examples that is expensive to obtain, which has become a key obstacle to wider adoption. We in this work take a different tack, proposing a transfer-learning approach to EM, leveraging pre-trained EM models from large-scale, production knowledge bases (KB). Specifically, for each entity-type in KB, (e.g., location, organization, people, etc.), we use rich synonymous names of known entities in the KB as training data, to pre-train type-detection and EM models for each type, using a novel hierarchical neural network architecture we develop. Given a new EM task, with little or no training data, we can either fine-tune or directly leverage pre-trained EM models, to build end-to-end, high-quality EM systems. Experiments on a variety of real EM tasks suggest that the pre-trained approach is effective and outperforms existing EM methods.1.

...read moreread less

88 citations

Proceedings Article•DOI•

Crawling deep web entity pages

[...]

Yeye He¹, Dong Xin², Venkatesh Ganti², Sriram Rajaraman², Nirav Shah² - Show less +1 more•Institutions (2)

University of Wisconsin-Madison¹, Google²

04 Feb 2013

TL;DR: This work describes a prototype system built that specializes in crawling entity-oriented deep-web sites and proposes techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep- web sites.

...read moreread less

Abstract: Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.

...read moreread less

69 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

Journal Article•

When is nearest neighbor meaningful

[...]

Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft

01 Jan 1999-Lecture Notes in Computer Science

TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.

...read moreread less

Abstract: We explore the effect of dimensionality on the nearest neighbor problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10-15) dimensionality!.

...read moreread less

1,992 citations

Book Chapter•DOI•

Fixed Point Theory

[...]

Klaus Deimling

01 Jan 1985

TL;DR: The first group of results in fixed point theory were derived from Banach's fixed point theorem as discussed by the authors, which is a nice result since it contains only one simple condition on the map F, since it is easy to prove and since it nevertheless allows a variety of applications.

...read moreread less

Abstract: Formally we have arrived at the middle of the book. So you may need a pause for recovering, a pause which we want to fill up by some fixed point theorems supplementing those which you already met or which you will meet in later chapters. The first group of results centres around Banach’s fixed point theorem. The latter is certainly a nice result since it contains only one simple condition on the map F, since it is so easy to prove and since it nevertheless allows a variety of applications. Therefore it is not astonishing that many mathematicians have been attracted by the question to which extent the conditions on F and the space Ω can be changed so that one still gets the existence of a unique or of at least one fixed point. The number of results produced this way is still finite, but of a statistical magnitude, suggesting at a first glance that only a random sample can be covered by a chapter or even a book of the present size. Fortunately (or unfortunately?) most of the modifications have not found applications up to now, so that there is no reason to write a cookery book about conditions but to write at least a short outline of some ideas indicating that this field can be as interesting as other chapters. A systematic account of more recent ideas and examples in fixed point theory should however be written by one of the true experts. Strange as it is, such a book does not seem to exist though so many people are puzzling out so many results.

...read moreread less

994 citations

Journal Article•DOI•

Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions

[...]

Wei Shen¹, Jianyong Wang², Jiawei Han³•Institutions (3)

Nankai University¹, Tsinghua University², University of Illinois at Urbana–Champaign³

01 Feb 2015-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A thorough overview and analysis of the main approaches to entity linking is presented, and various applications, the evaluation of entity linking systems, and future directions are discussed.

...read moreread less

Abstract: The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.

...read moreread less

702 citations

Book•

Submodular functions and optimization

[...]

悟藤重

01 Jan 1991

TL;DR: In this paper, the Lovasz Extensions of Submodular Functions are extended to include nonlinear weight functions and linear weight functions with continuous variables, and a Decomposition Algorithm is proposed.

...read moreread less

Abstract: Introduction. 1. Mathematical Preliminaries. Submodular Systems and Base Polyhedra. 2. From Matroids to Submodular Systems. Matroids. Polymatroids. Submodular Systems. 3. Submodular Systems and Base Polyhedra. Fundamental Operations on Submodular Systems. Greedy Algorithm. Structures of Base Polyhedra. Intersecting- and Crossing-Submodular Functions. Related Polyhedra. Submodular Systems of Network Type. Neoflows. 4. The Intersection Problem. The Intersection Theorem. The Discrete Separation Theorem. The Common Base Problem. 5. Neoflows. The Equivalence of the Neoflow Problems. Feasibility for Submodular Flows. Optimality for Submodular Flows. Algorithms for Neoflows. Matroid Optimization. Submodular Analysis. 6. Submodular Functions and Convexity. Conjugate Functions and a Fenchel-Type Min-Max Theorem for Submodular and Supermodular Functions. Subgradients of Submodular Functions. The Lovasz Extensions of Submodular Functions. 7. Submodular Programs. Submodular Programs - Unconstrained Optimization. Submodular Programs - Constrained Optimization. Nonlinear Optimization with Submodular Constraints. 8. Separable Convex Optimization. Optimality Conditions. A Decomposition Algorithm. Discrete Optimization. 9. The Lexicographically Optimal Base Problem. Nonlinear Weight Functions. Linear Weight Functions. 10. The Weighted Max-Min and Min-Max Problems. Continuous Variables. Discrete Variables. 11. The Fair Resource Allocation Problem. Continuous Variables. Discrete Variables. 12. The Neoflow Problem with a Separable Convex Cost Function. References. Index.

...read moreread less

505 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse