Home
/
Authors
/
Jianhua Feng

Author

Jianhua Feng

Bio: Jianhua Feng is an academic researcher from Tsinghua University. The author has contributed to research in topics: XML & String metric. The author has an hindex of 44, co-authored 169 publications receiving 7242 citations.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2000

Papers

PDF

Open Access

More filters

Journal Article•DOI•

CrowdER: crowdsourcing entity resolution

[...]

Jiannan Wang¹, Tim Kraska², Michael J. Franklin², Jianhua Feng¹•Institutions (2)

Tsinghua University¹, University of California, Berkeley²

01 Jul 2012

TL;DR: This work proposes a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are use to verify only the most likely matching pairs, and develops a novel two-tiered heuristic approach for creating batched tasks.

...read moreread less

Abstract: Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NP-Hard, but we develop a novel two-tiered heuristic approach for creating batched tasks. We describe this method, and present the results of extensive experiments on real data sets using a popular crowdsourcing platform. The experiments show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.

...read moreread less

499 citations

Posted Content•

CrowdER: Crowdsourcing Entity Resolution

[...]

Jiannan Wang¹, Tim Kraska², Michael J. Franklin², Jianhua Feng¹•Institutions (2)

Tsinghua University¹, University of California, Berkeley²

09 Aug 2012-arXiv: Databases

TL;DR: In this paper, a hybrid human-machine approach is proposed, in which machines are used to do an initial, coarse pass over all the data, and people were used to verify only the most likely matching pairs.

...read moreread less

450 citations

Proceedings Article•DOI•

EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data

[...]

Guoliang Li¹, Beng Chin Ooi², Jianhua Feng¹, Jianyong Wang¹, Lizhu Zhou¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, National University of Singapore²

09 Jun 2008

TL;DR: An extended inverted index is proposed to facilitate keyword-based search, and a novel ranking mechanism for enhancing search effectiveness is presented, which achieves both high search efficiency and high accuracy.

...read moreread less

Abstract: Conventional keyword search engines are restricted to a given data model and cannot easily adapt to unstructured, semi-structured or structured data. In this paper, we propose an efficient and adaptive keyword search method, called EASE, for indexing and querying large collections of heterogenous data. To achieve high efficiency in processing keyword queries, we first model unstructured, semi-structured and structured data as graphs, and then summarize the graphs and construct graph indices instead of using traditional inverted indices. We propose an extended inverted index to facilitate keyword-based search, and present a novel ranking mechanism for enhancing search effectiveness. We have conducted an extensive experimental study using real datasets, and the results show that EASE achieves both high search efficiency and high accuracy, and outperforms the existing approaches significantly.

...read moreread less

422 citations

Journal Article•DOI•

Comparing stars: on approximating graph edit distance

[...]

Zhiping Zeng¹, Anthony K. H. Tung², Jianyong Wang¹, Jianhua Feng¹, Lizhu Zhou¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, National University of Singapore²

01 Aug 2009

TL;DR: Three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time are introduced and result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs.

...read moreread less

Abstract: Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.

...read moreread less

413 citations

Proceedings Article•DOI•

Mining Individual Life Pattern Based on Location History

[...]

Yang Ye¹, Yu Zheng², Yukun Chen², Jianhua Feng¹, Xing Xie² - Show less +1 more•Institutions (2)

Tsinghua University¹, Microsoft²

18 May 2009

TL;DR: This paper proposes the novel notion of individual life pattern, which captures individual's general life style and regularity, and proposes the LP-Mine framework to effectively retrieve life patterns from raw individual GPS data.

...read moreread less

Abstract: The increasing pervasiveness of location-acquisition technologies (GPS, GSM networks, etc.) enables people to conveniently log their location history into spatial-temporal data, thus giving rise to the necessity as well as opportunity to discovery valuable knowledge from this type of data. In this paper, we propose the novel notion of individual life pattern, which captures individual's general life style and regularity. Concretely, we propose the life pattern normal form (the LP-normal form) to formally describe which kind of life regularity can be discovered from location history; then we propose the LP-Mine framework to effectively retrieve life patterns from raw individual GPS data. Our definition of life pattern focuses on significant places of individual life and considers diverse properties to combine the significant places. LP-Mine is comprised of two phases: the modelling phase and the mining phase. The modelling phase pre-processes GPS data into an available format as the input of the mining phase. The mining phase applies separate strategies to discover different types of pattern. Finally, we conduct extensive experiments using GPS data collected by volunteers in the real world to verify the effectiveness of the framework.

...read moreread less

292 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

Collapse

Cited by

PDF

Open Access

More filters

Data Mining - Concepts and Techniques.

[...]

Petra Perner

01 Jan 2002

9,314 citations

Book•

Ontology Matching

[...]

Jérôme Euzenat, Pavel Shvaiko

05 Jun 2007

TL;DR: The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content.

...read moreread less

Abstract: Ontologies tend to be found everywhere. They are viewed as the silver bullet for many applications, such as database integration, peer-to-peer systems, e-commerce, semantic web services, or social networks. However, in open or evolving systems, such as the semantic web, different parties would, in general, adopt different ontologies. Thus, merely using ontologies, like using XML, does not reduce heterogeneity: it just raises heterogeneity problems to a higher level. Euzenat and Shvaikos book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching aims at finding correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness, between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence. The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content. In particular, the book includes a new chapter dedicated to the methodology for performing ontology matching. It also covers emerging topics, such as data interlinking, ontology partitioning and pruning, context-based matching, matcher tuning, alignment debugging, and user involvement in matching, to mention a few. More than 100 state-of-the-art matching systems and frameworks were reviewed. With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the work and the techniques presented in this book can be equally applied to database schema matching, catalog integration, XML schema matching and other related problems. The objectives of the book include presenting (i) the state of the art and (ii) the latest research results in ontology matching by providing a systematic and detailed account of matching techniques and matching systems from theoretical, practical and application perspectives.

...read moreread less

2,579 citations

Journal Article•DOI•

Trajectory Data Mining: An Overview

[...]

Yu Zheng¹•Institutions (1)

Microsoft¹

12 May 2015-ACM Transactions on Intelligent Systems and Technology

TL;DR: A systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics, and introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors.

...read moreread less

Abstract: The advances in location-acquisition and mobile computing techniques have generated massive spatial trajectory data, which represent the mobility of a diversity of moving objects, such as people, vehicles, and animals. Many techniques have been proposed for processing, managing, and mining trajectory data in the past decade, fostering a broad range of applications. In this article, we conduct a systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics. Following a road map from the derivation of trajectory data, to trajectory data preprocessing, to trajectory data management, and to a variety of mining tasks (such as trajectory pattern mining, outlier detection, and trajectory classification), the survey explores the connections, correlations, and differences among these existing techniques. This survey also introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors, to which more data mining and machine learning techniques can be applied. Finally, some public trajectory datasets are presented. This survey can help shape the field of trajectory data mining, providing a quick understanding of this field to the community.

...read moreread less

1,289 citations

Book•

Outlier Analysis

[...]

Charu C. Aggarwal

11 Jan 2013

TL;DR: Outlier Analysis is a comprehensive exposition, as understood by data mining experts, statisticians and computer scientists, and emphasis was placed on simplifying the content, so that students and practitioners can also benefit.

...read moreread less

Abstract: With the increasing advances in hardware technology for data collection, and advances in software technology (databases) for data organization, computer scientists have increasingly participated in the latest advancements of the outlier analysis field. Computer scientists, specifically, approach this field based on their practical experiences in managing large amounts of data, and with far fewer assumptions the data can be of any type, structured or unstructured, and may be extremely large. Outlier Analysisis a comprehensive exposition, as understood by data mining experts, statisticians and computer scientists. The book has been organized carefully, and emphasis was placed on simplifying the content, so that students and practitioners can also benefit. Chapters will typically cover one of three areas: methods and techniques commonly used in outlier analysis, such as linear methods, proximity-based methods, subspace methods, and supervised methods; data domains, such as, text, categorical, mixed-attribute, time-series, streaming, discrete sequence, spatial and network data; and key applications of these methods as applied to diverse domains such as credit card fraud detection, intrusion detection, medical diagnosis, earth science, web log analytics, and social network analysis are covered.

...read moreread less

1,278 citations

Book•

コンピュータ・サイエンス : ACM computing surveys

[...]

共立出版株式会社

01 Jan 1978

1,055 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse