MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering

doi:10.1109/ICDE.2019.00042

Proceedings ArticleDOI

MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering

- pp 386-397

TLDR

M-Join is proposed, a multi-level filtering approach for fuzzy string similarity join that provides a flexible framework that can support multiple similarity functions at both levels and clearly outperforms state-of-the-art methods.

Abstract:

As an essential operation in data integration and data cleaning, similarity join has attracted considerable attention from the database community. In many application scenarios, it is essential to support fuzzy matching, which allows approximate matching between elements that improves the effectiveness of string similarity join. To describe the fuzzy matching between strings, we consider two levels of similarity, i.e., element-level and record-level similarity. Then the problem of calculating fuzzy matching similarity can be transformed into finding the weighted maximal matching in a bipartite graph. In this paper, we propose MF-Join, a multi-level filtering approach for fuzzy string similarity join. MF-Join provides a flexible framework that can support multiple similarity functions at both levels. To improve performance, we devise and implement several techniques to enhance the filter power. Specifically, we utilize a partition-based signature at the element-level and propose a frequency-aware partition strategy to improve the quality of signatures. We also devise a count filter at the record level to further prune dissimilar pairs. Moreover, we deduce an effective upper bound for the record-level similarity to reduce the computational overhead of verification. Experimental results on two popular datasets shows that our proposed method clearly outperforms state-of-the-art methods.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Blocking and Filtering Techniques for Entity Resolution: A Survey

George Papadakis, +3 more

- 13 Mar 2020 -

ACM Computing Surveys

TL;DR: In this paper, a large number of relevant works under two different but related frameworks, blocking and filtering, are reviewed, and a comprehensive list of the relevant works, discussing them in the greater context is provided.

...read moreread less

Journal ArticleDOI

A Transformation-Based Framework for KNN Set Similarity Search

Yong Zhang, +3 more

- 01 Mar 2020 -

IEEE Transactions on Knowledge and Data ...

TL;DR: A transformation based framework to solve the problem of KNN set similarity search, which given a collection of set records and a query set, returns results with the largest similarity to the query.

...read moreread less

Journal ArticleDOI

Deep Entity Matching: Challenges and Opportunities

Yuliang Li, +5 more

- 06 Jan 2021 -

Journal of Data and Information Quality

TL;DR: In this article, the authors report their recent system DITTO, which is an example of a modern entity matching system based on pretrained language models, and summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task.

...read moreread less

Posted Content

A Survey of Blocking and Filtering Techniques for Entity Resolution.

George Papadakis, +3 more

- 15 May 2019 -

arXiv: Databases

TL;DR: This survey organized the bulk of works in the field into Blocking, Filtering and hybrid techniques, facilitating their understanding and use, and provided an in-dept coverage of each category, further classifying the corresponding works into novel sub-categories.

...read moreread less

Posted Content

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach.

Yuyang Dong, +3 more

- 26 Oct 2020 -

arXiv: Information Retrieval

TL;DR: PEXESO is proposed, a framework for joinable table discovery in data lakes that identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

The Hungarian method for the assignment problem

Harold W. Kuhn

- 01 Mar 1955 -

Naval Research Logistics Quarterly

TL;DR: This paper has always been one of my favorite children, combining as it does elements of the duality of linear programming and combinatorial tools from graph theory, and it may be of some interest to tell the story of its origin this article.

...read moreread less

Book

Introduction to Graph Theory

Douglas B. West

TL;DR: In this article, the authors introduce the concept of graph coloring and propose a graph coloring algorithm based on the Eulers formula for k-chromatic graphs, which can be seen as a special case of the graph coloring problem.

...read moreread less

Journal ArticleDOI

Mining frequent patterns without candidate generation

Jiawei Han, +2 more

TL;DR: This study proposes a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develops an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth.

...read moreread less

Proceedings ArticleDOI

Google news personalization: scalable online collaborative filtering

Abhinandan S. Das, +3 more

TL;DR: This paper describes the approach to collaborative filtering for generating personalized recommendations for users of Google News using MinHash clustering, Probabilistic Latent Semantic Indexing, and covisitation counts, and combines recommendations from different algorithms using a linear model.

...read moreread less

Proceedings ArticleDOI

Similarity flooding: a versatile graph matching algorithm and its application to schema matching

Sergey Melnik, +2 more

TL;DR: This paper presents a matching algorithm based on a fixpoint computation that is usable across different scenarios and conducts a user study, in which the accuracy metric was used to estimate the labor savings that the users could obtain by utilizing the algorithm to obtain an initial matching.

...read moreread less