scispace - formally typeset
Proceedings ArticleDOI

MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering

TLDR
M-Join is proposed, a multi-level filtering approach for fuzzy string similarity join that provides a flexible framework that can support multiple similarity functions at both levels and clearly outperforms state-of-the-art methods.
Abstract
As an essential operation in data integration and data cleaning, similarity join has attracted considerable attention from the database community. In many application scenarios, it is essential to support fuzzy matching, which allows approximate matching between elements that improves the effectiveness of string similarity join. To describe the fuzzy matching between strings, we consider two levels of similarity, i.e., element-level and record-level similarity. Then the problem of calculating fuzzy matching similarity can be transformed into finding the weighted maximal matching in a bipartite graph. In this paper, we propose MF-Join, a multi-level filtering approach for fuzzy string similarity join. MF-Join provides a flexible framework that can support multiple similarity functions at both levels. To improve performance, we devise and implement several techniques to enhance the filter power. Specifically, we utilize a partition-based signature at the element-level and propose a frequency-aware partition strategy to improve the quality of signatures. We also devise a count filter at the record level to further prune dissimilar pairs. Moreover, we deduce an effective upper bound for the record-level similarity to reduce the computational overhead of verification. Experimental results on two popular datasets shows that our proposed method clearly outperforms state-of-the-art methods.

read more

Citations
More filters
Journal ArticleDOI

Blocking and Filtering Techniques for Entity Resolution: A Survey

TL;DR: In this paper, a large number of relevant works under two different but related frameworks, blocking and filtering, are reviewed, and a comprehensive list of the relevant works, discussing them in the greater context is provided.
Journal ArticleDOI

A Transformation-Based Framework for KNN Set Similarity Search

TL;DR: A transformation based framework to solve the problem of KNN set similarity search, which given a collection of set records and a query set, returns results with the largest similarity to the query.
Journal ArticleDOI

Deep Entity Matching: Challenges and Opportunities

TL;DR: In this article, the authors report their recent system DITTO, which is an example of a modern entity matching system based on pretrained language models, and summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task.
Posted Content

A Survey of Blocking and Filtering Techniques for Entity Resolution.

TL;DR: This survey organized the bulk of works in the field into Blocking, Filtering and hybrid techniques, facilitating their understanding and use, and provided an in-dept coverage of each category, further classifying the corresponding works into novel sub-categories.
Posted Content

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach.

TL;DR: PEXESO is proposed, a framework for joinable table discovery in data lakes that identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks.
References
More filters
Journal ArticleDOI

The Hungarian method for the assignment problem

TL;DR: This paper has always been one of my favorite children, combining as it does elements of the duality of linear programming and combinatorial tools from graph theory, and it may be of some interest to tell the story of its origin this article.
Book

Introduction to Graph Theory

TL;DR: In this article, the authors introduce the concept of graph coloring and propose a graph coloring algorithm based on the Eulers formula for k-chromatic graphs, which can be seen as a special case of the graph coloring problem.
Journal ArticleDOI

Mining frequent patterns without candidate generation

TL;DR: This study proposes a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develops an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth.
Proceedings ArticleDOI

Google news personalization: scalable online collaborative filtering

TL;DR: This paper describes the approach to collaborative filtering for generating personalized recommendations for users of Google News using MinHash clustering, Probabilistic Latent Semantic Indexing, and covisitation counts, and combines recommendations from different algorithms using a linear model.
Proceedings ArticleDOI

Similarity flooding: a versatile graph matching algorithm and its application to schema matching

TL;DR: This paper presents a matching algorithm based on a fixpoint computation that is usable across different scenarios and conducts a user study, in which the accuracy metric was used to estimate the labor savings that the users could obtain by utilizing the algorithm to obtain an initial matching.
Related Papers (5)