scispace - formally typeset
Book ChapterDOI

Efficient Approximate Entity Matching Using Jaro-Winkler Distance

Yaoshu Wang, +2 more
- pp 231-239
TLDR
This paper proposes an index-based method that relies on a filter-and-verify framework to support efficient Jaro-Winkler distance similarity search on a large dataset and leverages e-variants methods to build the index structure and pigeonhole principle to perform the search.
Abstract
Jaro-Winkler distance is a measurement to measure the similarity between two strings. Since Jaro-Winkler distance performs well in matching personal and entity names, it is widely used in the areas of record linkage, entity linking, information extraction. Given a query string q, Jaro-Winkler distance similarity search finds all strings in a dataset D whose Jaro-Winkler distance similarity with q is no more than a given threshold \(\tau \). With the growth of the dataset size, to efficiently perform Jaro-Winkler distance similarity search becomes challenge problem. In this paper, we propose an index-based method that relies on a filter-and-verify framework to support efficient Jaro-Winkler distance similarity search on a large dataset. We leverage e-variants methods to build the index structure and pigeonhole principle to perform the search. The experiment results clearly demonstrate the efficiency of our methods.

read more

Citations
More filters
Posted Content

CanDID: Can-Do Decentralized Identity with Legacy Compatibility, Sybil-Resistance, and Accountability.

TL;DR: CanDID provides strong confidentiality for user’s keys, real-world identities, and data, yet prevents users from spawning multiple identities and allows identification (and blacklisting) of sanctioned users.
Proceedings ArticleDOI

Understanding the Effect of Deplatforming on Social Networks

TL;DR: In this paper, the authors investigated the effect of deplatforming abusive users on their behavior on the behavior of Gab and found that users who get banned on Twitter/Reddit exhibit an increased level of activity and toxicity on Gab, although the audience they potentially reach decreases.
Journal ArticleDOI

Pigeonring: A Principle for Faster Thresholded Similarity Search

TL;DR: A universal filtering framework is introduced to encompass the solutions to problems defined in the form of identifying data objects whose similarities or distances to the query is constrained by a threshold and shows that the pigeonhole principle is a special case of the new principle.
Journal ArticleDOI

Mechanical and morphological investigation of bio-degradable magnesium AZ31 alloy for an orthopedic application

TL;DR: Bio-absorbable magnesium alloy AZ31 is used as base material with 1%Zn-3%Al-1-6%Mn which is prepared by stir casting technique and evaluated under mechanical and morphological aspects and can be preferred as an orthopedic implant for load bearing application.
Journal ArticleDOI

Spell corrector for Bangla language using Norvig’s Algorithm and Jaro-Winkler distance

TL;DR: This paper presented a method for error detection and correction in Bangla words' spellings using Norvig's Algorithm and Jaro-Winkler distance, which successfully achieved a 97% accuracy when evaluated with 1000 Banglawords.
References
More filters
Proceedings ArticleDOI

ArnetMiner: extraction and mining of academic social networks

TL;DR: The architecture and main features of the ArnetMiner system, which aims at extracting and mining academic social networks, are described and a unified modeling approach to simultaneously model topical aspects of papers, authors, and publication venues is proposed.
Journal ArticleDOI

Duplicate Record Detection: A Survey

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Journal ArticleDOI

Duplicate Record Detection: A Survey

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Proceedings Article

A comparison of string distance metrics for name-matching tasks

TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Proceedings Article

Robust Disambiguation of Named Entities in Text

TL;DR: A robust method for collective disambiguation is presented, by harnessing context from knowledge bases and using a new form of coherence graph that significantly outperforms prior methods in terms of accuracy, with robust behavior across a variety of inputs.
Related Papers (5)