scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Proceedings ArticleDOI
09 Jun 2008
TL;DR: This work expands the problem of record matching to take such user-defined string transformations as input, and demonstrates an improvement in record matching quality and efficient retrieval based on the index structure that is cognizant of transformations.
Abstract: Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record matching to take such user-defined string transformations as input. These transformations coupled with an underlying similarity function are used to define the similarity between two strings. We demonstrate the effectiveness of this approach via a fuzzy match operation that is used to lookup an input record against a table of records, where we have an additional table of transformations as input. We demonstrate an improvement in record matching quality and efficient retrieval based on our index structure that is cognizant of transformations.

16 citations

Patent
01 Aug 2002
TL;DR: In this article, a method of comparing version strings in a computing environment for use in version-specific computing tasks is presented, where each of a first and a second version string at each one of a set of predetermined delimiters to produce respective first and second sets of sequentially ordered string chunks.
Abstract: A method of comparing version strings in a computing environment for use in version-specific computing tasks. In one embodiment, the method divides each of a first and a second version string at each one of a set of predetermined delimiters to produce respective first and second sets of sequentially ordered string chunks. Next, string chunks of the same order from the first and second chunk sets are iteratively compared to determine matching of same-order string chunks, with the comparison continuing until a non-matching same-order string chunk pair is encountered. From the matching/non-matching comparisons, a determination may be made whether a specified quality relationship exists between the first and second version strings, where the quality relationship determines the propriety of a version-specific computing task.

16 citations

Journal ArticleDOI
TL;DR: This paper includes the swapoperation that interchanges two adjacent characters into the set of allowable edit operations, and presents anO(tmin(m,n))-time algorithm for the extended edit distance problem, where tmin represents the edit distance between the given strings, and n represents the extendedk-differences problem.

16 citations

Patent
09 May 2014
TL;DR: In this paper, a compression algorithm replaces duplicative strings with a copy pair indicating a location and length of a preceding identical string that is within a window from the duplicative string.
Abstract: A compression algorithm replaces duplicative strings with a copy pair indicating a location and length of a preceding identical string that is within a window from the duplicative string. Rather than a replacing a longest matching string within a window from a given point with a copy pair, the longest matching string may be used provide it is at least two bytes larger than the next longest matching string or is at a distance that is less than some multiple of a distance to the next longest matching string. In another aspect, the length of the window in which a matching string may be found is dependent on a length of the matching string. In yet another aspect, rather than labeling each literal and copy pair to indicate what it is, strings of non-duplicative literals are represented by a label and a length of the string.

16 citations

Proceedings ArticleDOI
01 Oct 2020
TL;DR: DeezyMatch is presented, a free, open-source software library written in Python for fuzzy string matching and candidate ranking that supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzystring matching.
Abstract: We present DeezyMatch, a free, open-source software library written in Python for fuzzy string matching and candidate ranking. Its pair classifier supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzy string matching. This approach is especially useful where only limited training examples are available. The learned DeezyMatch models can be used to generate rich vector representations from string inputs. The candidate ranker component in DeezyMatch uses these vector representations to find, for a given query, the best matching candidates in a knowledge base. It uses an adaptive searching algorithm applicable to large knowledge bases and query sets. We describe DeezyMatch’s functionality, design and implementation, accompanied by a use case in toponym matching and candidate ranking in realistic noisy datasets.

16 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839