Topic
Approximate string matching
About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.
Papers published on a yearly basis
Papers
More filters
••
09 Jun 2008TL;DR: This work expands the problem of record matching to take such user-defined string transformations as input, and demonstrates an improvement in record matching quality and efficient retrieval based on the index structure that is cognizant of transformations.
Abstract: Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record matching to take such user-defined string transformations as input. These transformations coupled with an underlying similarity function are used to define the similarity between two strings. We demonstrate the effectiveness of this approach via a fuzzy match operation that is used to lookup an input record against a table of records, where we have an additional table of transformations as input. We demonstrate an improvement in record matching quality and efficient retrieval based on our index structure that is cognizant of transformations.
16 citations
•
01 Aug 2002TL;DR: In this article, a method of comparing version strings in a computing environment for use in version-specific computing tasks is presented, where each of a first and a second version string at each one of a set of predetermined delimiters to produce respective first and second sets of sequentially ordered string chunks.
Abstract: A method of comparing version strings in a computing environment for use in version-specific computing tasks. In one embodiment, the method divides each of a first and a second version string at each one of a set of predetermined delimiters to produce respective first and second sets of sequentially ordered string chunks. Next, string chunks of the same order from the first and second chunk sets are iteratively compared to determine matching of same-order string chunks, with the comparison continuing until a non-matching same-order string chunk pair is encountered. From the matching/non-matching comparisons, a determination may be made whether a specified quality relationship exists between the first and second version strings, where the quality relationship determines the propriety of a version-specific computing task.
16 citations
••
TL;DR: This paper includes the swapoperation that interchanges two adjacent characters into the set of allowable edit operations, and presents anO(tmin(m,n))-time algorithm for the extended edit distance problem, where tmin represents the edit distance between the given strings, and n represents the extendedk-differences problem.
16 citations
•
09 May 2014
TL;DR: In this paper, a compression algorithm replaces duplicative strings with a copy pair indicating a location and length of a preceding identical string that is within a window from the duplicative string.
Abstract: A compression algorithm replaces duplicative strings with a copy pair indicating a location and length of a preceding identical string that is within a window from the duplicative string. Rather than a replacing a longest matching string within a window from a given point with a copy pair, the longest matching string may be used provide it is at least two bytes larger than the next longest matching string or is at a distance that is less than some multiple of a distance to the next longest matching string. In another aspect, the length of the window in which a matching string may be found is dependent on a length of the matching string. In yet another aspect, rather than labeling each literal and copy pair to indicate what it is, strings of non-duplicative literals are represented by a label and a length of the string.
16 citations
••
01 Oct 2020TL;DR: DeezyMatch is presented, a free, open-source software library written in Python for fuzzy string matching and candidate ranking that supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzystring matching.
Abstract: We present DeezyMatch, a free, open-source software library written in Python for fuzzy string matching and candidate ranking. Its pair classifier supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzy string matching. This approach is especially useful where only limited training examples are available. The learned DeezyMatch models can be used to generate rich vector representations from string inputs. The candidate ranker component in DeezyMatch uses these vector representations to find, for a given query, the best matching candidates in a knowledge base. It uses an adaptive searching algorithm applicable to large knowledge bases and query sets. We describe DeezyMatch’s functionality, design and implementation, accompanied by a use case in toponym matching and candidate ranking in realistic noisy datasets.
16 citations