scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Proceedings Article
01 Jan 2007
TL;DR: A novel method for approximate string matching, developed for the recognition of geographic and personal names, deals with abbreviations, name inversions, stopwords, and omission of parts.
Abstract: The problem of matching strings allowing errors has recently gained importance, considering the increasing volume of online textual data. In geo- technologies, approximate string matching algorithms find many applications, such as gazetteers, address matching, and geographic information retrieval. This paper presents a novel method for approximate string matching, devel- oped for the recognition of geographic and personal names. The method deals with abbreviations, name inversions, stopwords, and omission of parts. Three similarity measures and a method to match individual words considering ac- cent marks and other multilingual aspects were developed. Test results show high precision-recall rates and good overall matching efficiency.

14 citations

Proceedings Article
01 Jan 2010
TL;DR: This article used suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments, which outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3-247ms for a segment in a realistic scenario.
Abstract: We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3–247ms for a segment in a realistic scenario.

14 citations

Book ChapterDOI
13 Nov 2006
TL;DR: An entropy based Audio-Fingerprint delivering a framed, small footprint AFP is used which reduces the problem to a string matching problem and is able to correctly identify different renditions of masterpieces as well as pop music in less than a second per comparison.
Abstract: In this paper we address the problem of matching musical renditions of the same piece of music also known as performances. We use an entropy based Audio-Fingerprint delivering a framed, small footprint AFP which reduces the problem to a string matching problem. The Entropy AFP has very low resolution (750 ms per symbol), making it suitable for flexible string matching. We show experimental results using dynamic time warping (DTW), Levenshtein or edit distance and the Longest Common Subsequence (LCS) distance. We are able to correctly (100%) identify different renditions of masterpieces as well as pop music in less than a second per comparison. The three approaches are 100% effective, but LCS and Levenshtein can be computed online, making them suitable for monitoring applications (unlike DTW), and since they are distances a metric index could be use to speed up the recognition process.

14 citations

Proceedings ArticleDOI
01 Aug 2002
TL;DR: A method for measuring dissimilarities between cyclic strings is introduced, which computes a weighted mean between two (lower and upper) bounds of the exact cyclic edit distance, which are founded on a window-constrained edit graph related to the strings involved.
Abstract: A method for measuring dissimilarities between cyclic strings is introduced. It computes a weighted mean between two (lower and upper) bounds of the exact cyclic edit distance, which are founded on a window-constrained edit graph related to the strings involved. Weights are the ones which minimize the sum of squared relative errors of the weighted solution with respect to exact values, on a training set of string pairs. This method takes O(n/sup 2/) time. Experiments on both artificial and real data, show the highly accurate solutions achieved by this technique, which is clearly faster than the most efficient exact algorithms.

14 citations

Patent
23 Jan 2007
TL;DR: In this article, the Levenshtein Distance Algorithm (LDA) is augmented with additional information in the form of adjustments based on particular character substitutions, insertions and deletions together with weighting based on multiple alternatives for the OCR text string.
Abstract: Methods and systems of mapping of an optical character recognition (OCR) text string to a code included in a coding dictionary by supplementing the Levenshtein Distance Algorithm (LDA) with additional information in the form of adjustments based on particular character substitutions, insertions and deletions together with weighting based on multiple alternatives for the OCR text string. In one embodiment, an OCR text string mapping method (100) includes receiving (110) an OCR text string, comparing (120) it with selected text strings from a coding dictionary, computing (130) modified Levenshtein distances associated with the comparisons by determining (140) substitution penalties, determining (150) insertion penalties, determining (160) deletion penalties and combining (170) the penalties, selecting (180) the best matching text string from the coding dictionary based on the modified Levenshtein distances, determining (190) whether a maximum threshold distance is met, and assigning (200) a code associated with the best matching text string to the OCR text string when met, and assigning (210) a null or no code when not met.

14 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839