scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal Article
TL;DR: It is shown that there is a close connection between semi-local string alignment and a certain class of traditional comparison networks known as transposition networks, and it is concluded that the transposition network method is a very general and flexible way of understanding and improving different string comparison algorithms.
Abstract: Computing string or sequence alignments is a classical method of comparing strings and has applications in many areas of computing, such as signal processing and bioinformatics. Semi-local string alignment is a recent generalisation of this method, in which the alignment of a given string and all substrings of another string are computed simultaneously at no additional asymptotic cost. In this paper, we show that there is a close connection between semi-local string alignment and a certain class of traditional comparison networks known as transposition networks. The transposition network approach can be used to represent different string comparison algorithms in a unified form, and in some cases provides generalisations or improvements on existing algorithms. This approach allows us to obtain new algorithms for sparse semi-local string comparison and for comparison of highly similar and highly dissimilar strings, as well as of run-length compressed strings. We conclude that the transposition network method is a very general and flexible way of understanding and improving different string comparison algorithms, as well as their efficient implementation.

7 citations

Proceedings ArticleDOI
Y. Mishina1, K. Kojima1
03 Oct 1993
TL;DR: Measurements show that this string matching algorithm using IDP is more than 10 times faster than a scalar program using the Aho-Corasick method.
Abstract: The paper describes a new string matching algorithm that is suitable for vector processors. The hardware implementation of the algorithm is also presented. The algorithm consists of two parts. In the first part, candidate strings that are similar to pattern strings are extracted from a text string (cutout part). Candidate strings may include noise strings, and these are removed in the second part of the algorithm. Each part is efficiently vectorized using vector instructions of conventional vector processors for numerical computations. Moreover, the cutoff part is implemented as an added instruction of the Integrated Database Processor (IDP). Measurements show that this algorithm using IDP is more than 10 times faster than a scalar program using the Aho-Corasick method. >

7 citations

Journal ArticleDOI
TL;DR: A new BB pruning strategy that can be applied to dictionary-based approximate string matching when the dictionary is stored as a trie, which combines the advantages of partitioning the dictionary according to the string lengths, and the advantages gleaned by representing H using the trie data structure.
Abstract: This paper deals with the problem of estimating a transmitted string X * by processing the corresponding string Y, which is a noisy version of X *. We assume that Y contains substitution, insertion, and deletion errors, and that X * is an element of a finite (but possibly, large) dictionary, H. The best estimate X + of X *, is defined as that element of H which minimizes the generalized Levenshtein distance D(X, Y) between X and Y such that the total number of errors is not more than K, for all X ?H. The trie is a data structure that offers search costs that are independent of the document size. Tries also combine prefixes together, and so by using tries in approximate string matching we can utilize the information obtained in the process of evaluating any one D(X i , Y), to compute any other D(X j , Y), where X i and X j share a common prefix. In the artificial intelligence (AI) domain, branch and bound (BB) schemes are used when we want to prune paths that have costs above a certain threshold. These techniques have been applied to prune, for example, game trees. In this paper, we present a new BB pruning strategy that can be applied to dictionary-based approximate string matching when the dictionary is stored as a trie. The new strategy attempts to look ahead at each node, c, before moving further, by merely evaluating a certain local criterion at c. The search algorithm according to this pruning strategy will not traverse inside the subtrie(c) unless there is a "hope" of determining a suitable string in it. In other words, as opposed to the reported trie-based methods (Kashyap and Oommen in Inf Sci 23(2):123---142, 1981; Shang and Merrettal in IEEE Trans Knowledge Data Eng 8(4):540---547, 1996), the pruning is done a priori before even embarking on the edit distance computations. The new strategy depends highly on the variance of the lengths of the strings in H. It combines the advantages of partitioning the dictionary according to the string lengths, and the advantages gleaned by representing H using the trie data structure. The results demonstrate a marked improvement (up to 30% when costs are of a 0/1 form, and up to 47% when costs are general) with respect to the number of operations needed on three benchmark dictionaries.

7 citations

01 Jan 2004
TL;DR: Preliminary experimental results suggest that cross­domain approximate string matching can be applied to searching a database of scanned typeset documents using handwritten queries without requiring the correction of recognition errors.
Abstract: In this paper, we show how cross­domain approximate string matching can be applied to searching a database of scanned typeset documents using handwritten queries without requiring the correction of recognition errors. We present preliminary experimental results that suggest this approach can significantly improve retrieval effectiveness.

7 citations

Journal ArticleDOI
TL;DR: The method combines linking the text according to digrams, searching on the least‐frequent digram, and probing selected characters as a preliminary filter before full pattern comparison to reduce the number of character comparisons.
Abstract: We present a string matching or pattern matching method which is especially useful when a single block of text must be searched repeatedly for different patterns. The method combines linking the text according to digrams, searching on the least‐frequent digram, and probing selected characters as a preliminary filter before full pattern comparison. Tests on real alphabetic data show that the number of character comparisons may be decreased by two orders of magnitude compared with Knuth–Morris–Pratt and similar searching, but with an initialization overhead comparable to five to ten conventional searches. Copyright © 2001 John Wiley & Sons, Ltd.

7 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839