scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Proceedings Article
01 Sep 1992
TL;DR: This paper presents a new algorithmic technique for two-dimensional matching, that of periodicity analysis, and introduces a new pattern matching paradigm - Compressed Matching
Abstract: String matching is rich with a variety of algorithmic tools. In contrast, multidimensional matching has a rather sparse set of techniques. This paper presents a new algorithmic technique for two-dimensional matching, that of periodicity analysis.Periodicity in strings has been used to solve string matching problems. The success of these algorithms suggests that periodicity can be as important a tool in multidimensional matching. However, multidimensional periodicity is not as simple as it is in strings and was not formally studied or used in pattern matching.This paper's main contribution is defining and analysing two-dimensional periodicity in rectangular arrays. In addition, we introduce a new pattern matching paradigm - Compressed Matching. A text array T and a pattern array P are given in compressed forms c(T) and c(P). We seek all appearances of P in T, without decompressing T. By using periodicity analysis, we show that for the two-dimensional run-length compression there is a O(|c(T)|log|P|+|P|), or almost optimal algorithm that can achieve a search time that is sublinear in the size of the text |T|.

87 citations

Journal ArticleDOI
TL;DR: Three algorithms for exact string matching of multiple patterns, which apply q-grams and bit parallelism are presented, which appeared to be substantially faster than earlier solutions for sets of 1,000--10,000 patterns.
Abstract: We present three algorithms for exact string matching of multiple patterns. Our algorithms are filtering methods, which apply q-grams and bit parallelism. We ran extensive experiments with them and compared them with various versions of earlier algorithms, e.g., different trie implementations of the Aho--Corasick algorithm. All of our algorithms appeared to be substantially faster than earlier solutions for sets of 1,000--10,000 patterns and the good performance of two of them continues to 100,000 patterns. The gain is because of the improved filtering efficiency caused by q-grams.

87 citations

Patent
Lauri Karttunen1
16 May 1997
TL;DR: In this paper, a processor implemented method of modifying a string of a regular language, which includes at least two symbols and two predetermined substrings, was described, and the processor then replaced the matching substring with the string of the lower language associated with the selected preselected substrings and outputs the modified string.
Abstract: A processor implemented method of modifying a string of a regular language, which includes at least two symbols and at least two predetermined substrings. Upon receipt of the string, the processor determines an initial position within the string of a substring matching one of the preselected substrings. To make this determination, the processor either matches symbols of the string starting from the left and proceeding to the right or by starting from the right and proceeding to the left. After identifying the initial position, the processor then selects either the longest or the shortest of the preselected substrings. The processor then replaces the matching substring with the string of the lower language associated with the selected preselected substring and outputs the modified string.

86 citations

Patent
23 Jul 1997
TL;DR: In this paper, a dictionary based data compression and decompression system is proposed, where, in the compressor, when a partial string W and a character C are matched in the dictionary, a new string is entered into the dictionary with C as an extension character on the string PW where P is the string corresponding to the last output compressed code signal.
Abstract: A dictionary based data compression and decompression system where, in the compressor, when a partial string W and a character C are matched in the dictionary, a new string is entered into the dictionary with C as an extension character on the string PW where P is the string corresponding to the last output compressed code signal. An update string is entered into the compression dictionary for each input character that is read and matched. The updating is immediate and interleaved with the character-by-character matching of the current string. The update process continues until the longest match is found in the dictionary. The code of the longest matched string is output in a string matching cycle. If a single character or multi-character string "A" exists in the dictionary, the string AAA . . . A is encoded in two compressed code signals regardless of the string length. This encoding results in an unrecognized code signal at the decompressor. The decompressor, in response to an unrecognized code signal, enters update strings into the decompressor dictionary in accordance with the recovered string corresponding to the previously received code signal, the unrecognized code signal, the extant code of the decompressor and the number of characters in the previously recovered string.

85 citations

Proceedings ArticleDOI
01 Mar 2010
TL;DR: This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases based on the R-tree augmented with the min-wise signature and the linear hashing technique.
Abstract: This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases. The MHR-tree is based on the R-tree augmented with the min-wise signature and the linear hashing technique. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. MHR-tree supports a wide range of query predicates efficiently, including range and nearest neighbor queries. We also discuss how to estimate range query selectivity accurately. We present a novel adaptive algorithm for finding balanced partitions using both the spatial and string information stored in the tree. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approach.

84 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839