scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: An invariant handwritten Chinese character recognition system is proposed and fuzzy matching is verified through extensive experiments with the character set to show the performance of the proposed invariant features is clearly superior to that of moment invariants.

11 citations

Journal ArticleDOI
TL;DR: Two different versions of the problem of finding maximal multirepeats in a set of strings are presented, in the case of arbitrary gaps and when the gap is bounded in a small range c.
Abstract: A multirepeat in a string is a substring (factor) that appears a predefined number of times. A multirepeat is maximal if it cannot be extended either to the right or to the left and produce a multirepeat. In this paper, we present algorithms for two different versions of the problem of finding maximal multirepeats in a set of strings. In the case of arbitrary gaps, we propose an algorithm with O(σN2n + α) time complexity. When the gap is bounded in a small range c, we propose an algorithm with O((c2 + σ2)mN2n log(Nn) + α) time complexity. Here, N is the number of strings, n the mean length of each string, m the multiplicity of the multirepeat and α the number of reported occurrences. Our results extend previous work by considering sets of strings as well as by generalizing pairs to multirepeats.

11 citations

Journal Article
TL;DR: In this article, a uniform way of modifying each of these algorithms to permit also a fourth type of edit operation, transposing two adjacent characters in the pattern, is discussed, which is also known as the Damerau edit distance.
Abstract: Using bit-parallelism has resulted in fast and practical algorithms for approximate string matching under the Levenshtein edit distance, which permits a single edit operation to insert, delete or substitute a character. Depending on the parameters of the search, currently the fastest non-filtering algorithms in practice are the O(kn[m/ω]) algorithm of Wu & Manber, the O([km/ω]n) algorithm of Baeza-Yates & Navarro, and the O([m/ω]n) algorithm of Myers, where m is the pattern length, n is the text length, k is the error threshold and w is the computer word size. In this paper we discuss a uniform way of modifying each of these algorithms to permit also a fourth type of edit operation: transposing two adjacent characters in the pattern. This type of edit distance is also known as the Damerau edit distance. In the end we also present an experimental comparison of the resulting algorithms.

11 citations

Proceedings Article
17 Jun 2007
TL;DR: It is shown that fuzzy matching can recover new versions of GNU Emacs source from older versions, and can improve the performance of underlying distributed file storage systems by potentially saving significant network bandwidth and reducing file transmission costs.
Abstract: The fuzzy file block matching technique (fuzzy matching for short), was first proposed for opportunistic use of Content Addressable Storage. Fuzzy matching aims to increase the hit ratio in the content-addressable storage providers, and thus can improve the performance of underlying distributed file storage systems by potentially saving significant network bandwidth and reducing file transmission costs. Fuzzy matching employs shingling to represent the fuzzy hashing of file blocks for similarity detection, and error-correcting information to reconstruct the canonical content of a file block from some similar blocks. In this paper, we present the implementation details of fuzzy matching and a very basic evaluation of its performance. In particular, we show that fuzzy matching can recover new versions of GNU Emacs source from older versions.

11 citations

Book ChapterDOI
31 Mar 2009
TL;DR: This work presents several non-trivial applications of Matryoshka counters in string matching algorithms, improving their worst- or average-case time complexities.
Abstract: Many algorithms, e.g. in the field of string matching, are based on handling many counters, which can be performed in parallel, even on a sequential machine, using bit-parallelism. The recently presented technique of nested counters (Matryoshka counters ) [1] is to handle small counters most of the time, and refer to larger counters periodically, when the small counters may get full, to prevent overflow. In this work, we present several non-trivial applications of Matryoshka counters in string matching algorithms, improving their worst- or average-case time complexities. The set of problems comprises (Δ ,α )-matching, matching with k insertions, episode matching, and matching under Levenshtein distance.

11 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839