scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: By using generating function methods, it is shown that the number of distinct correlations of length n is independent of the alphabet size and is of order nlogn.

202 citations

Book ChapterDOI
29 Apr 1992
TL;DR: An optimal sequential solution of the color set size problem and string matching applications including a linear time algorithm for the problem of finding the longest substring common to at least k out of m input strings for all k between 1 and m is given.
Abstract: The Color Set Size problem is: Given a rooted tree of size n with l leaves colored from 1 to m, m ≤ l, for each vertex u find the number of different leaf colors in the subtree rooted at u. This problem formulation, together with the Generalized Suffix Tree data structure has applications to string matching. This paper gives an optimal sequential solution of the color set size problem and string matching applications including a linear time algorithm for the problem of finding the longest substring common to at least k out of m input strings for all k between 1 and m. In addition, parallel solutions to the above problems are given. These solutions may shed light on problems in computational biology, such as the multiple string alignment problem.

200 citations

Proceedings Article
23 Sep 2007
TL;DR: A novel technique, called VGRAM, to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection, and shows the significant performance improvements on three existing algorithms.
Abstract: Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to find those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using fixed-length grams, which are substrings of a string used as signatures to identify similar strings. In this paper we develop a novel technique, called VGRAM, to improve the performance of these algorithms. Its main idea is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. We give a full specification of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approximate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the significant performance improvements on three existing algorithms.

198 citations

Proceedings ArticleDOI
17 Oct 2015
TL;DR: In this article, it was shown that these measures do not have strongly sub quadratic time algorithms, i.e., no algorithm with running time O(n 2 ) for any a#x03B5; > 0, unless the Strong Exponential Time Hypothesis fails.
Abstract: Classic similarity measures of strings are longest common subsequence and Levenshtein distance (i.e., The classic edit distance). A classic similarity measure of curves is dynamic time warping. These measures can be computed by simple O(n2) dynamic programming algorithms, and despite much effort no algorithms with significantly better running time are known. We prove that, even restricted to binary strings or one-dimensional curves, respectively, these measures do not have strongly sub quadratic time algorithms, i.e., No algorithms with running time O(n2 -- a#x03B5;) for any a#x03B5; > 0, unless the Strong Exponential Time Hypothesis fails. We generalize the result to edit distance for arbitrary fixed costs of the four operations (deletion in one of the two strings, matching, substitution), by identifying trivial cases that can be solved in constant time, and proving quadratic-time hardness on binary strings for all other cost choices. This improves and generalizes the known hardness result for Levenshtein distance [Backurs, Indyk STOC'15] by the restriction to binary strings and the generalization to arbitrary costs, and adds important problems to a recent line of research showing conditional lower bounds for a growing number of quadratic time problems. As our main technical contribution, we introduce a framework for proving quadratic-time hardness of similarity measures. To apply the framework it suffices to construct a single gadget, which encapsulates all the expressive power necessary to emulate a reduction from satisfiability. Finally, we prove quadratic-time hardness for longest palindromic subsequence and longest tandem subsequence via reductions from longest common subsequence, showing that conditional lower bounds based on the Strong Exponential Time Hypothesis also apply to string problems that are not necessarily similarity measures.

195 citations

Journal ArticleDOI
TL;DR: An efficient image retrieval system with high performance of accuracy based on two novel features, the composite sub-band gradient vector and the energy distribution pattern string, which are generated from the sub-images of a wavelet decomposition of the original image.

195 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839