scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: The developed computer-assisted system can help marine mammalogists in their identification of dolphins, since it allows them to examine only a handful of candidate images instead of the currently used manual searching of the entire database.
Abstract: This paper presents a syntactic/semantic string representation scheme as well as a string matching method as part of a computer-assisted system to identify dolphins from photographs of their dorsal fins. A low-level string representation is constructed from the curvature function of a dolphin's fin trailing edge, consisting of positive and negative curvature primitives. A high-level string representation is then built over the low-level string via merging appropriate groupings of primitives in order to have a less sensitive representation to curvature fluctuations or noise. A family of syntactic/semantic distance measures between two strings is introduced. A composite distance measure is then defined and used as a dissimilarity measure for database search, highlighting both the syntax (structure or sequence) and semantic (attribute or feature) differences. The syntax consists of an ordered sequence of significant protrusions and intrusions on the edge, while the semantics consist of seven attributes extracted from the edge and its curvature function. The matching results are reported for a database of 624 images corresponding to 164 individual dolphins. The identification results indicate that the developed string matching method performs better than the previous matching methods including dorsal ratio, curvature, and curve matching. The developed computer-assisted system can help marine mammalogists in their identification of dolphins, since it allows them to examine only a handful of candidate images instead of the currently used manual searching of the entire database. © 2000 Biomedical Engineering Society. PAC00: 8780Tq, 4230Sy, 0705Pj

55 citations

Journal ArticleDOI
TL;DR: A randomized algorithm in deterministic time O(Nlog M) for estimating the score vector of matches between a text string of length N and a patternstring of length M, i.e., the vector obtained when the pattern is slid along the text, and the number of matches is counted for each position.
Abstract: We give a randomized algorithm in deterministic time O(Nlog M) for estimating the score vector of matches between a text string of length N and a pattern string of length M , i.e., the vector obtained when the pattern is slid along the text, and the number of matches is counted for each position. A direct application is approximate string matching. The randomized algorithm uses convolution to find an estimator of the scores; the variance of the estimator is particularly small for scores that are close to M , i.e., for approximate occurrences of the pattern in the text. No assumption is made about the probabilistic characteristics of the input, or about the size of the alphabet. The solution extends to string matching with classes, class complements, ``never match'' and ``always match'' symbols, to the weighted case and to higher dimensions.

55 citations

Book ChapterDOI
03 Apr 2002
TL;DR: A radically new indexing approach for approximate string matching where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space.
Abstract: We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the R occurrences of a pattern of length m in a text of length n in average time O(mlog2 n+m2+R), using O(n log n) space and O(n log2 n) index construction time. This complexity improves by far over all other previous methods. We also show a simpler scheme needing O(n) space.

55 citations

Journal ArticleDOI
01 Jan 2012
TL;DR: This work presents an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.
Abstract: Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t)=q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u,v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u≤q≤v. This definition encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem in sub-linear expected time using a wavelet tree of s, which can be computed in time O(n) in a preprocessing phase. We then discuss a Scrabble-like variation of the problem, in which a weight function on the letters of s is given and one has to find all occurrences in s of a substring t with maximum weight having Parikh vector p(t)≤v. For the case of a binary alphabet, we present an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.

55 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839