scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Posted Content
TL;DR: This work calculates the distance between two string variables using the Jaro-Winkler distance metric, used in record linkage to compare first or last names in different sources.
Abstract: jarowinkler calculates the distance between two string variables using the Jaro-Winkler distance metric. The distance metric is often used in record linkage to compare first or last names in different sources.

8 citations

Journal ArticleDOI
TL;DR: This work derives a sublinear-time algorithm for searching a noncircular pattern with k allowed mismatches, which is extended to the problem of approximate circular pattern matching with k mismatches and is the first nonfiltering method for approximate circular string matching in sublinear average time.
Abstract: We consider approximate string matching of a circular pattern consisting of the rotations of a pattern of length m. From SBNDM and Tuned Shift-Add, we derive a sublinear-time algorithm for searching a noncircular pattern with k allowed mismatches, which is extended to the problem of approximate circular pattern matching with k mismatches. We prove that the presented algorithms are average-optimal for m⋅⌈log2(k+1)+1 ⌉ = O(w), where w is the size of the computer word in bits. Experiments conducted under the aforementioned condition show that the new k-mismatches algorithm for circular strings outperforms previous solutions in practice. In particular, our algorithm is the first nonfiltering method for approximate circular string matching in sublinear average time, which makes it more suitable than earlier filtering methods for high error levels k/m and small alphabets.

8 citations

Book ChapterDOI
29 Oct 2007
TL;DR: In this article, a new pattern matching paradigm was proposed, pattern matching with address errors, where the pattern is transformed through a sequence of rearrangement operations, each with an associated cost.
Abstract: Recently, a new pattern matching paradigm was proposed, pattern matching with address errors. In this paradigm approximate string matching problems are studied, where the content is unaltered and only the locations of the different entries may change. Specifically, a broad class of problems in this new paradigm was defined - the class of rearrangement errors. In this type of errors the pattern is transformed through a sequence of rearrangement operations, each with an associated cost. The natural l1 and l2 rearrangement systems were considered. A variant of the l1-rearrangement distance problem seems more difficult - where the pattern is a general string that may have repeating symbols. The best algorithm presented for the general case is O(nm). In this paper, we show that even for general strings the problem can be approximated in linear time! This paper also considers another natural rearrangement system - the l∞ rearrangement distance. For this new rearrangement system we provide efficient exact solutions for different variants of the problem, as well as a faster approximation.

8 citations

Book ChapterDOI
22 May 2013
TL;DR: This paper shows a slightly improved worst-case efficient multiple pattern matching algorithm, and a data structure that requires O(m) words of space and can be compressed to only use O(mlogσ) bits of space while achieving query time O(n(log σ m) e /y), and shows two other direct applications.
Abstract: In this paper we are concerned with the basic problem of string pattern matching: preprocess one or multiple fixed strings over alphabet σ so as to be able to efficiently search for all occurrences of the string(s) in a given text T of length n. In our model, we assume that text and patterns are tightly packed so that any single character occupies logσ bits and thus any sequence of k consecutive characters in the text or the pattern occupies exactly klogσ bits. We first show a data structure that requires O(m) words of space (more precisely O(mlogm) bits of space) where m is the total size of the patterns and answers to search queries in average-optimal O(n/y) time where y is the length of the shortest pattern (y = m in case of a single pattern). This first data structure, while optimal in time, still requires O(mlogm) bits of space, which might be too much considering that the patterns occupy only mlogσ bits of space. We then show that our data structure can be compressed to only use O(mlogσ) bits of space while achieving query time O(n(log σ m) e /y), with e any constant such that 0 < e < 1. We finally show two other direct applications: average optimal pattern matching with worst-case guarantees and average optimal pattern matching with k differences. In the meantime we also show a slightly improved worst-case efficient multiple pattern matching algorithm.

8 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839