scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: An algorithm that produces the shortest edit sequence transforming one string into another is presented and is optimal in the sense that it generates a minimal covering set of common substrings of one string with respect to another.
Abstract: The string-to-string correction problem is to find a minimal sequence of edit operations for changing a given string into another given string. Extant algorithms compute a longest common subsequence (LCS) of the two strings and then regard the characters not included in the LCS as the differences. However, an LCS does not necessarily include all possible matches, and therefore does not produce the shortest edit sequence. An algorithm that produces the shortest edit sequence transforming one string into another is presented. The algorithm is optimal in the sense that it generates a minimal covering set of common substrings of one string with respect to another. Two improvements of the basic algorithm are developed. The first improvement performs well on strings with few replicated symbols. The second improvement runs in time and space linear to the size of the input. Efficient algorithms for regenerating a string from an edit sequence are also presented.

239 citations

Journal ArticleDOI
TL;DR: A linear implementation of the optimal universal data compression methods of Lempel and Ziv is described and the main tool is McCreight's algorithm for constructing suffix trees.
Abstract: A linear implementation of the optimal universal data compression methods of Lempel and Ziv is described. The main tool is McCreight's algorithm for constructing suffix trees. Both bounded and unbounded memory are considered.

236 citations

Proceedings ArticleDOI
01 Feb 2000
TL;DR: This work presents an algorithm that is faster than both the Galil-Giancarlo and Abrahamson algorithms in finding all locations where the pattern has at most k errors in time O(n√k log k).
Abstract: The string matching with mismatches problem is that of finding the number of mismatches between a pattern P of length m and every length m substring of the text T. Currently, the fastest algorithms for this problem are the following. The Galil-Giancarlo algorithm finds all locations where the pattern has at most k errors (where k is part of the input) in time O(nk). The Abrahamson algorithm finds the number of mismatches at every location in time O(n√ m log m). We present an algorithm that is faster than both. Our algorithm finds all locations where the pattern has at most k errors in time O(n√k log k). We also show an algorithm that solves the above problem in time O((n + (nk3)/m) log k).

221 citations

Journal ArticleDOI
TL;DR: Two polynomial-time approximationalgorithms with approximation ratio 1 + ε for any smallε to settle both the Closest String problem and the ClOSest Substring problem are presented.
Abstract: The problem of finding a center string that is "close" to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = {s1, s2, ..., sn}, each of length m, the Closest String problem is to find the smallest d and a string s of length m which is within Hamming distance d to each si e S. This problem comes from coding theory when we are looking for a code not too far away from a given set of codes. Closest Substring problem, with an additional input integer L, asks for the smallest d and a string s, of length L, which is within Hamming distance d away from a substring, of length L, of each si. This problem is much more elusive than the Closest String problem. The Closest Substring problem is formulated from applications in finding conserved regions, identifying genetic drug targets and generating genetic probes in molecular biology. Whether there are efficient approximation algorithms for both problems are major open questions in this area. We present two polynomial-time approximation algorithms with approximation ratio 1 + e for any small e to settle both questions.

219 citations

Patent
Barry Lynn Fritchman1
02 May 2001
TL;DR: In this paper, a method for matching a pattern string with a target string, where either string can contain single or multi-character wild cards, is described, which includes the steps of preprocessing the pattern string into a prefix, a suffix, and zero or more interior segments.
Abstract: The method of the present invention is useful in a computer system including at least one client. The program executes a method for matching a pattern string with a target string, where either string can contain single or multi-character wild cards. The method includes the steps of preprocessing the pattern string into a prefix segment, a suffix segment, and zero or more interior segments. Next, matching the prefix segment, the suffix segment, and the interior segment(s) with the target string.

217 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839