scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: A new bit-parallel technique for approximate string matching based on the concept of a witness, which permits sampling some dynamic programming matrix values to bound, deduce or compute others fast is presented, and is the fastest algorithm for several combinations of m, k and alphabet sizes.
Abstract: We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one, BPM (Myers, 1999), searches for a pattern of length m in a text of length n permitting k differences in $O(\lceil m/w \rceil n)$ time, where w is the width of the computer word. The second one, ABNDM (Navarro and Raffinot, 2000), extends a sublinear-time exact algorithm to approximate searching. ABNDM relies on another algorithm, BPA (Wu and Manber, 1992), which makes use of an $O(k \lceil m/w \rceil n)$ time algorithm for its internal workings. BPA is slow but flexible enough to support all operations required by ABNDM. We improve previous ABNDM analyses, showing that it is average-optimal in number of inspected characters, although the overall complexity is higher because of the $O(k \lceil m/w \rceil )$ work done per inspected character. We then show that the faster BPM can be adapted to support all the operations required by ABNDM. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The solution to those challenges is based on the concept of a witness, which permits sampling some dynamic programming matrix values to bound, deduce or compute others fast. The resulting algorithm is average-optimal for m ≤ w, assuming the alphabet size is constant. In practice, it performs better than the original ABNDM and is the fastest algorithm for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology. To show that the concept of witnesses can be used in further scenarios, we also improve a recent variant of BPM. The use of witnesses greatly improves the running time of this algorithm too.

33 citations

Journal ArticleDOI
10 Oct 2017-PLOS ONE
TL;DR: An efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs) that all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory.
Abstract: Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively.

32 citations

Book ChapterDOI
29 Aug 1988
TL;DR: Two string-matching algorithms belonging to the second family are presented, which respectively obey to time and space constraints.
Abstract: Pattern recognition in a constantly growing field of research. Identification of pattern in images, for instance, is a first step towards their interpretation. More generally, all formal systems handling strings of symbols involve parsing phases to recognize certain patterns. Regular expressions is one of the techniques to specify simple patterns [26]. It leads to practicable algorithms available under most operating systems or edition tons especially with Unix. String-matching is a particular case of pattern recognition. It consists in locating a word inside another word, called the text. Solutions to this problem can be divided into two families. In the first one the text is considered as fixed while the word is variable. This situation occurs when the text is a dictionary, for example. The basic solution of that sort is due to Weiner who introduced the notion of position trees [29]. It is a kind of index which as been improved in different ways (see [21], [5], [10]). For the second family of solutions to string-matching, it is the word that is fixed. The two most famous and efficient string-matching algorithms of this family have been designed by Knuth, Morris & Pratt [t8] and Boyer & Moore [7]. They have been subject to several studies, improvements or extensions (see [1], [11], [13-16], [22], [23], [25], [28]). A variation to the initial problem happens when approximate patterns are considered (see [20], [27]). Stringmatching is close to detection of repetitions in strings (see [3], [10], [17], [25]). In fact, the study of regularities in strings is a part of the analysis of string-matching algorithms. In this paper, two string-matching algorithms belonging to the second family are presented. They respectively obey to time and space constraints. Both algorithms start by a first phase during which the word alone is processed. Then, the search is done during a second phase which essentially supports the contraints.

32 citations

Proceedings ArticleDOI
01 May 1999
TL;DR: It is shown that the multi-method dispatching problem can be transformed to a geometric problem on multi-dimensional integer grids, for which a data structure is developed that uses near-linear space and has log-logarithmic query time.
Abstract: 1 Introduction Current object oriented programming languages (OOPLs) rely on mono-method dispatching. Recent research has identified multi-methods as a new, powerful feature to be added to OOPLs, and several experimental OOPLs now have multi-methods. Their ultimate success and impact in practice depends, among other things, on whether multi-method dispatching can be supported efficiently. We show that the multi-method dispatching problem can be transformed to a geometric problem on multi-dimensional integer grids, for which we then develop a data structure that uses near-linear space and has log-logarithmic query time. This gives a solution whose performance almost matches that of the best known algorithm for mono-method dispatching. In this paper we study problems from two different areas: the multi-method dispatching problem for object-oriented (00) languages and two string matching problems. It turns out that these problems are surprisingly similar: we prove that they can all be reduced to the same problem on multi-dimensional integer grids-see below for a description of this geometric problem. We present an efficient data structure for this problem, which allows various trade-offs between space and query time. This leads to significantly improved solutions to the multi-method dispatching problem and the string matching problems. In the rest of this introduction, as well as in the remainder of the paper, we lirst focus on the multi-method dispatching problem and then turn our attention to the string matching problems. Our geometric data structure has other applications as well, namely in two string matching problems: matching multiple rectangular patterns against a rectangular query text, and approximate dictionary matching with edit distance at most one. Our results for the former, long-standing open problem are substantially improved, near-linear time bounds. For the latter problem, which has applications in checking password security and the design of filtering tools, we obtain a near-linear solution as well. The multi-method dispatching problem Object-oriented languages. The 00.paradigm is becoming the norm for software development; languages such as Java, C++, and Smalltalk that embody some of the basic tenets of the 00.paradigm are highly popular. Recent research has identified new, powerful features that can enhance the current 00.technology, and the focus now is on understanding the implication of adding these features-their power and their cost in terms of the additional complexity. One such feature is the concept of multi-methods found in the new generation 00-languages such as CommonLoops [BK+86], CLOS [BD+88], Poly-Glot [AQl], Kea [MHHSl], Cecil [ChSZ] and Dylan [Ap94]. …

32 citations

Journal ArticleDOI
TL;DR: In the present study, feature-based matching techniques, in their classical and robust versions, are described, and an automatic method of fuzzy alignment (FA) is introduced that allows automatic matching of two gel images with different numbers of features with unknown correspondence.
Abstract: Automatic alignment (matching) of two-dimensional gel electrophoresis images is of primary interest in the evolving field of proteomics. In the present study, feature-based matching techniques, in their classical and robust versions, are described, and an automatic method of fuzzy alignment (FA) is introduced. This method allows automatic matching of two gel images with different numbers of features with unknown correspondence. Performance of FA is tested on simulated and real data sets.

32 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839