scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: This study proposes a tribrid parallel method for bit-parallel algorithms such as the Shift-Or and Wu-Manber algorithms to improve the runtimes of exact and approximate string matching algorithms, and integrates the inclusive-scan scheme into a previous segmentation-based scheme to maximize search throughput.
Abstract: In this study, to substantially improve the runtimes of exact and approximate string matching algorithms, we propose a tribrid parallel method for bit-parallel algorithms such as the Shift-Or and Wu-Manber algorithms. Our underlying idea is to interpret bit-parallel algorithms as inclusive-scan operations, which allow these bit-parallel algorithms to run efficiently on a graphics processing unit (GPU); we achieve this speed-up here because inclusive-scan operations not only eliminate duplicate searches between threads but also realize a GPU-friendly memory access pattern that maximizes memory read/write throughput. To realize our ideas, we first define two binary operators and then present a proof regarding the associativity of these operators, which is necessary for the parallelization of the inclusive-scan operations. Finally, we integrate the inclusive-scan scheme into a previous segmentation-based scheme to maximize search throughput, identifying the best tradeoff point between synchronization cost and duplicate work. Through our experiments, we compared our proposed method with previous segmentation-based methods and indexing-based sequence aligners. For online string matching, our proposed method performed 6.7-16.7 times faster than previous methods, achieving a search throughput of up to 1.88 terabits per second (Tbps) on a GeForce GTX TITAN X GPU. We therefore conclude that our proposed method is quite effective for decreasing the runtimes of online string matching of short patterns.

24 citations

Patent
19 Jan 2001
TL;DR: In this paper, a method for manipulation, storage, modeling, visualization, and quantification of datasets which correspond to target strings is described, which is used to generate comparison strings corresponding to some set of points that can serve as the domain of an iterative function.
Abstract: There is described a method for manipulation, storage, modeling, visualization, and quantification of datasets, which correspond to target strings. An iterative algorithm is used to generate comparison strings corresponding to some set of points that can serve as the domain of an iterative function. Preferably these points are located in the complex plane, such as in and/or near the Mandelbrot Set or a Julian Set. The comparison string is scored by evaluating a function having the comparison string and one of the plurality of target strings as inputs. The evaluation may be repeated for a number of the other target strings. The score or some other property corresponding to the comparison string is used to determine the target string's placement on a map. The points are analyzed and/or compared by examining, either visually or mathematically, their relative locations, their absolute locations within the region, and/or metrics other than location.

24 citations

Journal Article
TL;DR: A novel improved algorithm-BMH2C is presented, which computes the right shift using two characters and saves the shift in a two-dimension array, increases the shift, decreases the times of comparing and enhances the matching speed effectively.
Abstract: The technology of string matching is applied abroad in many fields Based on the discussions of Brute-Force, Boyer-Moore algorithms and the most important improvements to these algorithms, a novel improved algorithm-BMH2C is presented The algorithm computes the right shift using two characters and saves the shift in a two-dimension array, increases the shift, decreases the times of comparing and enhances the matching speed effectively In the end, the comparisons of the testing results of these algorithms are given

24 citations

Proceedings ArticleDOI
27 May 2015
TL;DR: A new filtering method, called local filtering, is proposed, based on the idea that two strings exhibiting substantial local dissimilarities must be globally dissimilar, which can achieve substantial speedup compared with state-of-the-art methods and be robust against factors such as dataset characteristics and large edit distance thresholds.
Abstract: We study efficient query processing for approximate string queries, which find strings within a string collection whose edit distances to the query strings are within the given thresholds. Existing methods typically hinge on the property that globally similar strings must share at least certain number of identical substrings or subsequences. They become ineffective when there are burst errors or when the number of errors is large. In this paper, we explore the opposite paradigm focusing on finding out the differences of database strings to the query string. We propose a new filtering method, called local filtering, based on the idea that two strings exhibiting substantial local dissimilarities must be globally dissimilar. We propose the concept of (positional) local distance to quantify the minimum amount of errors a query fragment contributes to the edit distance between the query and a data string. It also leads to effective pruning rules and can speed up verification via early termination. We devise a family of indexing methods based on the idea of precomputing (positional) local distances for all possible combinations of query fragments and edit distance thresholds. Based on careful analyses of subtle relationships among local distances, novel techniques are proposed to drastically reduce the amount of enumeration with no or little impact on the pruning power. Efficient query processing methods exploiting the new index and bit-parallelism are also proposed. Experimental results on real datasets show that our local filtering-based methods can achieve substantial speedup compared with state-of-the-art methods, and they are robust against factors such as dataset characteristics and large edit distance thresholds.

24 citations

Proceedings Article
30 Aug 2005
TL;DR: The approach presented is based on representing sets of strings at higher levels of the index structure as tries suitably compressed in a way that reasoning about edit distance between a query string and a compressed trie at index nodes is still feasible.
Abstract: In various applications such as data cleansing, being able to retrieve categorical or numerical attributes based on notions of approximate match (e.g., edit distance, numerical distance) is of profound importance. Commonly, approximate match predicates are specified on combinations of attributes in conjunction. Existing database techniques for approximate retrieval, however, limit their applicability to single attribute retrieval through B-trees and their variants. In this paper, we propose a methodology that utilizes known multidimensional indexing structures for the problem of approximate multi-attribute retrieval. Our method enables indexing of a collection of string and/or numeric attributes to facilitate approximate retrieval using edit distance as an approximate match predicate for strings and numeric distance for numeric attributes. The approach presented is based on representing sets of strings at higher levels of the index structure as tries suitably compressed in a way that reasoning about edit distance between a query string and a compressed trie at index nodes is still feasible. We propose and evaluate various techniques to generate the compressed trie representation and fully specify our indexing methodology. Our experimental results show the benefits of our proposal when compared with various alternate strategies for the same problem.

24 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839