scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Proceedings ArticleDOI
10 Jun 2014
TL;DR: A space-efficient multiple string matching algorithm BVM, which makes use of bit-vector and succinct hash table to replace the automata used in factor-searching-based algorithms.
Abstract: Multiple string matching plays a fundamental role in network intrusion detection systems. Automata-based multiple string matching algorithms like AC, SBDM and SBOM are widely used in practice, but the huge memory usage of automata prevents them from being applied to a large-scale pattern set. Meanwhile, poor cache locality of huge automata degrades the matching speed of algorithms. Here we propose a space-efficient multiple string matching algorithm BVM, which makes use of bit-vector and succinct hash table to replace the automata used in factor-searching-based algorithms. Space complexity of the proposed algorithm is O(rm 2 + Σ pϵP |p|), that is more space-efficient than the classic automata-based algorithms. Experiments on datasets including Snort, ClamAV, URL blacklist and synthetic rules show that the proposed algorithm significantly reduces memory usage and still runs at a fast matching speed. Above all, BVM costs less than 0.75% of the memory usage of AC, and is capable of matching millions of patterns efficiently.

6 citations

Journal ArticleDOI
TL;DR: Experimental results indicate that the algorithm is highly effective and it outperforms a popular Basic Local Alignment Search Tool (BLAST) in case of searching for short sequences.
Abstract: —This paper presents a new algorithm for searching short fragments of sequences in long DNA sequences. A short sequence (pattern) is searched in both DNA strands with a given maximal value of errors. Each DNA sequence (T) is preprocessed by compressing it using Burrows-Wheeler transform and wavelet tree. First, the pattern is divided into short words which overlap themselves, and then their positions in T are determined using FM-index. Connections between the words are searched under the assumption of an acceptable maximal error allowed. Experimental results indicate that the algorithm is highly effective and it outperforms a popular Basic Local Alignment Search Tool (BLAST) in case of searching for short sequences.

6 citations

01 Jan 2003
TL;DR: An improved algorithm to solve the Inexact Characteristic String Problem using Hamming distance instead of Levenshtein distance as a measure is presented being simpler and faster in practice by a constant factor than the previous algorithm.
Abstract: We present a new algorithm to solve the Inexact Characteristic String Problem (ICSP) using Hamming distance instead of Levenshtein distance as a measure. We embed our new algorithm and the previously known algorithm for Levenshtein distance in a common framework which reveals an additional improvement to the Levenshtein distance algorithm. The ICSP can thus be solved in time O(||T||+l*||S-T||) for Hamming distance and in time O(||T|| + k*l*||S-T||) for Levenshtein distance, where S is a set of strings, T is a non-empty subset of S (the target set), and l is the length of a shortest string in T. The ICSP has applications in probe and primer design. Both algorithms need to solve the Common Substring Problem for more than two strings. We present an improved algorithm for this problem being simpler and faster in practice by a constant factor than the previous algorithm.

6 citations

Patent
Enyuan Wu1
09 Sep 2011
TL;DR: In this article, a target string is broken into one or more target terms, and the target terms are matched to known terms in an index tree, where the terms in the index tree are associated with known string IDs.
Abstract: One or more techniques and/or systems are disclosed for matching a target string to a known string. A target string is broken into one or more target terms, and the one or more target terms are matched to known terms in an index tree. The index tree comprises one or more known terms from a plurality of known strings, where the respective known terms in the index tree are associated with one or more known string IDs. A known term that is associated with a known string ID (in the index tree, and to which a target term is matched), is comprised in a known string, which corresponds to the known string ID. The target string can be matched to the known string using the known string's corresponding known string ID that is associated with a desired number of occurrences in the matching of the one or more target terms.

6 citations

Book ChapterDOI
05 Oct 2004
TL;DR: This work considers the problem of finding all approximate occurrences of a given string q, with at most k differences, in a finite database or dictionary of strings, and considers the “triangular inequality”, the most important property in this case.
Abstract: We consider the problem of finding all approximate occurrences of a given string q, with at most k differences, in a finite database or dictionary of strings. The strings can be e.g. natural language words, such as the vocabulary of some document or set of documents. This has many important application in both off-line (indexed) and on-line string matching. More precisely, we have a universe \({\mathbb U}\) of strings, and a non-negative distance function \(d: {\mathbb U} \times {\mathbb U} \rightarrow {\mathbb N}\). The distance function is metric, if it satisfies (i) \(d(x,y) = 0 ~ \Leftrightarrow ~ x = y\); (ii) d(x,y) = d(y,x); (iii) d(x,y) ≤ d(x,z) + d(z,y). The last item is called the “triangular inequality”, and is the most important property in our case. Many useful distance functions are known to be metric, in particular edit (Levenshtein) distance is metric, which we will use for d.

6 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839