scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: A novel DNA sequences pattern matching algorithm called EPMA is presented, which utilizes fixed length 2-bits binary encoding, segmentation and multi-threading and can be extended to generalized string matching and their applications.
Abstract: To solve, manage and analyze biological problems using computer technology is called bioinformatics. With the emergent evolution in computing era, the volume of biological data has increased significantly. These large amounts of data have increased the need to analyze it in reasonable space and time. DNA sequences contain basic information of species, and pattern matching between different species is an important and challenging issue to cope with. There exist generalized string matching and some specialized DNA pattern matching algorithms in the literature. There is still need to develop fast and space efficient pattern matching algorithms that consider new hardware development. In this paper, we present a novel DNA sequences pattern matching algorithm called EPMA. The proposed algorithm utilizes fixed length 2-bits binary encoding, segmentation and multi-threading. The idea is to find the pattern with multiple searcher agents concurrently. The proposed algorithm is validated with comparative experimental results. The results show that the new algorithm is a good candidate for DNA sequence pattern matching applications. The algorithm effectively utilizes modern hardware and will help researchers in the sequence alignment, short read error correction, phylogenetic inference etc. Furthermore, the proposed method can be extended to generalized string matching and their applications.

21 citations

Journal ArticleDOI
TL;DR: An Aho-Corasick algorithm based parallel string matching that outperforms the existing bit-split string matching in the evaluations of Snort rules is proposed.
Abstract: As the variety of hazardous packet payload contents increases, the intrusion detection system (IDS) should be able to detect numerous patterns in real time. For this reason, this paper proposes an Aho-Corasick algorithm based parallel string matching. In order to balance memory usage between homogeneous finite-state machine (FSM) tiles for each string matcher, an optimal set of bit position groups is determined. Target patterns are sorted by binary-reflected gray code (BRGC), which reduces bit transitions in patterns mapped onto a string matcher. In the evaluations of Snort rules, the proposed string matching outperforms the existing bit-split string matching.

20 citations

Journal ArticleDOI
TL;DR: Two parallel string matching algorithms which adopt perfect hashing to compact a state transition table are proposed which reduce up to 99.5 percent memory requirements for storing the state Transition table compared to the traditional two-dimensional memory architecture.
Abstract: Multiple string matching has a wide range of applications such as network intrusion detection systems, spam filters, information retrieval systems, and bioinformatics. To accelerate multiple string matching, many hardware approaches are proposed to accelerate string matching. Among the hardware approaches, memory architectures have been widely adopted because of their flexibility and scalability. A conventional memory architecture compiles multiple string patterns into a state machine and performs string matching by traversing the corresponding state transition table. Due to the ever-increasing number of attack patterns, the memory used for storing the state transition table increased tremendously. Therefore, memory reduction has become a crucial issue in optimizing memory architectures. In this paper, we propose two parallel string matching algorithms which adopt perfect hashing to compact a state transition table. Different from most state-of-the-art approaches implemented on specific hardware such as TCAM, FPGA, or ASIC, our proposed approaches are easily implemented on commodity DRAM and extremely suitable to be implemented on GPUs. The proposed algorithms reduce up to 99.5 percent memory requirements for storing the state transition table compared to the traditional two-dimensional memory architecture. By studying existing approaches, our results obtain significant improvements in memory efficiency.

20 citations

Proceedings ArticleDOI
Jun Sun1, Zhulong Wang1, Hao Yu1, Fumihito Nishino1, Yukata Katsuyama1, Satoshi Naoi1 
20 Nov 2003
TL;DR: A novel stroke verification algorithm is used to effectively remove non-character strokes and build the binary text line image, which is segmented and recognized by dynamic programming.
Abstract: Images play a very important role in web content delivery. Many WWW images contain text information that can be used for web indexing and searching. A new text extraction and recognition algorithm is proposed in this paper. The character strokes in the image are first extracted by color clustering and connected component analysis. A novel stroke verification algorithm is used to effectively remove non-character strokes. The verified strokes are then used to build the binary text line image, which is segmented and recognized by dynamic programming. Since text in WWW image usually has close relationship with webpage content, approximate string matching is used to revise the recognition result by matching the content in the webpage with the content in the image. This effective post-processing not only improves the recognition performance, but also can be used in other applications such like image - webpage paragraph corresponding.

20 citations

Journal ArticleDOI
TL;DR: A new approach to pattern discovery called string pattern regression is presented, where a data set is given that consists of a string attribute and an objective numerical attribute, and an exact but efficient branch-and-bound algorithm is presented which is applicable to various pattern classes.
Abstract: We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.

20 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839