scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Journal ArticleDOI
TL;DR: The proposed algorithm for-the computation of S*(Y) requires cubic time and uses the recursively computable dissimilarity measure Dk(X, Y), termed as the kth distance between two strings X and Y which is a dissimilarities measure between Y and a certain subset of the set of contiguous substrings of X.
Abstract: Let T(U) be the set of words in the dictionary H which contains U as a substring. The problem considered here is the estimation of the set T(U) when U is not known, but Y, a noisy version of U is available. The suggested set estimate S*(Y) of T(U) is a proper subset of H such that its every element contains at least one substring which resembles Y most according to the Levenshtein metric. The proposed algorithm for-the computation of S*(Y) requires cubic time. The algorithm uses the recursively computable dissimilarity measure Dk(X, Y), termed as the kth distance between two strings X and Y which is a dissimilarity measure between Y and a certain subset of the set of contiguous substrings of X. Another estimate of T(U), namely SM(Y) is also suggested. The accuracy of SM(Y) is only slightly less than that of S*(Y), but the computation time of SM(Y) is substantially less than that of S*(Y). Experimental results involving 1900 noisy substrings and dictionaries which are subsets of 1023 most common English words [11] indicate that the accuracy of the estimate S*(Y) is around 99 percent and that of SM(Y) is about 98 percent.

28 citations

Proceedings ArticleDOI
10 Jul 2000
TL;DR: The algorithm and architecture of a processor for approximate string matching with high throughput rate is presented, dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary.
Abstract: In this paper we present the algorithm and architecture of a processor for approximate string matching with high throughput rate. The processor is dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary. The algorithm used for the approximate string matching is based on a dynamic programming procedure known as the string-to-string correction problem. It has been extended to fulfil the requirements of full text search in a database system, including string matching with wildcards and handling of idiomatic turns of some languages. The processor has been fabricated in a 0.6 /spl mu/m CMOS technology. It performs a maximum of 8.5 billion character comparisons per second when operating at the specified clock frequency of 132 MHz.

27 citations

Journal Article
TL;DR: This paper solves the smallest distance approximate seed problem and the restricted smallest approximate seedProblem in polynomial time and proves that the general smallest approximate Seed problem is NP-complete.
Abstract: In this paper we study approximate seeds of strings, that is, substrings of a given string x that cover (by concatenations or overlaps) a superstring of x, under a variety of distance rules (the Hamming distance, the edit distance, and the weighted edit distance). We solve the smallest distance approximate seed problem and the restricted smallest approximate seed problem in polynomial time and we prove that the general smallest approximate seed problem is NP-complete.

27 citations

Journal ArticleDOI
TL;DR: ALFRED is presented, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation and facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed.
Abstract: Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.

27 citations

Journal ArticleDOI
TL;DR: This paper presents a novel approach that is able to operate in a dynamic environment, where there is a steady arrival of new strings belonging to the considered set and needs only the median of the set computed before together with the new string to compute an updated median string of the new set.
Abstract: The generalised median string is defined as a string that has the smallest sum of distances to the elements of a given set of strings. It is a valuable tool in representing a whole set of objects by a single prototype, and has interesting applications in pattern recognition. All algorithms for computing generalised median strings known from the literature are of static nature. That is, they require all elements of the underlying set of strings to be given when the algorithm is started. In this paper, we present a novel approach that is able to operate in a dynamic environment, where there is a steady arrival of new strings belonging to the considered set. Rather than computing the median from scratch upon arrival of each new string, the proposed algorithm needs only the median of the set computed before together with the new string to compute an updated median string of the new set. Our approach is experimentally compared to a greedy algorithm and the set median using both synthetic and real data.

27 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839