scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Patent
14 Jun 2007
TL;DR: In this paper, a two-phase process is used to identify representations of the same items within a collection of item representations, referred to as a blocking phase and a matching phase.
Abstract: A two-phase process quickly and accurately identifies representations of the same items within a collection of item representations. In the first phase, referred to as a “blocking phase,” frequency information indicating the frequency with which terms appear within the collection of item representations is used to quickly identify “candidate pairs” (i.e., pairs of item representations that have a relatively high probability of matching). The blocking phase results in a reduced subset of the data for further analysis during the second phase. In the second phase, referred to as a “matching phase,” the candidate pairs are analyzed using fuzzy matching functions to accurately identify “matching pairs” (i.e., representations of the same items).

11 citations

Proceedings ArticleDOI
27 Mar 2012
TL;DR: A fast text retrieval system to index and browse degraded historical documents, designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process.
Abstract: In this paper, we present a fast text retrieval system to index and browse degraded historical documents. The indexing and retrieval strategy is designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process. During the indexing step, the text parts in the images are encoded into sequences of primitives, obtained from two different codebooks: a coarse one corresponding to connected components and a fine one corresponding to glyph primitives. A glyph consists of a single character or a part of a character according to the shape complexity. During the querying step, the coarse and the fine signature are generated from the query image using both codebooks. Then, a bi-level approximate string matching algorithm is applied to find similar words, using coarse approach first, and then the fine approach if necessary, by exploiting predetermined hypothetical locations. An experimental evaluation on datasets of real life document images, gathered from historical books of different scripts, demonstrated the speed improvement and good accuracy in presence of degradation.

11 citations

Journal ArticleDOI
TL;DR: This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks, that includes algorithms for string matching and alignment problems, and consists of C/C++ library functions as well as Perl library functions.
Abstract: This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at http://www.math.unipa.it/~raffaele/BATS/ under the GNU GPL.

11 citations

Patent
21 Oct 2015
TL;DR: In this article, a fuzzy word segmentation based non-multi-character word error automatic proofreading method is presented. But the method is not suitable for Chinese word error detection.
Abstract: The invention discloses a fuzzy word segmentation based non-multi-character word error automatic proofreading method. According to the method, accurate segmentation is carried out based on a correct word dictionary and a wrong character word dictionary to generate a word graph; then the similarity of Chinese word strings is calculated by utilizing a fuzzy matching algorithm, accurately segmented disperse strings are subjected to fuzzy matching, and a fuzzy matching result is added into the word graph to form a fuzzy word graph; and finally a shortest path of the fuzzy word graph is calculated by utilizing a binary model of words in combination with similarity, so that automatic proofreading of Chinese non-multi-character word errors is realized. According to the fuzzy word segmentation based non-multi-character word error automatic proofreading method provided by the invention, the system response is quick, the precision meets actual application demands, and the effectiveness and the accuracy are high.

11 citations

Proceedings ArticleDOI
03 Jun 2014
TL;DR: The hybrid algorithm, Maximum-Shift, shows efficient results compared to four string matching algorithms, Quick-Search, Horspool, Smith and Berry-Ravindran, in terms of the number of attempts and the total number of character comparisons.
Abstract: The string matching algorithms have broad applications in many areas of computer sciences. These areas include operating systems, information retrieval, editors, Internet searching engines, security applications and biological applications. Two important factors used to evaluate the performance of the sequential string matching algorithms are number of attempts and total number of character comparisons during the matching process. This research proposes to integrate the good properties of three single string matching algorithms, Quick-Search, Zuh-Takaoka and Horspool, to produce hybrid string matching algorithm called Maximum-Shift algorithm. Three datasets are used to test the proposed algorithm, which are, DNA, Protein sequence and English text. The hybrid algorithm, Maximum-Shift, shows efficient results compared to four string matching algorithms, Quick-Search, Horspool, Smith and Berry-Ravindran, in terms of the number of attempts and the total number of character comparisons.

11 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839