Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Patent•

Large scale item representation matching

[...]

Amir J. Padovitz¹, Dima Suponau¹, Wei Yu¹, Mikhail Bilenko¹•Institutions (1)

Microsoft¹

14 Jun 2007

TL;DR: In this paper, a two-phase process is used to identify representations of the same items within a collection of item representations, referred to as a blocking phase and a matching phase.

...read moreread less

Abstract: A two-phase process quickly and accurately identifies representations of the same items within a collection of item representations. In the first phase, referred to as a “blocking phase,” frequency information indicating the frequency with which terms appear within the collection of item representations is used to quickly identify “candidate pairs” (i.e., pairs of item representations that have a relatively high probability of matching). The blocking phase results in a reduced subset of the data for further analysis during the second phase. In the second phase, referred to as a “matching phase,” the candidate pairs are analyzed using fuzzy matching functions to accurately identify “matching pairs” (i.e., representations of the same items).

...read moreread less

11 citations

Proceedings Article•DOI•

An Efficient Coarse-to-Fine Indexing Technique for Fast Text Retrieval in Historical Documents

[...]

Partha Pratim Roy¹, Frédéric Rayar¹, Jean-Yves Ramel¹•Institutions (1)

François Rabelais University¹

27 Mar 2012

TL;DR: A fast text retrieval system to index and browse degraded historical documents, designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process.

...read moreread less

Abstract: In this paper, we present a fast text retrieval system to index and browse degraded historical documents. The indexing and retrieval strategy is designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process. During the indexing step, the text parts in the images are encoded into sequences of primitives, obtained from two different codebooks: a coarse one corresponding to connected components and a fine one corresponding to glyph primitives. A glyph consists of a single character or a part of a character according to the shape complexity. During the querying step, the coarse and the fine signature are generated from the query image using both codebooks. Then, a bi-level approximate string matching algorithm is applied to find similar words, using coarse approach first, and then the fine approach if necessary, by exploiting predetermined hypothetical locations. An experimental evaluation on datasets of real life document images, gathered from historical books of different scripts, demonstrated the speed improvement and good accuracy in presence of degradation.

...read moreread less

11 citations

Journal Article•DOI•

A basic analysis toolkit for biological sequences.

[...]

Raffaele Giancarlo¹, Alessandro Siragusa¹, Enrico Siragusa¹, Filippo Utro¹•Institutions (1)

University of Palermo¹

18 Sep 2007-Algorithms for Molecular Biology

TL;DR: This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks, that includes algorithms for string matching and alignment problems, and consists of C/C++ library functions as well as Perl library functions.

...read moreread less

Abstract: This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at http://www.math.unipa.it/~raffaele/BATS/ under the GNU GPL.

...read moreread less

11 citations

Patent•

Fuzzy word segmentation based non-multi-character word error automatic proofreading method

[...]

Liu Liangliang, Wu Jiankang

21 Oct 2015

TL;DR: In this article, a fuzzy word segmentation based non-multi-character word error automatic proofreading method is presented. But the method is not suitable for Chinese word error detection.

...read moreread less

Abstract: The invention discloses a fuzzy word segmentation based non-multi-character word error automatic proofreading method. According to the method, accurate segmentation is carried out based on a correct word dictionary and a wrong character word dictionary to generate a word graph; then the similarity of Chinese word strings is calculated by utilizing a fuzzy matching algorithm, accurately segmented disperse strings are subjected to fuzzy matching, and a fuzzy matching result is added into the word graph to form a fuzzy word graph; and finally a shortest path of the fuzzy word graph is calculated by utilizing a binary model of words in combination with similarity, so that automatic proofreading of Chinese non-multi-character word errors is realized. According to the fuzzy word segmentation based non-multi-character word error automatic proofreading method provided by the invention, the system response is quick, the precision meets actual application demands, and the effectiveness and the accuracy are high.

...read moreread less

11 citations

Proceedings Article•DOI•

Maximum-shift string matching algorithms

[...]

Hakem Adil Kadhim¹, NurAini AbdulRashidx¹•Institutions (1)

Universiti Sains Malaysia¹

03 Jun 2014

TL;DR: The hybrid algorithm, Maximum-Shift, shows efficient results compared to four string matching algorithms, Quick-Search, Horspool, Smith and Berry-Ravindran, in terms of the number of attempts and the total number of character comparisons.

...read moreread less

Abstract: The string matching algorithms have broad applications in many areas of computer sciences. These areas include operating systems, information retrieval, editors, Internet searching engines, security applications and biological applications. Two important factors used to evaluate the performance of the sequential string matching algorithms are number of attempts and total number of character comparisons during the matching process. This research proposes to integrate the good properties of three single string matching algorithms, Quick-Search, Zuh-Takaoka and Horspool, to produce hybrid string matching algorithm called Maximum-Shift algorithm. Three datasets are used to test the proposed algorithm, which are, DNA, Protein sequence and English text. The hybrid algorithm, Maximum-Shift, shows efficient results compared to four string matching algorithms, Quick-Search, Horspool, Smith and Berry-Ravindran, in terms of the number of attempts and the total number of character comparisons.

...read moreread less

11 citations

Collapse

Network Information

Performance

Metrics

1,942

Papers

64,998

Citations

No. of papers in the topic in previous years
Year	Papers
2023	8
2022	30
2021	32
2020	30
2019	48
2018	39

Approximate string matching

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics