scispace - formally typeset
Search or ask a question
Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.


Papers
More filters
Proceedings ArticleDOI
28 Jun 2000
TL;DR: A fast approximate Chinese word-matching algorithm that can deal with not only character substitution errors but also insertion, deletion and string substitution errors and can handle Chinese "non-word" error, making it possible and easy to establish a two-level structure in Chinese spelling correction.
Abstract: A fast approximate Chinese word-matching algorithm is presented. The algorithm can be used to implement the Chinese fuzzy-matching conception. Based on the algorithm, an automatic Chinese text error correction approach using confusing-word substitution and language model evaluation is designed. Compared with Zhang's (1994) confusing-character substitution method, this new approach can deal with not only character substitution errors but also insertion, deletion and string substitution errors. Besides, the algorithm can handle Chinese "non-word" error, making it possible and easy to establish a two-level structure in Chinese spelling correction.

6 citations

Proceedings ArticleDOI
I. Sadeh1
30 Mar 1993
TL;DR: The duality between the two algorithms is proved with some asymptotic properties concerning the workings of an approximate string matching algorithm for ergodic stationary sources.
Abstract: Two practical universal source coding schemes are proposed. One is an approximate fixed length string matching data compression, and the other is LZ-type quasi parsing by approximate string matching. It is shown that in the former algorithm the compression rate converges to the theoretical bound of R(D) for a large class of processes as the database size and the string length tend to infinity. A similar result holds for the latter algorithm in the limit of infinite data base size. The performance of the two algorithms is evaluated where data base size is finite and string length finite. The duality between the two algorithms is proved with some asymptotic properties concerning the workings of an approximate string matching algorithm for ergodic stationary sources. >

6 citations

Journal ArticleDOI
01 Mar 2016
TL;DR: A fuzzy string matching algorithm is applied for self‐citation detection and near full recall can be achieved with the proposed method while incurring only negligible precision loss.
Abstract: In this article I investigate the shortcomings of exact string match-based author self-citation detection methods. The contributions of this study are twofold. First, I apply a fuzzy string matching algorithm for self-citation detection and benchmark this approach and other common methods of exclusively author name-based self-citation detection against a manually curated ground truth sample. Near full recall can be achieved with the proposed method while incurring only negligible precision loss. Second, I report some important observations from the results about the extent of latent self-citations and their characteristics and give an example of the effect of improved self-citation detection on the document level self-citation rate of real data.

6 citations

Patent
Jagir R. Hussan1, Albee Jhoney1
20 Dec 2002
TL;DR: In this paper, a method, system and computer program product for identifying occurrences of a sequence of ordered marker strings in a string are disclosed, which particularly relate to finding a gene in a DNA sequence.
Abstract: A method, system and computer program product for identifying occurrences of a sequence of ordered marker strings in a string are disclosed. The method includes the steps of identifying sub-strings in the string that match the marker, for each marker string except the last marker string in the ordered sequence of marker strings creating directed links between a sub-string that matches a particular marker string and all the sub-strings that match a subsequent marker string in the ordered sequence of marker strings, and identifying occurrences of the sequence in the string by tracing one or more corresponding paths from each sub-string that matches the first marker string to all sub-strings that match the last marker string by following the directed links. The method, system and computer program product disclosed particularly relate to finding a gene in a DNA sequence.

6 citations

Journal ArticleDOI
TL;DR: This paper proposes a method for genome data classification based on approximate matching and shows the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size.
Abstract: Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method.

6 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
80% related
Scheduling (computing)
78.6K papers, 1.3M citations
79% related
Network packet
159.7K papers, 2.2M citations
78% related
Optimization problem
96.4K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202230
202132
202030
201948
201839