Home
/
Topics
/
Approximate string matching

Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.

...read moreread less

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Automatic Chinese text error correction approach based-on fast approximate Chinese word-matching algorithm

[...]

Zhang Lei¹, Zhou Ming, Huang Changning, Sun Maosong•Institutions (1)

Tsinghua University¹

28 Jun 2000

TL;DR: A fast approximate Chinese word-matching algorithm that can deal with not only character substitution errors but also insertion, deletion and string substitution errors and can handle Chinese "non-word" error, making it possible and easy to establish a two-level structure in Chinese spelling correction.

...read moreread less

Abstract: A fast approximate Chinese word-matching algorithm is presented. The algorithm can be used to implement the Chinese fuzzy-matching conception. Based on the algorithm, an automatic Chinese text error correction approach using confusing-word substitution and language model evaluation is designed. Compared with Zhang's (1994) confusing-character substitution method, this new approach can deal with not only character substitution errors but also insertion, deletion and string substitution errors. Besides, the algorithm can handle Chinese "non-word" error, making it possible and easy to establish a two-level structure in Chinese spelling correction.

...read moreread less

6 citations

Proceedings Article•DOI•

On approximate string matching

[...]

I. Sadeh¹•Institutions (1)

Tel Aviv University¹

30 Mar 1993

TL;DR: The duality between the two algorithms is proved with some asymptotic properties concerning the workings of an approximate string matching algorithm for ergodic stationary sources.

...read moreread less

Abstract: Two practical universal source coding schemes are proposed. One is an approximate fixed length string matching data compression, and the other is LZ-type quasi parsing by approximate string matching. It is shown that in the former algorithm the compression rate converges to the theoretical bound of R(D) for a large class of processes as the database size and the string length tend to infinity. A similar result holds for the latter algorithm in the limit of infinite data base size. The performance of the two algorithms is evaluated where data base size is finite and string length finite. The duality between the two algorithms is proved with some asymptotic properties concerning the workings of an approximate string matching algorithm for ergodic stationary sources. >

...read moreread less

6 citations

Journal Article•DOI•

Enhanced self-citation detection by fuzzy author name matching and complementary error estimates

[...]

Paul Donner

01 Mar 2016

TL;DR: A fuzzy string matching algorithm is applied for self‐citation detection and near full recall can be achieved with the proposed method while incurring only negligible precision loss.

...read moreread less

Abstract: In this article I investigate the shortcomings of exact string match-based author self-citation detection methods. The contributions of this study are twofold. First, I apply a fuzzy string matching algorithm for self-citation detection and benchmark this approach and other common methods of exclusively author name-based self-citation detection against a manually curated ground truth sample. Near full recall can be achieved with the proposed method while incurring only negligible precision loss. Second, I report some important observations from the results about the extent of latent self-citations and their characteristics and give an example of the effect of improved self-citation detection on the document level self-citation rate of real data.

...read moreread less

6 citations

Patent•

Gene finding using ordered sets

[...]

Jagir R. Hussan¹, Albee Jhoney¹•Institutions (1)

IBM¹

20 Dec 2002

TL;DR: In this paper, a method, system and computer program product for identifying occurrences of a sequence of ordered marker strings in a string are disclosed, which particularly relate to finding a gene in a DNA sequence.

...read moreread less

Abstract: A method, system and computer program product for identifying occurrences of a sequence of ordered marker strings in a string are disclosed. The method includes the steps of identifying sub-strings in the string that match the marker, for each marker string except the last marker string in the ordered sequence of marker strings creating directed links between a sub-string that matches a particular marker string and all the sub-strings that match a subsequent marker string in the ordered sequence of marker strings, and identifying occurrences of the sequence in the string by tracing one or more corresponding paths from each sub-string that matches the first marker string to all sub-strings that match the last marker string by following the directed links. The method, system and computer program product disclosed particularly relate to finding a gene in a DNA sequence.

...read moreread less

6 citations

Journal Article•DOI•

Genome data classification based on fuzzy matching

[...]

Nagamma Patil¹, Durga Toshniwal¹, Kumkum Garg²•Institutions (2)

Indian Institutes of Technology¹, Manipal University²

01 Mar 2013-CSI Transactions on ICT

TL;DR: This paper proposes a method for genome data classification based on approximate matching and shows the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size.

...read moreread less

Abstract: Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method.

...read moreread less

6 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
…
190
191
192
193
194
195
196
…
197
198
199
200

Collapse

Network Information

Performance

Metrics

1,942

Papers

64,998

Citations

No. of papers in the topic in previous years
Year	Papers
2023	8
2022	30
2021	32
2020	30
2019	48
2018	39

Approximate string matching

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics