Home
/
Topics
/
Approximate string matching

Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.

...read moreread less

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973

Papers

PDF

Open Access

More filters

Journal Article•DOI•

The longest common extension problem revisited and applications to approximate string searching

[...]

Lucian Ilie¹, Gonzalo Navarro², Liviu Tinta¹•Institutions (2)

University of Western Ontario¹, University of Chile²

01 Dec 2010-Journal of Discrete Algorithms

TL;DR: Two very simple algorithms for the Longest Common Extension problem are given that require no preprocessing and are 5 times faster than the best previous algorithms on the average whereas the second is faster on virtually all inputs.

...read moreread less

48 citations

Proceedings Article•DOI•

[...]

Peter Christen¹, Ross W. Gayler², David Hawking•Institutions (2)

Australian National University¹, Veda²

02 Nov 2009

TL;DR: Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach.

...read moreread less

Abstract: Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static databases. However, many organisations are increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, online retail stores, eGovernment services, and digital libraries. A novel inverted index based approach for real-time entity resolution is presented in this paper. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other approaches to approximate query matching in that it allows any similarity comparison function, and any 'blocking' (encoding) function, both possibly domain specific, to be incorporated. Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach. The interested reader is referred to the longer version of this paper [5].

...read moreread less

48 citations

Journal Article•DOI•

s-grams: Defining generalized n-grams for information retrieval

[...]

Anni Järvelin¹, Antti Järvelin¹, Kalervo Järvelin¹•Institutions (1)

University of Tampere¹

01 Jul 2007-Information Processing and Management

TL;DR: This paper defines the reduction of s-gram profiles to binary profiles in order to precisely define the (extended) Jaccard similarity function for s- grams, and shows that n-gram similarity/distance computations are special cases of the generalized definitions.

...read moreread less

Abstract: n-grams have been used widely and successfully for approximate string matching in many areas. s-grams have been introduced recently as an n-gram based matching technique, where di-grams are formed of both adjacent and non-adjacent characters. s-grams have proved successful in approximate string matching across language boundaries in Information Retrieval (IR). s-grams however lack precise definitions. Also their similarity comparison lacks precise definition. In this paper, we give precise definitions for both. Our definitions are developed in a bottom-up manner, only assuming character strings and elementary mathematical concepts. Extending established practices, we provide novel definitions of s-gram profiles and the L"1 distance metric for them. This is a stronger string proximity measure than the popular Jaccard similarity measure because Jaccard is insensitive to the counts of each n-gram in the strings to be compared. However, due to the popularity of Jaccard in IR experiments, we define the reduction of s-gram profiles to binary profiles in order to precisely define the (extended) Jaccard similarity function for s-grams. We also show that n-gram similarity/distance computations are special cases of our generalized definitions.

...read moreread less

48 citations

Journal Article•DOI•

On the Common Substring Alignment Problem

[...]

Gad M. Landau, Michal Ziv-Ukelson¹•Institutions (1)

IBM¹

01 Nov 2001-Journal of Algorithms

TL;DR: This paper describes an algorithm which is composed of an encoding stage and an alignment stage, and shows how to reduce the O(n?) alignment work, for each appearance of the common substring Y in a source string, to O-at the cost of O( n?) encoding work, which is executed only once.

...read moreread less

47 citations

Journal Article•DOI•

Two-dimensional dictionary matching

[...]

Amihood Amir¹, Martin Farach²•Institutions (2)

Georgia Institute of Technology¹, Rutgers University²

21 Dec 1992-Information Processing Letters

TL;DR: This paper presents an algorithm for the Two-Dimensional Dictionary Problem, that of finding each occurrence of a set of two-dimensional patterns in a text.

...read moreread less

47 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
…
46
47
48
49
50
51
52
…
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

Network Information

Performance

Metrics

1,942

Papers

64,998

Citations

No. of papers in the topic in previous years
Year	Papers
2023	8
2022	30
2021	32
2020	30
2019	48
2018	39

Approximate string matching

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics