Home
/
Topics
/
Approximate string matching

Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.

...read moreread less

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Practical suffix tree construction

[...]

Sandeep Tata¹, Richard A. Hankins¹, Jignesh M. Patel¹•Institutions (1)

University of Michigan¹

31 Aug 2004

TL;DR: This paper presents a buffer management strategy for the O(n2) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature.

...read moreread less

Abstract: Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree index on the string dataset. Although suffix trees can be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) complexity outperforms the popular O(n) Ukkonen algorithm, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we present a buffer management strategy for the O(n2) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature. Our approach far outperforms the best known disk-based construction algorithms.

...read moreread less

79 citations

Journal Article•DOI•

Multiseed Lossless Filtration

[...]

Gregory Kucherov¹, Laurent Noé¹, Mikhail A. Roytberg•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Jan 2005-IEEE/ACM Transactions on Computational Biology and Bioinformatics

TL;DR: A method of seed-based lossless filtration for approximate string matching and related bioinformatics applications and a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database are reported.

...read moreread less

Abstract: We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.

...read moreread less

79 citations

Book Chapter•DOI•

Indexing Text with Approximate q-Grams

[...]

Gonzalo Navarro¹, Erkki Sutinen², Jani Tanninen², Jorma Tarhio²•Institutions (2)

University of Chile¹, University of Eastern Finland²

21 Jun 2000

TL;DR: A new index for approximate string matching is presented and it is shown experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filTration is still effcient.

...read moreread less

Abstract: We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. We show experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filtration is still effcient.

...read moreread less

79 citations

Proceedings Article•DOI•

Random access to grammar-compressed strings

[...]

Philip Bille¹, Gad M. Landau², Rajeev Raman³, Kunihiko Sadakane⁴, Srinivasa Rao Satti⁵, Oren Weimann⁶ - Show less +2 more•Institutions (6)

Technical University of Denmark¹, University of Haifa², University of Leicester³, National Institute of Informatics⁴, Seoul National University⁵, Weizmann Institute of Science⁶

23 Jan 2011

TL;DR: In this paper, the authors presented two representations of a string of length n compressed into a context-free grammar S of size n with O(log N) random access time and O(n · αk(n)) construction time and space on the RAM.

...read moreread less

Abstract: Let S be a string of length N compressed into a context-free grammar S of size n We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM Here, αk(n) is the inverse of the kth row of Ackermann's function Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P|k, k4 +|P|} +log N) + occ), where occ is the number of occurrences of P in S Finally, we are able to generalize our results to navigation and other operations on grammar-compressed treesAll of the above bounds significantly improve the currently best known results To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy-paths in grammars

...read moreread less

77 citations

Proceedings Article•DOI•

Query by rhythm: an approach for song retrieval in music databases

[...]

J.C.C. Chen¹, Arbee L. P. Chen¹•Institutions (1)

National Tsing Hua University¹

23 Feb 1998

TL;DR: This work proposes techniques for retrieving songs by rhythm from music databases by defining similarity measures on rhythm strings and proposing an index structure, called L-tree, to support efficient sub-string matching.

...read moreread less

Abstract: We propose techniques for retrieving songs by rhythm from music databases. The rhythm of songs is modeled by rhythm strings. The song retrieval problem is then transformed to the string matching problem. In order to allow approximate string matching, we define similarity measures on rhythm strings. An index structure, called L-tree, is proposed to support efficient sub-string matching. Retrieval algorithms based on L-tree are then designed to provide approximate and sub- song retrieval. Experimental results show that this approach is effective and efficient.

...read moreread less

77 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
…
27
28
29
30
31
32
33
…
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

Network Information

Performance

Metrics

1,942

Papers

64,998

Citations

No. of papers in the topic in previous years
Year	Papers
2023	8
2022	30
2021	32
2020	30
2019	48
2018	39

Approximate string matching

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics