scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A comparison of approximate string matching algorithms

01 Dec 1996-Software - Practice and Experience (John Wiley & Sons, Inc.)-Vol. 26, Iss: 12, pp 1439-1458
TL;DR: It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.
Abstract: Experimental comparisons of the running time of approximate string matching algorithms for the k differences problem are presented. Given a pattern string, a text string, and an integer k, the task is to find all approximate occurrences of the pattern in the text with at most k differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.

Summary (1 min read)

Introduction

  • Experimental comparison of the running time of approximate string matching algorithms for the k differences problem is presented.
  • Tarhio and Ukkonen8, 9 present an algorithm which is based on the Boyer-Moore approach and works in sublinear average time.
  • The theoretical analyses given in the literature are helpful but it is important that the theory is completed with experimental comparisons extensive enough.
  • The algorithm evaluates a modified form of tableD.

4 for J in 1 .. n loop

  • For everyC-diagonal Algorithm GP performs an iteration that evaluates it from two previousC-diagonals (lines 7–38).
  • The evaluation of each entry starts with evaluating the Col value (line 11).
  • The sequence is updated on lines 28–35. ProcedureWithin(d) called on line 14 tests if text positiond is within some interval of thek first reference triples in the sequence.
  • Instead of the wholeC defined above, tableC of the algorithm contains only three successiveC-diagonals.
  • The use of this buffer of three diagonals is organized with variablesB1, B2, andB3.

7 DeQueue(Q, X);

  • The scanning phase (lines 3–16) scans over the text and marks the parts that may contain approximate occurrences ofP .
  • Parameterx of call EDP(x) tells how many columns should be evaluated for one marked diagonal.
  • The minimum valuem for x is applicable for DC.
  • The scanning phase is almost identical to the original algorithm.
  • If (x) andq(x) are frequencies of characterx in the pattern and inQ, variableZ has the valueXx in Qmax(q(x) f(x); 0): The value ofZ is computed together with tableC which maintains the differencef(x) q(x) for everyx.

22 Next.Go To(P(I)) := R; Next := Next.Fail;

  • In theformer case there is no approximate occurrence at the current alignmentand in the latter case a potential approximate occurrence has been found.
  • For determining the length of the shift, i.e. what is the nextpo ential diagonal afterh for marking, the authors search for the first diagonal afterh, where at least one of the charactersth+m; th+m 1; : : : ; th+m k matches with the corresponding character ofP .

6 for I in m–k .. k loop

  • The authors performed an extensive test program on all seven algorithms DP, EDP, GP, DC, UW, MM, and ABM described in the previous sections.
  • In their tests, the authors used random patterns of varying lengths and random texts of length 100,000 characters over alphabets of different sizes.
  • Because algorithms EDP, DC, MM, and ABM were better than the others, the authors studied relations of their execution times more carefully.
  • The execution times of EDP and ABM on Sun (shown in Table II for some parameter values) were on the average 68 per cent and 60 per cent, respectively, of the corresponding times on Vaxstation.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 1(1), 1–4 (JANUARY 1988)
A Comparison of
Approximate String Matching Algorithms
PETTERI JOKINEN, JORMA TARHIO, AND ESKO UKKONEN
Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland
(email: tarhio@cs.helsinki.fi)
SUMMARY
Experimental comparison of the running time of approximate string matching algorithms for the
k
dif-
ferences problem is presented. Given a pattern string, a text string, and integer
k
, the task is to nd all
approximate occurrences of the pattern in the text with at most
k
differences (insertions, deletions, changes).
Weconsider sevenalgorithms basedondifferentapproachesincludingdynamicprogramming,Boyer-Moore
string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms
is the best for all values of the problem parameters, and the speed differences between the methods can be
considerable.
KEY WORDS String matching Edit distance k differences problem
INTRODUCTION
We consider the
k
differences problem, a version of the approximate string matching problem.
Given two strings, text
T
=
t
1
t
2
: : : t
n
and pattern
P
=
p
1
p
2
: : : p
m
and integer
k
, the task is
to find the end points of all approximate occurrences of
P
in
T
. An approximate occurrence
means asubstring
P
0
of
T
such thatat most
k
editing operations (insertions,deletions, changes)
are needed to convert
P
0
to
P
.
There are several algorithms proposed for this problem, see e.g. the survey of Galil and
Giancarlo.
1
The problem can be solved in time
O
(
mn
)
by dynamic programming.
2, 3
A very
simple improvement giving
O
(
k n
)
expected time solution for random strings is described by
Ukkonen.
3
Later, Landau and Vishkin,
4, 5
Galil and Park,
6
Ukkonen and Wood
7
give different
algorithms that consist of preprocessing the pattern in time
O
(
m
2
)
(or
O
(
m
)
) and scanning
the text in worst-case time
O
(
k n
)
. Tarhio and Ukkonen
8, 9
present an algorithm which is based
on the Boyer-Moore approach and works in sublinear average time. There are also several
other efficient solutions
10-17
, and some
11-14
of them work in sublinear average time. Currently
O
(
k n
)
is the best worst-case bound known if the preprocessing time is allowed to be at most
O
(
m
2
)
.
There are also fast algorithms
9, 17-20
for the
k
mismatches problem, which is a reduced form
of
k
differences problem so that a change is the only editing operation allowed.
It is clear that with such a multitude of different solutions to the same problem it is difficult
to select a proper method for each particular approximate string matching task. The theoretical
analyses given in the literature are helpful but it is important that the theory is completed with
experimental comparisons extensive enough.
CCC 0038–0644/88/010001–04 Received 1 March 1988
c
1988 by John Wiley & Sons, Ltd. Revised 25 March 1988

2 PETTERI JOKINEN, JORMA TARHIO, AND ESKO UKKONEN
We will present an experimental comparison
of the running times of seven algorithms for
the
k
differences problem. The tested algorithms are: two dynamic programming methods,
2, 3
Galil-Park algorithm,
6
Ukkonen-Wood algorithm,
7
an algorithm counting the distribution
of characters,
18
approximate Boyer-Moore algorithm,
9
and an algorithm based on maximal
matches between the pattern and the text.
10
(The last algorithm
10
is very similar to the linear
algorithm of Chang and Lawler,
11
although they have been invented independently.) We give
brief descriptions of the algorithms as well as an Ada code for their central parts. As our
emphasis is in the experiments, the reader is advised to consult the original references for
more detailed descriptions of the methods.
The paper is organized as follows. At first, the framework based on edit distance is intro-
duced. Then the seven algorithms are presented. Finally, the comparison of the algorithms is
represented and its results are summarized.
THE K DIFFERENCES PROBLEM
We use the concept of edit distance
21, 22
to measure the goodness of approximate occurrences
of a pattern. The edit distance between two strings,
A
and
B
in alphabet Σ, can be defined
as the minimum number of editing steps needed to convert
A
to
B
. Each editing step is a
rewriting step of the form
a
!
"
(a deletion),
"
!
b
(an insertion), or
a
!
b
(a change) where
a
,
b
are in Σ and
"
is the empty string.
The k differences problem is, given pattern
P
=
p
1
p
2
: : : p
m
and text
T
=
t
1
t
2
: : : t
n
in
alphabet Σ of size
, and integer
k
, to find all such
j
that the edit distance (i.e., the number of
differences) between
P
and some substring of
T
ending at
t
j
is at most
k
. The basic solution
of the problem is the following dynamic programming method:
2, 3
Let
D
be an
m
+ 1 by
n
+
1 table such that
D
(
i; j
)
is the minimum edit distance between
p
1
p
2
: : : p
i
and any substring
of
T
ending at
t
j
. Then
D
(
0
; j
) =
0
;
0
j
n
;
D
(
i; j
) =
min
8
<
:
D
(
i
1
; j
) +
1
D
(
i
1
; j
1
) +
if
p
i
=
t
j
then 0 else 1
D
(
i; j
1
) +
1
Table
D
can be evaluated column-by-column in time
O
(
mn
)
. Whenever
D
(
m; j
)
is found
to be at most
k
for some
j
, there is an approximate occurrence of
P
ending at
t
j
with edit
distance
D
(
m; j
)
k
. Hence
j
is a solution to the
k
differences problem.
In Fig. 1 there is an example of table
D
for
T
= bcbacbbb and
P
= cacd. The pattern
occurs at positions 5 and 6 of the text with at most 2 differences.
All the algorithms presented work within this model, but they utilize different approaches
in restricting the number of entries that are necessary to evaluate in table
D
. Some of the algo-
rithms work in two phases: scanning and checking. The scanning phase searches for potential
occurrences of the pattern, and the checking phase verifies if the suggested occurrences are
good or not. The checking is always done using dynamic programming.
The comparison was carried out in 1991. Some of the newer methods will likely be faster than the tested algorithms for certain
values of problem parameters.

A COMPARISION OF APPROXIMATE STRING MATCHING ALGORITHMS 3
0 1 2 3 4 5 6 7 8
b c b a c b b b
0 0 0 0 0 0 0 0 0 0
1 c 1 1 0 1 1 0 1 1 1
2 a 2 2 1 1 1 1 1 2 2
3 c 3 3 2 2 2 1 2 2 3
4 d 4 4 3 3 3 2 2 3 3
Figure 1. Table
D
.
ALGORITHMS
Dynamic programming
We consider two different versions of dynamic programming for the
k
differences problem.
In the previous section we introduced the trivial solution which computes all entries of table
D
. The code of this algorithm is straight-forward,
2, 21
and we do not present it here. In the
following, we refer to this solution as Algorithm DP.
Diagonal
h
of
D
for
h
=
m
,
: : :
,
n
, consists of all
D
(
i; j
)
such that
j
i
=
h
. Considering
computation along diagonals gives a simple way to limit unnecessary computation. It is
easy to show that entries on every diagonal
h
are monotonically increasing.
22
Therefore
the computation along a diagonal can be stopped, when the threshold value of
k
+
1 is
reached, because the rest of the entries on that diagonal will be greater than
k
. This idea leads
to Algorithm EDP (Enhanced Dynamic Programming) working in average time
3
O
(
k n
)
.
Algorithm EDP is shown in Fig. 2.
In algorithm EDP, the text and the pattern are stored in tables
T
and
P
. Table
D
is evaluated
a column at a time. The entries of the current column are stored in table
h
, and the value of
D
(
i
1
; j
1
)
is temporarily stored in variable
C
. A work space of
O
(
m
)
is enough, because
every
D
(
i; j
)
depends only on entries
D
(
i
1
; j
)
,
D
(
i; j
1
)
, and
D
(
i
1
; j
1
)
. Variable
Top tells the row where the topmost diagonal still under the threshold value
k
+
1 intersects the
current column. On line 12 an approximate occurrence is reported, when row
m
is reached.
Galil-Park
The
O
(
k n
)
algorithm presented by Galil and Park
6
is based on the diagonalwise monotonic-
ity of the entries of table
D
. It also uses so-called reference triples that represent matching
substrings of the pattern and the text. This approach was used already by Landau and Vishkin.
4
The algorithm evaluates a modified form of table
D
. The core of the algorithm is shown in
Fig. 3 as Algorithm GP.
In preprocessing of pattern
P
(procedure call Prefixes
(
P
)
on line 2), upper triangular table
Prefix
(
i; j
)
, 1
i < j
m
, is computed where Prefix
(
i; j
)
is the length of the longest
common prefix of
p
i
: : : p
m
and
p
j
: : : p
m
.
Reference triple (
u
,
v
,
w
) consists of start position
u
, end position
v
, and diagonal
w
such
that substring
t
u
: : : t
v
matches substring
p
u
w
: : : p
v
w
and
t
v
+
1
6
=
p
v
+
1
w
. Algorithm GP
manipulates several triples; the components of the
r
th
triple are presented as
U
(
r
)
,
V
(
r
)
, and
W
(
r
)
.

4 PETTERI JOKINEN, JORMA TARHIO, AND ESKO UKKONEN
1 begin
2 Top := k + 1;
3 for I in 0 .. m loop H(I) := I; end loop;
4 for J in 1 .. n loop
5 C := 0;
6 for I in 1 .. Top loop
7 if P(I) = T(J) then E :=C;
8 else E := Min((H(I 1), H(I), C)) + 1; end if;
9 C := H(I); H(I) := E;
10 end loop;
11 while H(Top)
>
k loop Top := Top 1; end loop;
12 if Top = m then Report Match(J);
13 else Top := Top + 1; end if;
14 end loop;
15 end;
Figure 2. Algorithm EDP.
For diagonal
d
and integer
e
, let
C
(
e; d
)
be the largest column
j
such that
D
(
j
d; j
) =
e
.
In other words, the entries of value
e
on diagonal
d
of
D
end at column
C
(
e; d
)
. Now
C
(
e; d
) =
C ol
+
J ump
(
C ol
+
1
d; C ol
+
1
)
holds where
C ol
=
max
f
C
(
e
1
; d
1
) +
1
; C
(
e
1
; d
) +
1
; C
(
e
1
; d
+
1
)
g
and Jump
(
i; j
)
is the length of the longest common prefix of
p
i
: : : p
m
and
t
j
: : : t
n
for all
i
,
j
.
Let C-diagonal
g
consist of entries
C
(
e; d
)
such that
e
+
d
=
g
. For every
C
-diagonal
Algorithm GP performs an iteration that evaluates it from two previous
C
-diagonals (lines
7–38). The evaluation of each entry starts with evaluating the Col value (line 11). The rest
of the loop (lines 12–35) effectively finds the value Jump
(
C ol
+
1
d; C ol
+
1
)
using the
reference triples and table Prefix. A new
C
-value is stored on line 24.
The algorithm maintains an ordered sequence of reference triples. The sequence is updated
on lines 28–35. Procedure Within
(
d
)
called on line 14 tests if text position
d
is within some
interval of the
k
first reference triples in the sequence. In the positive case, variable
R
is
updated to express the index of the reference triple whose interval contains text position
d
.
A match is reported on line 26.
Instead ofthe whole
C
defined above, table
C
ofthe algorithmcontains onlythree successive
C
-diagonals. The use of this buffer of three diagonals is organized with variables
B
1,
B
2,
and
B
3.
Ukkonen-Wood
Another
O
(
k n
)
algorithm, given by Ukkonen and Wood,
7
has an overall structure identical
to the algorithm of Galil and Park. However, no reference triples are used. Instead, to nd
the necessary values Jump
(
i; j
)
, the text is scanned with a modified suffix automaton for

A COMPARISION OF APPROXIMATE STRING MATCHING ALGORITHMS 5
1 begin
2 Prefixes(P);
3 for I in –1 .. k loop
4 C(I, 1) := –Infinity; C(I, 2) := –1;
5 end loop;
6 B1:=0; B2:=1; B3:=2;
7 for J in 0 .. n m + k loop
8 C(–1, B1) := J; R := 0;
9 for E in 0 .. k loop
10 H := J E;
11 Col := Max((C(E–1, B2) + 1, C(E–1, B3) + 1, C(E–1, B1)));
12 Se := Col + 1; Found := false;
13 while not Found loop
14 if Within(Col + 1) then
15 F := V(R) Col; G := Prefix(Col+1–H, Col+1–W(R));
16 if F = G then Col := Col + F;
17 else Col := Col + Min(F, G); Found := true; end if;
18 else
19 if Col H
<
m and then P(Col+1–H) = T(Col+1) then
20 Col := Col + 1;
21 else Found := true; end if;
22 end if;
23 end loop;
24 C(E, B1) := Min(Col, m+H);
25 if C(E, B1) = H + m and then C(E–1, B2)
<
m + H then
26 Report
Match((H + m));
27 end if;
28 if V(E)
>
=
C(E, B1) then
29 if E = 0 then U(E) := J + 1;
30 else U(E) := Max(U(E), V(E–1) + 1); end if;
31 else
32 V(E) := C(E, B1); W(E) := H;
33 if E = 0 then U(E) := J + 1;
34 else U(E) := Max(Se, V(E–1) + 1); end if;
35 end if;
36 end loop;
37 B := B1; B1 := B3; B3 := B2; B2 := B;
38 end loop;
39 end;
Figure 3. Algorithm GP.

Citations
More filters
01 Jan 2006
TL;DR: A record grouping problem called transitive closure problem, and algorithms to solve the problem are introduced, and the proposed algorithms have been implemented efficiently in several ways.
Abstract: Improving and maintaining data quality has become a critical issue for many companies and organizations because poor data degrades organizational performance whereas quality data results in cost saving and customer satisfaction. Activities such as identifying and removing ”duplicate” database records from a single database, and correlating records, which identify the same real world ”entity”, from different databases are used routinely to improve data quality. Due to the large size of the data sources having several hundred millions to several billions records, and continuously growing, efficient techniques and algorithms are needed. One approach to speed up the processing is to use a two-step process, where potential candidate records are grouped together in step one and each group is further processed and analyzed in step two. The record grouping problem is a formal formulation of what needs to be done in step one. This paper introduces a record grouping problem called transitive closure problem, and proposes algorithms to solve the problem. The proposed algorithms have been implemented efficiently in several ways. The paper reports on the empirical study of the implementations of the proposed algorithms.

10 citations

Patent
22 Sep 2005
TL;DR: In this article, a data structure for annotating data files within a database is provided, which comprises a phoneme and word lattice which allows the quick and efficient searching of data files in response to a user's input query.
Abstract: A data structure is provided for annotating data files within a database. The annotation data comprises a phoneme and word lattice which allows the quick and efficient searching of data files within the database in response to a user's input query. The structure of the annotation data is such that it allows the input query to be made by voice and can be used for annotating various kinds of data files, such as audio data files, video data files, multimedia data files etc. The annotation data may be generated from the data files themselves or may be input by the user either from a voiced input or from a typed input.

10 citations

Journal ArticleDOI
TL;DR: The use of creating a 3D virtual environment using real map data whilst correcting and completing the missing data, improves the quality and performance of crisis management decision support system to provide a more natural and intuitive interface for crisis managers.
Abstract: In this paper we investigate the use of games technologies for the research and the development of 3D representations of real environments captured from GIS information and open source map data. Challenges involved in this area concern the large data-sets to be dealt with. Some existing map data include errors and are not complete, which makes the generation of realistic and accurate 3D environments problematic. The domain of application of our work is crisis management which requires very accurate GIS or map information. We believe the use of creating a 3D virtual environment using real map data whilst correcting and completing the missing data, improves the quality and performance of crisis management decision support system to provide a more natural and intuitive interface for crisis managers. Consequently, we present a case study into issues related to combining multiple large datasets to create an accurate representation of a novel, multi-layered, hybrid real-world maps. The hybrid map generation combines LiDAR, Ordnance Survey, and OpenStreetMap data to generate 3D cities spanning 1 km2. Evaluation of initial visualised scenes is presented. Initial tests consist of a 1 km2 landscape map containing up to 16 million vertices’ and run at an optimal 51.66 frames per-second.

10 citations

Journal ArticleDOI
TL;DR: A name spelling correction algorithm which combines String and Phonetic Matching Techniques and a statistical distance metrics for a quality measure by computing the Kullback-Leibler distance (K-L) to assess to what degree the edit distance strategy has been successful in correcting names.
Abstract: Research highlights? We created a name spelling correction algorithm which combines String and Phonetic Matching Techniques. ? We examine the effectiveness of the algorithm not only with its correction rate, but also estimate the quality of the algorithm based on its false corrections. ? Increasing random mistyped characters in the source is the main reason that the algorithm has failed to do provide the correct suggestions. In order to assist the companies dealing with data preparation problems, an approach is developed to handle the dirty data. Cleaning the customer records and producing the desired results require different set of effective tools and sequences such as the near miss strategy and phonetic structure and edit-distance to provide a suggestion table. The selection of the best match is verified and validated by the frequency of presence in the 20th century's Census Bureau statistics. Although, the conducted experiments resulted in better correction rates over the well known ASPELL, JSpell HTML and Ajax Spell Checkers, another remaining challenge is to introduce an estimation of quality factor for our Personal Name Recognizing Strategy Model (PNRS) to distinguish between submitted original names and suggested name estimations from PNRS. Here, we implement a statistical distance metrics for a quality measure by computing the Kullback-Leibler distance (K-L). K-L distance can be used to measure this distance between probability density function of original names and probability density function of suggested names estimated from the PNRS to assess/validate to what degree our edit distance strategy has been successful in correcting names. All submitted names as inputs of the PNRS model were taken in a maximum edit distance of 2 with respect to the original name. Kullback-Leibler distance will be an indicator of name recognizing quality.

9 citations

Book ChapterDOI
08 Jul 2001
TL;DR: It is shown that there is redundancy in period sets and the notion of an irreducible period set is introduced and it is proved that Γ(n) is a lattice under set inclusion and does not satisfy the Jordan-Dedekind condition.
Abstract: We consider the set Γ(n) of all period sets of strings of length n over a finite alphabet. We show that there is redundancy in period sets and introduce the notion of an irreducible period set. We prove that Γ(n) is a lattice under set inclusion and does not satisfy the Jordan-Dedekind condition.We propose the first enumeration algorithm for Γ(n) and improve upon the previously known asymptotic lower bounds on the cardinality of Γ(n). Finally, we provide a new recurrence to compute the number of strings sharing a given period set.

9 citations


Cites methods from "A comparison of approximate string ..."

  • ...The first method to computes its expectation was presented in [19] and requires the enumeration of all autocorrelations of size k: In the domain of approximate pattern matching, some algorithms first filter uninteresting regions of the text to be searched and then apply a dynamic programming algorithm on remaining regions that may contain an approximate match [11,15,22]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.
Abstract: The string-to-string correction problem is to determine the distance between two strings as measured by the minimum cost sequence of “edit operations” needed to change the one string into the other. The edit operations investigated allow changing one symbol of a string into another single symbol, deleting one symbol from a string, or inserting a single symbol into a string. An algorithm is presented which solves this problem in time proportional to the product of the lengths of the two strings. Possible applications are to the problems of automatic spelling correction and determining the longest subsequence of characters common to two strings.

3,252 citations

Journal ArticleDOI
TL;DR: The algorithm has the unusual property that, in most cases, not all of the first i.” in another string, are inspected.
Abstract: An algorithm is presented that searches for the location, “il” of the first occurrence of a character string, “pat,” in another string, “string.” During the search operation, the characters of pat are matched starting with the last character of pat. The information gained by starting the match at the end of the pattern often allows the algorithm to proceed in large jumps through the text being searched. Thus the algorithm has the unusual property that, in most cases, not all of the first i characters of string are inspected. The number of characters actually inspected (on the average) decreases as a function of the length of pat. For a random English pattern of length 5, the algorithm will typically inspect i/4 characters of string before finding a match at i. Furthermore, the algorithm has been implemented so that (on the average) fewer than i + patlen machine instructions are executed. These conclusions are supported with empirical evidence and a theoretical analysis of the average behavior of the algorithm. The worst case behavior of the algorithm is linear in i + patlen, assuming the availability of array space for tables linear in patlen plus the size of the alphabet.

2,542 citations

Journal ArticleDOI
TL;DR: T h e string-matching problem is a very c o m m o n problem; there are many extensions to t h i s problem; for example, it may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression.
Abstract: T h e string-matching problem is a very c o m m o n problem. We are searching for a string P = PtP2. . "Pro i n s i d e a la rge t ex t f i le T = t l t2. . . t . , b o t h sequences of characters from a f i n i t e character set Z. T h e characters may be English characters in a text file, DNA base pairs, lines of source code, angles between edges in polygons, machines or machine parts in a production schedule, music notes and tempo in a musical score, and so fo r th . We w a n t to f i n d a l l occurrences of P i n T; n a m e l y , we are searching for the set of starting posit ions F = {i[1 --i--n m + 1 s u c h t h a t titi+ l " " t i + m 1 = P } " T h e two most famous algorithms for this problem are t h e B o y e r M o o r e algorithm [3] and t h e K n u t h Morris Pratt algorithm [10]. There are many extensions to t h i s problem; for example, we may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression. String-matching tools are included in every reasonable text editor, word processor, and many other applications.

806 citations

Journal ArticleDOI
TL;DR: An improved algorithm that works in time and in space O and algorithms that can be used in conjunction with extended edit operation sets, including, for example, transposition of adjacent characters.
Abstract: The edit distance between strings a 1 … a m and b 1 … b n is the minimum cost s of a sequence of editing steps (insertions, deletions, changes) that convert one string into the other. A well-known tabulating method computes s as well as the corresponding editing sequence in time and in space O ( mn ) (in space O (min( m, n )) if the editing sequence is not required). Starting from this method, we develop an improved algorithm that works in time and in space O ( s · min( m, n )). Another improvement with time O ( s · min( m, n )) and space O ( s · min( s, m, n )) is given for the special case where all editing steps have the same cost independently of the characters involved. If the editing sequence that gives cost s is not required, our algorithms can be implemented in space O (min( s, m, n )). Since s = O (max( m, n )), the new methods are always asymptotically as good as the original tabulating method. As a by-product, algorithms are obtained that, given a threshold value t , test in time O ( t · min( m, n )) and in space O (min( t, m, n )) whether s ⩽ t . Finally, different generalized edit distances are analyzed and conditions are given under which our algorithms can be used in conjunction with extended edit operation sets, including, for example, transposition of adjacent characters.

672 citations

Journal ArticleDOI
06 Jan 1992
TL;DR: Two string distance functions that are computable in linear time give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edited distance based string matching.
Abstract: We study approximate string matching in connection with two string distance functions that are computable in linear time. The first function is based on the so-called $q$-grams. An algorithm is given for the associated string matching problem that finds the locally best approximate occurences of pattern $P$, $|P|=m$, in text $T$, $|T|=n$, in time $O(n\log (m-q))$. The occurences with distance $\leq k$ can be found in time $O(n\log k)$. The other distance function is based on finding maximal common substrings and allows a form of approximate string matching in time $O(n)$. Both distances give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edit distance based string matching.

665 citations

Frequently Asked Questions (1)
Q1. What are the contributions in "A comparison of approximate string matching algorithms" ?

Experimental comparison of the running time of approximate string matching algorithms for the k differences problem is presented. The authors consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters.