Journal Article•DOI•

A comparison of approximate string matching algorithms

Petteri Jokinen¹, Jorma Tarhio¹, Esko Ukkonen¹•Institutions (1)

01 Dec 1996-Software - Practice and Experience (John Wiley & Sons, Inc.)-Vol. 26, Iss: 12, pp 1439-1458

TL;DR: It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.

read less

Abstract: Experimental comparisons of the running time of approximate string matching algorithms for the k differences problem are presented. Given a pattern string, a text string, and an integer k, the task is to find all approximate occurrences of the pattern in the text with at most k differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.

...read moreread less

Summary (1 min read)

Jump to: [Introduction] – [4 for J in 1 .. n loop] – [7 DeQueue(Q, X);] – [22 Next.Go To(P(I)) := R; Next := Next.Fail;] and [6 for I in m–k .. k loop]

Introduction

Experimental comparison of the running time of approximate string matching algorithms for the k differences problem is presented.
Tarhio and Ukkonen8, 9 present an algorithm which is based on the Boyer-Moore approach and works in sublinear average time.
The theoretical analyses given in the literature are helpful but it is important that the theory is completed with experimental comparisons extensive enough.
The algorithm evaluates a modified form of tableD.

4 for J in 1 .. n loop

For everyC-diagonal Algorithm GP performs an iteration that evaluates it from two previousC-diagonals (lines 7–38).
The evaluation of each entry starts with evaluating the Col value (line 11).
The sequence is updated on lines 28–35. ProcedureWithin(d) called on line 14 tests if text positiond is within some interval of thek first reference triples in the sequence.
Instead of the wholeC defined above, tableC of the algorithm contains only three successiveC-diagonals.
The use of this buffer of three diagonals is organized with variablesB1, B2, andB3.

7 DeQueue(Q, X);

The scanning phase (lines 3–16) scans over the text and marks the parts that may contain approximate occurrences ofP .
Parameterx of call EDP(x) tells how many columns should be evaluated for one marked diagonal.
The minimum valuem for x is applicable for DC.
The scanning phase is almost identical to the original algorithm.
If (x) andq(x) are frequencies of characterx in the pattern and inQ, variableZ has the valueXx in Qmax(q(x) f(x); 0): The value ofZ is computed together with tableC which maintains the differencef(x) q(x) for everyx.

22 Next.Go To(P(I)) := R; Next := Next.Fail;

In theformer case there is no approximate occurrence at the current alignmentand in the latter case a potential approximate occurrence has been found.
For determining the length of the shift, i.e. what is the nextpo ential diagonal afterh for marking, the authors search for the first diagonal afterh, where at least one of the charactersth+m; th+m 1; : : : ; th+m k matches with the corresponding character ofP .

6 for I in m–k .. k loop

The authors performed an extensive test program on all seven algorithms DP, EDP, GP, DC, UW, MM, and ABM described in the previous sections.
In their tests, the authors used random patterns of varying lengths and random texts of length 100,000 characters over alphabets of different sizes.
Because algorithms EDP, DC, MM, and ABM were better than the others, the authors studied relations of their execution times more carefully.
The execution times of EDP and ABM on Sun (shown in Table II for some parameter values) were on the average 68 per cent and 60 per cent, respectively, of the corresponding times on Vaxstation.

Did you find this useful? Give us your feedback

Figures (13)

Figure 10. Execution times for = 2 andk = 4.

Figure 7. Preprocessing ofP for Algorithm MM.

Figure 12. Execution times for c = 30 and m = 10.

Figure 13. Execution times for c = 90 and k = 4.

Table I. Execution times (in units of 10 milliseconds) of thealgorithms (n = 100,000). Value E of denotes the English text.

Table II. Execution times (in units of 10 milliseconds) of EDP and ABM on Sun 4/260S.

Figure 9. Computation of tabled for Algorithm ABM.

Figure 11. Execution times for = 10 andk = m / 8.

Content maybe subject to copyright Report

SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 1(1), 1–4 (JANUARY 1988)

A Comparison of

Approximate String Matching Algorithms

PETTERI JOKINEN, JORMA TARHIO, AND ESKO UKKONEN

Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland

(email: tarhio@cs.helsinki.ﬁ)

SUMMARY

Experimental comparison of the running time of approximate string matching algorithms for the

dif-

ferences problem is presented. Given a pattern string, a text string, and integer

, the task is to ﬁnd all

approximate occurrences of the pattern in the text with at most

differences (insertions, deletions, changes).

Weconsider sevenalgorithms basedondifferentapproachesincludingdynamicprogramming,Boyer-Moore

string matching, sufﬁx automata, and the distribution of characters. It turns out that none of the algorithms

is the best for all values of the problem parameters, and the speed differences between the methods can be

considerable.

KEY WORDS String matching Edit distance k differences problem

INTRODUCTION

We consider the

differences problem, a version of the approximate string matching problem.

Given two strings, text

: : : t

and pattern

: : : p

and integer

, the task is

to ﬁnd the end points of all approximate occurrences of

. An approximate occurrence

means asubstring

such thatat most

editing operations (insertions,deletions, changes)

are needed to convert

There are several algorithms proposed for this problem, see e.g. the survey of Galil and

Giancarlo.

The problem can be solved in time

(

)

by dynamic programming.

2, 3

A very

simple improvement giving

(

k n

)

expected time solution for random strings is described by

Ukkonen.

Later, Landau and Vishkin,

4, 5

Galil and Park,

Ukkonen and Wood

give different

algorithms that consist of preprocessing the pattern in time

(

)

(or

(

)

) and scanning

the text in worst-case time

(

k n

)

. Tarhio and Ukkonen

8, 9

present an algorithm which is based

on the Boyer-Moore approach and works in sublinear average time. There are also several

other efﬁcient solutions

10-17

, and some

11-14

of them work in sublinear average time. Currently

(

k n

)

is the best worst-case bound known if the preprocessing time is allowed to be at most

(

)

There are also fast algorithms

9, 17-20

for the

mismatches problem, which is a reduced form

differences problem so that a change is the only editing operation allowed.

It is clear that with such a multitude of different solutions to the same problem it is difﬁcult

to select a proper method for each particular approximate string matching task. The theoretical

analyses given in the literature are helpful but it is important that the theory is completed with

experimental comparisons extensive enough.

CCC 0038–0644/88/010001–04 Received 1 March 1988



1988 by John Wiley & Sons, Ltd. Revised 25 March 1988

2 PETTERI JOKINEN, JORMA TARHIO, AND ESKO UKKONEN

We will present an experimental comparison



of the running times of seven algorithms for

the

differences problem. The tested algorithms are: two dynamic programming methods,

2, 3

Galil-Park algorithm,

Ukkonen-Wood algorithm,

an algorithm counting the distribution

of characters,

approximate Boyer-Moore algorithm,

and an algorithm based on maximal

matches between the pattern and the text.

(The last algorithm

is very similar to the linear

algorithm of Chang and Lawler,

although they have been invented independently.) We give

brief descriptions of the algorithms as well as an Ada code for their central parts. As our

emphasis is in the experiments, the reader is advised to consult the original references for

more detailed descriptions of the methods.

The paper is organized as follows. At ﬁrst, the framework based on edit distance is intro-

duced. Then the seven algorithms are presented. Finally, the comparison of the algorithms is

represented and its results are summarized.

THE K DIFFERENCES PROBLEM

We use the concept of edit distance

21, 22

to measure the goodness of approximate occurrences

of a pattern. The edit distance between two strings,

and

in alphabet Σ, can be deﬁned

as the minimum number of editing steps needed to convert

. Each editing step is a

rewriting step of the form

(a deletion),

(an insertion), or

(a change) where

are in Σ and

is the empty string.

The k differences problem is, given pattern

: : : p

and text

: : : t

alphabet Σ of size



, and integer

, to ﬁnd all such

that the edit distance (i.e., the number of

differences) between

and some substring of

ending at

is at most

. The basic solution

of the problem is the following dynamic programming method:

2, 3

Let

be an

+ 1 by

1 table such that

(

i; j

)

is the minimum edit distance between

: : : p

and any substring

ending at

. Then

(

; j

) =

;



;

(

i; j

) =

min

(



; j

) +

(



; j



) +

then 0 else 1

(

i; j



) +

Table

can be evaluated column-by-column in time

(

)

. Whenever

(

m; j

)

is found

to be at most

for some

, there is an approximate occurrence of

ending at

with edit

distance

(

m; j

)



. Hence

is a solution to the

differences problem.

In Fig. 1 there is an example of table

for

= bcbacbbb and

= cacd. The pattern

occurs at positions 5 and 6 of the text with at most 2 differences.

All the algorithms presented work within this model, but they utilize different approaches

in restricting the number of entries that are necessary to evaluate in table

. Some of the algo-

rithms work in two phases: scanning and checking. The scanning phase searches for potential

occurrences of the pattern, and the checking phase veriﬁes if the suggested occurrences are

good or not. The checking is always done using dynamic programming.



The comparison was carried out in 1991. Some of the newer methods will likely be faster than the tested algorithms for certain

values of problem parameters.

A COMPARISION OF APPROXIMATE STRING MATCHING ALGORITHMS 3

0 1 2 3 4 5 6 7 8

b c b a c b b b

0 0 0 0 0 0 0 0 0 0

1 c 1 1 0 1 1 0 1 1 1

2 a 2 2 1 1 1 1 1 2 2

3 c 3 3 2 2 2 1 2 2 3

4 d 4 4 3 3 3 2 2 3 3

Figure 1. Table

ALGORITHMS

Dynamic programming

We consider two different versions of dynamic programming for the

differences problem.

In the previous section we introduced the trivial solution which computes all entries of table

. The code of this algorithm is straight-forward,

2, 21

and we do not present it here. In the

following, we refer to this solution as Algorithm DP.

Diagonal

for



: : :

, consists of all

(

i; j

)

such that



. Considering

computation along diagonals gives a simple way to limit unnecessary computation. It is

easy to show that entries on every diagonal

are monotonically increasing.

Therefore

the computation along a diagonal can be stopped, when the threshold value of

1 is

reached, because the rest of the entries on that diagonal will be greater than

. This idea leads

to Algorithm EDP (Enhanced Dynamic Programming) working in average time

(

k n

)

Algorithm EDP is shown in Fig. 2.

In algorithm EDP, the text and the pattern are stored in tables

and

. Table

is evaluated

a column at a time. The entries of the current column are stored in table

, and the value of

(



; j



)

is temporarily stored in variable

. A work space of

(

)

is enough, because

every

(

i; j

)

depends only on entries

(



; j

)

(

i; j



)

, and

(



; j



)

. Variable

Top tells the row where the topmost diagonal still under the threshold value

1 intersects the

current column. On line 12 an approximate occurrence is reported, when row

is reached.

Galil-Park

The

(

k n

)

algorithm presented by Galil and Park

is based on the diagonalwise monotonic-

ity of the entries of table

. It also uses so-called reference triples that represent matching

substrings of the pattern and the text. This approach was used already by Landau and Vishkin.

The algorithm evaluates a modiﬁed form of table

. The core of the algorithm is shown in

Fig. 3 as Algorithm GP.

In preprocessing of pattern

(procedure call Preﬁxes

(

)

on line 2), upper triangular table

Preﬁx

(

i; j

)

, 1



i < j



, is computed where Preﬁx

(

i; j

)

is the length of the longest

common preﬁx of

: : : p

and

: : : p

Reference triple (

) consists of start position

, end position

, and diagonal

such

that substring

: : : t

matches substring



: : : p



and



. Algorithm GP

manipulates several triples; the components of the

triple are presented as

(

)

(

)

, and

(

)

4 PETTERI JOKINEN, JORMA TARHIO, AND ESKO UKKONEN

1 begin

2 Top := k + 1;

3 for I in 0 .. m loop H(I) := I; end loop;

4 for J in 1 .. n loop

5 C := 0;

6 for I in 1 .. Top loop

7 if P(I) = T(J) then E :=C;

8 else E := Min((H(I – 1), H(I), C)) + 1; end if;

9 C := H(I); H(I) := E;

10 end loop;

11 while H(Top)

k loop Top := Top – 1; end loop;

12 if Top = m then Report Match(J);

13 else Top := Top + 1; end if;

14 end loop;

15 end;

Figure 2. Algorithm EDP.

For diagonal

and integer

, let

(

e; d

)

be the largest column

such that

(



d; j

) =

In other words, the entries of value

on diagonal

end at column

(

e; d

)

. Now

(

e; d

) =

C ol

J ump

(

C ol



d; C ol

)

holds where

C ol

max

(



; d



) +

; C

(



; d

) +

; C

(



; d

)

and Jump

(

i; j

)

is the length of the longest common preﬁx of

: : : p

and

: : : t

for all

Let C-diagonal

consist of entries

(

e; d

)

such that

. For every

-diagonal

Algorithm GP performs an iteration that evaluates it from two previous

-diagonals (lines

7–38). The evaluation of each entry starts with evaluating the Col value (line 11). The rest

of the loop (lines 12–35) effectively ﬁnds the value Jump

(

C ol



d; C ol

)

using the

reference triples and table Preﬁx. A new

-value is stored on line 24.

The algorithm maintains an ordered sequence of reference triples. The sequence is updated

on lines 28–35. Procedure Within

(

)

called on line 14 tests if text position

is within some

interval of the

ﬁrst reference triples in the sequence. In the positive case, variable

updated to express the index of the reference triple whose interval contains text position

A match is reported on line 26.

Instead ofthe whole

deﬁned above, table

ofthe algorithmcontains onlythree successive

-diagonals. The use of this buffer of three diagonals is organized with variables

and

Ukkonen-Wood

Another

(

k n

)

algorithm, given by Ukkonen and Wood,

has an overall structure identical

to the algorithm of Galil and Park. However, no reference triples are used. Instead, to ﬁnd

the necessary values Jump

(

i; j

)

, the text is scanned with a modiﬁed sufﬁx automaton for

A COMPARISION OF APPROXIMATE STRING MATCHING ALGORITHMS 5

1 begin

2 Preﬁxes(P);

3 for I in –1 .. k loop

4 C(I, 1) := –Inﬁnity; C(I, 2) := –1;

5 end loop;

6 B1:=0; B2:=1; B3:=2;

7 for J in 0 .. n – m + k loop

8 C(–1, B1) := J; R := 0;

9 for E in 0 .. k loop

10 H := J – E;

11 Col := Max((C(E–1, B2) + 1, C(E–1, B3) + 1, C(E–1, B1)));

12 Se := Col + 1; Found := false;

13 while not Found loop

14 if Within(Col + 1) then

15 F := V(R) – Col; G := Preﬁx(Col+1–H, Col+1–W(R));

16 if F = G then Col := Col + F;

17 else Col := Col + Min(F, G); Found := true; end if;

18 else

19 if Col – H

m and then P(Col+1–H) = T(Col+1) then

20 Col := Col + 1;

21 else Found := true; end if;

22 end if;

23 end loop;

24 C(E, B1) := Min(Col, m+H);

25 if C(E, B1) = H + m and then C(E–1, B2)

m + H then

26 Report

Match((H + m));

27 end if;

28 if V(E)

C(E, B1) then

29 if E = 0 then U(E) := J + 1;

30 else U(E) := Max(U(E), V(E–1) + 1); end if;

31 else

32 V(E) := C(E, B1); W(E) := H;

33 if E = 0 then U(E) := J + 1;

34 else U(E) := Max(Se, V(E–1) + 1); end if;

35 end if;

36 end loop;

37 B := B1; B1 := B3; B3 := B2; B2 := B;

38 end loop;

39 end;

Figure 3. Algorithm GP.

HTML Viewer

Frequently Asked Questions (1)

Q1. What are the contributions in "A comparison of approximate string matching algorithms" ?

Experimental comparison of the running time of approximate string matching algorithms for the k differences problem is presented. The authors consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters.

A comparison of approximate string matching algorithms

Summary (1 min read)

Introduction

4 for J in 1 .. n loop

7 DeQueue(Q, X);

22 Next.Go To(P(I)) := R; Next := Next.Fail;

6 for I in m–k .. k loop

Figures (13)

Citations

References

Related Papers (5)

Frequently Asked Questions (1)

Q1. What are the contributions in "A comparison of approximate string matching algorithms" ?