# A comparison of approximate string matching algorithms

TL;DR: It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.

Abstract: Experimental comparisons of the running time of approximate string matching algorithms for the k differences problem are presented. Given a pattern string, a text string, and an integer k, the task is to find all approximate occurrences of the pattern in the text with at most k differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.

## Summary (1 min read)

### Introduction

- Experimental comparison of the running time of approximate string matching algorithms for the k differences problem is presented.
- Tarhio and Ukkonen8, 9 present an algorithm which is based on the Boyer-Moore approach and works in sublinear average time.
- The theoretical analyses given in the literature are helpful but it is important that the theory is completed with experimental comparisons extensive enough.
- The algorithm evaluates a modified form of tableD.

### 4 for J in 1 .. n loop

- For everyC-diagonal Algorithm GP performs an iteration that evaluates it from two previousC-diagonals (lines 7–38).
- The evaluation of each entry starts with evaluating the Col value (line 11).
- The sequence is updated on lines 28–35. ProcedureWithin(d) called on line 14 tests if text positiond is within some interval of thek first reference triples in the sequence.
- Instead of the wholeC defined above, tableC of the algorithm contains only three successiveC-diagonals.
- The use of this buffer of three diagonals is organized with variablesB1, B2, andB3.

### 7 DeQueue(Q, X);

- The scanning phase (lines 3–16) scans over the text and marks the parts that may contain approximate occurrences ofP .
- Parameterx of call EDP(x) tells how many columns should be evaluated for one marked diagonal.
- The minimum valuem for x is applicable for DC.
- The scanning phase is almost identical to the original algorithm.
- If (x) andq(x) are frequencies of characterx in the pattern and inQ, variableZ has the valueXx in Qmax(q(x) f(x); 0): The value ofZ is computed together with tableC which maintains the differencef(x) q(x) for everyx.

### 22 Next.Go To(P(I)) := R; Next := Next.Fail;

- In theformer case there is no approximate occurrence at the current alignmentand in the latter case a potential approximate occurrence has been found.
- For determining the length of the shift, i.e. what is the nextpo ential diagonal afterh for marking, the authors search for the first diagonal afterh, where at least one of the charactersth+m; th+m 1; : : : ; th+m k matches with the corresponding character ofP .

### 6 for I in m–k .. k loop

- The authors performed an extensive test program on all seven algorithms DP, EDP, GP, DC, UW, MM, and ABM described in the previous sections.
- In their tests, the authors used random patterns of varying lengths and random texts of length 100,000 characters over alphabets of different sizes.
- Because algorithms EDP, DC, MM, and ABM were better than the others, the authors studied relations of their execution times more carefully.
- The execution times of EDP and ABM on Sun (shown in Table II for some parameter values) were on the average 68 per cent and 60 per cent, respectively, of the corresponding times on Vaxstation.

Did you find this useful? Give us your feedback

...read more

##### Citations

2,515 citations

### Cites background or methods from "A comparison of approximate string ..."

...There exist other surveys on approximate string matching, which are however too old for this fast moving area [Hall and Dowling 1980; Sankoff and Kruskal 1983; Apostolico and Galil 1985; Galil and Giancarlo 1988; Jokinen et al. 1996] (the last one was in its de.nitive form in 1991)....

[...]

...2 Ukkonen (1983). In 1983, Ukkonen [1985a] presented an algorithm able to compute the edit distance between two strings x and y in O(ed (x, y)2) time, or to check in time O(k2) whether that distance was ≤k or not....

[...]

...Key: TU93 = [Tarhio and Ukkonen 1993], JTU96 = [Jokinen et al. 1996], Nav97a = [Navarro 1997a], CL94 = [Chang and Lawler 1994], Ukk92 = [Ukkonen 1992], BYN99 = [Baeza-Yates and Navarro 1999], WM92b = [Wu and Manber 1992b], BYP96 = [Baeza-Yates and Perleberg 1996], Shi96 = [Shi 1996], NBY99c = [Navarro and Baeza-Yates 1999c], Tak94 = [Takaoka 1994], CM94 = [Chang and Marr 1994], NBY98a = [Navarro and Baeza-Yates 1998a], NR00 = [Navarro and Raf.not 2000], ST95 = [Sutinen and Tarhio 1995], and GKHO97 = [Giegerich et al. 1997]. used Boyer Moore Horspool techniques We divide this area in two parts: moder[Boyer and Moore 1977; Horspool 1980] ate and very long patterns....

[...]

...On the left, the automaton of Ukkonen [1985b] where each column is...

[...]

...Key: BYN99 = [Baeza-Yates and Navarro 1999], NBY98a = [Navarro and Baeza-Yates 1998a], JTU96 = [Jokinen et al. 1996], Ukk92 = [Ukkonen 1992], CL94 = [Chang and Lawler 1994], WM92b = [Wu and Manber 1992b], TU93 = [Tarhio and Ukkonen 1993], Tak94 = [Takaoka 1994], Shi96 = [Shi 1996], ST95 =…...

[...]

458 citations

335 citations

314 citations

213 citations

##### References

3,101 citations

2,468 citations

782 citations

641 citations

637 citations