Bounds on the Complexity of the Longest Common Subsequence Problem

doi:10.1145/321921.321922

Bounds on the Complexity of the Longest Common

Subsequence Problem

A V. AHO

Bell Laboratortes, Murray Hdl, New Jersey

D S. HIRSCHBERG AND J D. ULLMAN

Prmceton Umverstty, Prmceton, New Jersey

ABSTRACT The problem of finding a longest common subsequence of two strings is discussed This

problem arises in data processing applications such as comparing two files and in genetic applications such

as studying molecular evolution The ddlqculty of computing a longest common subsequence of two

strings IS examined using the decision tree model of computation, m which vertices represent "equal -

unequal" comparisons It IS shown that

unless

a bound on the total number of 0istmct symbols is as-

sumed, every solution to the problem can consume an amount of time that is proportional to the product

of the lengths of the two strings A general lower bound as a function of the ratio of alphabet size to

string length is derived The case where comparisons between symbols of the same string are forbidden

is also considered and it is shown that this problem is of linear complexity for a two-symbol alphabet and

quadratic for an alphabet of three or more symbols

KEY WORDS AND PHR~tSES longest

common

subsequence, algorithm, computational complexity, file

comparison, molecular evolution

CR CATEGORIES 3 12, 3 73, 5 25

1. Introduction

A subsequence

of a given string is any string obtained by deleting zero or more sym-

bols from the given string. A

longest common subsequence (LCS)of

two strings is a

subsequence of both that is as long as any other common subsequence. For exam-

ple, "cled" and "cued" are the longest common subsequences of "schooled" and

"encyclopedia".

Being able to determine a longest common subsequence of two strings is use-

ful in data processing and genetic applications. In data processing a longest common

subsequence is often used to measure the differences between two files of data. For

example, we can consider a file to be a string in which each line of the file is treated

as a single symbol A longest common subsequence of two files identifies those por-

tions of the files that are identical. A genetic application arises in the study of the

evolution of long molecules such as proteins; there a longest common subsequence

for profit, all or part of this material is granted provided that ACM's copyright notice is given and that

reference is made to this pubhcation, to its date of issue, and to the fact that reprinting prtvdeges were

granted by permission of the Association for Computing Machinery

This research was partially supported by a National Science Foundation Fellowship to D S Hirschberg

and by National Science Foundation Grant GJ-35570 to Princeton University

A preliminary version of this paper was presented at the 15th Annual IEEE Symposium on Switching and

Automata Theory, October 14-16, 1974.

Authors' present addresses A. V Aho, Bell Laboratories, lnc, 600 Mountain Avenue, Murray Hill, NJ

07974, D S Hirschberg, Department of Electrical Engineering, Rice University, Houston, TX 77001, J

D UUman, Department of Electrical Engineering, Prmceton University, Princeton, NJ 08540

Journal of the Assoclation for Cornputmg Machm©ry, Vol 23, No l, January 1976, pp 1-12

2 A V AHO, D S HIRSCHBERG, AND J D ULLMAN

is used to measure the correlation between two such molecules [11, 14].

Using dynamic programming an LCS of two strings .4 and B can be computed

m time proportional to the product of thetr lengths. For special cases an LCS can be

computed in time less than the product. For example, if A and B are length n

strings of digits 1, 2 ..... n, and no position of A matches more than one position of

B, then an LCS of A and B can be computed in

O(nioglogn)

time by speciahzing

the algorithms in [6, 9, 18] to integers and using van Emde Boas' integer merging

technique [15]. Always being able to compute an LCS of two strings in time

significantly less than the product of their lengths, however, appears very difficult

[31.

For this reason we believe that an attempt at a lower bound is m order. To

derive lower bounds a precise model for a class of algorithms is necessary. The

model we choose ~s that of a decision tree [11 in which all decisions are whether or

not two positions have or do not have the same symbol. Th~s model fits various al-

gorithms for the LCS problem which have appeared in the literature [7, 14, 16]. It

has also been used to study the related string-to-string correction problem [17], the

substrmg matching problem [5], and various problems on sets [13]. The model does

not, however, fit the O(n21og

Iogn/logn)

algorithm of Paterson [12] nor the special

case algorithms of [8, 9].

For the remainder of this paper .4 and B denote two strings of length n whose

LCS we wish to compute 1 Throughout, s denotes the total number of distinct sym-

bols that can appear in .4 and B (the alphabet size).

T(n,

s) ~s the minimum

number of comparisons under the decision tree model needed to find an LCS of A

and B in the worst case

We shall derive both upper and lower bounds on

T(n, s).

The use of lower

bounds as clear They say that there are no algorithms of lower time complexity

whlc, h can be modeled by a decision tree w~th "equal-unequal" comparisons. We

are thus told something about the way algorithms for the LCS problem must

behave, if they exist at all.

The need for upper bounds on

T(n, s)

is less obvious. We shall use them to

demonstrate that no stronger bounds on

T(n, s)

can be shown. In principle, an

upper bound on

T(n, s)

is an algorithm for the LCS problem The algorithm, how-

ever, may involve essentmlly different decision trees for each value of n and s.

Thus, it ~s possible that no uniform algorithm taking strings of arbitrary lengths and

finding their LCS can be obtained from a sequence of deciston trees for all n and s,

and such appears to be the case here.

Our principal results are the following:

(1)

T(n,

2) = 2n--1 for n >t 1.

(2) For all n >/ 1'

s s

(i) ~(n+~) ~

r(n,s)

~<

min[n 2, (s-1)(2n- )1, for 2 .~s~ n.

(ii)

3ns/4 <~ T(n, s) <~ n 2,

for

n <, s <~ 4n]3.

(ii0

T(n,s)

=n 2,fors>/

4n/3.

These upper and lower bounds on

T(n, s)

are shown in Figure 1.

(3) The special case where all comparisons are between symbols of different

strings is shown to require 2n-1 comparisons if s = 2 and n 2 comparisons if s >I 3.

t We can, m a straightforward manner, generahze the results of this paper to the case where the strings

are of unequal length

The Longest Common Subsequence Problem

T(n, s)

n 2

.~.n 2

4

.2.7 <2,-7 > ,

n2

0 L

1 586n

n 4n/3 2n

S -"*

FIG 1. Upper and lower bounds on

T(n, s)

2. Decision Trees

This section makes precise the decision tree model of computation Intumvely, each

path starting at the root of a decision tree represents a sequence of comparisons

made between various posmons in the strings A and B. These comparisons give us

all the information we currently know about A and B. The information is in the

form of which positions in A and B must contain identical or distinct symbols.

More formally, we define a

dectston tree wnh "equal-unequal" comparisons for the

LCS problem

as a rooted binary tree in which each interior vertex is labeled with a

pair of integers and each leaf is labeled by two lists of positions from A and B,

respectively. A pair of integers

p.q

at an interior vertex represents a comparison

between the symbols m positrons p and q of the two strings. (p and q can be post-

tlons in the same string.) Each list of positrons at a leaf represents an LCS of A and

B

Since the only information we get about A and B comes from "equal-unequal"

comparisons among symbols of the two strings, we are always dealing with relative

values of symbols m various positions m the strings. Consequently, it is convenient

to define an

(n, s)-asstgnment

(or

asstgnment

when n and s are clear) as a setting of

values from some s-symbol alphabet to the positions of A and B. Intuitively, an as-

signment is a representative of an equivalence class of parrs of input strings. Given

a path P = v I , v 2 .....

v m

from the root to some vertex (not necessarily a leaf) m a

decision tree, we say (n, s)-assignment C is

valid

for P if for each pair of positions

Pl:ql

at vertex vl, 1 ~< t < m, v,+ 1 is the left son of v~ if the symbols m positions pl

and qt are equal according to C, and v~+ 1 is the right son of v, otherwise. Thus an

assignment C is valid for a path if C represents a class of pairs of input strings

whose symbols are consistent with the outcomes of the comparisons made along the

4 A. V AHO, D. S HIRSCHBERG, AND J. D ULLMAN

path.

We say a decls~on tree

D solves the (n, s)-LCSproblem

(or just the

LCSproblem

if n and s are clear) if for every leaf w of D and for every (n, s)-assignment C valid

~for 1 he path from the root of D to w, the two lists of positions found at w are an LCS

in the first and second strings, respectively.

The

complexity

of a decision tree is the length of a longest path in that tree.

We define

T(n, s)

to be the minimum complexity over all decision trees that solve

the (n, s)-LCS problem

A free

decision tree is one which makes no comparisons whose outcomes are

already known (For example, if the symbols at positions p and q and at positions q

and r have been compared and found equal, then, by transitivity, the symbols at po-

sitions p and r are also known to be equal.) We can, without loss of generality, as-

sume that all decision trees being considered are free. This assumption allows us to

consider decision trees in which there are no unnecessary comparisons.

Example

1. To fix the model more closely, let us consider the case where

n =s --2. (That is, we are to find an LCS of two strings each of length 2, and

each over the same two symbol alphabet.) For convenience we let A -- a I a 2 and

B -- b 1 b 2. In Figure 2 we see a decision tree that solves the (2, 2)-LCS problem. It

has complexity 3, which we shall see is the minimum for this problem. Thus

T(2.2) =3. []

no

yes f ~ no yesf '~, no

a2 = b 2

ao yes no

FIG 2. Decision tree solving the (2, 2)-LCS problem

3. Upper Bounds

There are two trivial strategies that can be used to construct decision trees for a

fixod n and s. The first strategy is to compare each symbol of one string with each

symbol of the other. It yields the following theorem.

THEOREM 1.

For all s and n, T(n, s) ~ n 2.

For

s >/ 4n/3

this result is the best possible under our model of computation.

The second strategy is to use comparisons to determine which portions of the two

strings hold identical symbols. We cannot, of course, determine the actual symbol at

The Longest Common Subsequence Problem 5

a position with "equal - unequal" comparisons. If we know the partition of the two

strings into equivalent positions, however, then we can surely select an LCS for the

string without making any additional "equal-unequal" comparisons. We are thus

motivated to make the following definitions.

The

(m, s)-string identification problem

is, given a string of length m, to deter-

mine which positions hold the same symbols, assuming all symbols are chosen from

an s-symbol alphabet A

decision tree with "equal-unequal" comparisons for the strmg

Mentification problem

is defined as for the LCS problem, except the leaves are labeled

with partitions of the integers 1, 2 ..... m into at most s equivalence classes.

The notions of assignment and validity of an assignment are defined as for the

LCS problem. A decision tree

solves

the (m, s)-string identification problem if for

each leaf w, all valid assignments for the path from the root to w have equal symbols

at a pair of positions if and only if those positions are in the same block of the parti-

tion at w. Finally, we can define

l(m, s)

to be the minimum over all decision trees

D solving the (m, s)-string identification problem of the length of the longest path

in D.

LEMMA 1.

T(n, s) ~< l(2n, s).

PROOF Concatenate the two strings of length n into one string of length 2n,

identify the equivalent positions, and determine from them an LCS for the two

strings of length n. Note that no algorithm to solve the LCS problem for general n

and s is implied by this strategy, but using it we can, for fixed n and s, build a deci-

sion tree for the

(n,

s)-LCS problem given a decision tree for the (2n, s)-strmg

identification problem. []

LEMMA 2.

l(m,s) ~< (s--l)(m--+) foralll ~< s ~< m.

PROOF Visit in turn each position of the given string, comparing the symbol

at that position with the representatives for each of the equivalence classes found so

far. If the symbol matches the representative of some class, it is added to that class.

If no match is found, the symbol becomes the representative of a new class. Hence,

for 1 ~< 1 ~< s, at most t-1 comparisons are needed for the t th position. For i > s,

s-1 comparisons suffice, since the sth comparison will always succeed if all others

have failed. The total number of comparisons is thus

s-1

S

]~i + (m--s)(s--l)

= (s--1)(m---~-). I~

From Lemmas 1 and 2 we conclude:

THEOREM 2.

For all s and n, T(n, s) ~<

(s-l)(2n---~-).

Note that Theorem 1 is stronger than Theorem 2 when ~ >t 2 --q~ ~ .586

n

and Theorem 2 is stronger otherwise.

4. Strmg Identification

Since the string identification problem was used in the proof of Theorem 2 to bound

from above the complexity of the LCS problem, let us digress a moment and show

that the upper bound on

l(m, s)

of Lemma 2 is its exact value.

To prove this result we relate the string identification problem to graph color-

ing. Given a path P in a decision tree for the LCS or string identification problem

we can associate with P an undirected graph

Gp as

follows. Let R e relate two posi-

tions if they have been compared and found equal along path P. Let ~p be the

least equivalence relation containing

Rp.

That is, p ~p q if and only if p = q or the

fact that p and q have the same symbol is implied by the outcomes along path P

Then the vertices of the graph Gp are the equivalence classes of ~p, and there is an

Bounds on the Complexity of the Longest Common Subsequence Problem

Citations

A linear space algorithm for computing maximal common subsequences

The Tree-to-Tree Correction Problem

A survey on tree edit distance and related problems

An O ( ND ) difference algorithm and its variations

Algorithms for the Longest Common Subsequence Problem

References

A general method applicable to the search for similarities in the amino acid sequence of two proteins

The Design and Analysis of Computer Algorithms

The String-to-String Correction Problem

A linear space algorithm for computing maximal common subsequences

Matching Sequences under Deletion/Insertion Constraints

Related Papers (5)

The String-to-String Correction Problem

Algorithms for the Longest Common Subsequence Problem

A linear space algorithm for computing maximal common subsequences

A faster algorithm computing string edit distances

A fast algorithm for computing longest common subsequences