scispace - formally typeset
Open AccessJournal ArticleDOI

Bounds on the Complexity of the Longest Common Subsequence Problem

TLDR
It is shown that unless a bound on the total number of distinct symbols is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings.
Abstract
The problem of finding a longest common subsequence of two strings is discussed. This problem arises in data processing applications such as comparing two files and in genetic applications such as studying molecular evolution. The difficulty of computing a longest common subsequence of two strings is examined using the decision tree model of computation, in which vertices represent “equal - unequal” comparisons. It is shown that unless a bound on the total number of distinct symbols is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings. A general lower bound as a function of the ratio of alphabet size to string length is derived. The case where comparisons between symbols of the same string are forbidden is also considered and it is shown that this problem is of linear complexity for a two-symbol alphabet and quadratic for an alphabet of three or more symbols.

read more

Content maybe subject to copyright    Report

Bounds on the Complexity of the Longest Common
Subsequence Problem
A V. AHO
Bell Laboratortes, Murray Hdl, New Jersey
D S. HIRSCHBERG AND J D. ULLMAN
Prmceton Umverstty, Prmceton, New Jersey
ABSTRACT The problem of finding a longest common subsequence of two strings is discussed This
problem arises in data processing applications such as comparing two files and in genetic applications such
as studying molecular evolution The ddlqculty of computing a longest common subsequence of two
strings IS examined using the decision tree model of computation, m which vertices represent "equal -
unequal" comparisons It IS shown that
unless
a bound on the total number of 0istmct symbols is as-
sumed, every solution to the problem can consume an amount of time that is proportional to the product
of the lengths of the two strings A general lower bound as a function of the ratio of alphabet size to
string length is derived The case where comparisons between symbols of the same string are forbidden
is also considered and it is shown that this problem is of linear complexity for a two-symbol alphabet and
quadratic for an alphabet of three or more symbols
KEY WORDS AND PHR~tSES longest
common
subsequence, algorithm, computational complexity, file
comparison, molecular evolution
CR CATEGORIES 3 12, 3 73, 5 25
1. Introduction
A subsequence
of a given string is any string obtained by deleting zero or more sym-
bols from the given string. A
longest common subsequence (LCS)of
two strings is a
subsequence of both that is as long as any other common subsequence. For exam-
ple, "cled" and "cued" are the longest common subsequences of "schooled" and
"encyclopedia".
Being able to determine a longest common subsequence of two strings is use-
ful in data processing and genetic applications. In data processing a longest common
subsequence is often used to measure the differences between two files of data. For
example, we can consider a file to be a string in which each line of the file is treated
as a single symbol A longest common subsequence of two files identifies those por-
tions of the files that are identical. A genetic application arises in the study of the
evolution of long molecules such as proteins; there a longest common subsequence
Copyright © 1976, Association for Computing Machinery, Inc General permlsmon to republish, but not
for profit, all or part of this material is granted provided that ACM's copyright notice is given and that
reference is made to this pubhcation, to its date of issue, and to the fact that reprinting prtvdeges were
granted by permission of the Association for Computing Machinery
This research was partially supported by a National Science Foundation Fellowship to D S Hirschberg
and by National Science Foundation Grant GJ-35570 to Princeton University
A preliminary version of this paper was presented at the 15th Annual IEEE Symposium on Switching and
Automata Theory, October 14-16, 1974.
Authors' present addresses A. V Aho, Bell Laboratories, lnc, 600 Mountain Avenue, Murray Hill, NJ
07974, D S Hirschberg, Department of Electrical Engineering, Rice University, Houston, TX 77001, J
D UUman, Department of Electrical Engineering, Prmceton University, Princeton, NJ 08540
Journal of the Assoclation for Cornputmg Machm©ry, Vol 23, No l, January 1976, pp 1-12

2 A V AHO, D S HIRSCHBERG, AND J D ULLMAN
is used to measure the correlation between two such molecules [11, 14].
Using dynamic programming an LCS of two strings .4 and B can be computed
m time proportional to the product of thetr lengths. For special cases an LCS can be
computed in time less than the product. For example, if A and B are length n
strings of digits 1, 2 ..... n, and no position of A matches more than one position of
B, then an LCS of A and B can be computed in
O(nioglogn)
time by speciahzing
the algorithms in [6, 9, 18] to integers and using van Emde Boas' integer merging
technique [15]. Always being able to compute an LCS of two strings in time
significantly less than the product of their lengths, however, appears very difficult
[31.
For this reason we believe that an attempt at a lower bound is m order. To
derive lower bounds a precise model for a class of algorithms is necessary. The
model we choose ~s that of a decision tree [11 in which all decisions are whether or
not two positions have or do not have the same symbol. Th~s model fits various al-
gorithms for the LCS problem which have appeared in the literature [7, 14, 16]. It
has also been used to study the related string-to-string correction problem [17], the
substrmg matching problem [5], and various problems on sets [13]. The model does
not, however, fit the O(n21og
Iogn/logn)
algorithm of Paterson [12] nor the special
case algorithms of [8, 9].
For the remainder of this paper .4 and B denote two strings of length n whose
LCS we wish to compute 1 Throughout, s denotes the total number of distinct sym-
bols that can appear in .4 and B (the alphabet size).
T(n,
s) ~s the minimum
number of comparisons under the decision tree model needed to find an LCS of A
and B in the worst case
We shall derive both upper and lower bounds on
T(n, s).
The use of lower
bounds as clear They say that there are no algorithms of lower time complexity
whlc, h can be modeled by a decision tree w~th "equal-unequal" comparisons. We
are thus told something about the way algorithms for the LCS problem must
behave, if they exist at all.
The need for upper bounds on
T(n, s)
is less obvious. We shall use them to
demonstrate that no stronger bounds on
T(n, s)
can be shown. In principle, an
upper bound on
T(n, s)
is an algorithm for the LCS problem The algorithm, how-
ever, may involve essentmlly different decision trees for each value of n and s.
Thus, it ~s possible that no uniform algorithm taking strings of arbitrary lengths and
finding their LCS can be obtained from a sequence of deciston trees for all n and s,
and such appears to be the case here.
Our principal results are the following:
(1)
T(n,
2) = 2n--1 for n >t 1.
(2) For all n >/ 1'
s s
(i) ~(n+~) ~
r(n,s)
~<
min[n 2, (s-1)(2n- )1, for 2 .~s~ n.
(ii)
3ns/4 <~ T(n, s) <~ n 2,
for
n <, s <~ 4n]3.
(ii0
T(n,s)
=n 2,fors>/
4n/3.
These upper and lower bounds on
T(n, s)
are shown in Figure 1.
(3) The special case where all comparisons are between symbols of different
strings is shown to require 2n-1 comparisons if s = 2 and n 2 comparisons if s >I 3.
t We can, m a straightforward manner, generahze the results of this paper to the case where the strings
are of unequal length

The Longest Common Subsequence Problem
T(n, s)
n 2
.~.n 2
4
.2.7 <2,-7 > ,
n2
0 L
1 586n
n 4n/3 2n
S -"*
FIG 1. Upper and lower bounds on
T(n, s)
2. Decision Trees
This section makes precise the decision tree model of computation Intumvely, each
path starting at the root of a decision tree represents a sequence of comparisons
made between various posmons in the strings A and B. These comparisons give us
all the information we currently know about A and B. The information is in the
form of which positions in A and B must contain identical or distinct symbols.
More formally, we define a
dectston tree wnh "equal-unequal" comparisons for the
LCS problem
as a rooted binary tree in which each interior vertex is labeled with a
pair of integers and each leaf is labeled by two lists of positions from A and B,
respectively. A pair of integers
p.q
at an interior vertex represents a comparison
between the symbols m positrons p and q of the two strings. (p and q can be post-
tlons in the same string.) Each list of positrons at a leaf represents an LCS of A and
B
Since the only information we get about A and B comes from "equal-unequal"
comparisons among symbols of the two strings, we are always dealing with relative
values of symbols m various positions m the strings. Consequently, it is convenient
to define an
(n, s)-asstgnment
(or
asstgnment
when n and s are clear) as a setting of
values from some s-symbol alphabet to the positions of A and B. Intuitively, an as-
signment is a representative of an equivalence class of parrs of input strings. Given
a path P = v I , v 2 .....
v m
from the root to some vertex (not necessarily a leaf) m a
decision tree, we say (n, s)-assignment C is
valid
for P if for each pair of positions
Pl:ql
at vertex vl, 1 ~< t < m, v,+ 1 is the left son of v~ if the symbols m positions pl
and qt are equal according to C, and v~+ 1 is the right son of v, otherwise. Thus an
assignment C is valid for a path if C represents a class of pairs of input strings
whose symbols are consistent with the outcomes of the comparisons made along the

4 A. V AHO, D. S HIRSCHBERG, AND J. D ULLMAN
path.
We say a decls~on tree
D solves the (n, s)-LCSproblem
(or just the
LCSproblem
if n and s are clear) if for every leaf w of D and for every (n, s)-assignment C valid
~for 1 he path from the root of D to w, the two lists of positions found at w are an LCS
in the first and second strings, respectively.
The
complexity
of a decision tree is the length of a longest path in that tree.
We define
T(n, s)
to be the minimum complexity over all decision trees that solve
the (n, s)-LCS problem
A free
decision tree is one which makes no comparisons whose outcomes are
already known (For example, if the symbols at positions p and q and at positions q
and r have been compared and found equal, then, by transitivity, the symbols at po-
sitions p and r are also known to be equal.) We can, without loss of generality, as-
sume that all decision trees being considered are free. This assumption allows us to
consider decision trees in which there are no unnecessary comparisons.
Example
1. To fix the model more closely, let us consider the case where
n =s --2. (That is, we are to find an LCS of two strings each of length 2, and
each over the same two symbol alphabet.) For convenience we let A -- a I a 2 and
B -- b 1 b 2. In Figure 2 we see a decision tree that solves the (2, 2)-LCS problem. It
has complexity 3, which we shall see is the minimum for this problem. Thus
T(2.2) =3. []
no
yes f ~ no yesf '~, no
a2 = b 2
ao yes no
FIG 2. Decision tree solving the (2, 2)-LCS problem
3. Upper Bounds
There are two trivial strategies that can be used to construct decision trees for a
fixod n and s. The first strategy is to compare each symbol of one string with each
symbol of the other. It yields the following theorem.
THEOREM 1.
For all s and n, T(n, s) ~ n 2.
For
s >/ 4n/3
this result is the best possible under our model of computation.
The second strategy is to use comparisons to determine which portions of the two
strings hold identical symbols. We cannot, of course, determine the actual symbol at

The Longest Common Subsequence Problem 5
a position with "equal - unequal" comparisons. If we know the partition of the two
strings into equivalent positions, however, then we can surely select an LCS for the
string without making any additional "equal-unequal" comparisons. We are thus
motivated to make the following definitions.
The
(m, s)-string identification problem
is, given a string of length m, to deter-
mine which positions hold the same symbols, assuming all symbols are chosen from
an s-symbol alphabet A
decision tree with "equal-unequal" comparisons for the strmg
Mentification problem
is defined as for the LCS problem, except the leaves are labeled
with partitions of the integers 1, 2 ..... m into at most s equivalence classes.
The notions of assignment and validity of an assignment are defined as for the
LCS problem. A decision tree
solves
the (m, s)-string identification problem if for
each leaf w, all valid assignments for the path from the root to w have equal symbols
at a pair of positions if and only if those positions are in the same block of the parti-
tion at w. Finally, we can define
l(m, s)
to be the minimum over all decision trees
D solving the (m, s)-string identification problem of the length of the longest path
in D.
LEMMA 1.
T(n, s) ~< l(2n, s).
PROOF Concatenate the two strings of length n into one string of length 2n,
identify the equivalent positions, and determine from them an LCS for the two
strings of length n. Note that no algorithm to solve the LCS problem for general n
and s is implied by this strategy, but using it we can, for fixed n and s, build a deci-
sion tree for the
(n,
s)-LCS problem given a decision tree for the (2n, s)-strmg
identification problem. []
LEMMA 2.
l(m,s) ~< (s--l)(m--+) foralll ~< s ~< m.
PROOF Visit in turn each position of the given string, comparing the symbol
at that position with the representatives for each of the equivalence classes found so
far. If the symbol matches the representative of some class, it is added to that class.
If no match is found, the symbol becomes the representative of a new class. Hence,
for 1 ~< 1 ~< s, at most t-1 comparisons are needed for the t th position. For i > s,
s-1 comparisons suffice, since the sth comparison will always succeed if all others
have failed. The total number of comparisons is thus
s-1
S
]~i + (m--s)(s--l)
= (s--1)(m---~-). I~
From Lemmas 1 and 2 we conclude:
THEOREM 2.
For all s and n, T(n, s) ~<
(s-l)(2n---~-).
Note that Theorem 1 is stronger than Theorem 2 when ~ >t 2 --q~ ~ .586
n
and Theorem 2 is stronger otherwise.
4. Strmg Identification
Since the string identification problem was used in the proof of Theorem 2 to bound
from above the complexity of the LCS problem, let us digress a moment and show
that the upper bound on
l(m, s)
of Lemma 2 is its exact value.
To prove this result we relate the string identification problem to graph color-
ing. Given a path P in a decision tree for the LCS or string identification problem
we can associate with P an undirected graph
Gp as
follows. Let R e relate two posi-
tions if they have been compared and found equal along path P. Let ~p be the
least equivalence relation containing
Rp.
That is, p ~p q if and only if p = q or the
fact that p and q have the same symbol is implied by the outcomes along path P
Then the vertices of the graph Gp are the equivalence classes of ~p, and there is an

Citations
More filters
Journal ArticleDOI

A linear space algorithm for computing maximal common subsequences

TL;DR: The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space and an algorithm is presented which will solve this problem in QuadraticTime and in linear space.
Journal ArticleDOI

The Tree-to-Tree Correction Problem

TL;DR: An algorithm is presented which solves the problem of determining the distance from T to T' as measured by the mlmmum cost sequence of edit operaUons needed to transform T into T'.
Journal ArticleDOI

A survey on tree edit distance and related problems

TL;DR: This work surveys the problem of comparing labeled trees based on simple local operations of deleting, inserting, and relabeling nodes and presents one or more of the central algorithms for solving the problem.
Journal ArticleDOI

An O ( ND ) difference algorithm and its variations

TL;DR: A simpleO(ND) time and space algorithm is developed whereN is the sum of the lengths of A andB andD is the size of the minimum edit script forA andB, and the algorithm performs well when differences are small and is consequently fast in typical applications.
Journal ArticleDOI

Algorithms for the Longest Common Subsequence Problem

TL;DR: A lgor i thm is appl icable in the genera l case and requi res O ( p n + n log n) t ime for any input strings o f lengths m and n even though the lower bound on T ime of O ( m n ) need not apply to all inputs.
References
More filters
Journal ArticleDOI

A general method applicable to the search for similarities in the amino acid sequence of two proteins

TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
Book

The Design and Analysis of Computer Algorithms

TL;DR: This text introduces the basic data structures and programming techniques often used in efficient algorithms, and covers use of lists, push-down stacks, queues, trees, and graphs.
Journal ArticleDOI

The String-to-String Correction Problem

TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.
Journal ArticleDOI

A linear space algorithm for computing maximal common subsequences

TL;DR: The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space and an algorithm is presented which will solve this problem in QuadraticTime and in linear space.
Journal ArticleDOI

Matching Sequences under Deletion/Insertion Constraints

TL;DR: An economical algorithm is elaborated for finding subsequences satisfying deletion/insertion constraints and is useful in the study of genetic homology based on nucleotide or amino-acid sequences.
Related Papers (5)