A Taxonomy of SuÆx Array Constrution
Algorithms
Simon J. Puglisi
1
, W. F. Smyth
1
;
2
, and Andrew Turpin
3
1
Department of Computing, Curtin University, GPO Box U1987
Perth WA 6845, Australia
e-mail:
puglissjomputing.edu.au
2
Algorithms Researh Group, Department of Computing & Software
MMaster University, Hamilton ON L8S 4K1, Canada
e-mail:
smythmmaster.a
www.as.mmaster.a/as/resear h/groups.sh tml
3
Sho ol of Computer Siene & Information Tehnology
RMIT University, GPO Box 2476V
Melb ourne V 3001, Australia
e-mail:
ahts.rmit.edu.au
Abstrat.
In 1990 Manb er & Myers prop osed suÆx arrays as a spae-saving
alternative to suÆx trees and desrib ed the rst algorithms for suÆx array
onstrution and use. Sine that time, and esp eially in the last few years, suf-
x array onstrution algorithms have proliferated in b ewildering abundane.
This survey pap er attempts to provide simple high-level desriptions of these
numerous algorithms that highlight b oth their distintive features and their
ommonalities, while avoiding as muh as possible the omplexities of imple-
mentation details. We also provide omparisons of the algorithms' worst-ase
time omplexity and use of additional spae, together with results of reent
exp erimental test runs on many of their implementations.
1 Intro dution
SuÆx arrays were intro dued in 1990 by Manber & Myers [MM90, MM93℄, along
with algorithms for their onstrution and use as a spae-saving alternative to suÆx
trees. In the intervening fteen years there have ertainly b een hundreds of researh
artiles published on the onstrution and use of suÆx trees and their variants. Over
that p erio d, it has b een shown that
pratial spae-eÆient suÆx array onstrution algorithms (SACAs) exist that
require worst-ase time linear in string length [KA03, KS03℄;
SACAs exist that are even faster in pratie, though with supralinear worst-ase
onstrution time requirements [LS99, BK03, MF04, M05℄;
Supported in part by grants from the Natural Sienes & Engineering Researh Counil of
Canada and the Australian Researh Counil.
1
Pro eedings of the Prague Stringology Conferene '05
any problem whose solution an b e omputed using suÆx trees is solvable with
the same asymptoti omplexity using suÆx arrays [AKO04℄.
Thus suÆx arrays have b eome the data struture of hoie for many, if not all, of
the string pro essing problems to whih suÆx tree metho dology is appliable.
In this survey pap er we do not attempt to over the entire suÆx array literature.
Our more modest goal is to provide an overview of SACAs, in partiular those mo deled
on the eÆient use of main memory | we exlude the substantial literature (for
example, [CF02℄) that disusses strategies based on the use of seondary storage.
Further, we deal with the onstrution of ompressed (\suint") suÆx arrays only
insofar as they relate to standard SACAs. For example, algorithms suh as those of
Grossi et al. and referenes therein [GGV04℄ are not overed.
Setion 2 provides an overview of the SACAs known to us, organized into a \tax-
onomy" based primarily on the metho dology used. As with all lassiation shemes,
there is room for argument: there are many ross-onnetions b etween algorithms
that o ur in disjoint subtrees of the taxonomy, just as there may b e b etween sp eies
in a biologial taxonomy. Our aim is to provide as omprehensive and, at the same
time, as aessible a desription of SACAs as we an.
Also in Setion 2 we present the voabulary to b e used for the strutured desrip-
tion of eah of the algorithms that will b e given in Setion 3. Then in Setion 4, we
rep ort on the results of exp erimental results on many of the algorithms desrib ed and
so draw onlusions ab out their relative sp eed and spae-eÆieny.
2 Overview
We onsider throughout a nite nonempty
string
x
=
x
[1
::n
℄ of
length
n
1,
dened on an
indexed
alphab et ; that is,
the letters
j
; j
= 1
;
2
; : : : ;
of
j
j
are ordered:
1
<
2
<
<
;
an array
A
[
1
::
℄ an b e dened in whih, for every
j
2
1
::
,
A
[
j
℄ is aessible
in onstant time;
1
2
O
(
n
).
Essentially, we assume that an b e treated as a sequene of integers whose range is
not to o large. Typially, the
j
may b e represented by ASCI I o des 0
::
255 (English
alphab et) or binary integers 00
::
11 (DNA) or simply bits, as the ase may b e. We
shall generally assume that a letter an b e stored in a byte and that
n
an b e stored
in one omputer word (four bytes).
The use of terminology not dened here follows [S03℄.
We are interested in omputing the
suÆx array
of
x
, whih we write SA
x
or
just SA; that is, an array SA[1
::n
℄ in whih SA[
j
℄ =
i
i
x
[
i::n
℄ is the
j
th
suÆx of
x
in (asending) lexiographial order (
lexorder
). For simpliity we will frequently
refer to
x
[
i::n
℄ simply as \suÆx i"; also, it will often b e onvenient for pro essing to
inorp orate into
x
at p osition
n
an ending sentinel $ assumed to b e less than any
j
.
Then, for example, on alphab et =
f
$
; a; b; ; d; e
g
:
2
A Taxonomy of SuÆx Array Constrution Algorithms
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
SA = 12 11 8 1 4 6 9 2 5 7 10 3
Thus SA tells us that
x
[12
::
12℄ = $ is the least suÆx,
x
[11
::
12℄ =
a
$ the seond least,
and so on (alphab etial ordering of the letters assumed). Note that SA is always a
p ermutation of 1
::n
.
Often used in onjuntion with SA
x
is the
lp array
lp[1
::n
℄: for every
j
2
2
::n
,
lp[
j
℄ is just the
longest ommon prex
of suÆxes SA[
j
1℄ and SA [
j
℄. In our
example:
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
SA = 12 11 8 1 4 6 9 2 5 7 10 3
lp =
0 1 4 1 1 0 3 0 0 0 2
Thus the longest ommon prex of suÆxes 11 and 8 is 1, that of suÆxes 8 and 1
is 4. Sine lp an b e omputed in linear time from SA
x
[KLAAP01, M04℄, also as a
bypro dut of some of the SACAs disussed b elow, we do not onsider its onstrution
further in this pap er. However, the
average lp
| that is, the average
lp of the
n
1 integers in the lp array | is as we shall see a useful indiator of the relative
eÆieny of ertain SACAs, notably Algorithm S.
We remark that b oth SA and lp an be omputed in linear time by a preorder
traversal of a suÆx tree.
Many of the SACAs also make use of the
inverse suÆx array
, written ISA
x
or ISA: an array ISA[1
::n
℄ in whih
ISA[
i
℄ =
j
()
SA[
j
℄ =
i:
ISA[
i
℄ =
j
therefore says that suÆx
i
has
rank
j
in lexorder. Continuing our example:
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
ISA = 4 8 12 5 9 6 10 3 7 11 2 1
Thus ISA tells us that suÆx 1 has rank 4 in lexorder, suÆx 2 rank 8, and so on. Note
that ISA is also a p ermutation of 1
::n
, and so SA and ISA are omputable, one from
the other, in (
n
) time:
for
j
1
to
n
do
SA
ISA[
j
℄
j
As shown in Figure 1, this omputation an if required also b e done in plae.
Many of the algorithms we shall b e desribing dep end up on a partial sort of some
or all of the suÆxes of
x
, partial b eause it is based on an ordering of the prexes
of these suÆxes that are of length
h
1. We refer to this partial ordering as an
h
-ordering
of suÆxes into
h
-order
, and to the pro ess itself as an
h
-sort
. If two
or more suÆxes are equal under
h
-order, we say that they have the same
h
-rank
and therefore fall into the same
h
-group
; they are aordingly said to b e
h
-equal
.
Usually an
h
-sort is
stable
, so that any previous ordering of the suÆxes is retained
within eah
h
-group.
3
Pro eedings of the Prague Stringology Conferene '05
for
j
1
to
n
do
i
SA[
j
℄
|
Negative entries already pro essed
if
i >
0
then
j
0
; j
0
j
rep eat
temp
SA[
i
℄; SA[
i
℄
j
0
j
0
i
;
i
temp
until
i
=
j
0
SA[
i
℄
j
0
else
SA[
j
℄
i
Figure 1: Algorithm for omputing ISA from SA in plae
The results of an
h
-sort are often stored in an approximate suÆx array, written
SA
h
, and/or an approximate inverse suÆx array, written ISA
h
. Here is the result of
a 1-sort on all the suÆxes of our example string:
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
SA
1
= 12 (1 4 6 8 11) (2 9) 5 7 (3 10)
ISA
1
= 2 7 11 2 9 2 10 2 7 11 2 1
or 6 8 12 6 9 6 10 6 8 12 6 1
or 2 3 6 2 4 2 5 2 3 6 2 1
The parentheses in SA
1
enlose 1-groups not yet redued to a single entry, thus not
yet in nal sorted order. Note that SA
h
retains the prop erty of b eing a p ermutation of
1
::n
, while ISA
h
may not. Depending on the requirements of the partiular algorithm,
ISA
h
may as shown express the
h
-rank of eah
h
-group in various ways:
the leftmost p osition
j
in SA
h
of a member of the
h
-group, also alled the
head
of the
h
-group;
the rightmost position
j
in SA
h
of a member of the
h
-group, also alled the
tail
of the
h
-group;
the ordinal left-to-right ounter of the
h
-group in SA
h
.
Compare the result of a 3-sort:
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
SA
3
= 12 11 (1 8) 4 6 (2 9) 5 7 10 3
ISA
3
= 3 7 12 5 9 6 10 3 7 11 2 1
or 4 8 12 5 9 6 10 4 8 11 2 1
or 3 6 10 4 7 5 8 3 6 9 2 1
Observe that an (
h
+1)-sort is a
renement
of an
h
-sort: all memb ers of an (
h
+1)-
group b elong to a single
h
-group.
4
A Taxonomy of SuÆx Array Constrution Algorithms
We now have available a vo abulary suÆient to haraterize the main sp eies of
SACA as follows.
(1) Prex-Doubling
First a fast 1-sort is p erformed (sine is indexed, buket sort an be used);
this yields SA
1
/
I S A
1
. Then for every
h
= 1
;
2
; : : :
, SA
2
h
/
I S A
2
h
are omputed
in (
n
) time from SA
h
/
I S A
h
until every 2
h
-group is a singleton. The time
required is therefore
O
(
n
log
n
). There are two algorithms in this lass: MM
[MM90, MM93℄ and LS [S98, LS99℄.
(2) Reursive
Form strings
x
0
and
y
from
x
, then show that if SA
x
0
is omputed, therefore
SA
y
and nally SA
x
an b e omputed in
O
(
n
) time. Hene the problem of
omputing SA
x
0
reursively replaes the omputation of SA
x
. Sine
j
x
0
j
is
always hosen so as to b e less than 2
j
x
j
=
3, the overall time requirement of these
algorithms is (
n
). There are three main algorithms in this lass: KA [KA03℄,
KS [KS03℄ and KJP [KJP04℄.
(3) Indued Copying
The key insight here is the same as for the reursive algorithms | a omplete sort
of a seleted subset of suÆxes an b e used to \indue" a omplete sort of other
subsets of suÆxes. The approah however is nonreursive: an eÆient suÆx
sorting tehnique (for example, [BM93, MBM93, M97, BS97, SZ04℄) is invoked
for the seleted subset of suÆxes. The general idea seems to have b een rst
prop osed by Burrows & Wheeler [BW94℄, but it has b een implemented in quite
dierent ways [IT99, S00, MF04, SS05, BK03, M05℄. In general, these metho ds
are very eÆient in pratie, but may have worst-ase asymptoti omplexity
as high as
O
(
n
2
log
n
).
The goal is to design SACAs that
have minimal asymptoti omplexity (
n
);
are fast \in pratie" (that is, on olletions of large real-world data sets suh
as [H04℄);
are
lightweight
| that is, use a small amount of working storage in addition
to the 5
n
bytes required by
x
and SA
x
.
To date none of the SACAs that has b een prop osed ahieves all of these ob jetives.
Figure 2 presents our taxonomy of the fourteen speies of SACA that have b een
reognized so far; Table 1 summarizes their time and spae requirements.
5