scispace - formally typeset
Open AccessJournal ArticleDOI

A taxonomy of suffix array construction algorithms

TLDR
A survey of suffix array construction algorithms can be found in this article, with a comparison of the algorithms' worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Abstract
In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple high-level descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms' worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.

read more

Content maybe subject to copyright    Report

A Taxonomy of SuÆx Array Constrution
Algorithms
Simon J. Puglisi
1
, W. F. Smyth
1
;
2
, and Andrew Turpin
3
1
Department of Computing, Curtin University, GPO Box U1987
Perth WA 6845, Australia
e-mail:
puglissjomputing.edu.au
2
Algorithms Researh Group, Department of Computing & Software
MMaster University, Hamilton ON L8S 4K1, Canada
e-mail:
smythmmaster.a
www.as.mmaster.a/as/resear h/groups.sh tml
3
Sho ol of Computer Siene & Information Tehnology
RMIT University, GPO Box 2476V
Melb ourne V 3001, Australia
e-mail:
ahts.rmit.edu.au
Abstrat.
In 1990 Manb er & Myers prop osed suÆx arrays as a spae-saving
alternative to suÆx trees and desrib ed the rst algorithms for suÆx array
onstrution and use. Sine that time, and esp eially in the last few years, suf-
x array onstrution algorithms have proliferated in b ewildering abundane.
This survey pap er attempts to provide simple high-level desriptions of these
numerous algorithms that highlight b oth their distintive features and their
ommonalities, while avoiding as muh as possible the omplexities of imple-
mentation details. We also provide omparisons of the algorithms' worst-ase
time omplexity and use of additional spae, together with results of reent
exp erimental test runs on many of their implementations.
1 Intro dution
SuÆx arrays were intro dued in 1990 by Manber & Myers [MM90, MM93℄, along
with algorithms for their onstrution and use as a spae-saving alternative to suÆx
trees. In the intervening fteen years there have ertainly b een hundreds of researh
artiles published on the onstrution and use of suÆx trees and their variants. Over
that p erio d, it has b een shown that
pratial spae-eÆient suÆx array onstrution algorithms (SACAs) exist that
require worst-ase time linear in string length [KA03, KS03℄;
SACAs exist that are even faster in pratie, though with supralinear worst-ase
onstrution time requirements [LS99, BK03, MF04, M05℄;
Supported in part by grants from the Natural Sienes & Engineering Researh Counil of
Canada and the Australian Researh Counil.
1

Pro eedings of the Prague Stringology Conferene '05
any problem whose solution an b e omputed using suÆx trees is solvable with
the same asymptoti omplexity using suÆx arrays [AKO04℄.
Thus suÆx arrays have b eome the data struture of hoie for many, if not all, of
the string pro essing problems to whih suÆx tree metho dology is appliable.
In this survey pap er we do not attempt to over the entire suÆx array literature.
Our more modest goal is to provide an overview of SACAs, in partiular those mo deled
on the eÆient use of main memory | we exlude the substantial literature (for
example, [CF02℄) that disusses strategies based on the use of seondary storage.
Further, we deal with the onstrution of ompressed (\suint") suÆx arrays only
insofar as they relate to standard SACAs. For example, algorithms suh as those of
Grossi et al. and referenes therein [GGV04 are not overed.
Setion 2 provides an overview of the SACAs known to us, organized into a \tax-
onomy" based primarily on the metho dology used. As with all lassiation shemes,
there is room for argument: there are many ross-onnetions b etween algorithms
that o ur in disjoint subtrees of the taxonomy, just as there may b e b etween sp eies
in a biologial taxonomy. Our aim is to provide as omprehensive and, at the same
time, as aessible a desription of SACAs as we an.
Also in Setion 2 we present the voabulary to b e used for the strutured desrip-
tion of eah of the algorithms that will b e given in Setion 3. Then in Setion 4, we
rep ort on the results of exp erimental results on many of the algorithms desrib ed and
so draw onlusions ab out their relative sp eed and spae-eÆieny.
2 Overview
We onsider throughout a nite nonempty
string
x
=
x
[1
::n
of
length
n
1,
dened on an
indexed
alphab et ; that is,
the letters
j
; j
= 1
;
2
; : : : ;
of
j
j
are ordered:
1
<
2
<
<
;
an array
A
[
1
::
an b e dened in whih, for every
j
2
1
::
,
A
[
j
is aessible
in onstant time;
1
2
O
(
n
).
Essentially, we assume that an b e treated as a sequene of integers whose range is
not to o large. Typially, the
j
may b e represented by ASCI I o des 0
::
255 (English
alphab et) or binary integers 00
::
11 (DNA) or simply bits, as the ase may b e. We
shall generally assume that a letter an b e stored in a byte and that
n
an b e stored
in one omputer word (four bytes).
The use of terminology not dened here follows [S03℄.
We are interested in omputing the
suÆx array
of
x
, whih we write SA
x
or
just SA; that is, an array SA[1
::n
in whih SA[
j
=
i
i
x
[
i::n
is the
j
th
suÆx of
x
in (asending) lexiographial order (
lexorder
). For simpliity we will frequently
refer to
x
[
i::n
simply as \suÆx i"; also, it will often b e onvenient for pro essing to
inorp orate into
x
at p osition
n
an ending sentinel $ assumed to b e less than any
j
.
Then, for example, on alphab et =
f
$
; a; b; ; d; e
g
:
2

A Taxonomy of SuÆx Array Constrution Algorithms
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
SA = 12 11 8 1 4 6 9 2 5 7 10 3
Thus SA tells us that
x
[12
::
12℄ = $ is the least suÆx,
x
[11
::
12℄ =
a
$ the seond least,
and so on (alphab etial ordering of the letters assumed). Note that SA is always a
p ermutation of 1
::n
.
Often used in onjuntion with SA
x
is the
lp array
lp[1
::n
℄: for every
j
2
2
::n
,
lp[
j
is just the
longest ommon prex
of suÆxes SA[
j
1℄ and SA [
j
℄. In our
example:
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
SA = 12 11 8 1 4 6 9 2 5 7 10 3
lp =
0 1 4 1 1 0 3 0 0 0 2
Thus the longest ommon prex of suÆxes 11 and 8 is 1, that of suÆxes 8 and 1
is 4. Sine lp an b e omputed in linear time from SA
x
[KLAAP01, M04℄, also as a
bypro dut of some of the SACAs disussed b elow, we do not onsider its onstrution
further in this pap er. However, the
average lp
| that is, the average
lp of the
n
1 integers in the lp array | is as we shall see a useful indiator of the relative
eÆieny of ertain SACAs, notably Algorithm S.
We remark that b oth SA and lp an be omputed in linear time by a preorder
traversal of a suÆx tree.
Many of the SACAs also make use of the
inverse suÆx array
, written ISA
x
or ISA: an array ISA[1
::n
in whih
ISA[
i
=
j
()
SA[
j
=
i:
ISA[
i
=
j
therefore says that suÆx
i
has
rank
j
in lexorder. Continuing our example:
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
ISA = 4 8 12 5 9 6 10 3 7 11 2 1
Thus ISA tells us that suÆx 1 has rank 4 in lexorder, suÆx 2 rank 8, and so on. Note
that ISA is also a p ermutation of 1
::n
, and so SA and ISA are omputable, one from
the other, in (
n
) time:
for
j
1
to
n
do
SA
ISA[
j
j
As shown in Figure 1, this omputation an if required also b e done in plae.
Many of the algorithms we shall b e desribing dep end up on a partial sort of some
or all of the suÆxes of
x
, partial b eause it is based on an ordering of the prexes
of these suÆxes that are of length
h
1. We refer to this partial ordering as an
h
-ordering
of suÆxes into
h
-order
, and to the pro ess itself as an
h
-sort
. If two
or more suÆxes are equal under
h
-order, we say that they have the same
h
-rank
and therefore fall into the same
h
-group
; they are aordingly said to b e
h
-equal
.
Usually an
h
-sort is
stable
, so that any previous ordering of the suÆxes is retained
within eah
h
-group.
3

Pro eedings of the Prague Stringology Conferene '05
for
j
1
to
n
do
i
SA[
j
|
Negative entries already pro essed
if
i >
0
then
j
0
; j
0
j
rep eat
temp
SA[
i
℄; SA[
i
j
0
j
0
i
;
i
temp
until
i
=
j
0
SA[
i
j
0
else
SA[
j
i
Figure 1: Algorithm for omputing ISA from SA in plae
The results of an
h
-sort are often stored in an approximate suÆx array, written
SA
h
, and/or an approximate inverse suÆx array, written ISA
h
. Here is the result of
a 1-sort on all the suÆxes of our example string:
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
SA
1
= 12 (1 4 6 8 11) (2 9) 5 7 (3 10)
ISA
1
= 2 7 11 2 9 2 10 2 7 11 2 1
or 6 8 12 6 9 6 10 6 8 12 6 1
or 2 3 6 2 4 2 5 2 3 6 2 1
The parentheses in SA
1
enlose 1-groups not yet redued to a single entry, thus not
yet in nal sorted order. Note that SA
h
retains the prop erty of b eing a p ermutation of
1
::n
, while ISA
h
may not. Depending on the requirements of the partiular algorithm,
ISA
h
may as shown express the
h
-rank of eah
h
-group in various ways:
the leftmost p osition
j
in SA
h
of a member of the
h
-group, also alled the
head
of the
h
-group;
the rightmost position
j
in SA
h
of a member of the
h
-group, also alled the
tail
of the
h
-group;
the ordinal left-to-right ounter of the
h
-group in SA
h
.
Compare the result of a 3-sort:
1 2 3 4 5 6 7 8 9 10 11 12
x
=
a b e a a d a b e a
$
SA
3
= 12 11 (1 8) 4 6 (2 9) 5 7 10 3
ISA
3
= 3 7 12 5 9 6 10 3 7 11 2 1
or 4 8 12 5 9 6 10 4 8 11 2 1
or 3 6 10 4 7 5 8 3 6 9 2 1
Observe that an (
h
+1)-sort is a
renement
of an
h
-sort: all memb ers of an (
h
+1)-
group b elong to a single
h
-group.
4

A Taxonomy of SuÆx Array Constrution Algorithms
We now have available a vo abulary suÆient to haraterize the main sp eies of
SACA as follows.
(1) Prex-Doubling
First a fast 1-sort is p erformed (sine is indexed, buket sort an be used);
this yields SA
1
/
I S A
1
. Then for every
h
= 1
;
2
; : : :
, SA
2
h
/
I S A
2
h
are omputed
in (
n
) time from SA
h
/
I S A
h
until every 2
h
-group is a singleton. The time
required is therefore
O
(
n
log
n
). There are two algorithms in this lass: MM
[MM90, MM93℄ and LS [S98, LS99℄.
(2) Reursive
Form strings
x
0
and
y
from
x
, then show that if SA
x
0
is omputed, therefore
SA
y
and nally SA
x
an b e omputed in
O
(
n
) time. Hene the problem of
omputing SA
x
0
reursively replaes the omputation of SA
x
. Sine
j
x
0
j
is
always hosen so as to b e less than 2
j
x
j
=
3, the overall time requirement of these
algorithms is (
n
). There are three main algorithms in this lass: KA [KA03℄,
KS [KS03℄ and KJP [KJP04℄.
(3) Indued Copying
The key insight here is the same as for the reursive algorithms | a omplete sort
of a seleted subset of suÆxes an b e used to \indue" a omplete sort of other
subsets of suÆxes. The approah however is nonreursive: an eÆient suÆx
sorting tehnique (for example, [BM93, MBM93, M97, BS97, SZ04℄) is invoked
for the seleted subset of suÆxes. The general idea seems to have b een rst
prop osed by Burrows & Wheeler [BW94℄, but it has b een implemented in quite
dierent ways [IT99, S00, MF04, SS05, BK03, M05℄. In general, these metho ds
are very eÆient in pratie, but may have worst-ase asymptoti omplexity
as high as
O
(
n
2
log
n
).
The goal is to design SACAs that
have minimal asymptoti omplexity (
n
);
are fast \in pratie" (that is, on olletions of large real-world data sets suh
as [H04℄);
are
lightweight
| that is, use a small amount of working storage in addition
to the 5
n
bytes required by
x
and SA
x
.
To date none of the SACAs that has b een prop osed ahieves all of these ob jetives.
Figure 2 presents our taxonomy of the fourteen speies of SACA that have b een
reognized so far; Table 1 summarizes their time and spae requirements.
5

Citations
More filters
Journal ArticleDOI

Adaptive seeds tame genomic sequence comparison.

TL;DR: LAST, the open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition, and guarantees that the number of matches increases linearly, instead of quadratically, with sequence length.
Journal ArticleDOI

Compressed full-text indexes

TL;DR: The relationship between text entropy and regularities that show up in index structures and permit compressing them are explained and the most relevant self-indexes are covered, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems.
Proceedings ArticleDOI

Ligra: a lightweight graph processing framework for shared memory

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Journal ArticleDOI

A method for taxonomy development and its application in information systems

TL;DR: The purpose of this paper is to present a method for taxonomy development that can be used in IS and demonstrates the efficacy of the method by developing a taxonomy in a domain in IS.
ReportDOI

Recommendation for the Entropy Sources Used for Random Bit Generation

TL;DR: This Recommendation specifies the design principles and requirements for the entropy sources used by Random Bit Generators, and the tests for the validation of entropy sources.
References
More filters

A Block-sorting Lossless Data Compression Algorithm

TL;DR: A block-sorting, lossless data compression algorithm, and the implementation of that algorithm and the performance of the implementation with widely available data compressors running on the same hardware are compared.
Journal ArticleDOI

Suffix arrays: a new method for on-line string searches

TL;DR: A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications.
Journal ArticleDOI

Compressed full-text indexes

TL;DR: The relationship between text entropy and regularities that show up in index structures and permit compressing them are explained and the most relevant self-indexes are covered, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems.
Journal ArticleDOI

Replacing suffix trees with enhanced suffix arrays

TL;DR: This article shows how every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that use an enhanced suffix array and solves the same problem in the same time complexity.
Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

This survey paper attempts to provide simple high-level des riptions of these numerous algorithms that highlight both their distin tive features and their ommonalities, while avoiding as mu h as possible the omplexities of implementation details. The authors also provide omparisons of the algorithms ' worstase time omplexity and use of additional spa e, together with results of re ent experimental test runs on many of their implementations.