What are the contributions in this paper?

Q: What are the contributions in this paper?

This survey paper attempts to provide simple high-level des riptions of these numerous algorithms that highlight both their distin tive features and their ommonalities, while avoiding as mu h as possible the omplexities of implementation details. The authors also provide omparisons of the algorithms ' worstase time omplexity and use of additional spa e, together with results of re ent experimental test runs on many of their implementations.

(Open Access) A taxonomy of suffix array construction algorithms (2007) | Simon J. Puglisi

A Taxonomy of SuÆx Array Constrution

Algorithms



Simon J. Puglisi

, W. F. Smyth

;

, and Andrew Turpin

Department of Computing, Curtin University, GPO Box U1987

Perth WA 6845, Australia

e-mail:

puglissjomputing.edu.au

Algorithms Researh Group, Department of Computing & Software

MMaster University, Hamilton ON L8S 4K1, Canada

e-mail:

smythmmaster.a

www.as.mmaster.a/as/resear h/groups.sh tml

Sho ol of Computer Siene & Information Tehnology

RMIT University, GPO Box 2476V

Melb ourne V 3001, Australia

e-mail:

ahts.rmit.edu.au

Abstrat.

In 1990 Manb er & Myers prop osed suÆx arrays as a spae-saving

alternative to suÆx trees and desrib ed the rst algorithms for suÆx array

onstrution and use. Sine that time, and esp eially in the last few years, suf-

x array onstrution algorithms have proliferated in b ewildering abundane.

This survey pap er attempts to provide simple high-level desriptions of these

numerous algorithms that highlight b oth their distintive features and their

ommonalities, while avoiding as muh as possible the omplexities of imple-

mentation details. We also provide omparisons of the algorithms' worst-ase

time omplexity and use of additional spae, together with results of reent

exp erimental test runs on many of their implementations.

1 Intro dution

SuÆx arrays were intro dued in 1990 by Manber & Myers [MM90, MM93℄, along

with algorithms for their onstrution and use as a spae-saving alternative to suÆx

trees. In the intervening fteen years there have ertainly b een hundreds of researh

artiles published on the onstrution and use of suÆx trees and their variants. Over

that p erio d, it has b een shown that



pratial spae-eÆient suÆx array onstrution algorithms (SACAs) exist that

require worst-ase time linear in string length [KA03, KS03℄;



SACAs exist that are even faster in pratie, though with supralinear worst-ase

onstrution time requirements [LS99, BK03, MF04, M05℄;



Supported in part by grants from the Natural Sienes & Engineering Researh Counil of

Canada and the Australian Researh Counil.

Pro eedings of the Prague Stringology Conferene '05



any problem whose solution an b e omputed using suÆx trees is solvable with

the same asymptoti omplexity using suÆx arrays [AKO04℄.

Thus suÆx arrays have b eome the data struture of hoie for many, if not all, of

the string pro essing problems to whih suÆx tree metho dology is appliable.

In this survey pap er we do not attempt to over the entire suÆx array literature.

Our more modest goal is to provide an overview of SACAs, in partiular those mo deled

on the eÆient use of main memory | we exlude the substantial literature (for

example, [CF02℄) that disusses strategies based on the use of seondary storage.

Further, we deal with the onstrution of ompressed (\suint") suÆx arrays only

insofar as they relate to standard SACAs. For example, algorithms suh as those of

Grossi et al. and referenes therein [GGV04℄ are not overed.

Setion 2 provides an overview of the SACAs known to us, organized into a \tax-

onomy" based primarily on the metho dology used. As with all lassiation shemes,

there is room for argument: there are many ross-onnetions b etween algorithms

that o ur in disjoint subtrees of the taxonomy, just as there may b e b etween sp eies

in a biologial taxonomy. Our aim is to provide as omprehensive and, at the same

time, as aessible a desription of SACAs as we an.

Also in Setion 2 we present the voabulary to b e used for the strutured desrip-

tion of eah of the algorithms that will b e given in Setion 3. Then in Setion 4, we

rep ort on the results of exp erimental results on many of the algorithms desrib ed and

so draw onlusions ab out their relative sp eed and spae-eÆieny.

2 Overview

We onsider throughout a nite nonempty

string

::n

℄ of

length



dened on an

indexed

alphab et ; that is,



the letters



; j

= 1

;

; : : : ; 



are ordered:



< 

  

< 



;



an array

[



::



℄ an b e dened in whih, for every

::

[



℄ is aessible

in onstant time;











(

Essentially, we assume that  an b e treated as a sequene of integers whose range is

not to o large. Typially, the



may b e represented by ASCI I o des 0

255 (English

alphab et) or binary integers 00

11 (DNA) or simply bits, as the ase may b e. We

shall generally assume that a letter an b e stored in a byte and that

an b e stored

in one omputer word (four bytes).

The use of terminology not dened here follows [S03℄.

We are interested in omputing the

suÆx array

, whih we write SA

just SA; that is, an array SA[1

::n

℄ in whih SA[

℄ =

i

[

i::n

℄ is the

suÆx of

in (asending) lexiographial order (

lexorder

). For simpliity we will frequently

refer to

[

i::n

℄ simply as \suÆx i"; also, it will often b e onvenient for pro essing to

inorp orate into

at p osition

an ending sentinel $ assumed to b e less than any



Then, for example, on alphab et  =

; a; b; ; d; e

A Taxonomy of SuÆx Array Constrution Algorithms

1 2 3 4 5 6 7 8 9 10 11 12

a b e a  a d a b e a

SA = 12 11 8 1 4 6 9 2 5 7 10 3

Thus SA tells us that

[12

12℄ = $ is the least suÆx,

[11

12℄ =

$ the seond least,

and so on (alphab etial ordering of the letters assumed). Note that SA is always a

p ermutation of 1

::n

Often used in onjuntion with SA

is the

lp array

lp[1

::n

℄: for every

::n

lp[

℄ is just the

longest ommon prex

of suÆxes SA[



1℄ and SA [

℄. In our

example:

1 2 3 4 5 6 7 8 9 10 11 12

a b e a  a d a b e a

SA = 12 11 8 1 4 6 9 2 5 7 10 3

lp =



0 1 4 1 1 0 3 0 0 0 2

Thus the longest ommon prex of suÆxes 11 and 8 is 1, that of suÆxes 8 and 1

is 4. Sine lp an b e omputed in linear time from SA

[KLAAP01, M04℄, also as a

bypro dut of some of the SACAs disussed b elow, we do not onsider its onstrution

further in this pap er. However, the

average lp

| that is, the average

lp of the



1 integers in the lp array | is as we shall see a useful indiator of the relative

eÆieny of ertain SACAs, notably Algorithm S.

We remark that b oth SA and lp an be omputed in linear time by a preorder

traversal of a suÆx tree.

Many of the SACAs also make use of the

inverse suÆx array

, written ISA

or ISA: an array ISA[1

::n

℄ in whih

ISA[

℄ =

()

SA[

℄ =

ISA[

℄ =

therefore says that suÆx

has

rank

in lexorder. Continuing our example:

1 2 3 4 5 6 7 8 9 10 11 12

a b e a  a d a b e a

ISA = 4 8 12 5 9 6 10 3 7 11 2 1

Thus ISA tells us that suÆx 1 has rank 4 in lexorder, suÆx 2 rank 8, and so on. Note

that ISA is also a p ermutation of 1

::n

, and so SA and ISA are omputable, one from

the other, in (

) time:

for



ISA[

℄



As shown in Figure 1, this omputation an if required also b e done in plae.

Many of the algorithms we shall b e desribing dep end up on a partial sort of some

or all of the suÆxes of

, partial b eause it is based on an ordering of the prexes

of these suÆxes that are of length



1. We refer to this partial ordering as an

-ordering

of suÆxes into

-order

, and to the pro ess itself as an

-sort

. If two

or more suÆxes are equal under

-order, we say that they have the same

-rank

and therefore fall into the same

-group

; they are aordingly said to b e

-equal

Usually an

-sort is

stable

, so that any previous ordering of the suÆxes is retained

within eah

-group.

Pro eedings of the Prague Stringology Conferene '05

for

SA[

℄

Negative entries already pro essed

i >

then

; j

rep eat

temp

SA[

℄; SA[

℄



;

temp

until

SA[

℄



else

SA[

℄



Figure 1: Algorithm for omputing ISA from SA in plae

The results of an

-sort are often stored in an approximate suÆx array, written

, and/or an approximate inverse suÆx array, written ISA

. Here is the result of

a 1-sort on all the suÆxes of our example string:

1 2 3 4 5 6 7 8 9 10 11 12

a b e a  a d a b e a

= 12 (1 4 6 8 11) (2 9) 5 7 (3 10)

ISA

= 2 7 11 2 9 2 10 2 7 11 2 1

or 6 8 12 6 9 6 10 6 8 12 6 1

or 2 3 6 2 4 2 5 2 3 6 2 1

The parentheses in SA

enlose 1-groups not yet redued to a single entry, thus not

yet in nal sorted order. Note that SA

retains the prop erty of b eing a p ermutation of

::n

, while ISA

may not. Depending on the requirements of the partiular algorithm,

ISA

may as shown express the

-rank of eah

-group in various ways:



the leftmost p osition

in SA

of a member of the

-group, also alled the

head

of the

-group;



the rightmost position

in SA

of a member of the

-group, also alled the

tail

of the

-group;



the ordinal left-to-right ounter of the

-group in SA

Compare the result of a 3-sort:

1 2 3 4 5 6 7 8 9 10 11 12

a b e a  a d a b e a

= 12 11 (1 8) 4 6 (2 9) 5 7 10 3

ISA

= 3 7 12 5 9 6 10 3 7 11 2 1

or 4 8 12 5 9 6 10 4 8 11 2 1

or 3 6 10 4 7 5 8 3 6 9 2 1

Observe that an (

+1)-sort is a

renement

of an

-sort: all memb ers of an (

+1)-

group b elong to a single

-group.

A Taxonomy of SuÆx Array Constrution Algorithms

We now have available a vo abulary suÆient to haraterize the main sp eies of

SACA as follows.

(1) Prex-Doubling

First a fast 1-sort is p erformed (sine  is indexed, buket sort an be used);

this yields SA

I S A

. Then for every

= 1

;

; : : :

, SA

I S A

are omputed

in (

) time from SA

I S A

until every 2

-group is a singleton. The time

required is therefore

(

log

). There are two algorithms in this lass: MM

[MM90, MM93℄ and LS [S98, LS99℄.

(2) Reursive

Form strings

and

from

, then show that if SA

is omputed, therefore

and nally SA

an b e omputed in

(

) time. Hene the problem of

omputing SA

reursively replaes the omputation of SA

. Sine

always hosen so as to b e less than 2

3, the overall time requirement of these

algorithms is (

). There are three main algorithms in this lass: KA [KA03℄,

KS [KS03℄ and KJP [KJP04℄.

(3) Indued Copying

The key insight here is the same as for the reursive algorithms | a omplete sort

of a seleted subset of suÆxes an b e used to \indue" a omplete sort of other

subsets of suÆxes. The approah however is nonreursive: an eÆient suÆx

sorting tehnique (for example, [BM93, MBM93, M97, BS97, SZ04℄) is invoked

for the seleted subset of suÆxes. The general idea seems to have b een rst

prop osed by Burrows & Wheeler [BW94℄, but it has b een implemented in quite

dierent ways [IT99, S00, MF04, SS05, BK03, M05℄. In general, these metho ds

are very eÆient in pratie, but may have worst-ase asymptoti omplexity

as high as

(

log

The goal is to design SACAs that



have minimal asymptoti omplexity (

);



are fast \in pratie" (that is, on olletions of large real-world data sets suh

as [H04℄);



are

lightweight

| that is, use a small amount of working storage in addition

to the 5

bytes required by

and SA

To date none of the SACAs that has b een prop osed ahieves all of these ob jetives.

Figure 2 presents our taxonomy of the fourteen speies of SACA that have b een

reognized so far; Table 1 summarizes their time and spae requirements.

A taxonomy of suffix array construction algorithms

Figures

Citations

Adaptive seeds tame genomic sequence comparison.

Compressed full-text indexes

Ligra: a lightweight graph processing framework for shared memory

A method for taxonomy development and its application in information systems

Recommendation for the Entropy Sources Used for Random Bit Generation

References

A Block-sorting Lossless Data Compression Algorithm

Suffix arrays: a new method for on-line string searches

コンピュータ・サイエンス : ACM computing surveys

Compressed full-text indexes

Replacing suffix trees with enhanced suffix arrays

Related Papers (5)

Suffix arrays: a new method for on-line string searches

A Block-sorting Lossless Data Compression Algorithm

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

Opportunistic data structures with applications

Compressed full-text indexes

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?