Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

doi:10.1137/S0097539702402354

SIAM J. COMPUT.

c



2005 Society for Industrial and Applied Mathematics

Vol. 35, No. 2, pp. 378–407

COMPRESSED SUFFIX ARRAYS AND SUFFIX TREES WITH

APPLICATIONS TO TEXT INDEXING AND STRING MATCHING

∗

ROBERTO GROSSI

†

AND JEFFREY SCOTT VITTER

‡

Abstract. The proliferation of online text, such as found on the World Wide Web and in online

databases, motivates the need for space-eﬃcient text indexing methods that support fast string

searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from

a ﬁxed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with

lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols,

with T being fully scanned only once, namely, when the index is created at preprocessing time.

The text indexing schemes published in the literature are greedy in terms of space usage: they

require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost

RAM, suﬃx trees and suﬃx arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are

larger than the text itself by a multiplicative factor of Ω(lg

|Σ|

n), which is signiﬁcant when Σ is of

constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching,

either in O(m lg |Σ|)timeorinO (m +lgn) time, plus an output-sensitive cost O(occ) for listing the

occ pattern occurrences.

We present a new text index that is based upon compressed representations of suﬃx arrays and

suﬃx trees. It achieves a fast O(m/ lg

|Σ|

n +lg



|Σ|

n) search time in the worst case, for any constant

0 <≤ 1, using at most



−1

+ O(1)



n lg |Σ| bits of storage. Our result thus presents for the ﬁrst

time an eﬃcient index whose size is provably linear in the size of the text in the worst case, and for

many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed

suﬃx array for a typical 100 MB ascii ﬁle can require 30–40 MB or less, while the raw suﬃx array

requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes.

Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive

cost, giving O(occ lg



|Σ|

n) time as a result. When the patterns are suﬃciently long, we can use

auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/ lg

|Σ|

n + occ)

time, which is optimal.

Key words. compression, text indexing, text retrieval, compressed data structures, suﬃx arrays,

suﬃx trees, string searching, pattern matching

AMS subject classiﬁcations. 68W05, 68Q25, 68P05, 68P10, 68P30

DOI. 10.1137/S0097539702402354

1. Introduction. A great deal of textual information is available in electronic

form in online databases and on the World Wide Web, and therefore devising eﬃ-

cient text indexing methods to support fast string searching is an important topic for

investigation. A typical search scenario involves string matching in a text string T

of length n [49]: given an input pattern string P of length m, the goal is to ﬁnd

occurrences of P in T . Each symbol in P and T belongs to a ﬁxed alphabet Σ of

size |Σ|≤n. An occurrence of the pattern at position i means that the substring

T [i, i + m − 1] is equal to P , where T [i, j] denotes the concatenation of the symbols

∗

Received by the editors February 9, 2002; accepted for publication (in revised form) May 2, 2005;

published electronically October 17, 2005. A preliminary version of these results appears in [37].

http://www.siam.org/journals/sicomp/35-2/40235.html

†

Dipartimento di Informatica, Universit`a di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy

(grossi@di.unipi.it). This author’s work was supported in part by the United Nations Educational,

Scientiﬁc and Cultural Organization (UNESCO) and by the Italian Ministry of Research and Edu-

cation (MIUR).

‡

Department of Computer Sciences, Purdue University, West Lafayette, IN 47907–2066 (jsv@

purdue.edu). Part of this work was done while the author was at Duke University and on sabbatical

at INRIA in Sophia Antipolis, France. It was supported in part by Army Research Oﬃce MURI

grants DAAH04–96–1–0013, DAAD19–01–1–0725, and DAAD19–03–1–0321 and by National Science

Foundation research grants CCR–9877133 and IIS–0415097.

378

Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

COMPRESSED SUFFIX ARRAYS 379

in T at positions i, i +1,... , j.

In this paper, we consider three types of string matching queries: existential,

counting, and enumerative. An existential query returns a Boolean value that indi-

cates whether P is contained in T . A counting query computes the number occ of

occurrences of P in T , where occ ≤ n. An enumerative query outputs the list of occ

positions, where P occurs in T . Eﬃcient oﬄine string matching algorithms, such as

that of Knuth, Morris, and Pratt [49], can answer each individual query in O(m + n)

time via an eﬃcient text scan.

The large mass of existing online text documents makes it infeasible to scan

through all the documents for every query, because n is typically much larger than

the pattern length m and the number of occurrences occ. In this scenario, text indexes

are preferable, as they are especially eﬃcient when several string searches are to be

performed on the same set of text documents. The text T needs to be entirely scanned

only once, namely, at preprocessing time when the indexes are created. After that,

searching is output-sensitive, that is, the time complexity of each online query is

proportional to either O(m lg |Σ|+ occ)orO(m +lgn + occ), which is much less than

Θ(m + n) when n is suﬃciently large.

The most popular indexes currently in use are inverted lists and signature ﬁles [48].

Inverted lists are theoretically and practically superior to signature ﬁles [72]. Their

versatility allows for several kinds of queries (exact, Boolean, ranked, and so on) whose

answers have a variety of output formats. They are eﬃcient indexes for texts that

are structured as long sequences of terms (or words) in which T is partitioned into

nonoverlapping substrings T [i

k

,j

k

] (the terms), where 1 ≤ i

k

≤ j

k

<i

k+1

≤ n.We

refer to the set of terms as the vocabulary. For each distinct term in the vocabulary,

the index maintains the inverted list (or position list) {i

k

} of the occurrences of that

term in T . As a result, in order to search eﬃciently, search queries must be limited

to terms or preﬁxes of them; it does not allow for eﬃcient searching of arbitrary

substrings of the text as in the string matching problem. For this reason, inverted

ﬁles are sometimes referred to as term-level or word-level text indexes.

Searching unstructured text to answer string matching queries adds a new diﬃ-

culty to text indexing. This case arises with DNA sequences and in some Eastern

languages (Burmese, Chinese, Taiwanese, Tibetan, etc.), which do not have a well-

deﬁned notion of terms. The set of successful search keys is possibly much larger than

the set of terms in structured texts, because it consists of all feasible substrings of T;

that is, we can have as many as



n

2



=Θ(n

2

) distinct substrings in the worst case,

while the number of distinct terms is at most n (considered as nonoverlapping sub-

strings). Suﬃx arrays [55, 35], suﬃx trees [57, 68], and similar tries or automata [20]

are among the prominent data structures used for unstructured texts. Since they can

handle all the search keys in O(n) memory words, they are sometimes referred to as

full-text indexes.

The suﬃx tree for text T = T [1,n] is a compact trie whose n leaves represent the

n text suﬃxes T [1,n], T [2,n], ... ,T [n, n]. By “compact” we mean that each internal

node has at least two children. Each edge in the tree is labeled with one or more

symbols for purposes of search navigation. The leaf with value  represents the suﬃx

T [, n]. The leaf values in an in-order traversal of the tree represent the n suﬃxes

of T in lexicographic order. An example suﬃx tree appears in Figure 1.

A suﬃx array SA = SA[1,n] for text T = T [1,n] consists of the values of the

leaves of the suﬃx tree in in-order, but without the tree structure information. In

other words, SA[i]= means that T [, n]istheith smallest suﬃx of T in lexicographic

order. The suﬃx array corresponding to the suﬃx tree of Figure 1 appears in Figure 2.

Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

380 ROBERTO GROSSI AND JEFFREY SCOTT VITTER

15 16

13

31

32

17 19 28

10

4

7

1

21

24

14 30

12 18

27

9

6 3

20

23

a

aa

a

aa

b

bb

b

bb

b

bb

#

··· #

··· #··· #

··· #

Fig. 1. Suﬃx tree built on text T = abbabbabbabbabaaabababbabbbabba# of length n =32,

where the last character is an end-of-string symbol #. The rightmost subtree (the triangle represent-

ing the suﬃxes of the form bb ···#) is not expanded in the ﬁgure. The edge label a ···# or b ···#

on the edge leading to the leaf with value  denotes the remaining characters of the suﬃx T [, n]

that have not already been traversed. For example, the ﬁrst suﬃx in lexicographic format is the

suﬃx T [15,n], namely, aaabababbabbbabba#, and the last edge represents the 16-symbol substring

that follows the preﬁx aa.

To speed up searches, a separate array is often maintained, which contains auxiliary

information such as the lengths of the longest common preﬁxes of a subset of the

suﬃxes [55].

Suﬃx trees and suﬃx arrays organize the suﬃxes so as to support the eﬃcient

search of their preﬁxes. Given a search pattern P , in order to ﬁnd an occurrence

T [i, i + m − 1] = P , we can exploit the property that P must be the preﬁx of suf-

ﬁx T [i, n]. In general, existential and counting queries take O(m lg |Σ|) time using

automata or suﬃx trees and their variations, and they take O(m +lgn) time using

suﬃx arrays along with longest common preﬁxes. Enumerative queries take an ad-

ditive output-sensitive cost O(occ). In this paper, we use the term “suﬃx array” to

denote the array containing the permutation of positions, 1, 2,...,n, but without the

longest common preﬁx information mentioned above. Full-text indexes such as suﬃx

arrays are more powerful than term-level inverted lists, since full-text indexes can also

implement inverted lists eﬃciently by storing the suﬃxes T [i

k

,n] that correspond to

the occurrences of the terms.

1.1. Issues on space eﬃciency. Suﬃx arrays and suﬃx trees are data struc-

tures with increasing importance because of the growing list of their applications.

Besides string searching, they also have signiﬁcant use in molecular biology, data com-

pression, data mining, and text retrieval, to name but a few applications [7, 38, 55].

However, the sizes of the data sets in these applications can become extremely large,

and space occupancy is often a critical issue. A major disadvantage that limits the

applicability of text indexes based upon suﬃx arrays and suﬃx trees is that they

occupy signiﬁcantly more space than do inverted lists.

Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

COMPRESSED SUFFIX ARRAYS 381

1 15 aaabababbabbbabba#

2 16 aabababbabbbabba#

3 31 a#

4 13

abaaabababbabbbabba#

5 17 abababbabbbabba#

6 19 ababbabbbabba#

7

28 abba#

8

10 abbabaaabababbabbbabba#

9 7 abbabbabaaabababbabbbabba#

10 4 abbabbabbabaaabababbabbbabba#

11 1 abbabbabbabbabaaabababbabbbabba#

12 21 abbabbbabba#

13 24 abbbabba#

14 32 #

15 14 baaabababbabbbabba#

16 30

ba#

17

12 babaaabababbabbbabba#

18

18 bababbabbbabba#

19 27 babba#

20 9 babbabaaabababbabbbabba#

21 6 babbabbabaaabababbabbbabba#

22 3 babbabbabbabaaabababbabbbabba#

23 20 babbabbbabba#

24 23 babbbabba#

.

32 25 bbbabba#

Fig. 2. Suﬃx array for the text T shown in Figure 1, where a < # < b. Note that the array

values correspond to the leaf values in the suﬃx tree in Figure 1 traversed in in-order.

We can illustrate this point by a more careful accounting of the space requirements

in the unit cost RAM model. We assume that each symbol in the text T is encoded by

lg |Σ| bits, for a total of n lg |Σ| bits.

1

In suﬃx arrays, the positions of the n suﬃxes

of T are stored as a permutation of 1, 2,...,n, using n lg n bits (kept in an array

consisting of n words, each of lg n bits). Suﬃx trees require considerably more space:

between 4n lg n and 5n lg n bits (stored in 4n–5n words) [55]. In contrast, inverted

lists require only approximately 10% of the text size [58], and thus suﬃx arrays and

suﬃx trees require signiﬁcantly more bits. From a theoretical point of view, if the

alphabet is very large, namely, if lg |Σ| = Θ(lg n), then suﬃx arrays require roughly

the same number of bits as the text. However, in practice, the alphabet size |Σ| is

typically a ﬁxed constant, such as |Σ| = 256 in electronic documents in ascii or larger

in unicode format, and Σ = 4 in DNA sequences. In such cases in practice, suﬃx

arrays and suﬃx trees are larger than the text by a signiﬁcant multiplicative factor

of Θ(lg

|Σ|

n) = Θ(lg n). For example, a DNA sequence of n symbols (with |Σ| =4)

can be stored with 2n bits in a computer. The suﬃx array for the sequence requires

instead at least n words of 4 bytes each, or 32n bits, which is 16 times larger than

the text itself. On the other hand, we cannot resort to inverted ﬁles since they do not

support a general search on unstructured sequences.

1

In this paper, we use the notation lg

c

b

n = (log

b

n)

c

= (log n/ log b)

c

to denote the cth power of

the base-b logarithm of n. If no base b is speciﬁed, the implied base is 2.

Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

382 ROBERTO GROSSI AND JEFFREY SCOTT VITTER

In [62], Munro, Raman, and Rao solve the open question raised by Muthukrishnan

by showing how to represent suﬃx trees in n lg n+O(n) bits while allowing O(m)-time

search of binary pattern strings of length m. This result highlights the conceptual

barrier of n lg n bits of storage needed for text indexing. In this paper, we go one step

further and investigate whether it is possible to design a full-text index in o(n lg n)

bits, while still supporting eﬃcient search.

The question of space usage is important in both theory and practice. Prior to

our work, the state of the art has taken for granted that at least n lg n bits are needed

to represent the permutation of the text positions for any eﬃcient full-text index. On

the other hand, if we note that each text of n symbols is in one-to-one correspondence

with a suﬃx array, then we can easily see by a simple information-theoretic argument

that Ω(n lg |Σ|) bits are required to represent the permutation. The argument is based

upon the fact that there are |Σ|

n

diﬀerent text strings of length n over the alphabet Σ;

hence, there are that many diﬀerent suﬃx arrays, and we need Ω(n lg |Σ|) bits to

distinguish them from one another. It is therefore an interesting problem to close this

gap in order to see if there is an eﬃcient representation of suﬃx arrays that use nearly

n lg |Σ| + O(n) bits in the worst case, even for random strings.

In order to have an idea of the computational diﬃculty of the question, let us

follow a simple approach that saves space. Let us consider binary alphabets. We

bunch every lg n bits together into a word (in eﬀect, constructing a large alphabet)

and create a text of length n/ lg n and a pattern of length m/ lg n. The suﬃx array

on the new text requires O((n/ lg n)lgn)=O(n) bits. Searching for a pattern of

length m must also consider situations when the pattern is not aligned at the precise

word boundaries. What is the searching cost? It appears that we have to handle

lg n situations, with a slowdown factor of lg n in the time complexity of the search.

However, this is not really so; we actually have to pay a much larger slowdown factor

of Ω(n) in the search cost, which makes querying the text index more expensive than

running the O(m + n)-time algorithms from scratch, such as in [49]. To see why, let

us examine the situation in which the pattern occurs k positions to the right of a

word boundary in the text. In order to query the index, we have to align the pattern

with the boundary by padding k bits to the left of the pattern. Since we do not know

the correct k bits to prepend to the pattern, we must try all 2

k

possible settings of

the k bits. When k ≈ lg n, we have to query the index 2

k

=Ω(n) times in the worst

case. (See the sparse suﬃx trees [47] cited in section 1.3 to partially alleviate this

drawback.)

The above example shows that a small reduction in the index size can make query-

ing the index useless in the worst case, as it can cost at least as much as performing

a full scan of the text from scratch. In section 1.3, we describe previous results moti-

vated by the need to ﬁnd an eﬃcient solution to the problem of designing a full-text

index that saves space and time in the worst case. No data structures with the func-

tionality of suﬃx trees and suﬃx arrays that have appeared in the literature to date

use Θ(n lg |Σ|)+o(n lg n) bits and support fast queries in o(m lg |Σ|)oro(m +lgn)

worst-case time. Our goal in this paper is to simultaneously reduce both the space

bound and the query time bound.

1.2. Our results. In this paper, we begin the study of the compressibility of

suﬃx arrays and related full-text indexes. We assume for simplicity that the alphabet

Σ is of bounded size (i.e., ascii or unicode/utf8). We recall that the suﬃx array SA

for text T stores the suﬃxes of T in lexicographic order, as shown in Figure 2. We

represent SA in the form of a permutation of the starting positions, 1, 2,...,n, of the

Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Citations

Fast and accurate short read alignment with Burrows–Wheeler transform

Compressed full-text indexes

High-order entropy-compressed text indexes

Replacing suffix trees with enhanced suffix arrays

Indexing compressed text

References

On Observing Nondeterminism and Concurrency

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

Fast Pattern Matching in Strings

Linear pattern matching algorithms

Suffix arrays: a new method for on-line string searches

Related Papers (5)

Suffix arrays: a new method for on-line string searches

A Block-sorting Lossless Data Compression Algorithm

Compressed full-text indexes

High-order entropy-compressed text indexes

Linear pattern matching algorithms