scispace - formally typeset
Open AccessJournal ArticleDOI

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Reads0
Chats0
TLDR
The result presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice.
Abstract
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ consisting of $n$ symbols drawn from a fixed alphabet $\Sigma$. The text $T$ can be represented in $n \lg |\Sigma|$ bits by encoding each symbol with $\lg |\Sigma|$ bits. The goal is to support fast online queries for searching any string pattern $P$ of $m$ symbols, with $T$ being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{|\Sigma|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg |\Sigma|)$ time or in $O(m + \lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{|\Sigma|} n + \lg_{|\Sigma|}^\epsilon n)}$ search time in the worst case, for any constant $0 < \epsilon \leq 1$, using at most $\smash{\bigl(\epsilon^{-1} + O(1)\bigr) \, n \lg |\Sigma|}$ bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB \textsc{ascii} file can require 30--40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve \emph{both} time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving $O(\mathit{occ} \, \smash{\lg_{|\Sigma|}^\epsilon n})$ time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in $O(n \lg |\Sigma|)$ bits to obtain a total search bound of $O(m /\lg_{|\Sigma|} n + \mathit{occ})$ time, which is optimal.

read more

Content maybe subject to copyright    Report

SIAM J. COMPUT.
c
2005 Society for Industrial and Applied Mathematics
Vol. 35, No. 2, pp. 378–407
COMPRESSED SUFFIX ARRAYS AND SUFFIX TREES WITH
APPLICATIONS TO TEXT INDEXING AND STRING MATCHING
ROBERTO GROSSI
AND JEFFREY SCOTT VITTER
Abstract. The proliferation of online text, such as found on the World Wide Web and in online
databases, motivates the need for space-efficient text indexing methods that support fast string
searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from
a fixed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with
lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols,
with T being fully scanned only once, namely, when the index is created at preprocessing time.
The text indexing schemes published in the literature are greedy in terms of space usage: they
require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost
RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are
larger than the text itself by a multiplicative factor of Ω(lg
|Σ|
n), which is significant when Σ is of
constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching,
either in O(m lg |Σ|)timeorinO (m +lgn) time, plus an output-sensitive cost O(occ) for listing the
occ pattern occurrences.
We present a new text index that is based upon compressed representations of suffix arrays and
suffix trees. It achieves a fast O(m/ lg
|Σ|
n +lg
|Σ|
n) search time in the worst case, for any constant
0 < 1, using at most
1
+ O(1)
n lg |Σ| bits of storage. Our result thus presents for the first
time an efficient index whose size is provably linear in the size of the text in the worst case, and for
many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed
suffix array for a typical 100 MB ascii file can require 30–40 MB or less, while the raw suffix array
requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes.
Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive
cost, giving O(occ lg
|Σ|
n) time as a result. When the patterns are sufficiently long, we can use
auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/ lg
|Σ|
n + occ)
time, which is optimal.
Key words. compression, text indexing, text retrieval, compressed data structures, suffix arrays,
suffix trees, string searching, pattern matching
AMS subject classifications. 68W05, 68Q25, 68P05, 68P10, 68P30
DOI. 10.1137/S0097539702402354
1. Introduction. A great deal of textual information is available in electronic
form in online databases and on the World Wide Web, and therefore devising effi-
cient text indexing methods to support fast string searching is an important topic for
investigation. A typical search scenario involves string matching in a text string T
of length n [49]: given an input pattern string P of length m, the goal is to find
occurrences of P in T . Each symbol in P and T belongs to a fixed alphabet Σ of
size |Σ|≤n. An occurrence of the pattern at position i means that the substring
T [i, i + m 1] is equal to P , where T [i, j] denotes the concatenation of the symbols
Received by the editors February 9, 2002; accepted for publication (in revised form) May 2, 2005;
published electronically October 17, 2005. A preliminary version of these results appears in [37].
http://www.siam.org/journals/sicomp/35-2/40235.html
Dipartimento di Informatica, Universit`a di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy
(grossi@di.unipi.it). This author’s work was supported in part by the United Nations Educational,
Scientific and Cultural Organization (UNESCO) and by the Italian Ministry of Research and Edu-
cation (MIUR).
Department of Computer Sciences, Purdue University, West Lafayette, IN 47907–2066 (jsv@
purdue.edu). Part of this work was done while the author was at Duke University and on sabbatical
at INRIA in Sophia Antipolis, France. It was supported in part by Army Research Office MURI
grants DAAH04–96–1–0013, DAAD19–01–1–0725, and DAAD19–03–1–0321 and by National Science
Foundation research grants CCR–9877133 and IIS–0415097.
378
Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

COMPRESSED SUFFIX ARRAYS 379
in T at positions i, i +1,... , j.
In this paper, we consider three types of string matching queries: existential,
counting, and enumerative. An existential query returns a Boolean value that indi-
cates whether P is contained in T . A counting query computes the number occ of
occurrences of P in T , where occ n. An enumerative query outputs the list of occ
positions, where P occurs in T . Efficient offline string matching algorithms, such as
that of Knuth, Morris, and Pratt [49], can answer each individual query in O(m + n)
time via an efficient text scan.
The large mass of existing online text documents makes it infeasible to scan
through all the documents for every query, because n is typically much larger than
the pattern length m and the number of occurrences occ. In this scenario, text indexes
are preferable, as they are especially efficient when several string searches are to be
performed on the same set of text documents. The text T needs to be entirely scanned
only once, namely, at preprocessing time when the indexes are created. After that,
searching is output-sensitive, that is, the time complexity of each online query is
proportional to either O(m lg |Σ|+ occ)orO(m +lgn + occ), which is much less than
Θ(m + n) when n is sufficiently large.
The most popular indexes currently in use are inverted lists and signature files [48].
Inverted lists are theoretically and practically superior to signature files [72]. Their
versatility allows for several kinds of queries (exact, Boolean, ranked, and so on) whose
answers have a variety of output formats. They are efficient indexes for texts that
are structured as long sequences of terms (or words) in which T is partitioned into
nonoverlapping substrings T [i
k
,j
k
] (the terms), where 1 i
k
j
k
<i
k+1
n.We
refer to the set of terms as the vocabulary. For each distinct term in the vocabulary,
the index maintains the inverted list (or position list) {i
k
} of the occurrences of that
term in T . As a result, in order to search efficiently, search queries must be limited
to terms or prefixes of them; it does not allow for efficient searching of arbitrary
substrings of the text as in the string matching problem. For this reason, inverted
files are sometimes referred to as term-level or word-level text indexes.
Searching unstructured text to answer string matching queries adds a new diffi-
culty to text indexing. This case arises with DNA sequences and in some Eastern
languages (Burmese, Chinese, Taiwanese, Tibetan, etc.), which do not have a well-
defined notion of terms. The set of successful search keys is possibly much larger than
the set of terms in structured texts, because it consists of all feasible substrings of T;
that is, we can have as many as
n
2
(n
2
) distinct substrings in the worst case,
while the number of distinct terms is at most n (considered as nonoverlapping sub-
strings). Suffix arrays [55, 35], suffix trees [57, 68], and similar tries or automata [20]
are among the prominent data structures used for unstructured texts. Since they can
handle all the search keys in O(n) memory words, they are sometimes referred to as
full-text indexes.
The suffix tree for text T = T [1,n] is a compact trie whose n leaves represent the
n text suffixes T [1,n], T [2,n], ... ,T [n, n]. By “compact” we mean that each internal
node has at least two children. Each edge in the tree is labeled with one or more
symbols for purposes of search navigation. The leaf with value represents the suffix
T [, n]. The leaf values in an in-order traversal of the tree represent the n suffixes
of T in lexicographic order. An example suffix tree appears in Figure 1.
A suffix array SA = SA[1,n] for text T = T [1,n] consists of the values of the
leaves of the suffix tree in in-order, but without the tree structure information. In
other words, SA[i]= means that T [, n]istheith smallest suffix of T in lexicographic
order. The suffix array corresponding to the suffix tree of Figure 1 appears in Figure 2.
Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

380 ROBERTO GROSSI AND JEFFREY SCOTT VITTER
15 16
13
31
32
17 19 28
10
4
7
1
21
24
14 30
12 18
27
9
6 3
20
23
a
a
a
a
a
a
a
a
a
a
a
a
a
aa
a
a
a
aa
b
b
b
b
b
b
b
b
b
b
bb
b
b
b
b
b
bb
b
b
b
bb
#
#
#
#
#
··· #
··· #
··· #
··· #
··· #
··· #
··· #
··· #
··· #
··· #
··· #··· #
··· #
··· #
··· #
··· #
··· #
··· #
··· #
Fig. 1. Suffix tree built on text T = abbabbabbabbabaaabababbabbbabba# of length n =32,
where the last character is an end-of-string symbol #. The rightmost subtree (the triangle represent-
ing the suffixes of the form bb ···#) is not expanded in the figure. The edge label a ···# or b ···#
on the edge leading to the leaf with value denotes the remaining characters of the suffix T [, n]
that have not already been traversed. For example, the first suffix in lexicographic format is the
suffix T [15,n], namely, aaabababbabbbabba#, and the last edge represents the 16-symbol substring
that follows the prefix aa.
To speed up searches, a separate array is often maintained, which contains auxiliary
information such as the lengths of the longest common prefixes of a subset of the
suffixes [55].
Suffix trees and suffix arrays organize the suffixes so as to support the efficient
search of their prefixes. Given a search pattern P , in order to find an occurrence
T [i, i + m 1] = P , we can exploit the property that P must be the prefix of suf-
fix T [i, n]. In general, existential and counting queries take O(m lg |Σ|) time using
automata or suffix trees and their variations, and they take O(m +lgn) time using
suffix arrays along with longest common prefixes. Enumerative queries take an ad-
ditive output-sensitive cost O(occ). In this paper, we use the term “suffix array” to
denote the array containing the permutation of positions, 1, 2,...,n, but without the
longest common prefix information mentioned above. Full-text indexes such as suffix
arrays are more powerful than term-level inverted lists, since full-text indexes can also
implement inverted lists efficiently by storing the suffixes T [i
k
,n] that correspond to
the occurrences of the terms.
1.1. Issues on space efficiency. Suffix arrays and suffix trees are data struc-
tures with increasing importance because of the growing list of their applications.
Besides string searching, they also have significant use in molecular biology, data com-
pression, data mining, and text retrieval, to name but a few applications [7, 38, 55].
However, the sizes of the data sets in these applications can become extremely large,
and space occupancy is often a critical issue. A major disadvantage that limits the
applicability of text indexes based upon suffix arrays and suffix trees is that they
occupy significantly more space than do inverted lists.
Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

COMPRESSED SUFFIX ARRAYS 381
1 15 aaabababbabbbabba#
2 16 aabababbabbbabba#
3 31 a#
4 13
abaaabababbabbbabba#
5 17 abababbabbbabba#
6 19 ababbabbbabba#
7
28 abba#
8
10 abbabaaabababbabbbabba#
9 7 abbabbabaaabababbabbbabba#
10 4 abbabbabbabaaabababbabbbabba#
11 1 abbabbabbabbabaaabababbabbbabba#
12 21 abbabbbabba#
13 24 abbbabba#
14 32 #
15 14 baaabababbabbbabba#
16 30
ba#
17
12 babaaabababbabbbabba#
18
18 bababbabbbabba#
19 27 babba#
20 9 babbabaaabababbabbbabba#
21 6 babbabbabaaabababbabbbabba#
22 3 babbabbabbabaaabababbabbbabba#
23 20 babbabbbabba#
24 23 babbbabba#
.
.
.
.
.
.
.
.
.
32 25 bbbabba#
Fig. 2. Suffix array for the text T shown in Figure 1, where a < # < b. Note that the array
values correspond to the leaf values in the suffix tree in Figure 1 traversed in in-order.
We can illustrate this point by a more careful accounting of the space requirements
in the unit cost RAM model. We assume that each symbol in the text T is encoded by
lg |Σ| bits, for a total of n lg |Σ| bits.
1
In suffix arrays, the positions of the n suffixes
of T are stored as a permutation of 1, 2,...,n, using n lg n bits (kept in an array
consisting of n words, each of lg n bits). Suffix trees require considerably more space:
between 4n lg n and 5n lg n bits (stored in 4n–5n words) [55]. In contrast, inverted
lists require only approximately 10% of the text size [58], and thus suffix arrays and
suffix trees require significantly more bits. From a theoretical point of view, if the
alphabet is very large, namely, if lg |Σ| = Θ(lg n), then suffix arrays require roughly
the same number of bits as the text. However, in practice, the alphabet size |Σ| is
typically a fixed constant, such as |Σ| = 256 in electronic documents in ascii or larger
in unicode format, and Σ = 4 in DNA sequences. In such cases in practice, suffix
arrays and suffix trees are larger than the text by a significant multiplicative factor
of Θ(lg
|Σ|
n) = Θ(lg n). For example, a DNA sequence of n symbols (with |Σ| =4)
can be stored with 2n bits in a computer. The suffix array for the sequence requires
instead at least n words of 4 bytes each, or 32n bits, which is 16 times larger than
the text itself. On the other hand, we cannot resort to inverted files since they do not
support a general search on unstructured sequences.
1
In this paper, we use the notation lg
c
b
n = (log
b
n)
c
= (log n/ log b)
c
to denote the cth power of
the base-b logarithm of n. If no base b is specified, the implied base is 2.
Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

382 ROBERTO GROSSI AND JEFFREY SCOTT VITTER
In [62], Munro, Raman, and Rao solve the open question raised by Muthukrishnan
by showing how to represent suffix trees in n lg n+O(n) bits while allowing O(m)-time
search of binary pattern strings of length m. This result highlights the conceptual
barrier of n lg n bits of storage needed for text indexing. In this paper, we go one step
further and investigate whether it is possible to design a full-text index in o(n lg n)
bits, while still supporting efficient search.
The question of space usage is important in both theory and practice. Prior to
our work, the state of the art has taken for granted that at least n lg n bits are needed
to represent the permutation of the text positions for any efficient full-text index. On
the other hand, if we note that each text of n symbols is in one-to-one correspondence
with a suffix array, then we can easily see by a simple information-theoretic argument
that Ω(n lg |Σ|) bits are required to represent the permutation. The argument is based
upon the fact that there are |Σ|
n
different text strings of length n over the alphabet Σ;
hence, there are that many different suffix arrays, and we need Ω(n lg |Σ|) bits to
distinguish them from one another. It is therefore an interesting problem to close this
gap in order to see if there is an efficient representation of suffix arrays that use nearly
n lg |Σ| + O(n) bits in the worst case, even for random strings.
In order to have an idea of the computational difficulty of the question, let us
follow a simple approach that saves space. Let us consider binary alphabets. We
bunch every lg n bits together into a word (in effect, constructing a large alphabet)
and create a text of length n/ lg n and a pattern of length m/ lg n. The suffix array
on the new text requires O((n/ lg n)lgn)=O(n) bits. Searching for a pattern of
length m must also consider situations when the pattern is not aligned at the precise
word boundaries. What is the searching cost? It appears that we have to handle
lg n situations, with a slowdown factor of lg n in the time complexity of the search.
However, this is not really so; we actually have to pay a much larger slowdown factor
of Ω(n) in the search cost, which makes querying the text index more expensive than
running the O(m + n)-time algorithms from scratch, such as in [49]. To see why, let
us examine the situation in which the pattern occurs k positions to the right of a
word boundary in the text. In order to query the index, we have to align the pattern
with the boundary by padding k bits to the left of the pattern. Since we do not know
the correct k bits to prepend to the pattern, we must try all 2
k
possible settings of
the k bits. When k lg n, we have to query the index 2
k
(n) times in the worst
case. (See the sparse suffix trees [47] cited in section 1.3 to partially alleviate this
drawback.)
The above example shows that a small reduction in the index size can make query-
ing the index useless in the worst case, as it can cost at least as much as performing
a full scan of the text from scratch. In section 1.3, we describe previous results moti-
vated by the need to find an efficient solution to the problem of designing a full-text
index that saves space and time in the worst case. No data structures with the func-
tionality of suffix trees and suffix arrays that have appeared in the literature to date
use Θ(n lg |Σ|)+o(n lg n) bits and support fast queries in o(m lg |Σ|)oro(m +lgn)
worst-case time. Our goal in this paper is to simultaneously reduce both the space
bound and the query time bound.
1.2. Our results. In this paper, we begin the study of the compressibility of
suffix arrays and related full-text indexes. We assume for simplicity that the alphabet
Σ is of bounded size (i.e., ascii or unicode/utf8). We recall that the suffix array SA
for text T stores the suffixes of T in lexicographic order, as shown in Figure 2. We
represent SA in the form of a permutation of the starting positions, 1, 2,...,n, of the
Downloaded 11/20/15 to 129.237.46.99. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Citations
More filters
Journal ArticleDOI

Fast and accurate short read alignment with Burrows–Wheeler transform

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Journal ArticleDOI

Compressed full-text indexes

TL;DR: The relationship between text entropy and regularities that show up in index structures and permit compressing them are explained and the most relevant self-indexes are covered, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems.
Proceedings ArticleDOI

High-order entropy-compressed text indexes

TL;DR: A novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits.
Journal ArticleDOI

Replacing suffix trees with enhanced suffix arrays

TL;DR: This article shows how every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that use an enhanced suffix array and solves the same problem in the same time complexity.
Journal ArticleDOI

Indexing compressed text

TL;DR: Two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form are designed and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.
References
More filters
Book

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

TL;DR: In this paper, the authors introduce suffix trees and their use in sequence alignment, core string edits, alignments and dynamic programming, and extend the core problems to extend the main problems.
Journal ArticleDOI

Fast Pattern Matching in Strings

TL;DR: An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings, showing that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time.
Proceedings ArticleDOI

Linear pattern matching algorithms

Peter Weiner
TL;DR: A linear time algorithm for obtaining a compacted version of a bi-tree associated with a given string is presented and indicated how to solve several pattern matching problems, including some from [4] in linear time.
Journal ArticleDOI

Suffix arrays: a new method for on-line string searches

TL;DR: A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications.