High-order entropy-compressed text indexes

doi:10.5555/644108.644250

Home
/
Papers
/
High-order entropy-compressed text indexes

Proceedings Article•DOI•

High-order entropy-compressed text indexes

Roberto Grossi¹, Ankur Gupta², Jeffrey Scott Vitter³•Institutions (3)

University of Pisa¹, Durham University², Purdue University³

12 Jan 2003-pp 841-850

TL;DR: A novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits.

read less

Abstract: We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lgvσv bits. We show that compressed suffix arrays use just nHh + σ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg vσv + polylog(n)) time. The term Hh ≤ lg vσv denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Compressed full-text indexes

[...]

Gonzalo Navarro¹, Veli Mäkinen²•Institutions (2)

University of Chile¹, University of Helsinki²

12 Apr 2007-ACM Computing Surveys

TL;DR: The relationship between text entropy and regularities that show up in index structures and permit compressing them are explained and the most relevant self-indexes are covered, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems.

...read moreread less

Abstract: Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, which radically changed the status of this area in less than 5 years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously.In this article we present the main concepts underlying (compressed) self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems. Our aim is to give the background to understand and follow the developments in this area.

...read moreread less

834 citations

Proceedings Article•DOI•

The Case for Learned Index Structures

[...]

Tim Kraska¹, Alex Beutel², Ed H. Chi², Jeffrey Dean², Neoklis Polyzotis² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Google²

27 May 2018

TL;DR: In this paper, the authors propose to replace traditional index structures with learned models, which can have significant advantages over traditional indexes, and theoretically analyze under which conditions learned indexes outperform traditional index structure and describe the main challenges in designing learned index structures.

...read moreread less

Abstract: Indexes are models: a \btree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term \em learned indexes. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show that our learned indexes can have significant advantages over traditional indexes. More importantly, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work provides just a glimpse of what might be possible.

...read moreread less

742 citations

Journal Article•DOI•

Indexing compressed text

[...]

Paolo Ferragina¹, Giovanni Manzini²•Institutions (2)

University of Pisa¹, University of Eastern Piedmont²

01 Jul 2005-Journal of the ACM

TL;DR: Two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form are designed and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.

...read moreread less

Abstract: We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p p occ log1pen) time for any chosen e, 0

...read moreread less

656 citations

Cites background or methods from "High-order entropy-compressed text ..."

...2002], wavelet trees [Grossi et al. 2003] and compression boosting [Ferragina et al....
[...]
...…research on compressed indexes has produced data structures that are more alphabet-friendly and achieve various tradeoffs between space usage and query time [Grossi et al. 2003; Rao 2002; Sadakane 2002, 2003; Grabowski et al. 2004; Navarro 2004; M¨akinen et al. 2004; M¨akinen and Navarro 2004]....
[...]
...These second generation compressed indexes make use of new algorithmic tools such as succinct dictionaries [Raman et al. 2002], wavelet trees [Grossi et al. 2003] and compression boosting [Ferragina et al. 2005]....
[...]
...Currently, the most space economical compressed indexes [Grossi et al. 2003; Ferragina et al. 2004] take nHk(T ) + o(n) bits for k < α log| | n with α < 1....
[...]
...Currently, the most space economical compressed indexes [Grossi et al. 2003; Ferragina et al. 2004] take nHk (T ) + o(n) bits for k <alog|| n with a< 1....
[...]

Journal Article•DOI•

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

[...]

Roberto Grossi¹, Jeffrey Scott Vitter²•Institutions (2)

University of Pisa¹, Duke University²

01 Aug 2005-SIAM Journal on Computing

TL;DR: The result presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice.

...read moreread less

Abstract: The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ consisting of $n$ symbols drawn from a fixed alphabet $\Sigma$. The text $T$ can be represented in $n \lg |\Sigma|$ bits by encoding each symbol with $\lg |\Sigma|$ bits. The goal is to support fast online queries for searching any string pattern $P$ of $m$ symbols, with $T$ being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{|\Sigma|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg |\Sigma|)$ time or in $O(m + \lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{|\Sigma|} n + \lg_{|\Sigma|}^\epsilon n)}$ search time in the worst case, for any constant $0 < \epsilon \leq 1$, using at most $\smash{\bigl(\epsilon^{-1} + O(1)\bigr) \, n \lg |\Sigma|}$ bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB \textsc{ascii} file can require 30--40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve \emph{both} time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving $O(\mathit{occ} \, \smash{\lg_{|\Sigma|}^\epsilon n})$ time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in $O(n \lg |\Sigma|)$ bits to obtain a total search bound of $O(m /\lg_{|\Sigma|} n + \mathit{occ})$ time, which is optimal.

...read moreread less

559 citations

Cites background or methods from "High-order entropy-compressed text ..."

...Assuming that the text is read-only and using a stronger version of the bit-probe model, Demaine and L´ opez-Ortiz [ 22 ] have shown in the worst case that any text index with alphabet size |Σ| = 2 that supports fast queries by probing O(m) bits in the text must use Ω(n) bits of extra storage space....
[...]
...(He also uses our Lemma 2 in section 3.1 to show how to store the skip values of the suffix tree in O(n) bits [65].) The space of compressed suffix arrays has been further reduced to the order-k entropy (with a multiplicative constant of 1) by Grossi, Gupta, and Vitter [ 36 ] using a novel analysis based on a finite set model....
[...]
...We remark that Sadakane [64] has shown that the space complexity in Theorem 1(ii) and Theorem 2(ii) can be restated in terms of the order-0 entropy H0 ≤ lg |Σ| of the string, giving as a result � −1H0 n + O(n) bits. Grossi, Gupta, and Vitter [ 36 ]...
[...]

Journal Article•DOI•

Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

[...]

Rajeev Raman¹, Venkatesh Raman², Srinivasa Rao Satti³•Institutions (3)

University of Leicester¹, Institute of Mathematical Sciences, Chennai², University of Waterloo³

01 Nov 2007-ACM Transactions on Algorithms

TL;DR: In the cell probe model, the O(lg lg m) additive term can be removed from the space bound, answering a question raised by Fich and Miltersen [1995] and Pagh [2001].

...read moreread less

Abstract: We consider the indexable dictionary problem, which consists of storing a set S ⊆ {0,…,m − 1} for some integer m while supporting the operations of rank(x), which returns the number of elements in S that are less than x if x ∈ S, and −1 otherwise; and select(i), which returns the ith smallest element in S. We give a data structure that supports both operations in O(1) time on the RAM model and requires B(n, m) p o(n) p O(lg lg m) bits to store a set of size n, where B(n, m) e ⌊lg (m/n)⌋ is the minimum number of bits required to store any n-element subset from a universe of size m. Previous dictionaries taking this space only supported (yes/no) membership queries in O(1) time. In the cell probe model we can remove the O(lg lg m) additive term in the space bound, answering a question raised by Fich and Miltersen [1995] and Pagh [2001]. We present extensions and applications of our indexable dictionary data structure, including: —an information-theoretically optimal representation of a k-ary cardinal tree that supports standard operations in constant time; —a representation of a multiset of size n from {0,…,m − 1} in B(n, m p n) p o(n) bits that supports (appropriate generalizations of) rank and select operations in constant time; and p O(lg lg m) —a representation of a sequence of n nonnegative integers summing up to m in B(n, m p n) p o(n) bits that supports prefix sum queries in constant time.

...read moreread less

415 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

Collapse

References

PDF

Open Access

More filters

Book•

Managing Gigabytes: Compressing and Indexing Documents and Images

[...]

Ian H. Witten, Alistair Moffat, Tim Bell

11 May 1999

TL;DR: A guide to the MG system and its applications, as well as a comparison to the NZDL reference index, are provided.

...read moreread less

Abstract: PREFACE 1. OVERVIEW 2. TEXT COMPRESSION 3. INDEXING 4. QUERYING 5. INDEX CONSTRUCTION 6. IMAGE COMPRESSION 7. TEXTUAL IMAGES 8. MIXED TEXT AND IMAGES 9. IMPLEMENTATION 10. THE INFORMATION EXPLOSION A. GUIDE TO THE MG SYSTEM B. GUIDE TO THE NZDL REFERENCES INDEX

...read moreread less

2,068 citations

"High-order entropy-compressed text ..." refers background in this paper

...Large alphabets are typical of phrase searching [5, 21], for example, in which the alphabet is made up of single words and its size cannot be considered a small constant....
[...]

Journal Article•DOI•

Suffix arrays: a new method for on-line string searches

[...]

Udi Manber¹, Gene Myers¹•Institutions (1)

University of Arizona¹

01 Oct 1993-SIAM Journal on Computing

TL;DR: A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications.

...read moreread less

Abstract: A new and conceptually simple data structure, called a suffix array, for on-line string searches is introduced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit on-line string searches of the type, “Is W a substring of A?” to be answered in time $O(P + \log N)$, where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, suffix trees can be constructed in $O(N)$ time in the worst case, versus $O(N\log N)$ time for suffix arrays. However, an augmented algorithm is given that, regardless of the alphabet size, constructs suffix arrays in $O(N)$expected time, albeit with lesser space efficiency. It is ...

...read moreread less

1,969 citations

"High-order entropy-compressed text ..." refers background or methods or result in this paper

...We first perform a search of P in SA`+lg t(n), which is stored explicitly along with LCP `+lg t(n), the longest common prefix information required in [10]....
[...]
...Similar to what we described in Section 2, level k = ` stores the suffix array SA`, inverted suffix array SA −1 ` , and an array LCP ` storing the longest common prefix information [10] to allow fast searching in SA`....
[...]
...A standard suffix array [4, 10] is an array containing the position of each of the n suffixes of text T in lexicographical order....
[...]

Journal Article•DOI•

A Space-Economical Suffix Tree Construction Algorithm

[...]

Edward M. McCreight¹•Institutions (1)

PARC¹

01 Apr 1976-Journal of the ACM

TL;DR: A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching that has the same asymptotic running time bound as previously published algorithms, but is more economical in space.

...read moreread less

Abstract: A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching. This algorithm has the same asymptotic running time bound as previously published algorithms, but is more economical in space. Some implementation considerations are discussed, and new work on the modification of these search trees in response to incremental changes in the strings they index (the update problem) is presented.

...read moreread less

1,661 citations

Proceedings Article•DOI•

Opportunistic data structures with applications

[...]

Paolo Ferragina, Giovanni Manzini

12 Nov 2000

TL;DR: A data structure whose space occupancy is a function of the entropy of the underlying data set is devised, which achieves sublinear space and sublinear query time complexity and is shown how to plug into the Glimpse tool.

...read moreread less

Abstract: We address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space reduction is achieved at no significant slowdown in the query performance. More precisely, its space occupancy is optimal in an information-content sense because text T[1,u] is stored using O(H/sub k/(T))+o(1) bits per input symbol in the worst case, where H/sub k/(T) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P[1,p], the opportunistic data structure allows to search for the occurrences of P in T in O(p+occlog/sup /spl epsiv//u) time (for any fixed /spl epsiv/>0). If data are uncompressible we achieve the best space bound currently known (Grossi and Vitter, 2000); on compressible data our solution improves the succinct suffix array of (Grossi and Vitter, 2000) and the classical suffix tree and suffix array data structures either in space or in query time or both. We also study our opportunistic data structure in a dynamic setting and devise a variant achieving effective search and update time bounds. Finally, we show how to plug our opportunistic data structure into the Glimpse tool (Manber and Wu, 1994). The result is an indexing tool which achieves sublinear space and sublinear query time complexity.

...read moreread less

1,188 citations

"High-order entropy-compressed text ..." refers background or methods in this paper

...Indexing the Associated Press file with the FM-index would require roughly 1 gigabyte according to the experiments in [3]....
[...]
...Decompressing one text symbol of Sj at a time is inherently sequential as in [2] and [19, 20]....
[...]
...1 Related Work A new trend in the design of advanced indexes for full-text searching of documents is represented by compressed suffix arrays [6, 18, 19, 20] and opportunistic FM-indexes [2, 3], in that they support the functionalities of suffix arrays and suffix trees, which are more powerful than classical inverted files [4]....
[...]
...1.1 Related Work A new trend in the design of ad- vanced indexes for full-text searching of documents is represented by compressed suffix arrays [6, 18, 19, 20] and opportunistic FM-indexes [2, 3], in that they sup- port the functionalities of suffix arrays and suffix trees, which are more powerful than classical inverted files [4]....
[...]
...The FM-index [2, 3] is a self-indexing data structure =in)gohr~inHh:aOr!~l~ll+gl.l~lg lglE') bits, while n "~-nel~12~+El pp " g " g " O(m +lg n) time, where I~1 = O(1)....
[...]

Journal Article•DOI•

PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

[...]

Donald R. Morrison¹•Institutions (1)

Sandia National Laboratories¹

01 Oct 1968-Journal of the ACM

TL;DR: PATRICIA as mentioned in this paper is an algorithm which provides a flexible means of storing, indexing, and retrieving information in a large file, which is economical of index space and of reindexing time.

...read moreread less

Abstract: PATRICIA is an algorithm which provides a flexible means of storing, indexing, and retrieving information in a large file, which is economical of index space and of reindexing time. It does not require rearrangement of text or index as new material is added. It requires a minimum restriction of format of text and of keys; it is extremely flexible in the variety of keys it will respond to. It retrieves information in response to keys furnished by the user with a quantity of computation which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several variations as FORTRAN programs for the CDC-3600, utilizing disk file storage of text. It has been applied to several large information-retrieval problems and will be applied to others.

...read moreread less

887 citations