scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

High-order entropy-compressed text indexes

TL;DR: A novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits.
Abstract: We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lgvσv bits. We show that compressed suffix arrays use just nHh + σ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg vσv + polylog(n)) time. The term Hh ≤ lg vσv denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The relationship between text entropy and regularities that show up in index structures and permit compressing them are explained and the most relevant self-indexes are covered, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems.
Abstract: Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, which radically changed the status of this area in less than 5 years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously.In this article we present the main concepts underlying (compressed) self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems. Our aim is to give the background to understand and follow the developments in this area.

834 citations

Proceedings ArticleDOI
27 May 2018
TL;DR: In this paper, the authors propose to replace traditional index structures with learned models, which can have significant advantages over traditional indexes, and theoretically analyze under which conditions learned indexes outperform traditional index structure and describe the main challenges in designing learned index structures.
Abstract: Indexes are models: a \btree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term \em learned indexes. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show that our learned indexes can have significant advantages over traditional indexes. More importantly, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work provides just a glimpse of what might be possible.

742 citations

Journal ArticleDOI
TL;DR: Two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form are designed and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.
Abstract: We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p p occ log1pen) time for any chosen e, 0

656 citations


Cites background or methods from "High-order entropy-compressed text ..."

  • ...2002], wavelet trees [Grossi et al. 2003] and compression boosting [Ferragina et al....

    [...]

  • ...…research on compressed indexes has produced data structures that are more alphabet-friendly and achieve various tradeoffs between space usage and query time [Grossi et al. 2003; Rao 2002; Sadakane 2002, 2003; Grabowski et al. 2004; Navarro 2004; M¨akinen et al. 2004; M¨akinen and Navarro 2004]....

    [...]

  • ...These second generation compressed indexes make use of new algorithmic tools such as succinct dictionaries [Raman et al. 2002], wavelet trees [Grossi et al. 2003] and compression boosting [Ferragina et al. 2005]....

    [...]

  • ...Currently, the most space economical compressed indexes [Grossi et al. 2003; Ferragina et al. 2004] take nHk(T ) + o(n) bits for k < α log| | n with α < 1....

    [...]

  • ...Currently, the most space economical compressed indexes [Grossi et al. 2003; Ferragina et al. 2004] take nHk (T ) + o(n) bits for k <alog|| n with a< 1....

    [...]

Journal ArticleDOI
TL;DR: The result presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice.
Abstract: The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ consisting of $n$ symbols drawn from a fixed alphabet $\Sigma$. The text $T$ can be represented in $n \lg |\Sigma|$ bits by encoding each symbol with $\lg |\Sigma|$ bits. The goal is to support fast online queries for searching any string pattern $P$ of $m$ symbols, with $T$ being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{|\Sigma|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg |\Sigma|)$ time or in $O(m + \lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{|\Sigma|} n + \lg_{|\Sigma|}^\epsilon n)}$ search time in the worst case, for any constant $0 < \epsilon \leq 1$, using at most $\smash{\bigl(\epsilon^{-1} + O(1)\bigr) \, n \lg |\Sigma|}$ bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB \textsc{ascii} file can require 30--40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve \emph{both} time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving $O(\mathit{occ} \, \smash{\lg_{|\Sigma|}^\epsilon n})$ time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in $O(n \lg |\Sigma|)$ bits to obtain a total search bound of $O(m /\lg_{|\Sigma|} n + \mathit{occ})$ time, which is optimal.

559 citations


Cites background or methods from "High-order entropy-compressed text ..."

  • ...Assuming that the text is read-only and using a stronger version of the bit-probe model, Demaine and L´ opez-Ortiz [ 22 ] have shown in the worst case that any text index with alphabet size |Σ| = 2 that supports fast queries by probing O(m) bits in the text must use Ω(n) bits of extra storage space....

    [...]

  • ...(He also uses our Lemma 2 in section 3.1 to show how to store the skip values of the suffix tree in O(n) bits [65].) The space of compressed suffix arrays has been further reduced to the order-k entropy (with a multiplicative constant of 1) by Grossi, Gupta, and Vitter [ 36 ] using a novel analysis based on a finite set model....

    [...]

  • ...We remark that Sadakane [64] has shown that the space complexity in Theorem 1(ii) and Theorem 2(ii) can be restated in terms of the order-0 entropy H0 ≤ lg |Σ| of the string, giving as a result � −1H0 n + O(n) bits. Grossi, Gupta, and Vitter [ 36 ]...

    [...]

Journal ArticleDOI
TL;DR: In the cell probe model, the O(lg lg m) additive term can be removed from the space bound, answering a question raised by Fich and Miltersen [1995] and Pagh [2001].
Abstract: We consider the indexable dictionary problem, which consists of storing a set S ⊆ {0,…,m − 1} for some integer m while supporting the operations of rank(x), which returns the number of elements in S that are less than x if x ∈ S, and −1 otherwise; and select(i), which returns the ith smallest element in S. We give a data structure that supports both operations in O(1) time on the RAM model and requires B(n, m) p o(n) p O(lg lg m) bits to store a set of size n, where B(n, m) e ⌊lg (m/n)⌋ is the minimum number of bits required to store any n-element subset from a universe of size m. Previous dictionaries taking this space only supported (yes/no) membership queries in O(1) time. In the cell probe model we can remove the O(lg lg m) additive term in the space bound, answering a question raised by Fich and Miltersen [1995] and Pagh [2001]. We present extensions and applications of our indexable dictionary data structure, including: —an information-theoretically optimal representation of a k-ary cardinal tree that supports standard operations in constant time; —a representation of a multiset of size n from {0,…,m − 1} in B(n, m p n) p o(n) bits that supports (appropriate generalizations of) rank and select operations in constant time; and p O(lg lg m) —a representation of a sequence of n nonnegative integers summing up to m in B(n, m p n) p o(n) bits that supports prefix sum queries in constant time.

415 citations

References
More filters
Book
11 May 1999
TL;DR: A guide to the MG system and its applications, as well as a comparison to the NZDL reference index, are provided.
Abstract: PREFACE 1. OVERVIEW 2. TEXT COMPRESSION 3. INDEXING 4. QUERYING 5. INDEX CONSTRUCTION 6. IMAGE COMPRESSION 7. TEXTUAL IMAGES 8. MIXED TEXT AND IMAGES 9. IMPLEMENTATION 10. THE INFORMATION EXPLOSION A. GUIDE TO THE MG SYSTEM B. GUIDE TO THE NZDL REFERENCES INDEX

2,068 citations


"High-order entropy-compressed text ..." refers background in this paper

  • ...Large alphabets are typical of phrase searching [5, 21], for example, in which the alphabet is made up of single words and its size cannot be considered a small constant....

    [...]

Journal ArticleDOI
TL;DR: A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications.
Abstract: A new and conceptually simple data structure, called a suffix array, for on-line string searches is introduced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit on-line string searches of the type, “Is W a substring of A?” to be answered in time $O(P + \log N)$, where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, suffix trees can be constructed in $O(N)$ time in the worst case, versus $O(N\log N)$ time for suffix arrays. However, an augmented algorithm is given that, regardless of the alphabet size, constructs suffix arrays in $O(N)$expected time, albeit with lesser space efficiency. It is ...

1,969 citations


"High-order entropy-compressed text ..." refers background or methods or result in this paper

  • ...We first perform a search of P in SA`+lg t(n), which is stored explicitly along with LCP `+lg t(n), the longest common prefix information required in [10]....

    [...]

  • ...Similar to what we described in Section 2, level k = ` stores the suffix array SA`, inverted suffix array SA −1 ` , and an array LCP ` storing the longest common prefix information [10] to allow fast searching in SA`....

    [...]

  • ...A standard suffix array [4, 10] is an array containing the position of each of the n suffixes of text T in lexicographical order....

    [...]

Journal ArticleDOI
Edward M. McCreight1
TL;DR: A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching that has the same asymptotic running time bound as previously published algorithms, but is more economical in space.
Abstract: A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching. This algorithm has the same asymptotic running time bound as previously published algorithms, but is more economical in space. Some implementation considerations are discussed, and new work on the modification of these search trees in response to incremental changes in the strings they index (the update problem) is presented.

1,661 citations

Proceedings ArticleDOI
12 Nov 2000
TL;DR: A data structure whose space occupancy is a function of the entropy of the underlying data set is devised, which achieves sublinear space and sublinear query time complexity and is shown how to plug into the Glimpse tool.
Abstract: We address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space reduction is achieved at no significant slowdown in the query performance. More precisely, its space occupancy is optimal in an information-content sense because text T[1,u] is stored using O(H/sub k/(T))+o(1) bits per input symbol in the worst case, where H/sub k/(T) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P[1,p], the opportunistic data structure allows to search for the occurrences of P in T in O(p+occlog/sup /spl epsiv//u) time (for any fixed /spl epsiv/>0). If data are uncompressible we achieve the best space bound currently known (Grossi and Vitter, 2000); on compressible data our solution improves the succinct suffix array of (Grossi and Vitter, 2000) and the classical suffix tree and suffix array data structures either in space or in query time or both. We also study our opportunistic data structure in a dynamic setting and devise a variant achieving effective search and update time bounds. Finally, we show how to plug our opportunistic data structure into the Glimpse tool (Manber and Wu, 1994). The result is an indexing tool which achieves sublinear space and sublinear query time complexity.

1,188 citations


"High-order entropy-compressed text ..." refers background or methods in this paper

  • ...Indexing the Associated Press file with the FM-index would require roughly 1 gigabyte according to the experiments in [3]....

    [...]

  • ...Decompressing one text symbol of Sj at a time is inherently sequential as in [2] and [19, 20]....

    [...]

  • ...1 Related Work A new trend in the design of advanced indexes for full-text searching of documents is represented by compressed suffix arrays [6, 18, 19, 20] and opportunistic FM-indexes [2, 3], in that they support the functionalities of suffix arrays and suffix trees, which are more powerful than classical inverted files [4]....

    [...]

  • ...1.1 Related Work A new trend in the design of ad- vanced indexes for full-text searching of documents is represented by compressed suffix arrays [6, 18, 19, 20] and opportunistic FM-indexes [2, 3], in that they sup- port the functionalities of suffix arrays and suffix trees, which are more powerful than classical inverted files [4]....

    [...]

  • ...The FM-index [2, 3] is a self-indexing data structure =in)gohr~inHh:aOr!~l~ll+gl.l~lg lglE') bits, while n "~-nel~12~+El pp " g " g " O(m +lg n) time, where I~1 = O(1)....

    [...]

Journal ArticleDOI
TL;DR: PATRICIA as mentioned in this paper is an algorithm which provides a flexible means of storing, indexing, and retrieving information in a large file, which is economical of index space and of reindexing time.
Abstract: PATRICIA is an algorithm which provides a flexible means of storing, indexing, and retrieving information in a large file, which is economical of index space and of reindexing time. It does not require rearrangement of text or index as new material is added. It requires a minimum restriction of format of text and of keys; it is extremely flexible in the variety of keys it will respond to. It retrieves information in response to keys furnished by the user with a quantity of computation which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several variations as FORTRAN programs for the CDC-3600, utilizing disk file storage of text. It has been applied to several large information-retrieval problems and will be applied to others.

887 citations