scispace - formally typeset
Book ChapterDOI

Move-to-front, distance coding, and inversion frequencies revisited

TLDR
This paper analyzes Move-to-Front, Distance Coding and Inversion Frequencies from the point of view of how effective they are in the task of compressing low-entropy strings, that is, strings which have many regularities and are therefore highly compressible.
Abstract
Move-to-Front, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the Burrows-Wheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing low-entropy strings, that is, strings which have many regularities and are therefore highly compressible. This is a non-trivial task since many compressors have non-constant overheads that become non-negligible when the input string is highly compressible. Because of the properties of the Burrows-Wheeler transform, being locally optimal ensures an algorithm compresses low-entropy strings effectively. Informally, local optimality implies that an algorithm is able to effectively compress an arbitrary partition of the input string. We show that in their original formulation neither Move-to-Front, nor Distance Coding, nor Inversion Frequencies is locally optimal. Then, we describe simple variants of the above algorithms which are locally optimal. To achieve local optimality with Move-to-Front it suffices to combine it with Run Length Encoding. To achieve local optimality with Distance Coding and Inversion Frequencies we use a novel "escape and re-enter" strategy. Since we build on previous results, our analyses are simple and shed new light on the inner workings of the three techniques considered in this paper.

read more

Citations
More filters
Journal ArticleDOI

The myriad virtues of wavelet trees

TL;DR: A novel framework, called Pruned Wavelet Trees, is proposed, that aims for the best combination of Wavelet trees of properly-designed shapes and compressors either binary (like, Run-Length encoders) or non-binary ( like, Huffman and Arithmetic encoder).
Journal Article

The Myriad Virtues of Wavelet Trees

TL;DR: This paper provides a complete theoretical analysis of a wide class of compression algorithms based on Wavelet Trees and proves high-order entropy bounds for the challenging combination of Burrows-Wheeler Transform and Wavelet trees.
Journal ArticleDOI

Balancing and clustering of words in the Burrows-Wheeler transform

TL;DR: Empirical observations suggest that balance is actually the combinatorial property of input word that ensure optimal BWT compression, and this hypothesis is corroborated by experiments on ''real'' text, by using local entropy as a measure of the degree of balance of a word.
Proceedings ArticleDOI

Wavelet Trees: From Theory to Practice

TL;DR: It is shown that the run-length $\delta$ coding size of wavelet trees achieves the 0-order empirical entropy size of the original string with leading constant 1, when the string's 0- order empirical entropy is asymptotically less than the logarithm of the alphabet size.
Journal ArticleDOI

Words with Simple Burrows-Wheeler Transforms

TL;DR: An alternative proof of this result is given and words on the alphabet whose transforms have the form of Burrows-Wheeler Transform are described, some of which have some common properties with standard words.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

A Block-sorting Lossless Data Compression Algorithm

TL;DR: A block-sorting, lossless data compression algorithm, and the implementation of that algorithm and the performance of the implementation with widely available data compressors running on the same hardware are compared.
Journal ArticleDOI

Universal codeword sets and representations of the integers

TL;DR: An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate.
Journal ArticleDOI

Compressed full-text indexes

TL;DR: The relationship between text entropy and regularities that show up in index structures and permit compressing them are explained and the most relevant self-indexes are covered, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems.
Proceedings ArticleDOI

High-order entropy-compressed text indexes

TL;DR: A novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits.
Related Papers (5)