Compactly encoding unstructured inputs with differential compression

doi:10.1145/567112.567116

Journal ArticleDOI

Compactly encoding unstructured inputs with differential compression

Miklós Ajtai, +4 more

- 01 May 2002 -

Journal of the ACM

- Vol. 49, Iss: 3, pp 318-367

Chats0

TLDR

This work presents new differencing algorithms that operate at a fine granularity (the atomic unit of change), make no assumptions about the format or alignment of input data, and in practice use linear time, use constant space, and give good compression.

Abstract:

The subject of this article is differential compression, the algorithmic task of finding common strings between versions of data and using them to encode one version compactly by describing it as a set of changes from its companion. A main goal of this work is to present new differencing algorithms that (i) operate at a fine granularity (the atomic unit of change), (ii) make no assumptions about the format or alignment of input data, and (iii) in practice use linear time, use constant space, and give good compression. We present new algorithms, which do not always compress optimally but use considerably less time or space than existing algorithms. One new algorithm runs in O(n) time and O(1) space in the worst case (where each unit of space contains [log n] bits), as compared to algorithms that run in O(n) time and O(n) space or in O(n2) time and O(1) space. We introduce two new techniques for differential compression and apply these to give additional algorithms that improve compression and time performance. We experimentally explore the properties of our algorithms by running them on actual versioned data. Finally, we present theoretical results that limit the compression power of differencing algorithms that are restricted to making only a single pass over the data.

Compactly encoding unstructured inputs with differential compression

Citations

Information Theory and Reliable Communication

Sparse indexing: large scale, inline deduplication using sampling and locality

Extreme Binning: Scalable, parallel deduplication for chunk-based file backup

Pastiche: making backup cheap and easy

Redundancy elimination within large collections of files

References

Information Theory and Reliable Communication

A universal algorithm for sequential data compression

The Art of Computer Programming: Volume 3: Sorting and Searching

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

Compression of individual sequences via variable-rate coding

Related Papers (5)

A low-bandwidth network file system

A universal algorithm for sequential data compression

On the resemblance and containment of documents

Venti: A New Approach to Archival Storage

Finding similar files in a large file system