Journal ArticleDOI
Compactly encoding unstructured inputs with differential compression
Reads0
Chats0
TLDR
This work presents new differencing algorithms that operate at a fine granularity (the atomic unit of change), make no assumptions about the format or alignment of input data, and in practice use linear time, use constant space, and give good compression.Abstract:
The subject of this article is differential compression, the algorithmic task of finding common strings between versions of data and using them to encode one version compactly by describing it as a set of changes from its companion. A main goal of this work is to present new differencing algorithms that (i) operate at a fine granularity (the atomic unit of change), (ii) make no assumptions about the format or alignment of input data, and (iii) in practice use linear time, use constant space, and give good compression. We present new algorithms, which do not always compress optimally but use considerably less time or space than existing algorithms. One new algorithm runs in O(n) time and O(1) space in the worst case (where each unit of space contains [log n] bits), as compared to algorithms that run in O(n) time and O(n) space or in O(n2) time and O(1) space. We introduce two new techniques for differential compression and apply these to give additional algorithms that improve compression and time performance. We experimentally explore the properties of our algorithms by running them on actual versioned data. Finally, we present theoretical results that limit the compression power of differencing algorithms that are restricted to making only a single pass over the data.read more
Citations
More filters
Proceedings Article
Sparse indexing: large scale, inline deduplication using sampling and locality
Mark Lillibridge,Kave Eshghi,Deepavali Bhagwat,Vinay Deolalikar,Greg Trezise,Peter Thomas Camble +5 more
TL;DR: Sparse indexing, a technique that uses sampling and exploits the inherent locality within backup streams to solve for large-scale backup the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face, is presented.
Proceedings ArticleDOI
Extreme Binning: Scalable, parallel deduplication for chunk-based file backup
TL;DR: Extreme Binning is presented, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time.
Journal ArticleDOI
Pastiche: making backup cheap and easy
TL;DR: Pastiche exploits excess disk capacity to perform peer-to-peer backup with no administrative costs, and provides mechanisms for confidentiality, integrity, and detection of failed or malicious peers.
Proceedings Article
Redundancy elimination within large collections of files
TL;DR: The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner.
References
More filters
Book
Information Theory and Reliable Communication
TL;DR: This chapter discusses Coding for Discrete Sources, Techniques for Coding and Decoding, and Source Coding with a Fidelity Criterion.
Journal ArticleDOI
A universal algorithm for sequential data compression
Jacob Ziv,A. Lempel +1 more
TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Book
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
TL;DR: In this paper, the authors introduce suffix trees and their use in sequence alignment, core string edits, alignments and dynamic programming, and extend the core problems to extend the main problems.
Journal ArticleDOI
Compression of individual sequences via variable-rate coding
Jacob Ziv,A. Lempel +1 more
TL;DR: The proposed concept of compressibility is shown to play a role analogous to that of entropy in classical information theory where one deals with probabilistic ensembles of sequences rather than with individual sequences.