scispace - formally typeset
Proceedings ArticleDOI

Cluster-based delta compression of a collection of files

Z. Ouyang, +3 more
- pp 257-268
Reads0
Chats0
TLDR
This work proposes a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings and demonstrates the efficacy of this approach on collections of Web pages.
Abstract
Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.

read more

Citations
More filters
Proceedings ArticleDOI

Locality-sensitive hashing scheme based on p-stable distributions

TL;DR: A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Proceedings Article

Redundancy elimination within large collections of files

TL;DR: The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner.
Proceedings Article

Application-specific Delta-encoding via Resemblance Detection.

TL;DR: It is found that delta-encoding using this resemblance detection technique can improve on simple compression by up to a factor of two, depending on workload, and that a small fraction of objects can potentially account for a large portion of these savings.
Journal ArticleDOI

Improving duplicate elimination in storage systems

TL;DR: A new object partitioning technique is proposed, called fingerdiff, that improves upon existing schemes in several important respects, most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization.
Journal ArticleDOI

A Survey and Classification of Storage Deduplication Systems

TL;DR: A classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope is identified and describes the different approaches used for each.
References
More filters
Journal ArticleDOI

A universal algorithm for sequential data compression

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Proceedings ArticleDOI

Approximate nearest neighbors: towards removing the curse of dimensionality

TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder
- 11 Jun 1997 - 
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Journal ArticleDOI

RCS—a system for version control

TL;DR: Basic version control concepts are introduced and the practice of version control using RCS is discussed, and usage statistics show that RCS's delta method is space and time efficient.
Journal ArticleDOI

Finding optimum branchings

TL;DR: An implementation of the algorithm which runs in 0(m logn) time if the problem graph has n vertices and m edges is given, and a modification for dense graphs gives a running time of 0(n2).