Proceedings ArticleDOI
Cluster-based delta compression of a collection of files
Z. Ouyang,N. Memon,T. Suel,D. Trendafilov +3 more
- pp 257-268
Reads0
Chats0
TLDR
This work proposes a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings and demonstrates the efficacy of this approach on collections of Web pages.Abstract:
Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.read more
Citations
More filters
Proceedings ArticleDOI
Locality-sensitive hashing scheme based on p-stable distributions
TL;DR: A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Proceedings Article
Redundancy elimination within large collections of files
TL;DR: The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner.
Proceedings Article
Application-specific Delta-encoding via Resemblance Detection.
Fred Douglis,Arun Iyengar +1 more
TL;DR: It is found that delta-encoding using this resemblance detection technique can improve on simple compression by up to a factor of two, depending on workload, and that a small fraction of objects can potentially account for a large portion of these savings.
Journal ArticleDOI
Improving duplicate elimination in storage systems
TL;DR: A new object partitioning technique is proposed, called fingerdiff, that improves upon existing schemes in several important respects, most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization.
Journal ArticleDOI
A Survey and Classification of Storage Deduplication Systems
João Paulo,José Pereira +1 more
TL;DR: A classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope is identified and describes the different approaches used for each.
References
More filters
Journal ArticleDOI
A universal algorithm for sequential data compression
Jacob Ziv,A. Lempel +1 more
TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Proceedings ArticleDOI
Approximate nearest neighbors: towards removing the curse of dimensionality
Piotr Indyk,Rajeev Motwani +1 more
TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Proceedings ArticleDOI
On the resemblance and containment of documents
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Journal ArticleDOI
RCS—a system for version control
TL;DR: Basic version control concepts are introduced and the practice of version control using RCS is discussed, and usage statistics show that RCS's delta method is space and time efficient.
Journal ArticleDOI
Finding optimum branchings
TL;DR: An implementation of the algorithm which runs in 0(m logn) time if the problem graph has n vertices and m edges is given, and a modification for dense graphs gives a running time of 0(n2).