scispace - formally typeset
Open AccessProceedings ArticleDOI

Approximating the number of differences between remote sets

TLDR
In this article, the problem of approximating the number of differences between sets held on remote hosts using minimum communication is considered, and a heuristic solution based on the counting Bloom filter is proposed.
Abstract
We consider the problem of approximating the number of differences between sets held on remote hosts using minimum communication. Efficient solutions to this problem are important for streamlining a variety of communication sensitive network applications, including data synchronization in mobile networks, gossip protocols and content delivery networks. Using tools from the field of interactive communication, we show that this problem requires about as much communication as the problem of exactly determining such differences. As a result, we propose a heuristic solution based on the counting Bloom filter. We provide analytic bounds on the expected performance of our protocol and also experimental evidence that they can outperform existing difference approximation techniques.

read more

Content maybe subject to copyright    Report

1
Approximating the number of differences
between remote sets
Sachin Agarwal
Deutsche Telekom AG, Laboratories
Email: sachin.agarwal@telekom.de
Ari Trachtenberg
Boston University
Email: trachten@bu.edu
Abstract
We consider the problem of approximating the number of differences between sets held on remote hosts using
minimum communication. Efficient solutions to this problem are important for streamlining a variety of communica-
tion sensitive network applications, including data synchronization in mobile networks, gossip protocols and content
delivery networks. Using tools from the field of interactive communication, we show that this problem requires
about as much communication as the problem of exactly determining such differences. As a result, we propose a
heuristic solution based on the counting Bloom filter. We provide analytic bounds on the expected performance of
our protocol and also experimental evidence that they can outperform existing difference approximation techniques.
A version of this work will appear at the IEEE Information Theory Workshop, Punta del Este, Uruguay,
March 2006.
I. INTRODUCTION
Many distributed network systems maintain copies of the same information on different hosts of
a network. In order to maintain even weak data consistency, hosts must periodically reconcile their
differences with other hosts as connections become available or according to prescribed scheduling.
In a dense or constrained network the decision to reconcile should be based, in part, on the number
of differences between the reconciling hosts’ data sets. Although hosts with many differences between
them should probably be fully reconciled, hosts that are fairly similar might wait for more differences to
accumulate. Unfortunately, simple solutions, such as update timestamps, do not scale well to dynamic or
large networks because of the need to maintain an update history with respect to every other host [1].
In this paper we introduce a new approach for approximating the number of differences between two
remote data sets based on a variant of the counting Bloom filters. Formally, the problem is as follows:
given two hosts A and B with data sets S
A
and S
B
respectively, we wish to approximate the size of
mutual difference S
A
S
B
ˆ= (S
A
S
B
) (S
B
S
A
). Our goal is to measure this size as accurately as
possible and using as little communication as possible, measured both in terms of transmitted bytes and
rounds of communication. As a secondary goal, we also seek to reduce the computational cost involved
with such an approximation. We note that the accuracy of the approximations provided in this paper is
statistical in nature (i.e., an average accuracy over a number of approximations).
This work was completed while Sachin Agarwal was at Boston University. This work is based, in part, upon work supported by the
National Science Foundation under Grants CCR-0133521 and ANI-0240333.

2
A. Organization
In Section II we provide a baseline information-theoretic analysis of the difference approximation prob-
lem. Thereafter, we describe some existing protocols for difference approximation and very briefly review
traditional Bloom filters. Our heuristic, a wrapped filter approximation based on the counting Bloom filter,
is presented in Section III. The accuracy of this technique depends on the amount of probabilistic false
positives incurred, which we analyze in Section IV. Finally, in Section V we experimentally compare our
approach to existing approximation techniques. Conclusions and directions for future work are discussed
in Section VI.
II. PRELIMINARIES
We first provide an information-theoretic baseline of the amount of communication needed for difference
approximation, and thereafter proceed to describe some existing solutions to the problem and some
fundamental properties of the Bloom filter.
A. Information-theoretic bounds
1) Tools: We shall rely on two tools in the analysis of the complexity of difference approximation.
The rst is a well-known result of Yao [2], based on a deterministic two-way communication model
in which two remote users with data X M and Y N respectively alternate, sending each other one
bit at a time, with the goal of computing a given deterministic function f(X, Y ). Yao showed that at
least log
2
(d(f)) 2 bits of communication are needed to correctly communicate f, with d(f) being the
minimum number of monochromatic rectangles needed to partition f on M × N.
The second tool comes from the work of Orlitsky and Roche [3], in which remote users have random
variables X and Y . In this one-way communication model, X repeatedly sends (as one block) an instance
of his variable to the second user, who then attempts to compute f(X, Y ) with respect to instances of its
own random variable, with vanishing block error probability. Orlitsky and Roche showed that the number
of bits that must be transmitted per block for this model is given by the graph entropy [4] H
G
(X|Y ) of
the characteristic graph [5] of function f
1
, namely H
G
(X|Y ) ˆ= min I(W ; X|Y ), where I denotes mutual
information, and W represents a random variable over the set of independent sets (i.e., induced subgraphs
with no edges) of G, with conditional probability
P
xw
p(w|x) = 1 for instances w W and x X.
The minimization is taken only over Markov chains W, X, Y .
2) Results: All the algorithms discussed in this paper compute the approximate number of differences
between sets on remote hosts. Unfortunately, determining the exact number of set differences requires a
large amount of communication, essentially at least as much information as contained between of the sets.
This comes out from the fact that computing set equality is a special case of computing set differences [2,
3].
The following lemma shows that approximating differences within an additive constant also requires
much communication.
Lemma II.1 For a fixed U, no algorithm can deterministically and definitively compute an approximation
ˆ
∆(S, S
0
) of the number of differences ∆(S, S
0
) with
∆(S, S
0
) k
ˆ
∆(S, S
0
) ∆(S, S
0
) + k S, S
0
U .
using less than Ω(|U|) bits of communication at worst.
Proof: Consider the boolean function f (S, S
0
) defined to be 1 exactly when
ˆ
∆(S, S
0
) k. Clearly
computing
ˆ
requires at least as much communication as computing f. On the other hand, the number
of ones in any row f (S, S
0
) S
0
U will, at most, consist of all sets that differ by ±k elements from
1
This is a graph whose vertices are the support set of X and edges (x, x
0
) are such that y with p(x, y), p(x
0
, y) > 0 and f(x, y) 6= f (x
0
, y).

3
S; there are O(|U|
2k
) such sets. As such
2
|U|
|U|
2k
monochromatic rectangles are needed to partition the space
of f, leading to the stated result under Yao’s theorem [2]. In fact the result can be generalized to any
approximation that results in a function f with asymptotically less than 2
|U|
ones in any row.
We can deduce a stronger result under the model of Orlitsky and Roche, essentially showing that one
cannot efficiently approximate set difference in many cases. This result is based on the following lemma
generalizing a similar result in [6].
Lemma II.2 Let q
G
(P ) denote the probability that two vertices, chosen independently with distribution
P , do not form an edge in G. Then
H
G
(X|Y ) log
1
q
G
(P )
.
Proof: By definition,
I(W ; X|Y ) =
X
w,x,y
p(w, x, y) log
p(w, x|y)
p(w|y)p(x|y)
.
Applying the log sum inequality,
I(W ; X|Y )
X
y
p(y)
"
X
w,x
p(w; x|y)
#
log
P
w,x
p(w, x|y)
P
w,x
p(w|y)p(x|y)
=
X
y
p(y) log
X
w
p(w|y)
X
xw
p(x|y)
!
(1)
Adapting the technique in [6], we see that
X
xw
p(w|y) =
X
xw
X
zw
p(w, z|y) (2)
X
(x, z) is not an edge
p(w, z|y), (3)
where the last line follows from the fact that w is an independent set. Inserting (3) into (1) concludes the
lemma.
Lemma II.2 is tight enough to show that difference approximation typically requires communication of
the same size as the sets being compared.
Theorem II.3 Consider two sets X, Y U generated independently, with elements chosen with prob-
ability p. Then a one-way communication algorithm approximating differences within an additive error
o(|U|) requires at least Ω(|U)|) bits of communication.
Proof: Suppose an algorithm f approximates set differences within an additive error k that is o(n)
where n ˆ=|U|. Then the characteristic graph G of the function computed by this algorithm will have edges
between all sets of distance > 2k, and the graph G
0
with only these edges lower-bounds the graph entropy
of f because of the sub-additivity of H
G
[7]. We can compute the probability q
G
0
(P ) of two randomly
chosen vertices not corresponding to an edge as follows:
q
G
0
(P ) =
2k
X
i=0
n
i
α
i
(1 α)
ni
,

4
where two sets contain a given element with probability α = p
2
+ (1 p)
2
. Noting that α(1 α)
1
4
and that (1 α)
1
2
, we get:
q
G
0
(P )
2k
X
i=0
n
i
1
4
i
1
2
n2i
1
2
n
2k
X
i=0
n
i
.
log(
1
q
G
0
(P )
) n log(
2k
X
i=0
n
i
n k log(
n
k
) which is Ω(n).
In fact, Theorem II.3 can be trivially generalized.
Corollary II.4 Any algorithm on remote sets X, Y U returning an approximation f(X, Y ) with
f
1
(X Y ) f(X, Y ) f
2
(X Y )
for some functions f
1
, f
2
such that
c > 0 f
1
(x) > f
2
(0) x > c|U|
requires at least Ω(|U|) bits of one-way communication.
The difficulty of efficiently providing hard approximation guarantees leads us to the consideration of
heuristic techniques.
B. Existing solutions
Various existing techniques for approximating set difference size are surveyed nicely in [8]. A simple
protocol for approximating differences involves random sampling, in which host A transmits k randomly
chosen elements to host B for comparison. If B has r of the transmitted elements, then we approximation
that
r
k
of the elements of B are common to A. The main problems with random sampling are a high error
rate and low resolution, as we shall see in Section V.
The problem of determining similarity across documents [9] is also related to our work in this paper,
though its solutions are generally more complicated due to the relative complexity of the similarity
metric [10, 11]. Some of these approaches are based on clever sampling-based techniques called min-
wise sketches [12]. Though better than random sampling, min-wise sketches suffer from poor data
compressibility.
C. Bloom filter basics
Bloom filters [13–15] are used to perform efficient membership queries on sets. The Bloom filter of a
set is a bit array; each element of the set is hashed with several hashes into corresponding locations in
the array, which are thereby set to 1 and otherwise 0. Testing whether a specific element x is in a set
thus involves checking whether the appropriate bits are 1 in the Bloom filter of the set; if they are not,
then x is certainly not in the set, but otherwise the Bloom filter reports that x is in the set. In the latter
case, it is possible for the Bloom filter to incorrectly report that x is an element of the set (i.e., a false
positive) when, in fact, it is not.
The probability of a false positive of a Bloom filter for a set S is denoted P
f
(S) and depends on the
number of elements in the set |S|, the length of the Bloom filter m, and the number of (independent)
hash functions k used to compute the Bloom filter. This false positive probability is given in [14] as
P
f
(S) =
1
1
1
m
k|S|
!
k
. (4)

5
Protocol 1 Unwrapping a wrapped filter W
S
B
against a host set S
A
.
for each set element s
i
S
A
do
copy W
temp
= W
S
B
for each hash function h
j
do
if W
temp
[(s
i
)] > 0 then
W
temp
[h
j
(s
i
)] = W
temp
[h
j
(s
i
)] 1
else proceed to the next element s
i
copy W
S
B
= W
temp
return the approximate δ
A
=
P
m
i=1
W
S
B
[i]
k
III. WRAPPED FILTER APPROXIMATION
Wrapped filters hold condensed set membership information with more precision than a Bloom filter.
The additional precision comes at the expense of higher communication costs, but, surprisingly, this
expense is outweighed by the benefits of improved performance. As we show later, wrapped filters often
provide a more accurate approximation of set difference per communicated bit than traditional Bloom
filters.
A. Wrapping
Wrapped filters are constructed in a fashion similar to counting Bloom filters [13,14]. A wrapped filter
W
S
of a set S = {s
1
, s
2
, s
3
, . . . s
n
} is first initialized with all zeroes, and then set elements are added
to the filter by incrementing locations in W (S) corresponding to k independent hashes h
i
(·) of these
elements. More precisely, we increment W
S
[h
j
(s
i
)] for each set element s
i
S and hash function h
j
in
order to construct the wrapped filter W
S
.
The wrapped filter clearly generalizes the Bloom filter in that we may transform the former into the
latter by treating all non-zero entries as ones. Host A can use this Bloom filter property of a wrapped
function to determine |S
A
S
B
| by inspecting Bs wrapped filter W
S
B
; in other words, all elements of S
A
that do not fit the Bloom filter can be considered to be in S
A
S
B
. Conversely, the unwrapping algorithm
in Section III-B allows us to approximation |S
B
S
A
| from the same wrapped filter, giving an overall
approximation for the mutual difference |S
A
S
B
|.
Unlike Bloom filters, wrapped lters also have the feature of incrementally handling both insertions
and deletions. Thus, whereas a Bloom filter for a set would have to be recomputed upon deletion of an
element, one may simply decrement the corresponding hash locations for this element in the wrapped
filter. The price for this feature is that each entry can now take any of kn values (where n = |S| is the size
of the set being wrapped), requiring a worst-case of m log(kn) bits of storage memory and communication
for a lter of size m; in contrast, Bloom filters require only m bits of communication. Fortunately, the
expected case is for each entry to have only
kn
m
entries giving an expected multiplicative storage overhead
of log(
kn
m
) over traditional Bloom lters.
B. Unwrapping
Host A can unwrap a wrapped filter W
S
B
to approximate |S
B
S
A
|. This unwrapping procedure is
presented formally in Protocol 1.
The strength of the wrapped filter rests in two features of the unwrapping algorithm. First, the total
weight of the wrapped lter (i.e.,
P
s
i
S
W
S
B
(s
i
)) decreases as each set element is unwrapped. As a
result, the false positive probability also generally decreases with each unwrapping, yielding a better
overall approximate, as we shall see in Section IV.

Citations
More filters
Journal ArticleDOI

Collaborative data gathering in wireless sensor networks using measurement co-occurrence

TL;DR: This work proposes a novel collaborative data gathering approach utilizing data co-occurrence, which is different from data correlation, that offers a trade-off between communication costs of data gathering versus errors at estimating the sensor measurements at the base station.
Proceedings ArticleDOI

Collaborative Data Gathering in Wireless Sensor Networks Using Measurement Co-Occurrence

TL;DR: This work proposes a novel collaborative data gathering approach utilizing data co-occurrence, which is different from data correlation, that offers a trade-off between communication costs of data gathering versus errors at estimating the sensor measurements at the base station.

Nye's Trie and Floret Estimators: Techniques for Detecting and Repairing Divergence in the SCADS Distributed Storage Toolkit

TL;DR: The floret estimator is introduced, a novel sublinear-space set summarization structure used to estimate the cardinalities of set difference, union, and intersection operations in the SCADS system.
References
More filters
Journal ArticleDOI

Space/time trade-offs in hash coding with allowable errors

TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Journal ArticleDOI

The String-to-String Correction Problem

TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.
Journal ArticleDOI

Summary cache: a scalable wide-area web cache sharing protocol

TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder
- 11 Jun 1997 - 
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Proceedings ArticleDOI

Some complexity questions related to distributive computing(Preliminary Report)

TL;DR: The quantity of interest, which measures the information exchange necessary for computing f, is the minimum number of bits exchanged in any algorithm.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions in "Approximating the number of differences between remote sets" ?

The authors consider the problem of approximating the number of differences between sets held on remote hosts using minimum communication. Efficient solutions to this problem are important for streamlining a variety of communication sensitive network applications, including data synchronization in mobile networks, gossip protocols and content delivery networks. Using tools from the field of interactive communication, the authors show that this problem requires about as much communication as the problem of exactly determining such differences. As a result, the authors propose a heuristic solution based on the counting Bloom filter. The authors provide analytic bounds on the expected performance of their protocol and also experimental evidence that they can outperform existing difference approximation techniques. A version of this work will appear at the IEEE Information Theory Workshop, Punta del Este, Uruguay, March 2006. 

a false positive can prevent a valid set element (i.e., an element that is in the set intersection) from fitting in the resulting filter by reducing to zero (or causing to reduce to zero at some later time) one of the hash locations of the valid element. 

Using the normal distribution to upper bound this entropy givesH(pi) ≤ log(2πe)2∞ ∑i=0pii 2 −(∞ ∑i=0ipi)2+ 112 , (9)which can be manipulated to prove the theorem. 

Yao showed that at least log2(d(f)) − 2 bits of communication are needed to correctly communicate f , with d(f) being the minimum number of monochromatic rectangles needed to partition f on M × N . 

The compressed size of a length m wrapped filter with k hash functions encoding n elements is (asymptotically) at most1.42(1 − 1m)kn + 0.12m bits. 

The probability of a false positive of a Bloom filter for a set S is denoted Pf (S) and depends on the number of elements in the set |S|, the length of the Bloom filter m, and the number of (independent) hash functions k used to compute the Bloom filter. 

Proof: Given an initial weight w = kn, the probability of a given wrapped filter location having weight i is given by a binomial distributionpi =(wi)( 1 − 1m)w−i( 1m)i.Utilizing any scheme of entropy coding, the authors compress the average filter element to its entropy rate, H(pi) = ∑w i=0 pi log(pi). 

The authors can compute the probability qG′(P ) of two randomly chosen vertices not corresponding to an edge as follows:qG′(P ) =2k ∑i=0(ni)αi(1 − α)n−i,4 where two sets contain a given element with probability α = p2 + (1 − p)2. 

The price for this feature is that each entry can now take any of kn values (where n = |S| is the size of the set being wrapped), requiring a worst-case of m log(kn) bits of storage memory and communication for a filter of size m; in contrast, Bloom filters require only m bits of communication. 

All these qualities make wrapped filters particularly suitable for the many network applications where there is a need to quickly and efficiently measure the consistency of distributed information.