What is the effect of false positives on the filter?

a false positive can prevent a valid set element (i.e., an element that is in the set intersection) from fitting in the resulting filter by reducing to zero (or causing to reduce to zero at some later time) one of the hash locations of the valid element.

how many entropys can be used to prove the theorem?

Using the normal distribution to upper bound this entropy givesH(pi) ≤ log(2πe)2∞ ∑i=0pii 2 −(∞ ∑i=0ipi)2+ 112 , (9)which can be manipulated to prove the theorem.

how many bits is the size of a m wrapped filter?

The compressed size of a length m wrapped filter with k hash functions encoding n elements is (asymptotically) at most1.42(1 − 1m)kn + 0.12m bits.

What is the probability of a given wrapped filter having weight i?

Proof: Given an initial weight w = kn, the probability of a given wrapped filter location having weight i is given by a binomial distributionpi =(wi)( 1 − 1m)w−i( 1m)i.Utilizing any scheme of entropy coding, the authors compress the average filter element to its entropy rate, H(pi) = ∑w i=0 pi log(pi).

What is the probability of two sets of vertices not corresponding to an edge?

The authors can compute the probability qG′(P ) of two randomly chosen vertices not corresponding to an edge as follows:qG′(P ) =2k ∑i=0(ni)αi(1 − α)n−i,4 where two sets contain a given element with probability α = p2 + (1 − p)2.

What are the qualities of a wrapped filter?

All these qualities make wrapped filters particularly suitable for the many network applications where there is a need to quickly and efficiently measure the consistency of distributed information.

(Open Access) Approximating the number of differences between remote sets (2006) | Sachin Agarwal

Q: What are the contributions in "Approximating the number of differences between remote sets" ?

The authors consider the problem of approximating the number of differences between sets held on remote hosts using minimum communication. Efficient solutions to this problem are important for streamlining a variety of communication sensitive network applications, including data synchronization in mobile networks, gossip protocols and content delivery networks. Using tools from the field of interactive communication, the authors show that this problem requires about as much communication as the problem of exactly determining such differences. As a result, the authors propose a heuristic solution based on the counting Bloom filter. The authors provide analytic bounds on the expected performance of their protocol and also experimental evidence that they can outperform existing difference approximation techniques. A version of this work will appear at the IEEE Information Theory Workshop, Punta del Este, Uruguay, March 2006.

Q: How many bits of communication are needed to partition f?

Yao showed that at least log2(d(f)) − 2 bits of communication are needed to correctly communicate f , with d(f) being the minimum number of monochromatic rectangles needed to partition f on M × N .

Q: What is the probability of a false positive of a Bloom filter?

The probability of a false positive of a Bloom filter for a set S is denoted Pf (S) and depends on the number of elements in the set |S|, the length of the Bloom filter m, and the number of (independent) hash functions k used to compute the Bloom filter.

Approximating the number of differences

between remote sets

Sachin Agarwal

Deutsche Telekom AG, Laboratories

Email: sachin.agarwal@telekom.de

Ari Trachtenberg

Boston University

Email: trachten@bu.edu

Abstract

We consider the problem of approximating the number of differences between sets held on remote hosts using

minimum communication. Efﬁcient solutions to this problem are important for streamlining a variety of communica-

tion sensitive network applications, including data synchronization in mobile networks, gossip protocols and content

delivery networks. Using tools from the ﬁeld of interactive communication, we show that this problem requires

about as much communication as the problem of exactly determining such differences. As a result, we propose a

heuristic solution based on the counting Bloom ﬁlter. We provide analytic bounds on the expected performance of

our protocol and also experimental evidence that they can outperform existing difference approximation techniques.

A version of this work will appear at the IEEE Information Theory Workshop, Punta del Este, Uruguay,

March 2006.

I. INTRODUCTION

Many distributed network systems maintain copies of the same information on different hosts of

a network. In order to maintain even weak data consistency, hosts must periodically reconcile their

differences with other hosts as connections become available or according to prescribed scheduling.

In a dense or constrained network the decision to reconcile should be based, in part, on the number

of differences between the reconciling hosts’ data sets. Although hosts with many differences between

them should probably be fully reconciled, hosts that are fairly similar might wait for more differences to

accumulate. Unfortunately, simple solutions, such as update timestamps, do not scale well to dynamic or

large networks because of the need to maintain an update history with respect to every other host [1].

In this paper we introduce a new approach for approximating the number of differences between two

remote data sets based on a variant of the counting Bloom ﬁlters. Formally, the problem is as follows:

given two hosts A and B with data sets S

and S

respectively, we wish to approximate the size of

mutual difference S

⊕ S

ˆ= (S

− S

) ∪ (S

− S

). Our goal is to measure this size as accurately as

possible and using as little communication as possible, measured both in terms of transmitted bytes and

rounds of communication. As a secondary goal, we also seek to reduce the computational cost involved

with such an approximation. We note that the accuracy of the approximations provided in this paper is

statistical in nature (i.e., an average accuracy over a number of approximations).

This work was completed while Sachin Agarwal was at Boston University. This work is based, in part, upon work supported by the

National Science Foundation under Grants CCR-0133521 and ANI-0240333.

A. Organization

In Section II we provide a baseline information-theoretic analysis of the difference approximation prob-

lem. Thereafter, we describe some existing protocols for difference approximation and very brieﬂy review

traditional Bloom ﬁlters. Our heuristic, a wrapped ﬁlter approximation based on the counting Bloom ﬁlter,

is presented in Section III. The accuracy of this technique depends on the amount of probabilistic false

positives incurred, which we analyze in Section IV. Finally, in Section V we experimentally compare our

approach to existing approximation techniques. Conclusions and directions for future work are discussed

in Section VI.

II. PRELIMINARIES

We ﬁrst provide an information-theoretic baseline of the amount of communication needed for difference

approximation, and thereafter proceed to describe some existing solutions to the problem and some

fundamental properties of the Bloom ﬁlter.

A. Information-theoretic bounds

1) Tools: We shall rely on two tools in the analysis of the complexity of difference approximation.

The ﬁrst is a well-known result of Yao [2], based on a deterministic two-way communication model

in which two remote users with data X ∈ M and Y ∈ N respectively alternate, sending each other one

bit at a time, with the goal of computing a given deterministic function f(X, Y ). Yao showed that at

least log

(d(f)) − 2 bits of communication are needed to correctly communicate f, with d(f) being the

minimum number of monochromatic rectangles needed to partition f on M × N.

The second tool comes from the work of Orlitsky and Roche [3], in which remote users have random

variables X and Y . In this one-way communication model, X repeatedly sends (as one block) an instance

of his variable to the second user, who then attempts to compute f(X, Y ) with respect to instances of its

own random variable, with vanishing block error probability. Orlitsky and Roche showed that the number

of bits that must be transmitted per block for this model is given by the graph entropy [4] H

(X|Y ) of

the characteristic graph [5] of function f

, namely H

(X|Y ) ˆ= min I(W ; X|Y ), where I denotes mutual

information, and W represents a random variable over the set of independent sets (i.e., induced subgraphs

with no edges) of G, with conditional probability

x∈w

p(w|x) = 1 for instances w ∈ W and x ∈ X.

The minimization is taken only over Markov chains W, X, Y .

2) Results: All the algorithms discussed in this paper compute the approximate number of differences

between sets on remote hosts. Unfortunately, determining the exact number of set differences requires a

large amount of communication, essentially at least as much information as contained between of the sets.

This comes out from the fact that computing set equality is a special case of computing set differences [2,

3].

The following lemma shows that approximating differences within an additive constant also requires

much communication.

Lemma II.1 For a ﬁxed U, no algorithm can deterministically and deﬁnitively compute an approximation

∆(S, S

) of the number of differences ∆(S, S

) with

∆(S, S

) − k ≤

∆(S, S

) ≤ ∆(S, S

) + k ∀S, S

⊆ U .

using less than Ω(|U|) bits of communication at worst.

Proof: Consider the boolean function f (S, S

) deﬁned to be 1 exactly when

∆(S, S

) ≤ k. Clearly

computing

∆ requires at least as much communication as computing f. On the other hand, the number

of ones in any row f (S, S

) ∀ S

⊆ U will, at most, consist of all sets that differ by ±k elements from

This is a graph whose vertices are the support set of X and edges (x, x

) are such that ∃y with p(x, y), p(x

, y) > 0 and f(x, y) 6= f (x

, y).

S; there are O(|U|

) such sets. As such

|U|

monochromatic rectangles are needed to partition the space

of f, leading to the stated result under Yao’s theorem [2]. In fact the result can be generalized to any

approximation that results in a function f with asymptotically less than 2

|U|

ones in any row.

We can deduce a stronger result under the model of Orlitsky and Roche, essentially showing that one

cannot efﬁciently approximate set difference in many cases. This result is based on the following lemma

generalizing a similar result in [6].

Lemma II.2 Let q

(P ) denote the probability that two vertices, chosen independently with distribution

P , do not form an edge in G. Then

(X|Y ) ≥ log



(P )



Proof: By deﬁnition,

−I(W ; X|Y ) = −

w,x,y

p(w, x, y) log

p(w, x|y)

p(w|y)p(x|y)

Applying the log sum inequality,

−I(W ; X|Y )

≤







p(y)

w,x

p(w; x|y)

log

w,x

p(w, x|y)

w,x

p(w|y)p(x|y)







p(y) log

p(w|y)

x∈w

p(x|y)

(1)

Adapting the technique in [6], we see that

x∈w

p(w|y) =

x∈w

z∈w

p(w, z|y) (2)

≤

(x, z) is not an edge

p(w, z|y), (3)

where the last line follows from the fact that w is an independent set. Inserting (3) into (1) concludes the

lemma.

Lemma II.2 is tight enough to show that difference approximation typically requires communication of

the same size as the sets being compared.

Theorem II.3 Consider two sets X, Y ⊆ U generated independently, with elements chosen with prob-

ability p. Then a one-way communication algorithm approximating differences within an additive error

o(|U|) requires at least Ω(|U)|) bits of communication.

Proof: Suppose an algorithm f approximates set differences within an additive error k that is o(n)

where n ˆ=|U|. Then the characteristic graph G of the function computed by this algorithm will have edges

between all sets of distance > 2k, and the graph G

with only these edges lower-bounds the graph entropy

of f because of the sub-additivity of H

[7]. We can compute the probability q

(P ) of two randomly

chosen vertices not corresponding to an edge as follows:

(P ) =

i=0





(1 − α)

n−i

where two sets contain a given element with probability α = p

+ (1 − p)

. Noting that α(1 − α) ≤

and that (1 − α) ≤

, we get:

(P ) ≤

i=0





n−2i

≤

i=0





log(

(P )

) ≥ n − log(

i=0





≥ n − k log(

) which is Ω(n).

In fact, Theorem II.3 can be trivially generalized.

Corollary II.4 Any algorithm on remote sets X, Y ⊆ U returning an approximation f(X, Y ) with

(X ⊕ Y ) ≤ f(X, Y ) ≤ f

(X ⊕ Y )

for some functions f

, f

such that

∃c > 0 f

(x) > f

(0) ∀x > c|U|

requires at least Ω(|U|) bits of one-way communication.

The difﬁculty of efﬁciently providing hard approximation guarantees leads us to the consideration of

heuristic techniques.

B. Existing solutions

Various existing techniques for approximating set difference size are surveyed nicely in [8]. A simple

protocol for approximating differences involves random sampling, in which host A transmits k randomly

chosen elements to host B for comparison. If B has r of the transmitted elements, then we approximation

that

of the elements of B are common to A. The main problems with random sampling are a high error

rate and low resolution, as we shall see in Section V.

The problem of determining similarity across documents [9] is also related to our work in this paper,

though its solutions are generally more complicated due to the relative complexity of the similarity

metric [10, 11]. Some of these approaches are based on clever sampling-based techniques called min-

wise sketches [12]. Though better than random sampling, min-wise sketches suffer from poor data

compressibility.

C. Bloom ﬁlter basics

Bloom ﬁlters [13–15] are used to perform efﬁcient membership queries on sets. The Bloom ﬁlter of a

set is a bit array; each element of the set is hashed with several hashes into corresponding locations in

the array, which are thereby set to 1 and otherwise 0. Testing whether a speciﬁc element x is in a set

thus involves checking whether the appropriate bits are 1 in the Bloom ﬁlter of the set; if they are not,

then x is certainly not in the set, but otherwise the Bloom ﬁlter reports that x is in the set. In the latter

case, it is possible for the Bloom ﬁlter to incorrectly report that x is an element of the set (i.e., a false

positive) when, in fact, it is not.

The probability of a false positive of a Bloom ﬁlter for a set S is denoted P

(S) and depends on the

number of elements in the set |S|, the length of the Bloom ﬁlter m, and the number of (independent)

hash functions k used to compute the Bloom ﬁlter. This false positive probability is given in [14] as

(S) =

1 −



1 −



k|S|

. (4)

Protocol 1 Unwrapping a wrapped ﬁlter W

against a host set S

for each set element s

∈ S

copy W

temp

= W

for each hash function h

if W

temp

[(s

)] > 0 then

temp

)] = W

temp

)] − 1

else proceed to the next element s

copy W

= W

temp

return the approximate δ

i=1

[i]

III. WRAPPED FILTER APPROXIMATION

Wrapped ﬁlters hold condensed set membership information with more precision than a Bloom ﬁlter.

The additional precision comes at the expense of higher communication costs, but, surprisingly, this

expense is outweighed by the beneﬁts of improved performance. As we show later, wrapped ﬁlters often

provide a more accurate approximation of set difference per communicated bit than traditional Bloom

ﬁlters.

A. Wrapping

Wrapped ﬁlters are constructed in a fashion similar to counting Bloom ﬁlters [13,14]. A wrapped ﬁlter

of a set S = {s

, s

, . . . s

} is ﬁrst initialized with all zeroes, and then set elements are added

to the ﬁlter by incrementing locations in W (S) corresponding to k independent hashes h

(·) of these

elements. More precisely, we increment W

)] for each set element s

∈ S and hash function h

order to construct the wrapped ﬁlter W

The wrapped ﬁlter clearly generalizes the Bloom ﬁlter in that we may transform the former into the

latter by treating all non-zero entries as ones. Host A can use this Bloom ﬁlter property of a wrapped

function to determine |S

− S

| by inspecting B’s wrapped ﬁlter W

; in other words, all elements of S

that do not ﬁt the Bloom ﬁlter can be considered to be in S

− S

. Conversely, the unwrapping algorithm

in Section III-B allows us to approximation |S

− S

| from the same wrapped ﬁlter, giving an overall

approximation for the mutual difference |S

⊕ S

Unlike Bloom ﬁlters, wrapped ﬁlters also have the feature of incrementally handling both insertions

and deletions. Thus, whereas a Bloom ﬁlter for a set would have to be recomputed upon deletion of an

element, one may simply decrement the corresponding hash locations for this element in the wrapped

ﬁlter. The price for this feature is that each entry can now take any of kn values (where n = |S| is the size

of the set being wrapped), requiring a worst-case of m log(kn) bits of storage memory and communication

for a ﬁlter of size m; in contrast, Bloom ﬁlters require only m bits of communication. Fortunately, the

expected case is for each entry to have only

entries giving an expected multiplicative storage overhead

of log(

) over traditional Bloom ﬁlters.

B. Unwrapping

Host A can unwrap a wrapped ﬁlter W

to approximate |S

− S

|. This unwrapping procedure is

presented formally in Protocol 1.

The strength of the wrapped ﬁlter rests in two features of the unwrapping algorithm. First, the total

weight of the wrapped ﬁlter (i.e.,

∈S

)) decreases as each set element is unwrapped. As a

result, the false positive probability also generally decreases with each unwrapping, yielding a better

overall approximate, as we shall see in Section IV.

Approximating the number of differences between remote sets

Figures

Citations

Advances in User Authentication

Collaborative data gathering in wireless sensor networks using measurement co-occurrence

Collaborative Data Gathering in Wireless Sensor Networks Using Measurement Co-Occurrence

Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings

Nye's Trie and Floret Estimators: Techniques for Detecting and Repairing Divergence in the SCADS Distributed Storage Toolkit

References

Space/time trade-offs in hash coding with allowable errors

The String-to-String Correction Problem

Summary cache: a scalable wide-area web cache sharing protocol

On the resemblance and containment of documents

Some complexity questions related to distributive computing(Preliminary Report)

Related Papers (5)

Statistical network protocol identification with unknown pattern extraction

ASR - Adaptive Similarity-Based Regressor for Uplink Data Rate Estimation in Mobile Networks

Using per-Source measurements to improve performance of Internet traffic classification

BitMatrix: A Multipurpose Sketch for Monitoring of Multi-tenant Networks

Robust evaluation method of communication network based on the combination of complex network and big data

Frequently Asked Questions (10)

Q1. What are the contributions in "Approximating the number of differences between remote sets" ?

Q2. What is the effect of false positives on the filter?

Q3. how many entropys can be used to prove the theorem?

Q4. How many bits of communication are needed to partition f?

Q5. how many bits is the size of a m wrapped filter?

Q6. What is the probability of a false positive of a Bloom filter?

Q7. What is the probability of a given wrapped filter having weight i?

Q8. What is the probability of two sets of vertices not corresponding to an edge?

Q9. How many bits of communication does a Bloom filter require?

Q10. What are the qualities of a wrapped filter?