Journal Article•DOI•

Theory and Practice of Bloom Filters for Distributed Systems

Sasu Tarkoma¹, Christian Esteve Rothenberg², Eemil Lagerspetz¹•Institutions (2)

Helsinki Institute for Information Technology¹, State University of Campinas²

21 Jan 2012-IEEE Communications Surveys and Tutorials (IEEE)-Vol. 14, Iss: 1, pp 131-155

TL;DR: An overview of the basic and advanced probabilistic techniques is given, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peer-to-peer systems, routing and forwarding, and measurement data summarization.

read less

Abstract: Many network solutions and overlay networks utilize probabilistic techniques to reduce information processing and networking costs. This survey article presents a number of frequently used and useful probabilistic techniques. Bloom filters and their variants are of prime importance, and they are heavily used in various distributed systems. This has been reflected in recent research and many new algorithms have been proposed for distributed systems that are either directly or indirectly based on Bloom filters. In this survey, we give an overview of the basic and advanced techniques, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peer-to-peer systems, routing and forwarding, and measurement data summarization.

...read moreread less

Summary (7 min read)

Jump to: [Introduction] – [II. BLOOM FILTERS] – [A. False Positive Probability] – [B. Operations] – [C. Hashing techniques] – [III. BLOOM FILTER VARIANTS] – [A. Counting Bloom Filters] – [C. Compressed Bloom Filter] – [E. Hierarchical Bloom Filters] – [F. Spectral Bloom Filters] – [H. Decaying Bloom Filters] – [I. Stable Bloom Filter] – [K. Adaptive Bloom filters] – [N. Scalable Bloom filters] – [O. Dynamic Bloom Filter] – [P. Split Bloom Filters] – [Q. Retouched Bloom filters] – [R. Generalized Bloom Filters] – [T. Data Popularity Conscious Bloom Filters] – [V. Weighted Bloom filter] – [W. Secure Bloom filters] – [X. Summary and discussion] – [IV. BLOOM FILTERS IN DISTRIBUTED COMPUTING] – [A. Caching] – [B. P2P Networks] – [C. Packet Routing and Forwarding] – [D. Monitoring and Measurement] – [E. Security] – [F. Other Applications] and [V. SUMMARY]

Introduction

This survey presents a number of frequently used and useful probabilistic techniques.
Fast matching of arbitrary identifiers to values is a basic requirement for a large number of applications.
Given that there are millions or even billions of data elements, developing efficient solutions for storing, updating, and querying them becomes increasingly important.
Section II introduces the functionality and parameters of the Bloom filter as a hash-based, probabilistic data structure.

II. BLOOM FILTERS

The Bloom filter is a space-efficient probabilistic data structure that supports set membership queries.
The weak point of Bloom filters is the possibility for a false positive.
The bits that correspond to z (positions 15, 10 and 7) were set through the addition of elements b, y and l.
The development of uniform hashing techniques has been an active area of research.
Finally, the size of the set that is inserted into the filter determines the false positive rate.

A. False Positive Probability

The authors now derive the false positive probability rate of a Bloom filter and the optimal number of hash functions for a given false positive probability rate.
The authors start with the assumption that a hash function selects each array position with equal probability.
Now, the authors want to minimize the probability of false positives, by minimizing (1−e−kn/m)k with respect to k. (8) This means that in order to maintain a fixed false positive probability, the length of a Bloom filter must grow linearly with the number of elements inserted in the filter.
There are other data structures that use space closer to the lower bound, but they are more complicated (cf. [5], [6], [7]).

B. Operations

Standard Bloom filters do not support the removal of elements.
Therefore a number of dedicated structures have been proposed that support deletions.
The bit-vector nature of the Bloom filter allows the union of two or more Bloom filters simply by performing bitwise OR on the bit-vectors.
One straightforward approach is to assume the same m and hash functions and to take the logical AND operation between the two bit-vectors.
Host A can then check false positives with B in a final round.

C. Hashing techniques

Hash functions are the key building block of probabilistic filters.
The n-size array can be used to store the information associated with each element x ∈ S [5].
For Bloom filter operations, the double hashing scheme reduces the number of true hash computations from k down to two without any increase in the asymptotic false positive probability [16].
When applied for hash table constructions, multiple hashing provides a probabilistic method to limit the effects of collisions by allocating elements more-or-less evenly distributed.
While this is a great aid to theoretical analyses, hash function implementations are known to behave far worse than truly random ones.

III. BLOOM FILTER VARIANTS

A number of Bloom filter variants have been proposed that address some of the limitations of the original structure, including counting, deletion, multisets, and space-efficiency.
The authors start their examination with the basic counting Bloom filter construction, and then proceed to more elaborate structures including Bloomier and Spectral filters.

A. Counting Bloom Filters

As mentioned with the treatment on standard Bloom filters, they do not support element deletions.
To avoid counter overflow, the authors need choose sufficiently large counters.
A counting Bloom filter also has the ability to keep approximate counts of items.
The upper bound is given by the formula below.
When an element is placed into the table, following the dleft hashing technique, d candidate buckets are obtained by computing d independent hash values of the element.

C. Compressed Bloom Filter

Compressing a Bloom filter improves performance when a Bloom filter is passed in a message between distributed nodes.
This structure is particularly useful when information must be transmitted repeatedly, and the bandwidth is a limiting factor [7].
If the optimal value of the number of hash functions k in order to minimize the false positive probability is used then the probability that a bit is set in the bitstring representing the filter is 1/2.
The key idea in compressed Bloom filters is that by changing the way bits are distributed in the filter, it can be compressed for transmission purposes.
After transmission, the filter is decompressed for use.

E. Hierarchical Bloom Filters

Shanmugasundaram et al. [31] presented a data structure called Hierarchical Bloom Filter to support substring matching.
The filter works by splitting an input string into a number of fixed-size blocks.
These blocks are then inserted into a standard Bloom filter.
This substring matching may result in combinations of strings that are incorrectly reported as being in the set (false positives).
For the second level, two subsequent blocks are concatenated and inserted into the second level.

F. Spectral Bloom Filters

Spectral Bloom filters generalize Bloom filters to storing an approximate multiset and support frequency queries [32].
The answer to any multiplicity query is never smaller than the true multiplicity, and greater only with probability ǫ.
Spectral refers to the range within which multiplicity answers are given.
The space usage is similar to that of a Bloom filter for a set of the same size (including the counters to store the frequency values).
A further improvement of the error rate can be achieved using the recurring minimum (RM) method, which consists of storing elements with a single minimum (among the k counters) in a secondary Spectral Bloom filter with a smaller error probability.

H. Decaying Bloom Filters

Duplicate element detection is an important problem, especially pertaining to data stream processing [36].
This motivates approximate detection of duplicates among newly arrived data elements of a data stream.
This can be accomplished within a fixed time window.
The Decaying Bloom Filter (DBF) structure has been proposed for this application scenario.
A variant of DBF has been applied for hint-based routing in wireless sensor networks [39].

I. Stable Bloom Filter

The Stable Bloom Filter or SBF [41] is another solution to duplicate element detection.
The SBF guarantees that the expected fraction of zeros in the SBF stays constant.
The SBF introduces both false positives and false negatives, but with rates improved from standard Bloom filters or standard buffering.
When adding an element, P counters chosen at random are first decremented (by one).
Please see the full paper [41] for details on setting all the parameters.

K. Adaptive Bloom filters

The Adaptive Bloom Filter (ABF) [43] is an alternative construction to counting Bloom filters especially well suited for applications where large counters are to be supported without overflows and under unpredictable collision rate dynamics (e.g., network traffic applications).
The key idea of the ABF is to count the appearances of elements by an increasing set of hash functions.
The key idea is to take advantage of differing flow sizes and increase or decrease the signature lengths of flows making them more easy or less easy to identify in the filter.
The construction can adaptively reduce the false positive rate by removing some bits of the signature, thus effectively removing the flow from the structure.
A related technique for handling time-varying sets, called double buffering, uses two bitmaps, active and inactive, to support time-dependent Bloom filters.

N. Scalable Bloom filters

One caveat with Bloom Filters is having to dimension the maximum filter size (m) a priori.
This is commonly done by application designers by establishing an upper bound on the expected fpr and estimating the maximum required capacity (n). Scalable Bloom Filters (SBF) [47] refer to a BF variant that can adapt dynamically to the number of elements stored, while assuring a maximum false positive probability.
Set membership queries require testing for element presence in each filter, thus the requirement on increasing sizes and tightening of error probabilities as the BF scales up.
Parameters of the SBF in addition to the initial bit size m and target fpr include the expected growth rate (s) and the error probability tightening ratio (r).

O. Dynamic Bloom Filter

Standard BFs and its mainstream variations suffer from inefficiencies when the cardinality of the set under representation is unknown prior to design and deployment.
In distributed applications, BF reconstruction is cumbersome and may hinder interoperability.
The DBF is based on the notion of an active Bloom filter.
The element is then inserted into the active BF.
If multiple filters return true, the element removal may result in, at most, k potential false negatives.

P. Split Bloom Filters

A Split Bloom filter (SPBF) [49] employs a constant s × m bit matrix for set representation, where s is a pre-defined constant based on the estimation of maximum set cardinality.
The SPBF aims at overcoming the limitation of standard BFs which do not take sets of variable sizes into account.
The basic idea of the SPBF is to allocate more memory space to enhance the capacity of the filter before its implementation and actual deployment.
The false match probability increases as the set cardinality grows.
An existing SPBF must be reconstructed using a new bit matrix if the false match probability exceeds an upper bound.

Q. Retouched Bloom filters

The Retouched Bloom filter (RBF) [50] builds upon two observations.
First, for many BF applications, there are some false positives, which are more troublesome than others and can be identified after BF construction but prior to deployment.
Second, there are cases where a low level of false negatives is acceptable.
The novel idea behind the RBF is the bit clearing process by which false positives are removed by resetting individual bits.
In case of a random bit clearing process, the gains are neutral, i.e., the fpr decrease equals the fnr increase.

R. Generalized Bloom Filters

A GBF starts out as an arbitrary bit vector set with both 1s and 0s, and information is encoded by setting chosen bits to either 0 or 1, departing thus from the notion that empty bit cells represent the absence of information.
As a result, the GBF is a more general binary classifier than the standard Bloom filter.
In the GBF, the false-positive probability is upper bounded and it does not depend on the initial condition of the filter.
The generalization brought by the set of hash functions resetting bits introduces false negatives, whose probability can be upper bounded and does not depend either on the bit filter initial set-up.
The GBF returns false if any bit is inverted, i.e. the queried element does not 12 belong to the set with a high probability.

T. Data Popularity Conscious Bloom Filters

In many information processing environments, the underlying popularities of data items and queries are not identical, but rather they differ and skewed.
An intuitive approach to take data item popularity into account is to use longer encodings and more hash functions for important elements and shorter encodings and fewer hash functions for less important ones.
Thus the Bloom filter construction lends itself well to data popularity-conscious filtering as well; however, this requires the minimization of the false positive rate by adapting the number of hashes used for each element to its popularities in sets and membership queries.
To this end, an object importance metric was proposed in [55].
The problem was modeled as a constrained nonlinear integer program and two polynomialtime solutions were presented with bounded approximation ratios.

V. Weighted Bloom filter

Bruck et al. [57] propose Weighted Bloom filter (WBF), a Bloom filter variant that exploits the a priori knowledge of the frequency of element requests by varying the number of hash functions (k) accordingly as a function of the element query popularity.
Hence, a WBF incorporates the information on the query frequencies and the membership likelihood of the elements into its optimal design, which fits many applications well in which popular elements are queried much more often than others.
The rationale behind the WBF design is to consider the filter fpr as a weighted sum of each individual element’s false positive probability, where the weight is positively correlated with the element’s query frequency and is negatively correlated with the element’s probability of being a member.
As a consequence, in applications where the query frequencies can be estimated or collected and result for instance in a step or the Zipf distribution, the WBF largely outperforms in fpr the traditional Bloom filter.
Even a simple binary classification of elements between hot and cold can result in false positive improvements of a few orders of magnitude.

W. Secure Bloom filters

The hashing nature of Bloom filters provide some basic security means in the sense that the identities of the set elements represented by the BF are not clearly visible for an observer.
Morever, BFs are vulnerable to correlation attacks where the similarity of BFs’ contents can be deduced by comparing BF indexes for overlaps, or lack thereof.
Encrypted Bloom filters by Bellovin and Cheswick [59] propose a privacy-preserving filter variant of Bloom filters which introduces a semi-trusted third party to transform one party’s queries to a form suitable for querying the other party’s BF, in such a way that the original query privacy is preserved.
Instead of undisclosing the keys of all parties and securing the BF operations with keyed hash functions as per Goh [58], Bellovin and Cheswick propose a specialized form of encryption function where operations can be done on encrypted data.
More specifically, their proposal is based on the Pohlig-Hellman cipher, which forms an Abelian group over its keys when encrypting any given element.

X. Summary and discussion

Table II summarizes the distinguishing features of the Bloom filter variants discussed in this section.
The different Bloom filter designs aim at addressing specific concerns regarding space and transmission efficiency, false positive rate, dynamic operation in terms of increasing workload, dynamic operation in terms of insertions and deletions, counting and frequencies, popularity-aware operation, and mapping to elements and sets instead of simple set membership tests.
For each variant, table II indicates the output type (e.g., boolean, frequency, value) and whether counting (C), deletion (D), or popularity-awareness (P) are supported (Yes/No/Maybe), or false negatives (FN) are introduced.
Making this choice and optimizing the parameters for the expected uses cases are fundamental factors to achieve the desired performance in practice.
Ultimately, which probabilistic data structure is best suited depends a lot on the application specifics.

IV. BLOOM FILTERS IN DISTRIBUTED COMPUTING

The authors have surveyed techniques for probabilistic representation of sets and functions.
The applications of these structures are manyfold, and they are widely used in various networking systems, such as Web proxies and caches, database servers, and routers.
Packet routing and forwarding, in which Bloom filters and variants have important roles in flow detection and classification.
Probabilistic techniques can be used to store and process measurement data summaries in routers and other network entities.
For more detail, see Figure 15 at the end of this article.

A. Caching

Bloom filters have been applied extensively to caching in distributed environments.
Figure 10 illustrates the use of a Bloom filter-based summary cache at a proxy.
Within a single proxy, a Bloom filter representing the local content cache needs to be recreated when the content changes.
Each chunk modulo the digest size is used as the value for one of the Bloom filter hash functions.
Bigtable uses Bloom filters to reduce the disk lookups for non-existent rows or columns [65].

B. P2P Networks

Bloom filters have been extensively applied in P2P environments for various tasks, such as compactly storing keywordbased searches and indices [67], synchronizing sets over network, and summarizing content.
In [68], the applications and parameters of Bloom filters in P2P networks are discussed.
Ideally, the state should be such that it allows for accurate matching of queries and takes sublinear space (or near constant space).
They present a locality-aware P2P system architecture called Foreseer, which 16 explicitly exploits geographical locality and temporal locality by constructing a neighbor overlay and a friend overlay, respectively.
Tribler uses Bloom filters to keep the databases that maintain the social trust network synchronized between peers.

C. Packet Routing and Forwarding

Bloom filters have been used to improve network router performance [76].
In [77], Bloom filters are used for high-speed network packet filtering.
By using direct lookup array and Controlled Prefix Expansion (CPE), worst-case performance is limited to two hash probes and one array access per lookup.
The other extreme approach to support multicast is to move state from the network elements to the packets themselves in form of Bloom filter-based representations of the multicast trees.
More importantly, matching of an incoming packet can now be performed in parallel over all tuples.

D. Monitoring and Measurement

Network monitoring and measurement are key application areas for Bloom filters and their variants.
The authors briefly examine some key cases in this domain, for example detection of heavy flows, Iceberg queries, packet attribution, and approximate state machines.
Bloom filter variants that are able to count elements are good candidate structures for supporting Iceberg queries.
Packet and payload attribution is another application area in measurement for Bloom filters.
It solves the central problems (counter space and flow-tocounter association) of per-flow measurement by ”braiding” a hierarchy of counters with random graphs.

E. Security

The hashing nature of the Bloom filter makes it a natural fit for security applications.
Two years later, Manber and Wu [108] presented two extensions to enhance the Bloomfilter-based check for weak passwords.
When the CBF was empty to the degree α, the attack string was considered detected, and the full string matcher was used to check for false positives.
The authors report a greater than 99% detection rate and false positive ratios of 1% or less.
In [118], Wolf presents a mechanism where packet forwarding is dependent on credentials represented as a packet header size Bloom filter.

F. Other Applications

This section summarizes use of Bloom filters in several other interesting applications.
Figure 14 shows an overview of device wakeup using a Bloom filter.
Millions of path queries can be stored efficiently.
Their Bloom pre-calculation scheme provides high-speed identification with a small amount of memory by storing pre-calculated outputs of the tags in Bloom filters.
The differential file, with updated records, would be accessed only when the record to fetch was contained in the Bloom filter, indicating that the record in the database is not up-to-date.

V. SUMMARY

Bloom filters are a general aid for network processing and improving the performance and scalability of distributed systems.
In Figure 15, The Bloom filter variants introduced in this paper are categorized by application domain and supported features.
Variants that support a certain feature are found inside a highlighted area labeled with the name of that feature.
The variants that support this are derived from the Counting Bloom Filter and include an array of fixed or variable size counters.
These allow for example in-word matches for text search.

Did you find this useful? Give us your feedback

Figures (11)

Fig. 10. Bloom filters for caching proxies

Fig. 14. Overview of device wakeup using a Bloom filter

Fig. 15. Summary of Bloom filter variants

Fig. 9. Bloom filter variants grouped by usage scenarios.

Fig. 13. Example of zFilter routing and forwarding

TABLE II KEY FEATURES OF THEBLOOM FILTER VARIANTS, INCLUDING THE ADDITIONAL CAPABILITIES : COUNTING (C), DELETION (D), POPULARITY-AWARENESS(P), FALSE-NEGATIVES (FN), AND THE OUTPUT TYPE.

Fig. 6. Example of a DlBF withm = 32, k = 3 andr = 4, representingS = {x, y, z}. The 1s in the firstr bits indicate collisions in the corresponding regions and bits therein cannot be deleted. All elements are deletable as each has at least one bit in a collision-free zone.

Fig. 7. Deletability estimate as function of the filter density m/n for different collision bitmap sizesr.

Fig. 12. Longest Prefix Matching with Bloom filters

Content maybe subject to copyright Report

Theory and Practice of Bloom Filters for

Distributed Systems

Sasu Tarkoma, Christian Esteve Rothenberg, and Eemil Lagerspetz

Abstract—Many network solutions and overlay networks uti-

lize probabilistic techniques to reduce information processing

and networking costs. This survey article presents a number of

frequently used and useful probabilistic techniques. Bloom ﬁlters

and their variants are of prime importance, and they are heavily

used in various distributed systems. This has been reﬂected in

recent research and many new algorithms have been proposed for

distributed systems that are either directly or indirectly based on

Bloom ﬁlters. In this survey, we give an overview of the basic and

advanced techniques, reviewing over 20 variants and discussing

their application in distributed systems, in particular for caching,

peer-to-peer systems, routing and forwarding, and measurement

data summarization.

Index Terms—Bloom ﬁlters, probabilistic structures, dis-

tributed systems

I. INTRODUCTION

Many network solutions and overlay networks utilize prob-

abilistic techniques to reduce information processing and net-

working costs. This survey presents a number of frequently

used and useful probabilistic techniques. Bloom ﬁlters (BF)

and their variants are of prime importance, and they are heavily

used in various distributed systems. This has been reﬂected in

recent research and many new algorithms have been proposed

for distributed systems that are either directly or indirectly

based on Bloom ﬁlters.

Fast matching of arbitrary identiﬁers to values is a basic

requirement for a large number of applications. Data objects

are typically referenced using locally or globally unique identi-

ﬁers. Recently, many distributed systems have been developed

using probabilistic globally unique random bit strings as node

identiﬁers. For example, a node tracks a large number of peers

that advertise ﬁles or parts of ﬁles. Fast mapping from host

identiﬁers to object identiﬁers and vice versa are needed. The

number of these identiﬁers in memory may be great, which

motivates the development of fast and compact matching

algorithms.

Given that there are millions or even billions of data

elements, developing efﬁcient solutions for storing, updating,

and querying them becomes increasingly important. The key

idea behind the data structures discussed in this survey is that

by allowing the representation of the set of elements to lose

some information, in other words to become lossy, the storage

requirements can be signiﬁcantly reduced.

The data structures presented in this survey for probabilistic

representation of sets are based on the seminal work by Burton

S. Tarkoma and E. Lagerspetz are with University of Helsinki, Department

of Computer Science

C. E. Rothenberg is with the University of Campinas (Unicamp), Depart-

ment of Computer Engineering and Industrial Automation

Bloom in 1970. Bloom ﬁrst described a compact probabilistic

data structure that was used to represent words in a dictionary.

There was little interest in using Bloom ﬁlters for networking

until 1995, after which this area has gained widespread interest

both in academia and in the industry. This survey provides

an up-to-date view to this emerging area of research and

development that was ﬁrst surveyed in the work of Broder

and Mitzenmacher [1].

Section II introduces the functionality and parameters of the

Bloom ﬁlter as a hash-based, probabilistic data structure. The

theoretical analysis is complemented with practical examples

and common practices in the underpinning hashing techniques.

Section III surveys as many as twenty-three Bloom ﬁlter

variants discussing their key features and their differential be-

haviour. Section IV covers a number of recent applications in

distributed systems, such as caches, database servers, routers,

security, and packet forwarding relying on packet header size

Bloom ﬁlters. Finally, Section V concludes the survey with a

brief summary on the rationale behind the widespread use of

the polymorphic Bloom ﬁlter data structure.

II. BLOOM FILTERS

The Bloom ﬁlter is a space-efﬁcient probabilistic data struc-

ture that supports set membership queries. The data structure

was conceived by Burton H. Bloom in 1970 [2]. The structure

offers a compact probabilistic way to represent a set that can

result in false positives (claiming an element to be part of

the set when it was not inserted), but never in false negatives

(reporting an inserted element to be absent from the set). This

makes Bloom ﬁlters useful for many different kinds of tasks

that involve lists and sets. The basic operations involve adding

elements to the set and querying for element membership in

the probabilistic set representation.

The basic Bloom ﬁlter does not support the removal of ele-

ments; however, a number of extensions have been developed

that also support removals. The accuracy of a Bloom ﬁlter

depends on the size of the ﬁlter, the number of hash functions

used in the ﬁlter, and the number of elements added to the set.

The more elements are added to a Bloom ﬁlter, the higher the

probability that the query operation reports false positives.

Broder and Mitzenmacher have coined the Bloom ﬁlter

principle [1]:

Whenever a list or set is used, and space is at a

premium, consider using a Bloom ﬁlter if the effect

of false positives can be mitigated.

A Bloom ﬁlter is an array of m bits for representing a set

S = {x

, x

, . . . , x

} of n elements. Initially all the bits in the

ﬁlter are set to zero. The key idea is to use k hash functions,

(x), 1 ≤ i ≤ k to map items x ∈ S to random numbers

uniform in the range 1, . . . m. The hash functions are assumed

to be uniform. The MD5 hash algorithm is a popular choice

for the hash functions.

An element x ∈ S is inserted into the ﬁlter by setting the

bits h

(x) to one for 1 ≤ i ≤ k. Conversely, y is assumed a

member of S if the bits h

(y) are set, and guaranteed not to

be a member if any bit h

(y) is not set. Algorithm 1 presents

the pseudocode for the insertion operation. Algorithm 2 gives

the pseudocode for the membership test of a given element x

in the ﬁlter. The weak point of Bloom ﬁlters is the possibility

for a false positive. False positives are elements that are not

part of S but are reported being in the set by the ﬁlter.

Data: x is the object key to insert into the Bloom ﬁlter.

Function: insert(x)

for j : 1 . . . k do

Loop all hash functions k

i ← h

(x);

if B

== 0 then

Bloom filter had zero bit at

position i

← 1;

end

Algorithm 1: Pseudocode for Bloom ﬁlter insertion

Data: x is the object key for which membership is tested.

Function: ismember(x) returns true or false to the

membership test

m ← 1;

j ← 1;

while m == 1 and j ≤ k do

i ← h

(x);

if B

== 0 then

m ← 0;

end

j ← j + 1;

end

return m;

Algorithm 2: Pseudocode for Bloom member test

Figure 1 presents an overview of a Bloom ﬁlter. The Bloom

ﬁlter consists of a bitstring of length 32. Three elements have

been inserted, namely x, y, and z. Each of the elements have

been hashed using k = 3 hash functions to bit positions in

the bitstring. The corresponding bits have been set to 1. Now,

when an element not in the set, w, is looked up, it will be

hashed using the same three hash functions into bit positions.

In this case, one of the positions is zero and hence the Bloom

ﬁlter reports correctly that the element is not in the set. It may

happen that all the bit positions of an element report that the

corresponding bits have been set. When this occurs, the Bloom

ﬁlter will erroneously report that the element is a member of

the set. These erroneous reports are called false positives. We

observe that for the inserted elements, the hashed positions

correctly report that the bit is set in the bitstring.

Figure 2 illustrates a practical example of a Bloom ﬁlter

through adding and querying elements. In this example, the

Fig. 1. Overview of a Bloom ﬁlter

Fig. 2. Addition and query example using a Bloom ﬁlter

Bloom ﬁlter is a bitstring of length 16. The bit positions are

numbered 0 to 15, from right to left. Three hash functions

are used: h

, h

, and h

, being MD5, SHA1 and CRC32,

respectively. The elements added are text strings containing

only a single letter. The Bloom ﬁlter starts out empty, with

all bits unset, or zero. When adding an element, the values

of h

through h

(modulo 16) are calculated for the element,

and corresponding bit positions are set to one. After adding

a and b, the Bloom ﬁlter has positions 15, 9, 8, 3 and 1 set.

In this case, a and b have one common bit position (8). We

further add elements y and l. After this, positions 15, 14, 13,

10, 9, 8, 7, 5, 3 and 1 are set. When we query for q and z, the

same hash functions are used. Bit positions that correspond

to q and z are examined. If the three bits for an element

are set, that element is assumed to be present. In the case

of q, position 0 is not set, and therefore q is guaranteed not to

be present in the Bloom ﬁlter. However, z is assumed to be

present, since the corresponding bits have been set. We know

that z is a false positive: it is reported present though it is not

actually contained in the set of added elements. The bits that

correspond to z (positions 15, 10 and 7) were set through the

addition of elements b, y and l.

For optimal performance, each of the k hash functions

should be a member of the class of universal hash functions,

which means that the hash functions map each item in the

universe to a random number uniform over the range. The

development of uniform hashing techniques has been an

active area of research. An almost ideal solution for uniform

hashing is presented in [3]. In practice, hash functions yielding

sufﬁciently uniformly distributed outputs, such as MD5 or

CRC32, are useful for most probabilistic ﬁlter purposes. For

candidate implementations, see the empirical evaluation of 25

hash functions by Henke et al. [4]. Later in Section II-C we

discuss relevant hashing techniques further.

A Bloom ﬁlter constructed based on S requires space O(n)

and can answer membership queries in O(1) time. Given x ∈

TABLE I

KEY BLOOM FILTER PARAMETERS

Parameters Increase

Number of hash functions (k) More computation, lower false positive rate as k → k

opt

Size of ﬁlter (m) More space is needed, lower false positive rate

Number of elements in the set (n) Higher false positive rate

S, the Bloom ﬁlter will always report that x belongs to S, but

given y 6∈ S the Bloom ﬁlter may report that y ∈ S.

Table I examines the behaviour of three key parameters

when their value is either decreased or increased. Increasing

or decreasing the number of hash functions towards k

opt

can

lower false positive ratio while increasing computation in

insertions and lookups. The cost is directly proportional to the

number of hash functions. The size of the ﬁlter can be used to

tune the space requirements and the false positive rate (fpr).

A larger ﬁlter will result in fewer false positives. Finally, the

size of the set that is inserted into the ﬁlter determines the

false positive rate. We note that although no false negatives

(fn) occur with regular BFs, some variants will be presented

later in the article that may result in false negatives.

A. False Positive Probability

We now derive the false positive probability rate of a Bloom

ﬁlter and the optimal number of hash functions for a given

false positive probability rate. We start with the assumption

that a hash function selects each array position with equal

probability. Let m denote the number of bits in the Bloom

ﬁlter. When inserting an element into the ﬁlter, the probability

that a certain bit is not set to one by a hash function is

1 −

. (1)

Now, there are k hash functions, and the probability of any

of them not having set a speciﬁc bit to one is given by



1 −



. (2)

After inserting n elements to the ﬁlter, the probability that

a given bit is still zero is



1 −



. (3)

And consequently the probability that the bit is one is

1 −



1 −



. (4)

For an element membership test, if all of the k array

positions in the ﬁlter computed by the hash functions are set

to one, the Bloom ﬁlter claims that the element belongs to the

set. The probability of this happening when the element is not

part of the set is given by

1 −



1 −



≈



1 − e

−kn/m



. (5)

1e-009

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1 10 100 1000 10000 100000

False positive probability (p)

Number of inserted elements (n)

False positive rate of Bloom filters

m=64

m=512

m=1024

m=2048

m=4096

Fig. 3. False positive probability rate for Bloom ﬁlters.

We note that e

−kn/m

is a very close approximation of (1 −

)

[1]. The false positive probability decreases as the size

of the Bloom ﬁlter, m, increases. The probability increases

with n as more elements are added. Now, we want to minimize

the probability of false positives, by minimizing (1 −e

−kn/m

)

with respect to k. This is accomplished by taking the derivative

and equaling to zero, which gives the optimal value of k

opt

ln 2 ≈

13n

. (6)

This results in the false positive probability of





≈ 0.6185

m/n

. (7)

Using the optimal number of hashes k

opt

, the false positive

probability can be rewritten and bounded

≥

ln 2

. (8)

This means that in order to maintain a ﬁxed false positive

probability, the length of a Bloom ﬁlter must grow linearly

with the number of elements inserted in the ﬁlter. The number

of bits m for the desired number of elements n and false

positive rate p, is given by

m = −

n ln p

(ln 2)

. (9)

Figure 3 presents the false positive probability rate p as a

function of the number of elements n in the ﬁlter and the ﬁlter

size m. An optimal number of hash functions k = (m/n) ln 2

has been assumed.

There is a factor of log

e ≈ 1.44 between the amount of

space used by a Bloom ﬁlter and the optimal amount of space

that can be used. There are other data structures that use space

closer to the lower bound, but they are more complicated (cf.

[5], [6], [7]).

Recently, Bose et al. [8] have shown that the false positive

analysis originally given by Bloom and repeated in many sub-

sequent articles is optimistic and only a good approximation

for large Bloom ﬁlters. The revisited analysis proves that the

commonly used estimate (Eq. 5) is actually a lower bound and

the real false positive rate is larger than expected by theory,

especially for small values of m.

B. Operations

Standard Bloom ﬁlters do not support the removal of

elements. Removal of an element can be implemented by

using a second Bloom ﬁlter that contains elements that have

been removed. The problem of this approach is that the false

positives of the second ﬁlter result in false negatives in the

composite ﬁlter, which is undesirable. Therefore a number of

dedicated structures have been proposed that support deletions.

These are examined later in this survey.

A number of operations involving Bloom ﬁlters can be

implemented easily, for example the union and halving of a

Bloom ﬁlter. The bit-vector nature of the Bloom ﬁlter allows

the union of two or more Bloom ﬁlters simply by performing

bitwise OR on the bit-vectors. Given two sets S

and S

, a

Bloom ﬁlter B that represents the union S = S

∪ S

can

be created by taking the OR of the original Bloom ﬁlters

B = B

∨ B

assuming that m and the hash functions are the

same. The merged ﬁlter B will report any element belonging

to S

or S

as belonging to set S. The following theorem

gives a lower bound for the false positive rate of the union of

Bloom ﬁlters [9]:

Theorem 1: The false positive probability of BF (A∪ B) is

not less than that of BF (A) and BF (B). At the same time,

the false positive probability of BF (A) ∪ BF (B) is also not

less than that of BF (A) and BF (B).

If the BF size m is divisible by 2, halving can be easily

done by bitwise ORing the ﬁrst and second halves together.

Now, the range of the hash functions needs to be accordingly

constrained, for instance, by applying the mod(m/2) to the

hash outputs.

Bloom ﬁlters can be used to approximate set intersection;

however, this is more complicated than the union operation.

One straightforward approach is to assume the same m and

hash functions and to take the logical AND operation between

the two bit-vectors. The following theorem gives the proba-

bility for this to hold [9]:

Theorem 2: If BF (A ∩ B), BF (A), and BF (B) use the

same m and hash functions, then BF (A ∩ B) = BF (A) ∩

BF (B) with probability (1 − 1/m)

|A−A∩B||B−A∩B|

The inner product of the bit-vectors is an indicator of

the size of the intersection [1]. The idea of a bloomjoin

was presented by Mackert and Lohman in 1986 [10]. In a

bloomjoin, two hosts, A and B, compute the intersection of

two sets S

and S

, when A has the ﬁrst set and B the second.

It is not feasible to send all the elements from A to B, and vice

versa. In a bloomjoin, S

is represented using a Bloom ﬁlter

and sent from A to B. B can then compute the intersection

and send back this set. Host A can then check false positives

with B in a ﬁnal round.

C. Hashing techniques

Hash functions are the key building block of probabilistic

ﬁlters. There is a large literature on hash functions spanning

from randomness analysis to security evaluation over many

networking and computing applications. We focus on the best

practices and recent developments in hashing techniques which

are relevant to the performance and practicality of Bloom ﬁlter

constructs. For further details, deeper theoretical foundations

and system-speciﬁc applications we refer to related work, such

as [4], [11], [12], [13].

One noteworthy property of Bloom ﬁlters is that the false

positive performance depends only on the bit-per-element ratio

(m/n) and not on the form or size of the hashed elements.

As long as the size of the elements can be bounded, hashing

time can be assumed to be a constant factor. Considering the

trend in computational power versus memory access time, the

practical bottleneck is the amount of (slow) memory accesses

rather than the hash computation time. Nevertheless, whenever

a ﬁlter application needs to run at line speed, hardware-

amenable per-packet operations are critical [13].

In the following subsections, we brieﬂy present hashing

techniques that are the basis for good Bloom ﬁlter implemen-

tations. We start with perfect hashing, which is an alternative

to Bloom ﬁlters when the set is known beforehand and it is

static. Double hashing allows reducing the number of true hash

computations. Partitioned hashing and multiple hashing deal

with how bits are allocated in a Bloom ﬁlter. Finally, the use

of simple hash functions is considered.

1) Perfect Hashing Scheme: A simple technique called

perfect hashing (or explicit hashing) can be used to store a

static set S of values in an optimal manner using a perfect hash

function. A perfect hash function is a computable bijection

from S to an array of |S| = n hash buckets. The n-size

array can be used to store the information associated with

each element x ∈ S [5].

Bloom ﬁlter like functionality can be obtained by, given

a set of elements S, ﬁrst ﬁnding a perfect hash function P

and then storing at each location an f = 1/ǫ bit ﬁngerprint,

computed using some (pseudo-)random hash function H.

Figure 4 illustrates this perfect hashing scheme.

Lookup of x simply consists of computing P (x) and check-

ing whether the stored hash function value matches H(x).

When x ∈ S, the correct value is always returned, and when

x /∈ S a false positive (claiming the element being in S) occurs

with probability at most ǫ. This follows from the deﬁnition of

2-universal hashing by Carter and Wengman [14], that any

element y not in S has probability at most ǫ of having the

same hash function value h(y) as the element in S that maps

to the same entry of the array.

While space efﬁcient, this approach is disconsidered for

dynamic environments, because the perfect hash function

needs to be recomputed when the set S changes.

Element 1 Element 2 Element 3 Element 4 Element 5

Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)

Fig. 4. Example of explicit hashing

Another technique for minimal perfect hashing was intro-

duced by Antichi et al. [15]. It relies on Bloom ﬁlters and

Blooming Trees to turn the imperfect hashing of a Bloom

ﬁlter into a perfect hashing. The technique gives space and

time savings. This technique also requires a static set S, but

can handle a huge number of elements.

2) Double Hashing: The improvement of the double hash-

ing technique over basic hashing is being able to generate

k hash values based on only two universal hash functions

as base generators (or “seed” hashes). As a practical conse-

quence, Bloom ﬁlters can be built with less hashing operations

without sacriﬁcing performance. Kirsch and Mitzenmacher

have shown [16] that it requires only two independent hash

functions, h

(x) and h

(x), to generate additional “pseudo”

hashes deﬁned as:

(x) = h

(x) + f(i) ∗ h

(x) (10)

where i is the hash value index, f(i) can be any arbitrary

function of i (e.g., i

), and x is the element being hashed. For

Bloom ﬁlter operations, the double hashing scheme reduces the

number of true hash computations from k down to two without

any increase in the asymptotic false positive probability [16].

3) Partitioned Hashing: In this hashing technique, the k

hash functions are allocated disjoint ranges of m/k consec-

utive bits instead of the full m-bit array space. Following

the same false positive probability analysis of Sec. II-A, the

probability of a speciﬁc bit being 0 in a partitioned Bloom

ﬁlter can be approximated to:

(1 − k/m)

≈ e

−kn/m

(11)

While the asymptotic performance remains the same, in

practice, partitioned Bloom ﬁlters exhibit a poorer false posi-

tive performance as they tend to have larger ﬁll factors (more

1s) due to the m/k bit range restriction. This can be explained

by the observation that:

(1 − 1/m)

k∗n

> (1 − k/m)

(12)

4) Multiple Hashing: Multiple hashing is a popular tech-

nique that exploits the notion of having multiple hash choices

and having the power to choose the most convenient candidate.

When applied for hash table constructions, multiple hashing

provides a probabilistic method to limit the effects of collisions

by allocating elements more-or-less evenly distributed. The

original idea was proposed by Azar et al. in his seminal work

on balanced allocations [17]. Formulating hashing as a balls

into bins problem, the authors show that if n balls are placed

sequentially into m for m = O(n) with each ball being

placed in one of a constant d = 2 randomly chosen bins,

then, after all balls are inserted, the maximal load in a bin is,

with high probability, (ln ln n)/ln d + O(1). V

ocking et al.

[18] elaborate on this observation and propose the always-go-

left algorithm (or d-left hashing scheme) to break ties when

inserting (chained) elements to the least loaded one among the

d partitioned candidates.

As a result this hashing technique provides an almost

optimal (up to an additive constant) load-balancing scheme.

In addition to the balancing improvement, partitioning the

hash buckets (i.e., bins) into groups makes d-left hashing

more hardware friendly as it allows the parallelized look-

up of the d hash locations. Thus, hash partitioning and tie-

breaking have elevated d-left hashing as an optimal technique

for building high performance (negligible overﬂow probabil-

ities) data structures such as the multiple level hash tables

(MHT) [19] or counting Bloom ﬁlters [20]. A breakthrough

Bloom ﬁlter design was recently proposed using an open-

addressed multiple choice hash table based on d-left hashing,

element ﬁngerprints (a smaller representation like the last f

bits of the element hash) and dynamic bit reassignment [21].

After all optimizations, the authors show that the performance

is comparable to plain Bloom ﬁlter constructs, outperforms

traditional counting Bloom ﬁlter constructs (see d-left CBF

in Sec. III-B), and easily extensible to support practical

networking applications (e.g., ﬂow tracking in Sec. IV-D).

The power of (two) choices has been exploited by Lumetta

and Mitzenmacher to improve the false positive performance

of Bloom ﬁlters [22]. The key idea consists of considering not

one but two groups of k hash functions. On element insertion,

the selection criteria is based on the group of k hash functions

that sets fewer bits to 1. The caveat is that when checking for

elements, both groups of k hash functions need to be checked

since there is no information on which group was initially used

and false positives can potentially be claimed for either group.

Although it may appear counter-intuitive, under some settings

(high m/n ratios), setting fewer ones in the ﬁlter actually pays

off the double checking operations.

Fundamentally similar in exploiting the power of choices

in producing less dense (improved) Bloom ﬁlters, the method

proposed by Hao et al. [23] is based on a partitioned hashing

technique which results in a choice of hash functions that set

fewer bits. Experimental results show that this improvement

can be as much as a ten-fold increase in performance over

standard constructs. However, the choice of hash functions

cannot be done on an element basis as in [22], and its

applicability is constrained to non-dynamic environments.

5) Simple hash functions: A common assumption is to

consider output hash values as truly random, that is, each

hashed element is independently mapped to a uniform location.

While this is a great aid to theoretical analyses, hash function

implementations are known to behave far worse than truly ran-

dom ones. On the other hand, empirical works using standard

universal hashing have been reporting negligible differences in

practical performance compared to predictions assuming ideal

hashing (see [24] for the case of Bloom ﬁlters).

Mitzenmacher and Vadhany [25] provide the seeds to for-

mally explaining this gap between the theory and practice

HTML Viewer

Frequently Asked Questions (19)

Q1. What contributions have the authors mentioned in the paper "Theory and practice of bloom filters for distributed systems" ?

This survey article presents a number of frequently used and useful probabilistic techniques. This has been reflected in recent research and many new algorithms have been proposed for distributed systems that are either directly or indirectly based on Bloom filters. In this survey, the authors give an overview of the basic and advanced techniques, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peer-to-peer systems, routing and forwarding, and measurement data summarization.

Q2. What is the effect of increasing or decreasing the number of hash functions towards kopt?

Increasing or decreasing the number of hash functions towards kopt can lower false positive ratio while increasing computation in insertions and lookups.

Q3. What is the accuracy of a Bloom filter?

The accuracy of a Bloom filter depends on the size of the filter, the number of hash functions used in the filter, and the number of elements added to the set.

Q4. What is the main idea of multiple hashing?

When applied for hash table constructions, multiple hashing provides a probabilistic method to limit the effects of collisions by allocating elements more-or-less evenly distributed.

Q5. What is the function that returns true if any of the BFs contain the element?

The query element membership operation iterates the set of BFs in the DBF and returns true if any of the BFs contain the element.

Q6. What is the probability of a bit being set in the bitstring?

If the optimal value of the number of hash functions k in order to minimize the false positive probability is used then the probability that a bit is set in the bitstring representing the filter is 1/2.

Q7. What is the effect of the construction on false positives?

The construction can adaptively reduce the false positive rate by removing some bits of the signature, thus effectively removing the flow from the structure.

Q8. What is the technique for handling time-varying sets?

A related technique for handling time-varying sets, called double buffering, uses two bitmaps, active and inactive, to support time-dependent Bloom filters.

Q9. How does the proposed mechanism adapt to set growth?

The proposed mechanism adapts to set growth by adding “slices” of traditional Bloom Filters of increasing sizes and tighter error probabilities, added as needed.

Q10. What is the simplest way to test for element presence in a set?

Set membership queries require testing for element presence in each filter, thus the requirement on increasing sizes and tightening of error probabilities as the BF scales up.

Q11. What is the description of the perfect hashing scheme?

1) Perfect Hashing Scheme: A simple technique called perfect hashing (or explicit hashing) can be used to store a static set S of values in an optimal manner using a perfect hash function.

Q12. Why is the Bloom filter used to distribute routing tables?

This is motivated by applications such as Web caches and P2P information sharing, which frequently use Bloom filters to distribute routing tables.

Q13. What is the caveat to multiple hashing?

The caveat is that when checking for elements, both groups of k hash functions need to be checked since there is no information on which group was initially used and false positives can potentially be claimed for either group.

Q14. What is the hierarchy of blocks used to insert a string?

When a string is inserted, it is first broken into blocks which are inserted into the filter hierarchy starting from the lowest level.

Q15. What is the probability that the ith counter is incremented j times?

The probability that the ith counter is incremented j times is a binomial random variable:P (c(i) = j) =(nkj)( 1m )j(1−1 m )nk−j (13)The probability that any counter is at least j is bounded above by mP (c(i) = j), which can be calculated using the above formula.

Q16. How can The authorsolve the problem of deleting a fingerprint?

The problem of knowing which candidate element fingerprint to delete – in case of fingerprint collisions – can be neatly solved by breaking the problem into two parts, namely the creation of the fingerprint, and finding the d locations by making additional (pseudo)-random permutations.

Q17. How does the hash function be constrained?

the range of the hash functions needs to be accordingly constrained, for instance, by applying the mod(m/2) to the hash outputs.

Q18. What is the effect of the WBF on the query frequency?

As a consequence, in applications where the query frequencies can be estimated or collected and result for instance in a step or the Zipf distribution, the WBF largely outperforms in fpr the traditional Bloom filter.

Q19. What is the popular explanation for multiple hashing?

This can be explained by the observation that:(1− 1/m)k∗n > (1− k/m)n (12)4) Multiple Hashing: Multiple hashing is a popular technique that exploits the notion of having multiple hash choices and having the power to choose the most convenient candidate.

Theory and Practice of Bloom Filters for Distributed Systems

Summary (7 min read)

Introduction

II. BLOOM FILTERS

A. False Positive Probability

B. Operations

C. Hashing techniques

III. BLOOM FILTER VARIANTS

A. Counting Bloom Filters

C. Compressed Bloom Filter

E. Hierarchical Bloom Filters

F. Spectral Bloom Filters

H. Decaying Bloom Filters

I. Stable Bloom Filter

K. Adaptive Bloom filters

N. Scalable Bloom filters

O. Dynamic Bloom Filter

P. Split Bloom Filters

Q. Retouched Bloom filters

R. Generalized Bloom Filters

T. Data Popularity Conscious Bloom Filters

V. Weighted Bloom filter

W. Secure Bloom filters

X. Summary and discussion

IV. BLOOM FILTERS IN DISTRIBUTED COMPUTING

A. Caching

B. P2P Networks

C. Packet Routing and Forwarding

D. Monitoring and Measurement

E. Security

F. Other Applications

V. SUMMARY

Figures (11)

Citations

Cites background from "Theory and Practice of Bloom Filter..."

Cites methods from "Theory and Practice of Bloom Filter..."

Cites background from "Theory and Practice of Bloom Filter..."

References

Additional excerpts

"Theory and Practice of Bloom Filter..." refers methods in this paper

"Theory and Practice of Bloom Filter..." refers methods in this paper

"Theory and Practice of Bloom Filter..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (19)

Q1. What contributions have the authors mentioned in the paper "Theory and practice of bloom filters for distributed systems" ?

Q2. What is the effect of increasing or decreasing the number of hash functions towards kopt?

Q3. What is the accuracy of a Bloom filter?

Q4. What is the main idea of multiple hashing?

Q5. What is the function that returns true if any of the BFs contain the element?

Q6. What is the probability of a bit being set in the bitstring?

Q7. What is the effect of the construction on false positives?

Q8. What is the technique for handling time-varying sets?

Q9. How does the proposed mechanism adapt to set growth?

Q10. What is the simplest way to test for element presence in a set?

Q11. What is the description of the perfect hashing scheme?

Q12. Why is the Bloom filter used to distribute routing tables?

Q13. What is the caveat to multiple hashing?

Q14. What is the hierarchy of blocks used to insert a string?

Q15. What is the probability that the ith counter is incremented j times?

Q16. How can The authorsolve the problem of deleting a fingerprint?

Q17. How does the hash function be constrained?

Q18. What is the effect of the WBF on the query frequency?

Q19. What is the popular explanation for multiple hashing?