scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Theory and Practice of Bloom Filters for Distributed Systems

TL;DR: An overview of the basic and advanced probabilistic techniques is given, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peer-to-peer systems, routing and forwarding, and measurement data summarization.
Abstract: Many network solutions and overlay networks utilize probabilistic techniques to reduce information processing and networking costs. This survey article presents a number of frequently used and useful probabilistic techniques. Bloom filters and their variants are of prime importance, and they are heavily used in various distributed systems. This has been reflected in recent research and many new algorithms have been proposed for distributed systems that are either directly or indirectly based on Bloom filters. In this survey, we give an overview of the basic and advanced techniques, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peer-to-peer systems, routing and forwarding, and measurement data summarization.

Summary (7 min read)

Introduction

  • This survey presents a number of frequently used and useful probabilistic techniques.
  • Fast matching of arbitrary identifiers to values is a basic requirement for a large number of applications.
  • Given that there are millions or even billions of data elements, developing efficient solutions for storing, updating, and querying them becomes increasingly important.
  • Section II introduces the functionality and parameters of the Bloom filter as a hash-based, probabilistic data structure.

II. BLOOM FILTERS

  • The Bloom filter is a space-efficient probabilistic data structure that supports set membership queries.
  • The weak point of Bloom filters is the possibility for a false positive.
  • The bits that correspond to z (positions 15, 10 and 7) were set through the addition of elements b, y and l.
  • The development of uniform hashing techniques has been an active area of research.
  • Finally, the size of the set that is inserted into the filter determines the false positive rate.

A. False Positive Probability

  • The authors now derive the false positive probability rate of a Bloom filter and the optimal number of hash functions for a given false positive probability rate.
  • The authors start with the assumption that a hash function selects each array position with equal probability.
  • Now, the authors want to minimize the probability of false positives, by minimizing (1−e−kn/m)k with respect to k. (8) This means that in order to maintain a fixed false positive probability, the length of a Bloom filter must grow linearly with the number of elements inserted in the filter.
  • There are other data structures that use space closer to the lower bound, but they are more complicated (cf. [5], [6], [7]).

B. Operations

  • Standard Bloom filters do not support the removal of elements.
  • Therefore a number of dedicated structures have been proposed that support deletions.
  • The bit-vector nature of the Bloom filter allows the union of two or more Bloom filters simply by performing bitwise OR on the bit-vectors.
  • One straightforward approach is to assume the same m and hash functions and to take the logical AND operation between the two bit-vectors.
  • Host A can then check false positives with B in a final round.

C. Hashing techniques

  • Hash functions are the key building block of probabilistic filters.
  • The n-size array can be used to store the information associated with each element x ∈ S [5].
  • For Bloom filter operations, the double hashing scheme reduces the number of true hash computations from k down to two without any increase in the asymptotic false positive probability [16].
  • When applied for hash table constructions, multiple hashing provides a probabilistic method to limit the effects of collisions by allocating elements more-or-less evenly distributed.
  • While this is a great aid to theoretical analyses, hash function implementations are known to behave far worse than truly random ones.

III. BLOOM FILTER VARIANTS

  • A number of Bloom filter variants have been proposed that address some of the limitations of the original structure, including counting, deletion, multisets, and space-efficiency.
  • The authors start their examination with the basic counting Bloom filter construction, and then proceed to more elaborate structures including Bloomier and Spectral filters.

A. Counting Bloom Filters

  • As mentioned with the treatment on standard Bloom filters, they do not support element deletions.
  • To avoid counter overflow, the authors need choose sufficiently large counters.
  • A counting Bloom filter also has the ability to keep approximate counts of items.
  • The upper bound is given by the formula below.
  • When an element is placed into the table, following the dleft hashing technique, d candidate buckets are obtained by computing d independent hash values of the element.

C. Compressed Bloom Filter

  • Compressing a Bloom filter improves performance when a Bloom filter is passed in a message between distributed nodes.
  • This structure is particularly useful when information must be transmitted repeatedly, and the bandwidth is a limiting factor [7].
  • If the optimal value of the number of hash functions k in order to minimize the false positive probability is used then the probability that a bit is set in the bitstring representing the filter is 1/2.
  • The key idea in compressed Bloom filters is that by changing the way bits are distributed in the filter, it can be compressed for transmission purposes.
  • After transmission, the filter is decompressed for use.

E. Hierarchical Bloom Filters

  • Shanmugasundaram et al. [31] presented a data structure called Hierarchical Bloom Filter to support substring matching.
  • The filter works by splitting an input string into a number of fixed-size blocks.
  • These blocks are then inserted into a standard Bloom filter.
  • This substring matching may result in combinations of strings that are incorrectly reported as being in the set (false positives).
  • For the second level, two subsequent blocks are concatenated and inserted into the second level.

F. Spectral Bloom Filters

  • Spectral Bloom filters generalize Bloom filters to storing an approximate multiset and support frequency queries [32].
  • The answer to any multiplicity query is never smaller than the true multiplicity, and greater only with probability ǫ.
  • Spectral refers to the range within which multiplicity answers are given.
  • The space usage is similar to that of a Bloom filter for a set of the same size (including the counters to store the frequency values).
  • A further improvement of the error rate can be achieved using the recurring minimum (RM) method, which consists of storing elements with a single minimum (among the k counters) in a secondary Spectral Bloom filter with a smaller error probability.

H. Decaying Bloom Filters

  • Duplicate element detection is an important problem, especially pertaining to data stream processing [36].
  • This motivates approximate detection of duplicates among newly arrived data elements of a data stream.
  • This can be accomplished within a fixed time window.
  • The Decaying Bloom Filter (DBF) structure has been proposed for this application scenario.
  • A variant of DBF has been applied for hint-based routing in wireless sensor networks [39].

I. Stable Bloom Filter

  • The Stable Bloom Filter or SBF [41] is another solution to duplicate element detection.
  • The SBF guarantees that the expected fraction of zeros in the SBF stays constant.
  • The SBF introduces both false positives and false negatives, but with rates improved from standard Bloom filters or standard buffering.
  • When adding an element, P counters chosen at random are first decremented (by one).
  • Please see the full paper [41] for details on setting all the parameters.

K. Adaptive Bloom filters

  • The Adaptive Bloom Filter (ABF) [43] is an alternative construction to counting Bloom filters especially well suited for applications where large counters are to be supported without overflows and under unpredictable collision rate dynamics (e.g., network traffic applications).
  • The key idea of the ABF is to count the appearances of elements by an increasing set of hash functions.
  • The key idea is to take advantage of differing flow sizes and increase or decrease the signature lengths of flows making them more easy or less easy to identify in the filter.
  • The construction can adaptively reduce the false positive rate by removing some bits of the signature, thus effectively removing the flow from the structure.
  • A related technique for handling time-varying sets, called double buffering, uses two bitmaps, active and inactive, to support time-dependent Bloom filters.

N. Scalable Bloom filters

  • One caveat with Bloom Filters is having to dimension the maximum filter size (m) a priori.
  • This is commonly done by application designers by establishing an upper bound on the expected fpr and estimating the maximum required capacity (n). Scalable Bloom Filters (SBF) [47] refer to a BF variant that can adapt dynamically to the number of elements stored, while assuring a maximum false positive probability.
  • Set membership queries require testing for element presence in each filter, thus the requirement on increasing sizes and tightening of error probabilities as the BF scales up.
  • Parameters of the SBF in addition to the initial bit size m and target fpr include the expected growth rate (s) and the error probability tightening ratio (r).

O. Dynamic Bloom Filter

  • Standard BFs and its mainstream variations suffer from inefficiencies when the cardinality of the set under representation is unknown prior to design and deployment.
  • In distributed applications, BF reconstruction is cumbersome and may hinder interoperability.
  • The DBF is based on the notion of an active Bloom filter.
  • The element is then inserted into the active BF.
  • If multiple filters return true, the element removal may result in, at most, k potential false negatives.

P. Split Bloom Filters

  • A Split Bloom filter (SPBF) [49] employs a constant s × m bit matrix for set representation, where s is a pre-defined constant based on the estimation of maximum set cardinality.
  • The SPBF aims at overcoming the limitation of standard BFs which do not take sets of variable sizes into account.
  • The basic idea of the SPBF is to allocate more memory space to enhance the capacity of the filter before its implementation and actual deployment.
  • The false match probability increases as the set cardinality grows.
  • An existing SPBF must be reconstructed using a new bit matrix if the false match probability exceeds an upper bound.

Q. Retouched Bloom filters

  • The Retouched Bloom filter (RBF) [50] builds upon two observations.
  • First, for many BF applications, there are some false positives, which are more troublesome than others and can be identified after BF construction but prior to deployment.
  • Second, there are cases where a low level of false negatives is acceptable.
  • The novel idea behind the RBF is the bit clearing process by which false positives are removed by resetting individual bits.
  • In case of a random bit clearing process, the gains are neutral, i.e., the fpr decrease equals the fnr increase.

R. Generalized Bloom Filters

  • A GBF starts out as an arbitrary bit vector set with both 1s and 0s, and information is encoded by setting chosen bits to either 0 or 1, departing thus from the notion that empty bit cells represent the absence of information.
  • As a result, the GBF is a more general binary classifier than the standard Bloom filter.
  • In the GBF, the false-positive probability is upper bounded and it does not depend on the initial condition of the filter.
  • The generalization brought by the set of hash functions resetting bits introduces false negatives, whose probability can be upper bounded and does not depend either on the bit filter initial set-up.
  • The GBF returns false if any bit is inverted, i.e. the queried element does not 12 belong to the set with a high probability.

T. Data Popularity Conscious Bloom Filters

  • In many information processing environments, the underlying popularities of data items and queries are not identical, but rather they differ and skewed.
  • An intuitive approach to take data item popularity into account is to use longer encodings and more hash functions for important elements and shorter encodings and fewer hash functions for less important ones.
  • Thus the Bloom filter construction lends itself well to data popularity-conscious filtering as well; however, this requires the minimization of the false positive rate by adapting the number of hashes used for each element to its popularities in sets and membership queries.
  • To this end, an object importance metric was proposed in [55].
  • The problem was modeled as a constrained nonlinear integer program and two polynomialtime solutions were presented with bounded approximation ratios.

V. Weighted Bloom filter

  • Bruck et al. [57] propose Weighted Bloom filter (WBF), a Bloom filter variant that exploits the a priori knowledge of the frequency of element requests by varying the number of hash functions (k) accordingly as a function of the element query popularity.
  • Hence, a WBF incorporates the information on the query frequencies and the membership likelihood of the elements into its optimal design, which fits many applications well in which popular elements are queried much more often than others.
  • The rationale behind the WBF design is to consider the filter fpr as a weighted sum of each individual element’s false positive probability, where the weight is positively correlated with the element’s query frequency and is negatively correlated with the element’s probability of being a member.
  • As a consequence, in applications where the query frequencies can be estimated or collected and result for instance in a step or the Zipf distribution, the WBF largely outperforms in fpr the traditional Bloom filter.
  • Even a simple binary classification of elements between hot and cold can result in false positive improvements of a few orders of magnitude.

W. Secure Bloom filters

  • The hashing nature of Bloom filters provide some basic security means in the sense that the identities of the set elements represented by the BF are not clearly visible for an observer.
  • Morever, BFs are vulnerable to correlation attacks where the similarity of BFs’ contents can be deduced by comparing BF indexes for overlaps, or lack thereof.
  • Encrypted Bloom filters by Bellovin and Cheswick [59] propose a privacy-preserving filter variant of Bloom filters which introduces a semi-trusted third party to transform one party’s queries to a form suitable for querying the other party’s BF, in such a way that the original query privacy is preserved.
  • Instead of undisclosing the keys of all parties and securing the BF operations with keyed hash functions as per Goh [58], Bellovin and Cheswick propose a specialized form of encryption function where operations can be done on encrypted data.
  • More specifically, their proposal is based on the Pohlig-Hellman cipher, which forms an Abelian group over its keys when encrypting any given element.

X. Summary and discussion

  • Table II summarizes the distinguishing features of the Bloom filter variants discussed in this section.
  • The different Bloom filter designs aim at addressing specific concerns regarding space and transmission efficiency, false positive rate, dynamic operation in terms of increasing workload, dynamic operation in terms of insertions and deletions, counting and frequencies, popularity-aware operation, and mapping to elements and sets instead of simple set membership tests.
  • For each variant, table II indicates the output type (e.g., boolean, frequency, value) and whether counting (C), deletion (D), or popularity-awareness (P) are supported (Yes/No/Maybe), or false negatives (FN) are introduced.
  • Making this choice and optimizing the parameters for the expected uses cases are fundamental factors to achieve the desired performance in practice.
  • Ultimately, which probabilistic data structure is best suited depends a lot on the application specifics.

IV. BLOOM FILTERS IN DISTRIBUTED COMPUTING

  • The authors have surveyed techniques for probabilistic representation of sets and functions.
  • The applications of these structures are manyfold, and they are widely used in various networking systems, such as Web proxies and caches, database servers, and routers.
  • Packet routing and forwarding, in which Bloom filters and variants have important roles in flow detection and classification.
  • Probabilistic techniques can be used to store and process measurement data summaries in routers and other network entities.
  • For more detail, see Figure 15 at the end of this article.

A. Caching

  • Bloom filters have been applied extensively to caching in distributed environments.
  • Figure 10 illustrates the use of a Bloom filter-based summary cache at a proxy.
  • Within a single proxy, a Bloom filter representing the local content cache needs to be recreated when the content changes.
  • Each chunk modulo the digest size is used as the value for one of the Bloom filter hash functions.
  • Bigtable uses Bloom filters to reduce the disk lookups for non-existent rows or columns [65].

B. P2P Networks

  • Bloom filters have been extensively applied in P2P environments for various tasks, such as compactly storing keywordbased searches and indices [67], synchronizing sets over network, and summarizing content.
  • In [68], the applications and parameters of Bloom filters in P2P networks are discussed.
  • Ideally, the state should be such that it allows for accurate matching of queries and takes sublinear space (or near constant space).
  • They present a locality-aware P2P system architecture called Foreseer, which 16 explicitly exploits geographical locality and temporal locality by constructing a neighbor overlay and a friend overlay, respectively.
  • Tribler uses Bloom filters to keep the databases that maintain the social trust network synchronized between peers.

C. Packet Routing and Forwarding

  • Bloom filters have been used to improve network router performance [76].
  • In [77], Bloom filters are used for high-speed network packet filtering.
  • By using direct lookup array and Controlled Prefix Expansion (CPE), worst-case performance is limited to two hash probes and one array access per lookup.
  • The other extreme approach to support multicast is to move state from the network elements to the packets themselves in form of Bloom filter-based representations of the multicast trees.
  • More importantly, matching of an incoming packet can now be performed in parallel over all tuples.

D. Monitoring and Measurement

  • Network monitoring and measurement are key application areas for Bloom filters and their variants.
  • The authors briefly examine some key cases in this domain, for example detection of heavy flows, Iceberg queries, packet attribution, and approximate state machines.
  • Bloom filter variants that are able to count elements are good candidate structures for supporting Iceberg queries.
  • Packet and payload attribution is another application area in measurement for Bloom filters.
  • It solves the central problems (counter space and flow-tocounter association) of per-flow measurement by ”braiding” a hierarchy of counters with random graphs.

E. Security

  • The hashing nature of the Bloom filter makes it a natural fit for security applications.
  • Two years later, Manber and Wu [108] presented two extensions to enhance the Bloomfilter-based check for weak passwords.
  • When the CBF was empty to the degree α, the attack string was considered detected, and the full string matcher was used to check for false positives.
  • The authors report a greater than 99% detection rate and false positive ratios of 1% or less.
  • In [118], Wolf presents a mechanism where packet forwarding is dependent on credentials represented as a packet header size Bloom filter.

F. Other Applications

  • This section summarizes use of Bloom filters in several other interesting applications.
  • Figure 14 shows an overview of device wakeup using a Bloom filter.
  • Millions of path queries can be stored efficiently.
  • Their Bloom pre-calculation scheme provides high-speed identification with a small amount of memory by storing pre-calculated outputs of the tags in Bloom filters.
  • The differential file, with updated records, would be accessed only when the record to fetch was contained in the Bloom filter, indicating that the record in the database is not up-to-date.

V. SUMMARY

  • Bloom filters are a general aid for network processing and improving the performance and scalability of distributed systems.
  • In Figure 15, The Bloom filter variants introduced in this paper are categorized by application domain and supported features.
  • Variants that support a certain feature are found inside a highlighted area labeled with the name of that feature.
  • The variants that support this are derived from the Counting Bloom Filter and include an array of fixed or variable size counters.
  • These allow for example in-word matches for text search.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

1
Theory and Practice of Bloom Filters for
Distributed Systems
Sasu Tarkoma, Christian Esteve Rothenberg, and Eemil Lagerspetz
AbstractMany network solutions and overlay networks uti-
lize probabilistic techniques to reduce information processing
and networking costs. This survey article presents a number of
frequently used and useful probabilistic techniques. Bloom filters
and their variants are of prime importance, and they are heavily
used in various distributed systems. This has been reflected in
recent research and many new algorithms have been proposed for
distributed systems that are either directly or indirectly based on
Bloom filters. In this survey, we give an overview of the basic and
advanced techniques, reviewing over 20 variants and discussing
their application in distributed systems, in particular for caching,
peer-to-peer systems, routing and forwarding, and measurement
data summarization.
Index TermsBloom filters, probabilistic structures, dis-
tributed systems
I. INTRODUCTION
Many network solutions and overlay networks utilize prob-
abilistic techniques to reduce information processing and net-
working costs. This survey presents a number of frequently
used and useful probabilistic techniques. Bloom filters (BF)
and their variants are of prime importance, and they are heavily
used in various distributed systems. This has been reflected in
recent research and many new algorithms have been proposed
for distributed systems that are either directly or indirectly
based on Bloom filters.
Fast matching of arbitrary identifiers to values is a basic
requirement for a large number of applications. Data objects
are typically referenced using locally or globally unique identi-
fiers. Recently, many distributed systems have been developed
using probabilistic globally unique random bit strings as node
identifiers. For example, a node tracks a large number of peers
that advertise files or parts of files. Fast mapping from host
identifiers to object identifiers and vice versa are needed. The
number of these identifiers in memory may be great, which
motivates the development of fast and compact matching
algorithms.
Given that there are millions or even billions of data
elements, developing efficient solutions for storing, updating,
and querying them becomes increasingly important. The key
idea behind the data structures discussed in this survey is that
by allowing the representation of the set of elements to lose
some information, in other words to become lossy, the storage
requirements can be significantly reduced.
The data structures presented in this survey for probabilistic
representation of sets are based on the seminal work by Burton
S. Tarkoma and E. Lagerspetz are with University of Helsinki, Department
of Computer Science
C. E. Rothenberg is with the University of Campinas (Unicamp), Depart-
ment of Computer Engineering and Industrial Automation
Bloom in 1970. Bloom first described a compact probabilistic
data structure that was used to represent words in a dictionary.
There was little interest in using Bloom filters for networking
until 1995, after which this area has gained widespread interest
both in academia and in the industry. This survey provides
an up-to-date view to this emerging area of research and
development that was first surveyed in the work of Broder
and Mitzenmacher [1].
Section II introduces the functionality and parameters of the
Bloom filter as a hash-based, probabilistic data structure. The
theoretical analysis is complemented with practical examples
and common practices in the underpinning hashing techniques.
Section III surveys as many as twenty-three Bloom filter
variants discussing their key features and their differential be-
haviour. Section IV covers a number of recent applications in
distributed systems, such as caches, database servers, routers,
security, and packet forwarding relying on packet header size
Bloom filters. Finally, Section V concludes the survey with a
brief summary on the rationale behind the widespread use of
the polymorphic Bloom filter data structure.
II. BLOOM FILTERS
The Bloom filter is a space-efficient probabilistic data struc-
ture that supports set membership queries. The data structure
was conceived by Burton H. Bloom in 1970 [2]. The structure
offers a compact probabilistic way to represent a set that can
result in false positives (claiming an element to be part of
the set when it was not inserted), but never in false negatives
(reporting an inserted element to be absent from the set). This
makes Bloom filters useful for many different kinds of tasks
that involve lists and sets. The basic operations involve adding
elements to the set and querying for element membership in
the probabilistic set representation.
The basic Bloom filter does not support the removal of ele-
ments; however, a number of extensions have been developed
that also support removals. The accuracy of a Bloom filter
depends on the size of the filter, the number of hash functions
used in the filter, and the number of elements added to the set.
The more elements are added to a Bloom filter, the higher the
probability that the query operation reports false positives.
Broder and Mitzenmacher have coined the Bloom filter
principle [1]:
Whenever a list or set is used, and space is at a
premium, consider using a Bloom filter if the effect
of false positives can be mitigated.
A Bloom filter is an array of m bits for representing a set
S = {x
1
, x
2
, . . . , x
n
} of n elements. Initially all the bits in the

2
filter are set to zero. The key idea is to use k hash functions,
h
i
(x), 1 i k to map items x S to random numbers
uniform in the range 1, . . . m. The hash functions are assumed
to be uniform. The MD5 hash algorithm is a popular choice
for the hash functions.
An element x S is inserted into the filter by setting the
bits h
i
(x) to one for 1 i k. Conversely, y is assumed a
member of S if the bits h
i
(y) are set, and guaranteed not to
be a member if any bit h
i
(y) is not set. Algorithm 1 presents
the pseudocode for the insertion operation. Algorithm 2 gives
the pseudocode for the membership test of a given element x
in the filter. The weak point of Bloom filters is the possibility
for a false positive. False positives are elements that are not
part of S but are reported being in the set by the filter.
Data: x is the object key to insert into the Bloom filter.
Function: insert(x)
for j : 1 . . . k do
/
*
Loop all hash functions k
*
/
i h
j
(x);
if B
i
== 0 then
/
*
Bloom filter had zero bit at
position i
*
/
B
i
1;
end
end
Algorithm 1: Pseudocode for Bloom filter insertion
Data: x is the object key for which membership is tested.
Function: ismember(x) returns true or false to the
membership test
m 1;
j 1;
while m == 1 and j k do
i h
j
(x);
if B
i
== 0 then
m 0;
end
j j + 1;
end
return m;
Algorithm 2: Pseudocode for Bloom member test
Figure 1 presents an overview of a Bloom filter. The Bloom
filter consists of a bitstring of length 32. Three elements have
been inserted, namely x, y, and z. Each of the elements have
been hashed using k = 3 hash functions to bit positions in
the bitstring. The corresponding bits have been set to 1. Now,
when an element not in the set, w, is looked up, it will be
hashed using the same three hash functions into bit positions.
In this case, one of the positions is zero and hence the Bloom
filter reports correctly that the element is not in the set. It may
happen that all the bit positions of an element report that the
corresponding bits have been set. When this occurs, the Bloom
filter will erroneously report that the element is a member of
the set. These erroneous reports are called false positives. We
observe that for the inserted elements, the hashed positions
correctly report that the bit is set in the bitstring.
Figure 2 illustrates a practical example of a Bloom filter
through adding and querying elements. In this example, the
Fig. 1. Overview of a Bloom filter
Fig. 2. Addition and query example using a Bloom filter
Bloom filter is a bitstring of length 16. The bit positions are
numbered 0 to 15, from right to left. Three hash functions
are used: h
1
, h
2
, and h
3
, being MD5, SHA1 and CRC32,
respectively. The elements added are text strings containing
only a single letter. The Bloom filter starts out empty, with
all bits unset, or zero. When adding an element, the values
of h
1
through h
3
(modulo 16) are calculated for the element,
and corresponding bit positions are set to one. After adding
a and b, the Bloom filter has positions 15, 9, 8, 3 and 1 set.
In this case, a and b have one common bit position (8). We
further add elements y and l. After this, positions 15, 14, 13,
10, 9, 8, 7, 5, 3 and 1 are set. When we query for q and z, the
same hash functions are used. Bit positions that correspond
to q and z are examined. If the three bits for an element
are set, that element is assumed to be present. In the case
of q, position 0 is not set, and therefore q is guaranteed not to
be present in the Bloom filter. However, z is assumed to be
present, since the corresponding bits have been set. We know
that z is a false positive: it is reported present though it is not
actually contained in the set of added elements. The bits that
correspond to z (positions 15, 10 and 7) were set through the
addition of elements b, y and l.
For optimal performance, each of the k hash functions
should be a member of the class of universal hash functions,
which means that the hash functions map each item in the
universe to a random number uniform over the range. The
development of uniform hashing techniques has been an
active area of research. An almost ideal solution for uniform
hashing is presented in [3]. In practice, hash functions yielding
sufficiently uniformly distributed outputs, such as MD5 or
CRC32, are useful for most probabilistic filter purposes. For
candidate implementations, see the empirical evaluation of 25
hash functions by Henke et al. [4]. Later in Section II-C we
discuss relevant hashing techniques further.
A Bloom filter constructed based on S requires space O(n)
and can answer membership queries in O(1) time. Given x

3
TABLE I
KEY BLOOM FILTER PARAMETERS
Parameters Increase
Number of hash functions (k) More computation, lower false positive rate as k k
opt
Size of filter (m) More space is needed, lower false positive rate
Number of elements in the set (n) Higher false positive rate
S, the Bloom filter will always report that x belongs to S, but
given y 6∈ S the Bloom filter may report that y S.
Table I examines the behaviour of three key parameters
when their value is either decreased or increased. Increasing
or decreasing the number of hash functions towards k
opt
can
lower false positive ratio while increasing computation in
insertions and lookups. The cost is directly proportional to the
number of hash functions. The size of the filter can be used to
tune the space requirements and the false positive rate (fpr).
A larger filter will result in fewer false positives. Finally, the
size of the set that is inserted into the filter determines the
false positive rate. We note that although no false negatives
(fn) occur with regular BFs, some variants will be presented
later in the article that may result in false negatives.
A. False Positive Probability
We now derive the false positive probability rate of a Bloom
filter and the optimal number of hash functions for a given
false positive probability rate. We start with the assumption
that a hash function selects each array position with equal
probability. Let m denote the number of bits in the Bloom
filter. When inserting an element into the filter, the probability
that a certain bit is not set to one by a hash function is
1
1
m
. (1)
Now, there are k hash functions, and the probability of any
of them not having set a specific bit to one is given by
1
1
m
k
. (2)
After inserting n elements to the filter, the probability that
a given bit is still zero is
1
1
m
kn
. (3)
And consequently the probability that the bit is one is
1
1
1
m
kn
. (4)
For an element membership test, if all of the k array
positions in the filter computed by the hash functions are set
to one, the Bloom filter claims that the element belongs to the
set. The probability of this happening when the element is not
part of the set is given by
1
1
1
m
kn
!
k
1 e
kn/m
k
. (5)
1e-009
1e-008
1e-007
1e-006
1e-005
0.0001
0.001
0.01
0.1
1
1 10 100 1000 10000 100000
False positive probability (p)
Number of inserted elements (n)
False positive rate of Bloom filters
m=64
m=512
m=1024
m=2048
m=4096
Fig. 3. False positive probability rate for Bloom filters.
We note that e
kn/m
is a very close approximation of (1
1
m
)
kn
[1]. The false positive probability decreases as the size
of the Bloom filter, m, increases. The probability increases
with n as more elements are added. Now, we want to minimize
the probability of false positives, by minimizing (1 e
kn/m
)
k
with respect to k. This is accomplished by taking the derivative
and equaling to zero, which gives the optimal value of k
k
opt
=
m
n
ln 2
9m
13n
. (6)
This results in the false positive probability of
1
2
k
0.6185
m/n
. (7)
Using the optimal number of hashes k
opt
, the false positive
probability can be rewritten and bounded
m
n
1
ln 2
. (8)
This means that in order to maintain a fixed false positive
probability, the length of a Bloom filter must grow linearly
with the number of elements inserted in the filter. The number
of bits m for the desired number of elements n and false
positive rate p, is given by
m =
n ln p
(ln 2)
2
. (9)
Figure 3 presents the false positive probability rate p as a
function of the number of elements n in the filter and the filter
size m. An optimal number of hash functions k = (m/n) ln 2
has been assumed.
There is a factor of log
2
e 1.44 between the amount of

4
space used by a Bloom filter and the optimal amount of space
that can be used. There are other data structures that use space
closer to the lower bound, but they are more complicated (cf.
[5], [6], [7]).
Recently, Bose et al. [8] have shown that the false positive
analysis originally given by Bloom and repeated in many sub-
sequent articles is optimistic and only a good approximation
for large Bloom filters. The revisited analysis proves that the
commonly used estimate (Eq. 5) is actually a lower bound and
the real false positive rate is larger than expected by theory,
especially for small values of m.
B. Operations
Standard Bloom filters do not support the removal of
elements. Removal of an element can be implemented by
using a second Bloom filter that contains elements that have
been removed. The problem of this approach is that the false
positives of the second filter result in false negatives in the
composite filter, which is undesirable. Therefore a number of
dedicated structures have been proposed that support deletions.
These are examined later in this survey.
A number of operations involving Bloom filters can be
implemented easily, for example the union and halving of a
Bloom filter. The bit-vector nature of the Bloom filter allows
the union of two or more Bloom filters simply by performing
bitwise OR on the bit-vectors. Given two sets S
1
and S
2
, a
Bloom filter B that represents the union S = S
1
S
2
can
be created by taking the OR of the original Bloom filters
B = B
1
B
2
assuming that m and the hash functions are the
same. The merged filter B will report any element belonging
to S
1
or S
2
as belonging to set S. The following theorem
gives a lower bound for the false positive rate of the union of
Bloom filters [9]:
Theorem 1: The false positive probability of BF (A B) is
not less than that of BF (A) and BF (B). At the same time,
the false positive probability of BF (A) BF (B) is also not
less than that of BF (A) and BF (B).
If the BF size m is divisible by 2, halving can be easily
done by bitwise ORing the first and second halves together.
Now, the range of the hash functions needs to be accordingly
constrained, for instance, by applying the mod(m/2) to the
hash outputs.
Bloom filters can be used to approximate set intersection;
however, this is more complicated than the union operation.
One straightforward approach is to assume the same m and
hash functions and to take the logical AND operation between
the two bit-vectors. The following theorem gives the proba-
bility for this to hold [9]:
Theorem 2: If BF (A B), BF (A), and BF (B) use the
same m and hash functions, then BF (A B) = BF (A)
BF (B) with probability (1 1/m)
k
2
|AAB||BAB|
.
The inner product of the bit-vectors is an indicator of
the size of the intersection [1]. The idea of a bloomjoin
was presented by Mackert and Lohman in 1986 [10]. In a
bloomjoin, two hosts, A and B, compute the intersection of
two sets S
1
and S
2
, when A has the first set and B the second.
It is not feasible to send all the elements from A to B, and vice
versa. In a bloomjoin, S
1
is represented using a Bloom filter
and sent from A to B. B can then compute the intersection
and send back this set. Host A can then check false positives
with B in a final round.
C. Hashing techniques
Hash functions are the key building block of probabilistic
filters. There is a large literature on hash functions spanning
from randomness analysis to security evaluation over many
networking and computing applications. We focus on the best
practices and recent developments in hashing techniques which
are relevant to the performance and practicality of Bloom filter
constructs. For further details, deeper theoretical foundations
and system-specific applications we refer to related work, such
as [4], [11], [12], [13].
One noteworthy property of Bloom filters is that the false
positive performance depends only on the bit-per-element ratio
(m/n) and not on the form or size of the hashed elements.
As long as the size of the elements can be bounded, hashing
time can be assumed to be a constant factor. Considering the
trend in computational power versus memory access time, the
practical bottleneck is the amount of (slow) memory accesses
rather than the hash computation time. Nevertheless, whenever
a filter application needs to run at line speed, hardware-
amenable per-packet operations are critical [13].
In the following subsections, we briefly present hashing
techniques that are the basis for good Bloom filter implemen-
tations. We start with perfect hashing, which is an alternative
to Bloom filters when the set is known beforehand and it is
static. Double hashing allows reducing the number of true hash
computations. Partitioned hashing and multiple hashing deal
with how bits are allocated in a Bloom filter. Finally, the use
of simple hash functions is considered.
1) Perfect Hashing Scheme: A simple technique called
perfect hashing (or explicit hashing) can be used to store a
static set S of values in an optimal manner using a perfect hash
function. A perfect hash function is a computable bijection
from S to an array of |S| = n hash buckets. The n-size
array can be used to store the information associated with
each element x S [5].
Bloom filter like functionality can be obtained by, given
a set of elements S, first finding a perfect hash function P
and then storing at each location an f = 1 bit fingerprint,
computed using some (pseudo-)random hash function H.
Figure 4 illustrates this perfect hashing scheme.
Lookup of x simply consists of computing P (x) and check-
ing whether the stored hash function value matches H(x).
When x S, the correct value is always returned, and when
x / S a false positive (claiming the element being in S) occurs
with probability at most ǫ. This follows from the definition of
2-universal hashing by Carter and Wengman [14], that any
element y not in S has probability at most ǫ of having the
same hash function value h(y) as the element in S that maps
to the same entry of the array.
While space efficient, this approach is disconsidered for
dynamic environments, because the perfect hash function
needs to be recomputed when the set S changes.

5
Element 1 Element 2 Element 3 Element 4 Element 5
Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)
Fig. 4. Example of explicit hashing
Another technique for minimal perfect hashing was intro-
duced by Antichi et al. [15]. It relies on Bloom filters and
Blooming Trees to turn the imperfect hashing of a Bloom
filter into a perfect hashing. The technique gives space and
time savings. This technique also requires a static set S, but
can handle a huge number of elements.
2) Double Hashing: The improvement of the double hash-
ing technique over basic hashing is being able to generate
k hash values based on only two universal hash functions
as base generators (or “seed” hashes). As a practical conse-
quence, Bloom filters can be built with less hashing operations
without sacrificing performance. Kirsch and Mitzenmacher
have shown [16] that it requires only two independent hash
functions, h
1
(x) and h
2
(x), to generate additional “pseudo”
hashes defined as:
h
i
(x) = h
1
(x) + f(i) h
2
(x) (10)
where i is the hash value index, f(i) can be any arbitrary
function of i (e.g., i
2
), and x is the element being hashed. For
Bloom filter operations, the double hashing scheme reduces the
number of true hash computations from k down to two without
any increase in the asymptotic false positive probability [16].
3) Partitioned Hashing: In this hashing technique, the k
hash functions are allocated disjoint ranges of m/k consec-
utive bits instead of the full m-bit array space. Following
the same false positive probability analysis of Sec. II-A, the
probability of a specific bit being 0 in a partitioned Bloom
filter can be approximated to:
(1 k/m)
n
e
kn/m
(11)
While the asymptotic performance remains the same, in
practice, partitioned Bloom filters exhibit a poorer false posi-
tive performance as they tend to have larger fill factors (more
1s) due to the m/k bit range restriction. This can be explained
by the observation that:
(1 1/m)
kn
> (1 k/m)
n
(12)
4) Multiple Hashing: Multiple hashing is a popular tech-
nique that exploits the notion of having multiple hash choices
and having the power to choose the most convenient candidate.
When applied for hash table constructions, multiple hashing
provides a probabilistic method to limit the effects of collisions
by allocating elements more-or-less evenly distributed. The
original idea was proposed by Azar et al. in his seminal work
on balanced allocations [17]. Formulating hashing as a balls
into bins problem, the authors show that if n balls are placed
sequentially into m for m = O(n) with each ball being
placed in one of a constant d = 2 randomly chosen bins,
then, after all balls are inserted, the maximal load in a bin is,
with high probability, (ln ln n)/ln d + O(1). V
¨
ocking et al.
[18] elaborate on this observation and propose the always-go-
left algorithm (or d-left hashing scheme) to break ties when
inserting (chained) elements to the least loaded one among the
d partitioned candidates.
As a result this hashing technique provides an almost
optimal (up to an additive constant) load-balancing scheme.
In addition to the balancing improvement, partitioning the
hash buckets (i.e., bins) into groups makes d-left hashing
more hardware friendly as it allows the parallelized look-
up of the d hash locations. Thus, hash partitioning and tie-
breaking have elevated d-left hashing as an optimal technique
for building high performance (negligible overflow probabil-
ities) data structures such as the multiple level hash tables
(MHT) [19] or counting Bloom filters [20]. A breakthrough
Bloom filter design was recently proposed using an open-
addressed multiple choice hash table based on d-left hashing,
element fingerprints (a smaller representation like the last f
bits of the element hash) and dynamic bit reassignment [21].
After all optimizations, the authors show that the performance
is comparable to plain Bloom filter constructs, outperforms
traditional counting Bloom filter constructs (see d-left CBF
in Sec. III-B), and easily extensible to support practical
networking applications (e.g., flow tracking in Sec. IV-D).
The power of (two) choices has been exploited by Lumetta
and Mitzenmacher to improve the false positive performance
of Bloom filters [22]. The key idea consists of considering not
one but two groups of k hash functions. On element insertion,
the selection criteria is based on the group of k hash functions
that sets fewer bits to 1. The caveat is that when checking for
elements, both groups of k hash functions need to be checked
since there is no information on which group was initially used
and false positives can potentially be claimed for either group.
Although it may appear counter-intuitive, under some settings
(high m/n ratios), setting fewer ones in the filter actually pays
off the double checking operations.
Fundamentally similar in exploiting the power of choices
in producing less dense (improved) Bloom filters, the method
proposed by Hao et al. [23] is based on a partitioned hashing
technique which results in a choice of hash functions that set
fewer bits. Experimental results show that this improvement
can be as much as a ten-fold increase in performance over
standard constructs. However, the choice of hash functions
cannot be done on an element basis as in [22], and its
applicability is constrained to non-dynamic environments.
5) Simple hash functions: A common assumption is to
consider output hash values as truly random, that is, each
hashed element is independently mapped to a uniform location.
While this is a great aid to theoretical analyses, hash function
implementations are known to behave far worse than truly ran-
dom ones. On the other hand, empirical works using standard
universal hashing have been reporting negligible differences in
practical performance compared to predictions assuming ideal
hashing (see [24] for the case of Bloom filters).
Mitzenmacher and Vadhany [25] provide the seeds to for-
mally explaining this gap between the theory and practice

Citations
More filters
Journal ArticleDOI
01 Jan 2015
TL;DR: This paper presents an in-depth analysis of the hardware infrastructure, southbound and northbound application programming interfaces (APIs), network virtualization layers, network operating systems (SDN controllers), network programming languages, and network applications, and presents the key building blocks of an SDN infrastructure using a bottom-up, layered approach.
Abstract: The Internet has led to the creation of a digital society, where (almost) everything is connected and is accessible from anywhere. However, despite their widespread adoption, traditional IP networks are complex and very hard to manage. It is both difficult to configure the network according to predefined policies, and to reconfigure it to respond to faults, load, and changes. To make matters even more difficult, current networks are also vertically integrated: the control and data planes are bundled together. Software-defined networking (SDN) is an emerging paradigm that promises to change this state of affairs, by breaking vertical integration, separating the network's control logic from the underlying routers and switches, promoting (logical) centralization of network control, and introducing the ability to program the network. The separation of concerns, introduced between the definition of network policies, their implementation in switching hardware, and the forwarding of traffic, is key to the desired flexibility: by breaking the network control problem into tractable pieces, SDN makes it easier to create and introduce new abstractions in networking, simplifying network management and facilitating network evolution. In this paper, we present a comprehensive survey on SDN. We start by introducing the motivation for SDN, explain its main concepts and how it differs from traditional networking, its roots, and the standardization activities regarding this novel paradigm. Next, we present the key building blocks of an SDN infrastructure using a bottom-up, layered approach. We provide an in-depth analysis of the hardware infrastructure, southbound and northbound application programming interfaces (APIs), network virtualization layers, network operating systems (SDN controllers), network programming languages, and network applications. We also look at cross-layer problems such as debugging and troubleshooting. In an effort to anticipate the future evolution of this new paradigm, we discuss the main ongoing research efforts and challenges of SDN. In particular, we address the design of switches and control platforms—with a focus on aspects such as resiliency, scalability, performance, security, and dependability—as well as new opportunities for carrier transport networks and cloud providers. Last but not least, we analyze the position of SDN as a key enabler of a software-defined environment.

3,589 citations


Cites background from "Theory and Practice of Bloom Filter..."

  • ...traffic matrix estimation [262], fine-grained monitoring of wildcard rules [365], two-stage Bloom filters [366] to represent monitoring rules and provide high measurement accuracy without incurring in extra memory or control plane traffic overhead [309], and special monitoring functions (extensions to OpenFlow) in forwarding devices to reduce traffic and...

    [...]

Posted Content
TL;DR: Software-Defined Networking (SDN) as discussed by the authors is an emerging paradigm that promises to change this state of affairs, by breaking vertical integration, separating the network's control logic from the underlying routers and switches, promoting (logical) centralization of network control, and introducing the ability to program the network.
Abstract: Software-Defined Networking (SDN) is an emerging paradigm that promises to change this state of affairs, by breaking vertical integration, separating the network's control logic from the underlying routers and switches, promoting (logical) centralization of network control, and introducing the ability to program the network. The separation of concerns introduced between the definition of network policies, their implementation in switching hardware, and the forwarding of traffic, is key to the desired flexibility: by breaking the network control problem into tractable pieces, SDN makes it easier to create and introduce new abstractions in networking, simplifying network management and facilitating network evolution. In this paper we present a comprehensive survey on SDN. We start by introducing the motivation for SDN, explain its main concepts and how it differs from traditional networking, its roots, and the standardization activities regarding this novel paradigm. Next, we present the key building blocks of an SDN infrastructure using a bottom-up, layered approach. We provide an in-depth analysis of the hardware infrastructure, southbound and northbound APIs, network virtualization layers, network operating systems (SDN controllers), network programming languages, and network applications. We also look at cross-layer problems such as debugging and troubleshooting. In an effort to anticipate the future evolution of this new paradigm, we discuss the main ongoing research efforts and challenges of SDN. In particular, we address the design of switches and control platforms -- with a focus on aspects such as resiliency, scalability, performance, security and dependability -- as well as new opportunities for carrier transport networks and cloud providers. Last but not least, we analyze the position of SDN as a key enabler of a software-defined environment.

1,968 citations

Journal ArticleDOI
TL;DR: A novel taxonomy is introduced to study Named Data Networking features in depth and identifies a set of open challenges which should be addressed by researchers in due course.

228 citations


Cites methods from "Theory and Practice of Bloom Filter..."

  • ...[57] have used modified bloom filter [63] and proposed a new mapping bloom filter (MBF)....

    [...]

Journal ArticleDOI
TL;DR: Lighter is a fast, memory-efficient tool for correcting sequencing errors that uses a pair of Bloom filters, one holding a sample of the input k-mers and the other likely to be correct, and is both faster and more memory- efficient than competing approaches while achieving comparable accuracy.
Abstract: Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

222 citations

Proceedings ArticleDOI
09 May 2017
TL;DR: Monkey, an LSM-based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget is presented, and how to use this model to answer what-if design questions about how changes in environmental parameters impact performance is shown.
Abstract: In this paper, we show that key-value stores backed by an LSM-tree exhibit an intrinsic trade-off between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune trade-off among these metrics. We pinpoint the problem to the fact that all modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters' false positive rates in each level. We present Monkey, an LSM-based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The insight is that worst-case lookup cost is proportional to the sum of the false positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize this sum. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of LevelDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50%-80% for the data sizes we experimented with). Furthermore, we map the LSM-tree design space onto a closed-form model that enables co-tuning the merge policy, the buffer size and the filters' false positive rates to trade among lookup cost, update cost and/or main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the various LSM-tree design elements accordingly.

172 citations


Cites background from "Theory and Practice of Bloom Filter..."

  • ...This relationship is captured by the following equation [32]2....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Abstract: In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hash-coding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods.In such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to “catch” the small fraction of errors associated with the new methods. An example is discussed which illustrates possible areas of application for the new methods.Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

7,390 citations


Additional excerpts

  • ...Bloom in 1970 [2]....

    [...]

Proceedings Article
01 Jan 2006
TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

4,843 citations

Proceedings Article
07 Sep 1999
TL;DR: Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition.
Abstract: The nearestor near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality." That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points Supported by NAVY N00014-96-1-1221 grant and NSF Grant IIS-9811904. Supported by Stanford Graduate Fellowship and NSF NYI Award CCR-9357849. Supported by ARO MURI Grant DAAH04-96-1-0007, NSF Grant IIS-9811904, and NSF Young Investigator Award CCR9357849, with matching funds from IBM, Mitsubishi, Schlumberger Foundation, Shell Foundation, and Xerox Corporation. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999. from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. We provide experimental evidence that our method gives signi cant improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition. Experimental results also indicate that our scheme scales well even for a relatively large number of dimensions (more than 50).

3,705 citations


"Theory and Practice of Bloom Filter..." refers methods in this paper

  • ...The DSBF is implemented using locality-sensitive hash functions [53], [54] and allows false positives and false...

    [...]

Journal ArticleDOI
12 Nov 2000
TL;DR: OceanStore monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through pro-active movement of data.
Abstract: OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowed to be cached anywhere, anytime. Additionally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through pro-active movement of data. A prototype implementation is currently under development.

3,376 citations


"Theory and Practice of Bloom Filter..." refers methods in this paper

  • ...Rhea and Kubiatowicz [69] designed a probabilistic routing algorithm for P2P location mechanisms in the OceanStore project....

    [...]

Journal ArticleDOI
TL;DR: The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.
Abstract: A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Bloom filters allow false positives but the space savings often outweigh this drawback when the probability of an error is controlled. Bloom filters have been used in database applications since the 1970s, but only in recent years have they become popular in the networking literature. The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.

2,199 citations


"Theory and Practice of Bloom Filter..." refers background in this paper

  • ...Broder and Mitzenmacher have coined the Bloom filter principle [1]:...

    [...]

  • ...The inner product of the bit-vectors is an indicator of the size of the intersection [1]....

    [...]

  • ...1 m ) kn [1]....

    [...]

  • ...The analysis from [27] reveals that 4 bits per counter should suffice for most applications [1], [28]....

    [...]

  • ...and Mitzenmacher [1]....

    [...]

Frequently Asked Questions (19)
Q1. What contributions have the authors mentioned in the paper "Theory and practice of bloom filters for distributed systems" ?

This survey article presents a number of frequently used and useful probabilistic techniques. This has been reflected in recent research and many new algorithms have been proposed for distributed systems that are either directly or indirectly based on Bloom filters. In this survey, the authors give an overview of the basic and advanced techniques, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peer-to-peer systems, routing and forwarding, and measurement data summarization. 

Increasing or decreasing the number of hash functions towards kopt can lower false positive ratio while increasing computation in insertions and lookups. 

The accuracy of a Bloom filter depends on the size of the filter, the number of hash functions used in the filter, and the number of elements added to the set. 

When applied for hash table constructions, multiple hashing provides a probabilistic method to limit the effects of collisions by allocating elements more-or-less evenly distributed. 

The query element membership operation iterates the set of BFs in the DBF and returns true if any of the BFs contain the element. 

If the optimal value of the number of hash functions k in order to minimize the false positive probability is used then the probability that a bit is set in the bitstring representing the filter is 1/2. 

The construction can adaptively reduce the false positive rate by removing some bits of the signature, thus effectively removing the flow from the structure. 

A related technique for handling time-varying sets, called double buffering, uses two bitmaps, active and inactive, to support time-dependent Bloom filters. 

The proposed mechanism adapts to set growth by adding “slices” of traditional Bloom Filters of increasing sizes and tighter error probabilities, added as needed. 

Set membership queries require testing for element presence in each filter, thus the requirement on increasing sizes and tightening of error probabilities as the BF scales up. 

1) Perfect Hashing Scheme: A simple technique called perfect hashing (or explicit hashing) can be used to store a static set S of values in an optimal manner using a perfect hash function. 

This is motivated by applications such as Web caches and P2P information sharing, which frequently use Bloom filters to distribute routing tables. 

The caveat is that when checking for elements, both groups of k hash functions need to be checked since there is no information on which group was initially used and false positives can potentially be claimed for either group. 

When a string is inserted, it is first broken into blocks which are inserted into the filter hierarchy starting from the lowest level. 

The probability that the ith counter is incremented j times is a binomial random variable:P (c(i) = j) =(nkj)( 1m )j(1−1 m )nk−j (13)The probability that any counter is at least j is bounded above by mP (c(i) = j), which can be calculated using the above formula. 

The problem of knowing which candidate element fingerprint to delete – in case of fingerprint collisions – can be neatly solved by breaking the problem into two parts, namely the creation of the fingerprint, and finding the d locations by making additional (pseudo)-random permutations. 

the range of the hash functions needs to be accordingly constrained, for instance, by applying the mod(m/2) to the hash outputs. 

As a consequence, in applications where the query frequencies can be estimated or collected and result for instance in a step or the Zipf distribution, the WBF largely outperforms in fpr the traditional Bloom filter. 

This can be explained by the observation that:(1− 1/m)k∗n > (1− k/m)n (12)4) Multiple Hashing: Multiple hashing is a popular technique that exploits the notion of having multiple hash choices and having the power to choose the most convenient candidate.