Theory and Practice of Bloom Filters for Distributed Systems
Summary (7 min read)
Introduction
- This survey presents a number of frequently used and useful probabilistic techniques.
- Fast matching of arbitrary identifiers to values is a basic requirement for a large number of applications.
- Given that there are millions or even billions of data elements, developing efficient solutions for storing, updating, and querying them becomes increasingly important.
- Section II introduces the functionality and parameters of the Bloom filter as a hash-based, probabilistic data structure.
II. BLOOM FILTERS
- The Bloom filter is a space-efficient probabilistic data structure that supports set membership queries.
- The weak point of Bloom filters is the possibility for a false positive.
- The bits that correspond to z (positions 15, 10 and 7) were set through the addition of elements b, y and l.
- The development of uniform hashing techniques has been an active area of research.
- Finally, the size of the set that is inserted into the filter determines the false positive rate.
A. False Positive Probability
- The authors now derive the false positive probability rate of a Bloom filter and the optimal number of hash functions for a given false positive probability rate.
- The authors start with the assumption that a hash function selects each array position with equal probability.
- Now, the authors want to minimize the probability of false positives, by minimizing (1−e−kn/m)k with respect to k. (8) This means that in order to maintain a fixed false positive probability, the length of a Bloom filter must grow linearly with the number of elements inserted in the filter.
- There are other data structures that use space closer to the lower bound, but they are more complicated (cf. [5], [6], [7]).
B. Operations
- Standard Bloom filters do not support the removal of elements.
- Therefore a number of dedicated structures have been proposed that support deletions.
- The bit-vector nature of the Bloom filter allows the union of two or more Bloom filters simply by performing bitwise OR on the bit-vectors.
- One straightforward approach is to assume the same m and hash functions and to take the logical AND operation between the two bit-vectors.
- Host A can then check false positives with B in a final round.
C. Hashing techniques
- Hash functions are the key building block of probabilistic filters.
- The n-size array can be used to store the information associated with each element x ∈ S [5].
- For Bloom filter operations, the double hashing scheme reduces the number of true hash computations from k down to two without any increase in the asymptotic false positive probability [16].
- When applied for hash table constructions, multiple hashing provides a probabilistic method to limit the effects of collisions by allocating elements more-or-less evenly distributed.
- While this is a great aid to theoretical analyses, hash function implementations are known to behave far worse than truly random ones.
III. BLOOM FILTER VARIANTS
- A number of Bloom filter variants have been proposed that address some of the limitations of the original structure, including counting, deletion, multisets, and space-efficiency.
- The authors start their examination with the basic counting Bloom filter construction, and then proceed to more elaborate structures including Bloomier and Spectral filters.
A. Counting Bloom Filters
- As mentioned with the treatment on standard Bloom filters, they do not support element deletions.
- To avoid counter overflow, the authors need choose sufficiently large counters.
- A counting Bloom filter also has the ability to keep approximate counts of items.
- The upper bound is given by the formula below.
- When an element is placed into the table, following the dleft hashing technique, d candidate buckets are obtained by computing d independent hash values of the element.
C. Compressed Bloom Filter
- Compressing a Bloom filter improves performance when a Bloom filter is passed in a message between distributed nodes.
- This structure is particularly useful when information must be transmitted repeatedly, and the bandwidth is a limiting factor [7].
- If the optimal value of the number of hash functions k in order to minimize the false positive probability is used then the probability that a bit is set in the bitstring representing the filter is 1/2.
- The key idea in compressed Bloom filters is that by changing the way bits are distributed in the filter, it can be compressed for transmission purposes.
- After transmission, the filter is decompressed for use.
E. Hierarchical Bloom Filters
- Shanmugasundaram et al. [31] presented a data structure called Hierarchical Bloom Filter to support substring matching.
- The filter works by splitting an input string into a number of fixed-size blocks.
- These blocks are then inserted into a standard Bloom filter.
- This substring matching may result in combinations of strings that are incorrectly reported as being in the set (false positives).
- For the second level, two subsequent blocks are concatenated and inserted into the second level.
F. Spectral Bloom Filters
- Spectral Bloom filters generalize Bloom filters to storing an approximate multiset and support frequency queries [32].
- The answer to any multiplicity query is never smaller than the true multiplicity, and greater only with probability ǫ.
- Spectral refers to the range within which multiplicity answers are given.
- The space usage is similar to that of a Bloom filter for a set of the same size (including the counters to store the frequency values).
- A further improvement of the error rate can be achieved using the recurring minimum (RM) method, which consists of storing elements with a single minimum (among the k counters) in a secondary Spectral Bloom filter with a smaller error probability.
H. Decaying Bloom Filters
- Duplicate element detection is an important problem, especially pertaining to data stream processing [36].
- This motivates approximate detection of duplicates among newly arrived data elements of a data stream.
- This can be accomplished within a fixed time window.
- The Decaying Bloom Filter (DBF) structure has been proposed for this application scenario.
- A variant of DBF has been applied for hint-based routing in wireless sensor networks [39].
I. Stable Bloom Filter
- The Stable Bloom Filter or SBF [41] is another solution to duplicate element detection.
- The SBF guarantees that the expected fraction of zeros in the SBF stays constant.
- The SBF introduces both false positives and false negatives, but with rates improved from standard Bloom filters or standard buffering.
- When adding an element, P counters chosen at random are first decremented (by one).
- Please see the full paper [41] for details on setting all the parameters.
K. Adaptive Bloom filters
- The Adaptive Bloom Filter (ABF) [43] is an alternative construction to counting Bloom filters especially well suited for applications where large counters are to be supported without overflows and under unpredictable collision rate dynamics (e.g., network traffic applications).
- The key idea of the ABF is to count the appearances of elements by an increasing set of hash functions.
- The key idea is to take advantage of differing flow sizes and increase or decrease the signature lengths of flows making them more easy or less easy to identify in the filter.
- The construction can adaptively reduce the false positive rate by removing some bits of the signature, thus effectively removing the flow from the structure.
- A related technique for handling time-varying sets, called double buffering, uses two bitmaps, active and inactive, to support time-dependent Bloom filters.
N. Scalable Bloom filters
- One caveat with Bloom Filters is having to dimension the maximum filter size (m) a priori.
- This is commonly done by application designers by establishing an upper bound on the expected fpr and estimating the maximum required capacity (n). Scalable Bloom Filters (SBF) [47] refer to a BF variant that can adapt dynamically to the number of elements stored, while assuring a maximum false positive probability.
- Set membership queries require testing for element presence in each filter, thus the requirement on increasing sizes and tightening of error probabilities as the BF scales up.
- Parameters of the SBF in addition to the initial bit size m and target fpr include the expected growth rate (s) and the error probability tightening ratio (r).
O. Dynamic Bloom Filter
- Standard BFs and its mainstream variations suffer from inefficiencies when the cardinality of the set under representation is unknown prior to design and deployment.
- In distributed applications, BF reconstruction is cumbersome and may hinder interoperability.
- The DBF is based on the notion of an active Bloom filter.
- The element is then inserted into the active BF.
- If multiple filters return true, the element removal may result in, at most, k potential false negatives.
P. Split Bloom Filters
- A Split Bloom filter (SPBF) [49] employs a constant s × m bit matrix for set representation, where s is a pre-defined constant based on the estimation of maximum set cardinality.
- The SPBF aims at overcoming the limitation of standard BFs which do not take sets of variable sizes into account.
- The basic idea of the SPBF is to allocate more memory space to enhance the capacity of the filter before its implementation and actual deployment.
- The false match probability increases as the set cardinality grows.
- An existing SPBF must be reconstructed using a new bit matrix if the false match probability exceeds an upper bound.
Q. Retouched Bloom filters
- The Retouched Bloom filter (RBF) [50] builds upon two observations.
- First, for many BF applications, there are some false positives, which are more troublesome than others and can be identified after BF construction but prior to deployment.
- Second, there are cases where a low level of false negatives is acceptable.
- The novel idea behind the RBF is the bit clearing process by which false positives are removed by resetting individual bits.
- In case of a random bit clearing process, the gains are neutral, i.e., the fpr decrease equals the fnr increase.
R. Generalized Bloom Filters
- A GBF starts out as an arbitrary bit vector set with both 1s and 0s, and information is encoded by setting chosen bits to either 0 or 1, departing thus from the notion that empty bit cells represent the absence of information.
- As a result, the GBF is a more general binary classifier than the standard Bloom filter.
- In the GBF, the false-positive probability is upper bounded and it does not depend on the initial condition of the filter.
- The generalization brought by the set of hash functions resetting bits introduces false negatives, whose probability can be upper bounded and does not depend either on the bit filter initial set-up.
- The GBF returns false if any bit is inverted, i.e. the queried element does not 12 belong to the set with a high probability.
T. Data Popularity Conscious Bloom Filters
- In many information processing environments, the underlying popularities of data items and queries are not identical, but rather they differ and skewed.
- An intuitive approach to take data item popularity into account is to use longer encodings and more hash functions for important elements and shorter encodings and fewer hash functions for less important ones.
- Thus the Bloom filter construction lends itself well to data popularity-conscious filtering as well; however, this requires the minimization of the false positive rate by adapting the number of hashes used for each element to its popularities in sets and membership queries.
- To this end, an object importance metric was proposed in [55].
- The problem was modeled as a constrained nonlinear integer program and two polynomialtime solutions were presented with bounded approximation ratios.
V. Weighted Bloom filter
- Bruck et al. [57] propose Weighted Bloom filter (WBF), a Bloom filter variant that exploits the a priori knowledge of the frequency of element requests by varying the number of hash functions (k) accordingly as a function of the element query popularity.
- Hence, a WBF incorporates the information on the query frequencies and the membership likelihood of the elements into its optimal design, which fits many applications well in which popular elements are queried much more often than others.
- The rationale behind the WBF design is to consider the filter fpr as a weighted sum of each individual element’s false positive probability, where the weight is positively correlated with the element’s query frequency and is negatively correlated with the element’s probability of being a member.
- As a consequence, in applications where the query frequencies can be estimated or collected and result for instance in a step or the Zipf distribution, the WBF largely outperforms in fpr the traditional Bloom filter.
- Even a simple binary classification of elements between hot and cold can result in false positive improvements of a few orders of magnitude.
W. Secure Bloom filters
- The hashing nature of Bloom filters provide some basic security means in the sense that the identities of the set elements represented by the BF are not clearly visible for an observer.
- Morever, BFs are vulnerable to correlation attacks where the similarity of BFs’ contents can be deduced by comparing BF indexes for overlaps, or lack thereof.
- Encrypted Bloom filters by Bellovin and Cheswick [59] propose a privacy-preserving filter variant of Bloom filters which introduces a semi-trusted third party to transform one party’s queries to a form suitable for querying the other party’s BF, in such a way that the original query privacy is preserved.
- Instead of undisclosing the keys of all parties and securing the BF operations with keyed hash functions as per Goh [58], Bellovin and Cheswick propose a specialized form of encryption function where operations can be done on encrypted data.
- More specifically, their proposal is based on the Pohlig-Hellman cipher, which forms an Abelian group over its keys when encrypting any given element.
X. Summary and discussion
- Table II summarizes the distinguishing features of the Bloom filter variants discussed in this section.
- The different Bloom filter designs aim at addressing specific concerns regarding space and transmission efficiency, false positive rate, dynamic operation in terms of increasing workload, dynamic operation in terms of insertions and deletions, counting and frequencies, popularity-aware operation, and mapping to elements and sets instead of simple set membership tests.
- For each variant, table II indicates the output type (e.g., boolean, frequency, value) and whether counting (C), deletion (D), or popularity-awareness (P) are supported (Yes/No/Maybe), or false negatives (FN) are introduced.
- Making this choice and optimizing the parameters for the expected uses cases are fundamental factors to achieve the desired performance in practice.
- Ultimately, which probabilistic data structure is best suited depends a lot on the application specifics.
IV. BLOOM FILTERS IN DISTRIBUTED COMPUTING
- The authors have surveyed techniques for probabilistic representation of sets and functions.
- The applications of these structures are manyfold, and they are widely used in various networking systems, such as Web proxies and caches, database servers, and routers.
- Packet routing and forwarding, in which Bloom filters and variants have important roles in flow detection and classification.
- Probabilistic techniques can be used to store and process measurement data summaries in routers and other network entities.
- For more detail, see Figure 15 at the end of this article.
A. Caching
- Bloom filters have been applied extensively to caching in distributed environments.
- Figure 10 illustrates the use of a Bloom filter-based summary cache at a proxy.
- Within a single proxy, a Bloom filter representing the local content cache needs to be recreated when the content changes.
- Each chunk modulo the digest size is used as the value for one of the Bloom filter hash functions.
- Bigtable uses Bloom filters to reduce the disk lookups for non-existent rows or columns [65].
B. P2P Networks
- Bloom filters have been extensively applied in P2P environments for various tasks, such as compactly storing keywordbased searches and indices [67], synchronizing sets over network, and summarizing content.
- In [68], the applications and parameters of Bloom filters in P2P networks are discussed.
- Ideally, the state should be such that it allows for accurate matching of queries and takes sublinear space (or near constant space).
- They present a locality-aware P2P system architecture called Foreseer, which 16 explicitly exploits geographical locality and temporal locality by constructing a neighbor overlay and a friend overlay, respectively.
- Tribler uses Bloom filters to keep the databases that maintain the social trust network synchronized between peers.
C. Packet Routing and Forwarding
- Bloom filters have been used to improve network router performance [76].
- In [77], Bloom filters are used for high-speed network packet filtering.
- By using direct lookup array and Controlled Prefix Expansion (CPE), worst-case performance is limited to two hash probes and one array access per lookup.
- The other extreme approach to support multicast is to move state from the network elements to the packets themselves in form of Bloom filter-based representations of the multicast trees.
- More importantly, matching of an incoming packet can now be performed in parallel over all tuples.
D. Monitoring and Measurement
- Network monitoring and measurement are key application areas for Bloom filters and their variants.
- The authors briefly examine some key cases in this domain, for example detection of heavy flows, Iceberg queries, packet attribution, and approximate state machines.
- Bloom filter variants that are able to count elements are good candidate structures for supporting Iceberg queries.
- Packet and payload attribution is another application area in measurement for Bloom filters.
- It solves the central problems (counter space and flow-tocounter association) of per-flow measurement by ”braiding” a hierarchy of counters with random graphs.
E. Security
- The hashing nature of the Bloom filter makes it a natural fit for security applications.
- Two years later, Manber and Wu [108] presented two extensions to enhance the Bloomfilter-based check for weak passwords.
- When the CBF was empty to the degree α, the attack string was considered detected, and the full string matcher was used to check for false positives.
- The authors report a greater than 99% detection rate and false positive ratios of 1% or less.
- In [118], Wolf presents a mechanism where packet forwarding is dependent on credentials represented as a packet header size Bloom filter.
F. Other Applications
- This section summarizes use of Bloom filters in several other interesting applications.
- Figure 14 shows an overview of device wakeup using a Bloom filter.
- Millions of path queries can be stored efficiently.
- Their Bloom pre-calculation scheme provides high-speed identification with a small amount of memory by storing pre-calculated outputs of the tags in Bloom filters.
- The differential file, with updated records, would be accessed only when the record to fetch was contained in the Bloom filter, indicating that the record in the database is not up-to-date.
V. SUMMARY
- Bloom filters are a general aid for network processing and improving the performance and scalability of distributed systems.
- In Figure 15, The Bloom filter variants introduced in this paper are categorized by application domain and supported features.
- Variants that support a certain feature are found inside a highlighted area labeled with the name of that feature.
- The variants that support this are derived from the Counting Bloom Filter and include an array of fixed or variable size counters.
- These allow for example in-word matches for text search.
Did you find this useful? Give us your feedback
Citations
3,589 citations
Cites background from "Theory and Practice of Bloom Filter..."
...traffic matrix estimation [262], fine-grained monitoring of wildcard rules [365], two-stage Bloom filters [366] to represent monitoring rules and provide high measurement accuracy without incurring in extra memory or control plane traffic overhead [309], and special monitoring functions (extensions to OpenFlow) in forwarding devices to reduce traffic and...
[...]
1,968 citations
228 citations
Cites methods from "Theory and Practice of Bloom Filter..."
...[57] have used modified bloom filter [63] and proposed a new mapping bloom filter (MBF)....
[...]
222 citations
172 citations
Cites background from "Theory and Practice of Bloom Filter..."
...This relationship is captured by the following equation [32]2....
[...]
References
7,390 citations
Additional excerpts
...Bloom in 1970 [2]....
[...]
4,843 citations
3,705 citations
"Theory and Practice of Bloom Filter..." refers methods in this paper
...The DSBF is implemented using locality-sensitive hash functions [53], [54] and allows false positives and false...
[...]
3,376 citations
"Theory and Practice of Bloom Filter..." refers methods in this paper
...Rhea and Kubiatowicz [69] designed a probabilistic routing algorithm for P2P location mechanisms in the OceanStore project....
[...]
2,199 citations
"Theory and Practice of Bloom Filter..." refers background in this paper
...Broder and Mitzenmacher have coined the Bloom filter principle [1]:...
[...]
...The inner product of the bit-vectors is an indicator of the size of the intersection [1]....
[...]
...1 m ) kn [1]....
[...]
...The analysis from [27] reveals that 4 bits per counter should suffice for most applications [1], [28]....
[...]
...and Mitzenmacher [1]....
[...]
Related Papers (5)
Frequently Asked Questions (19)
Q2. What is the effect of increasing or decreasing the number of hash functions towards kopt?
Increasing or decreasing the number of hash functions towards kopt can lower false positive ratio while increasing computation in insertions and lookups.
Q3. What is the accuracy of a Bloom filter?
The accuracy of a Bloom filter depends on the size of the filter, the number of hash functions used in the filter, and the number of elements added to the set.
Q4. What is the main idea of multiple hashing?
When applied for hash table constructions, multiple hashing provides a probabilistic method to limit the effects of collisions by allocating elements more-or-less evenly distributed.
Q5. What is the function that returns true if any of the BFs contain the element?
The query element membership operation iterates the set of BFs in the DBF and returns true if any of the BFs contain the element.
Q6. What is the probability of a bit being set in the bitstring?
If the optimal value of the number of hash functions k in order to minimize the false positive probability is used then the probability that a bit is set in the bitstring representing the filter is 1/2.
Q7. What is the effect of the construction on false positives?
The construction can adaptively reduce the false positive rate by removing some bits of the signature, thus effectively removing the flow from the structure.
Q8. What is the technique for handling time-varying sets?
A related technique for handling time-varying sets, called double buffering, uses two bitmaps, active and inactive, to support time-dependent Bloom filters.
Q9. How does the proposed mechanism adapt to set growth?
The proposed mechanism adapts to set growth by adding “slices” of traditional Bloom Filters of increasing sizes and tighter error probabilities, added as needed.
Q10. What is the simplest way to test for element presence in a set?
Set membership queries require testing for element presence in each filter, thus the requirement on increasing sizes and tightening of error probabilities as the BF scales up.
Q11. What is the description of the perfect hashing scheme?
1) Perfect Hashing Scheme: A simple technique called perfect hashing (or explicit hashing) can be used to store a static set S of values in an optimal manner using a perfect hash function.
Q12. Why is the Bloom filter used to distribute routing tables?
This is motivated by applications such as Web caches and P2P information sharing, which frequently use Bloom filters to distribute routing tables.
Q13. What is the caveat to multiple hashing?
The caveat is that when checking for elements, both groups of k hash functions need to be checked since there is no information on which group was initially used and false positives can potentially be claimed for either group.
Q14. What is the hierarchy of blocks used to insert a string?
When a string is inserted, it is first broken into blocks which are inserted into the filter hierarchy starting from the lowest level.
Q15. What is the probability that the ith counter is incremented j times?
The probability that the ith counter is incremented j times is a binomial random variable:P (c(i) = j) =(nkj)( 1m )j(1−1 m )nk−j (13)The probability that any counter is at least j is bounded above by mP (c(i) = j), which can be calculated using the above formula.
Q16. How can The authorsolve the problem of deleting a fingerprint?
The problem of knowing which candidate element fingerprint to delete – in case of fingerprint collisions – can be neatly solved by breaking the problem into two parts, namely the creation of the fingerprint, and finding the d locations by making additional (pseudo)-random permutations.
Q17. How does the hash function be constrained?
the range of the hash functions needs to be accordingly constrained, for instance, by applying the mod(m/2) to the hash outputs.
Q18. What is the effect of the WBF on the query frequency?
As a consequence, in applications where the query frequencies can be estimated or collected and result for instance in a step or the Zipf distribution, the WBF largely outperforms in fpr the traditional Bloom filter.
Q19. What is the popular explanation for multiple hashing?
This can be explained by the observation that:(1− 1/m)k∗n > (1− k/m)n (12)4) Multiple Hashing: Multiple hashing is a popular technique that exploits the notion of having multiple hash choices and having the power to choose the most convenient candidate.