Network Applications of Bloom Filters: A Survey

doi:10.1080/15427951.2004.10129096

Home
/
Papers
/
Network Applications of Bloom Filters: A Survey

Journal Article•DOI•

Network Applications of Bloom Filters: A Survey

Andrei Z. Broder¹, Michael Mitzenmacher²•Institutions (2)

IBM¹, Harvard University²

01 Jan 2004-Internet Mathematics (A.K. Peters)-Vol. 1, Iss: 4, pp 485-509

TL;DR: The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.

read less

Abstract: A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Bloom filters allow false positives but the space savings often outweigh this drawback when the probability of an error is controlled. Bloom filters have been used in database applications since the 1970s, but only in recent years have they become popular in the networking literature. The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Data streams: algorithms and applications

[...]

S. Muthukrishnan¹•Institutions (1)

Rutgers University¹

01 Aug 2005-Foundations and Trends in Theoretical Computer Science

TL;DR: Data Streams: Algorithms and Applications surveys the emerging area of algorithms for processing data streams and associated applications, which rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity.

...read moreread less

Abstract: In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [1].

...read moreread less

1,598 citations

Book•

Data Streams: Algorithms and Applications

[...]

S. Muthukrishnan¹•Institutions (1)

Rutgers University¹

01 Jan 2005

TL;DR: In this paper, the authors present a survey of basic mathematical foundations for data streaming systems, including basic mathematical ideas, basic algorithms, and basic algorithms and algorithms for data stream processing.

...read moreread less

Abstract: 1 Introduction 2 Map 3 The Data Stream Phenomenon 4 Data Streaming: Formal Aspects 5 Foundations: Basic Mathematical Ideas 6 Foundations: Basic Algorithmic Techniques 7 Foundations: Summary 8 Streaming Systems 9 New Directions 10 Historic Notes 11 Concluding Remarks Acknowledgements References

...read moreread less

1,506 citations

Proceedings Article•DOI•

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response

[...]

Úlfar Erlingsson¹, Vasyl Pihur¹, Aleksandra Korolova²•Institutions (2)

Google¹, University of Southern California²

25 Jul 2014-arXiv: Cryptography and Security

TL;DR: RAPPOR as discussed by the authors is a system for crowdsourcing statistics from end-user client software, anonymously, with strong privacy guarantees, allowing the forest of client data to be studied, without permitting the possibility of looking at individual trees.

...read moreread less

Abstract: Randomized Aggregatable Privacy-Preserving Ordinal Response, or RAPPOR, is a technology for crowdsourcing statistics from end-user client software, anonymously, with strong privacy guarantees. In short, RAPPORs allow the forest of client data to be studied, without permitting the possibility of looking at individual trees. By applying randomized response in a novel manner, RAPPOR provides the mechanisms for such collection as well as for efficient, high-utility analysis of the collected data. In particular, RAPPOR permits statistics to be collected on the population of client-side strings with strong privacy guarantees for each client, and without linkability of their reports. This paper describes and motivates RAPPOR, details its differential-privacy and utility guarantees, discusses its practical deployment and properties in the face of different attack models, and, finally, gives results of its application to both synthetic and real-world data.

...read moreread less

1,338 citations

Proceedings Article•

Avoiding the disk bottleneck in the data domain deduplication file system

[...]

Benjamin Zhu¹, Kai Li², Hugo Patterson¹•Institutions (2)

Data Domain¹, Princeton University²

26 Feb 2008

TL;DR: Three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck are described, which enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/ sec for multi- stream throughput.

...read moreread less

Abstract: Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment. This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.

...read moreread less

934 citations

Proceedings Article•DOI•

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response

[...]

Úlfar Erlingsson¹, Vasyl Pihur¹, Aleksandra Korolova²•Institutions (2)

Google¹, University of Southern California²

03 Nov 2014

TL;DR: This paper describes and motivates RAPPOR, details its differential-privacy and utility guarantees, discusses its practical deployment and properties in the face of different attack models, and gives results of its application to both synthetic and real-world data.

...read moreread less

880 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Chord: A scalable peer-to-peer lookup service for internet applications

[...]

Ion Stoica¹, Robert Morris², David R. Karger², M. Frans Kaashoek², Hari Balakrishnan² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

27 Aug 2001

TL;DR: Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.

...read moreread less

Abstract: A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data item pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.

...read moreread less

10,286 citations

Journal Article•DOI•

Space/time trade-offs in hash coding with allowable errors

[...]

Burton H. Bloom

01 Jul 1970-Communications of The ACM

TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

...read moreread less

Abstract: In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hash-coding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods.In such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to “catch” the small fraction of errors associated with the new methods. An example is discussed which illustrates possible areas of application for the new methods.Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

...read moreread less

7,390 citations

"Network Applications of Bloom Filte..." refers methods in this paper

...Burton Bloom introduced Bloom filters in the 1970s [Bloom 70], and ever since they have been very popular in database applications....
[...]
...Bloom introduced Bloom filters in conjunction with an application to hyphenation programs [Bloom 70]....
[...]

Proceedings Article•DOI•

A scalable content-addressable network

[...]

Sylvia Ratnasamy¹, Paul Francis², Mark Handley², Richard M. Karp¹, Scott Shenker² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, AT&T²

27 Aug 2001

TL;DR: The concept of a Content-Addressable Network (CAN) as a distributed infrastructure that provides hash table-like functionality on Internet-like scales is introduced and its scalability, robustness and low-latency properties are demonstrated through simulation.

...read moreread less

Abstract: Hash tables - which map "keys" onto "values" - are an essential building block in modern software systems. We believe a similar functionality would be equally valuable to large distributed systems. In this paper, we introduce the concept of a Content-Addressable Network (CAN) as a distributed infrastructure that provides hash table-like functionality on Internet-like scales. The CAN is scalable, fault-tolerant and completely self-organizing, and we demonstrate its scalability, robustness and low-latency properties through simulation.

...read moreread less

6,703 citations

Journal Article•DOI•

OceanStore: an architecture for global-scale persistent storage

[...]

John Kubiatowicz¹, David Bindel¹, Yan Chen¹, Steven E. Czerwinski¹, Patrick Eaton¹, Dennis Geels¹, Ramakrishna Gummadi¹, Sean Rhea¹, Hakim Weatherspoon¹, Westley Weimer¹, Chris Wells¹, Ben Y. Zhao¹ - Show less +8 more•Institutions (1)

University of California, Berkeley¹

12 Nov 2000

TL;DR: OceanStore monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through pro-active movement of data.

...read moreread less

Abstract: OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowed to be cached anywhere, anytime. Additionally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through pro-active movement of data. A prototype implementation is currently under development.

...read moreread less

3,376 citations

"Network Applications of Bloom Filte..." refers methods in this paper

...Rhea and Kubiatowicz [Rhea and Kubiatowicz 02] utilize the ideas in the basic protocol in Section 6.1 to design a probabilistic routing algorithm for peer-to-peer location mechanisms, in conjunction with the OceanStore project [Kubiatowicz et al. 00]....
[...]

Journal Article•DOI•

Summary cache: a scalable wide-area web cache sharing protocol

[...]

Li Fan¹, Pei Cao², Jussara M. Almeida¹, Andrei Z. Broder•Institutions (2)

University of Wisconsin-Madison¹, Cisco Systems, Inc.²

01 Jun 2000-IEEE ACM Transactions on Networking

TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.

...read moreread less

Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we demonstrate the benefits of cache sharing, measure the overhead of the existing protocols, and propose a new protocol called "summary cache". In this new protocol, each proxy keeps a summary of the cache directory of each participating proxy, and checks these summaries for potential hits before sending any queries. Two factors contribute to our protocol's low overhead: the summaries are updated only periodically, and the directory representations are very economical, as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that, compared to existing protocols such as the Internet cache protocol (ICP), summary cache reduces the number of intercache protocol messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP. Hence summary cache scales to a large number of proxies. (This paper is a revision of Fan et al. 1998; we add more data and analysis in this version.).

...read moreread less

2,174 citations

"Network Applications of Bloom Filte..." refers background or methods in this paper

...The analysis from [Fan et al. 00] reveals that 4 bits per counter should suffice for most applications....
[...]
...Analysis from [8] (which will appear in the full version of the survey) reveals that 4 bits per counter should suÆce for most applications....
[...]
...To avoid this problem, [8] introduces the idea of a counting Bloom lter....
[...]
...Fan, Cao, Almeida, and Broder describe Summary Cache, which uses Bloom filters for Web cache sharing [Fan et al. 00]....
[...]
...Fan, Cao, Almeida, and Broder describe Summary Cache, which uses Bloom lters for Web cache sharing [8]....
[...]