scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An improved data stream summary: the count-min sketch and its applications

01 Apr 2005-Journal of Algorithms (Academic Press, Inc.)-Vol. 55, Iss: 1, pp 58-75
TL;DR: In this paper, the authors introduce a sublinear space data structure called the countmin sketch for summarizing data streams, which allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc.
About: This article is published in Journal of Algorithms.The article was published on 2005-04-01 and is currently open access. It has received 1939 citations till now. The article focuses on the topics: Count–min sketch & Data stream mining.

Summary (2 min read)

1 Introduction

  • This setup is the data stream scenario that has emerged recently.
  • Second, processing an update should be fast and simple; likewise, answering queries of a given type should be fast and have usable accuracy guarantees.
  • In recent years, several different sketches have been proposed in the data stream context that allow a number of simple aggregation functions to be approximated.
  • -Although sketches use small space, the space used typically has a Ω(1/ε 2 ) multiplicative factor.
  • -Many sketch constructions require time linear in the size of the sketch to process each update to the underlying data [2, 13] .

2 Preliminaries

  • For convenience, the authors shall usually drop t and refer only to the current state of the vector.
  • The former is known as the cash register case and the latter the turnstile case [16] .
  • The general case occurs in important scenarios too, for example in distributed settings where one considers the subtraction of one vector from another, say.
  • In databases, the point and range queries are of interest in summarizing the data distribution approximately; and inner-product queries allow approximation of join size of relations.
  • The authors will also study use of these queries to compute more complex functions on data streams.

3 Count-Min Sketches

  • The authors now introduce their data structure, the Count-Min, or CM, sketch.
  • It is named after the two basic operations used to answer point queries, counting first and computing the minimum next.
  • The authors use e to denote the base of the natural logarithm function, ln. Update Procedure.
  • The space used by Count-Min sketches is the array of wd counts, which takes wd words, and d hash functions, each of which can be stored using 2 words when using the pairwise functions described in [15] .

4.1 Point Query

  • The authors first show the analysis for point queries for the non-negative case.
  • The best known previous result using sketches was in [4] : there sketches were used to approximate point queries.
  • Results were stated in terms of the frequencies of individual items.
  • Join size estimation is important in database query planners in order to determine the best order in which to evaluate queries.
  • The following corollary follows from the above theorem.

Corollary 1. The Join size of two relations on a particular attribute can be approximated up to ε||a||

  • Previous results have used the "tug-of-war" sketches [1] .
  • When the distribution of items is non-uniform, for example when certain items contribute a large amount to the join size, then the two norms are closer, and the guarantees of the CM sketch method is closer to the existing method.

4.3 Range Query

  • Then, given a range query Q(l, r), compute the at most 2 log 2 n dyadic ranges which canonically cover the range, and pose that many point queries to the sketches, returning the sum of the queries as the estimate.
  • Applying the same Markov inequality argument as before, the probability that this additive error is more than 2ε log n||a|| 1 for any estimator is less than 1 e ; hence, for all of them the probability is at most δ.
  • The above theorem states the bound for the standard CM sketch size.
  • An obvious improvement of this technique in practice is to keep exact counts for the first few levels of the hierarchy, where there are only a small number of dyadic ranges.
  • This improves the space, time and accuracy of the algorithm in practice, although the asymptotic bounds are unaffected.

5.1 Quantiles in the Turnstile Model

  • In [13] the authors showed that finding the approximate φ-quantiles of the data subject to insertions and deletions can be reduced to the problem of computing range sums.
  • The method of [13] uses Random Subset Sums to compute range sums.
  • By replacing this structure with Count-Min sketches, the improved results follow immediately.
  • By keeping log n sketches, one for each dyadic range and setting the accuracy parameter for each to be ε/ log n and the probability guarantee to δφ/ log(n), the overall probability guarantee for all 1/φ quantiles is achieved.
  • It is illustrative to contrast their bounds with those for the problem in the weaker Cash Register Model where items are only inserted (recall that in their stronger Turnstile model, items are deleted as well).

5.2 Heavy Hitters in the Turnstile Model

  • The authors adopt the solution given in [5] , which describes a divide and conquer procedure to find the heavy hitters.
  • This keeps sketches for computing range sums: log n different sketches, one for each different dyadic range.
  • In order to find all the heavy hitters, a parallel binary search is performed, descending one level of the hierarchy at each step.
  • Nodes in the hierarchy (corresponding to dyadic ranges) whose estimated weight exceeds the threshold of (φ + ε)||a|| 1 are split into two ranges, and investigated recursively.
  • The authors instead must limit the number of items output whose true frequency is less than the fraction φ.

Did you find this useful? Give us your feedback

Citations
More filters
Journal ArticleDOI
TL;DR: The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.
Abstract: A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Bloom filters allow false positives but the space savings often outweigh this drawback when the probability of an error is controlled. Bloom filters have been used in database applications since the 1970s, but only in recent years have they become popular in the networking literature. The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.

2,199 citations

Journal ArticleDOI
TL;DR: Data Streams: Algorithms and Applications surveys the emerging area of algorithms for processing data streams and associated applications, which rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity.
Abstract: In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [1].

1,598 citations

Book
01 Jan 2005
TL;DR: In this paper, the authors present a survey of basic mathematical foundations for data streaming systems, including basic mathematical ideas, basic algorithms, and basic algorithms and algorithms for data stream processing.
Abstract: 1 Introduction 2 Map 3 The Data Stream Phenomenon 4 Data Streaming: Formal Aspects 5 Foundations: Basic Mathematical Ideas 6 Foundations: Basic Algorithmic Techniques 7 Foundations: Summary 8 Streaming Systems 9 New Directions 10 Historic Notes 11 Concluding Remarks Acknowledgements References

1,506 citations

Proceedings ArticleDOI
14 Jun 2009
TL;DR: In this article, the authors provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability, and demonstrate the feasibility of this approach with experimental results for a new use case.
Abstract: Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case --- multitask learning with hundreds of thousands of tasks.

955 citations

Book
20 Nov 2014
TL;DR: This volume covers mining aspects of data streams comprehensively: each contributed chapter contains a survey on the topic, the key ideas in the field for that particular topic, and future research directions.
Abstract: This book primarily discusses issues related to the mining aspects of data streams and it is unique in its primary focus on the subject This volume covers mining aspects of data streams comprehensively: each contributed chapter contains a survey on the topic, the key ideas in the field for that particular topic, and future research directions The book is intended for a professional audience composed of researchers and practitioners in industry This book is also appropriate for advanced-level students in computer science

726 citations

References
More filters
Book
01 Jan 1995
TL;DR: This book introduces the basic concepts in the design and analysis of randomized algorithms and presents basic tools such as probability theory and probabilistic analysis that are frequently used in algorithmic applications.
Abstract: For many applications, a randomized algorithm is either the simplest or the fastest algorithm available, and sometimes both. This book introduces the basic concepts in the design and analysis of randomized algorithms. The first part of the text presents basic tools such as probability theory and probabilistic analysis that are frequently used in algorithmic applications. Algorithmic examples are also given to illustrate the use of each tool in a concrete setting. In the second part of the book, each chapter focuses on an important area to which randomized algorithms can be applied, providing a comprehensive and representative selection of the algorithms that might be used in each of these areas. Although written primarily as a text for advanced undergraduates and graduate students, this book should also prove invaluable as a reference for professionals and researchers.

4,412 citations

Proceedings ArticleDOI
03 Jun 2002
TL;DR: The need for and research issues arising from a new model of data processing, where data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams are motivated.
Abstract: In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

2,933 citations

Proceedings ArticleDOI
03 Jun 2002
TL;DR: This paper shows that XML's ordered data model can indeed be efficiently supported by a relational database system, and proposes three order encoding methods that can be used to represent XML order in the relational data model, and also proposes algorithms for translating ordered XPath expressions into SQL using these encoding methods.
Abstract: XML is quickly becoming the de facto standard for data exchange over the Internet. This is creating a new set of data management requirements involving XML, such as the need to store and query XML documents. Researchers have proposed using relational database systems to satisfy these requirements by devising ways to "shred" XML documents into relations, and translate XML queries into SQL queries over these relations. However, a key issue with such an approach, which has largely been ignored in the research literature, is how (and whether) the ordered XML data model can be efficiently supported by the unordered relational data model. This paper shows that XML's ordered data model can indeed be efficiently supported by a relational database system. This is accomplished by encoding order as a data value. We propose three order encoding methods that can be used to represent XML order in the relational data model, and also propose algorithms for translating ordered XPath expressions into SQL using these encoding methods. Finally, we report the results of an experimental study that investigates the performance of the proposed order encoding methods on a workload of ordered XML queries and updates.

2,402 citations

Journal ArticleDOI
TL;DR: Data Streams: Algorithms and Applications surveys the emerging area of algorithms for processing data streams and associated applications, which rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity.
Abstract: In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [1].

1,598 citations

Frequently Asked Questions (18)
Q1. What are the contributions in "An improved data stream summary: the count-min sketch and its applications" ?

The authors introduce a new sublinear space data structure—the Count-Min Sketch— for summarizing data streams. The time and space bounds the authors show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/ε to 1/ε in factor. 

accuracy guarantees will be made in terms of a pair of user specified parameters, ε and δ, meaning that the error in answering a query is within a factor of ε with probability δ. 

This sketch has the advantages that: (1) space used is proportional to 1/ε; (2) the update time is significantly sublinear in the size of the sketch; (3) it requires only pairwise independent hash functions that are simple to construct; (4) this sketch can be used for several different queries and multiple applications;and (5) all the constants are made explicit and are small. 

Any range query can be reduced to at most 2 log2 n dyadic range queries, which in turn can each be reduced to a single point query. 

Choosing CM sketches over Random Subset Sums improves both the query time and the update time from O( 1ε2 log 2(n) log log nεδ ), by a factor of more than 34 ε2 log n. 

when the distribution of items is non-uniform, for example when certain items contribute a large amount to the join size, then the two norms are closer, and the guarantees of the CM sketch method is closer to the existing method. 

The previously best known space bounds for finding approximate quantiles is O( 1ε (log 2 1 ε + log2 log 1δ )) space for a randomized sampling and O( 1ε log(ε||a||1)) space for a deterministic solution [14]. 

given a range query Q(l, r), compute the at most 2 log2 n dyadic ranges which canonically cover the range, and pose that many point queries to the sketches, returning the sum of the queries as the estimate. 

These have applications to computing correlations between data streams and tracking the number of distinct elements in streams, both of which are of great interest. 

The φ-heavy hitters of a multiset of ||a||1 (integer) values each in the range 1 . . . n, consist of those items whose multiplicity exceeds the fraction φ of the total cardinality, i.e., ai ≥ φ||a||1. 

Sketches are typically a few kilobytes up to a megabyte or so, and processing this much data for every update severely limits the update speed. 

By keeping log n sketches, one for each dyadic range and setting the accuracy parameter for each to be ε/ log n and the probability guarantee to δφ/ log(n), the overall probability guarantee for all 1/φ quantiles is achieved. 

Nodes in the hierarchy (corresponding to dyadic ranges) whose estimated weight exceeds the threshold of (φ + ε)||a||1 are split into two ranges, and investigated recursively. 

The best previous bounds for this problem in the turnstile model are given in [13], where range queries are answered by keeping O(log n) sketches, each of sizeO( 1 ε′2 log(n) log log nδ ) to give approximations with additive error ε||a||1 with probability 1 − δ′. 

Corollary 1. The Join size of two relations on a particular attribute can be approximated up to ε||a||1||b||1 with probability 1− δ, by keeping space O( 1ε log 1 δ ).1 

In [13] the authors showed that finding the approximate φ-quantiles of the data subject to insertions and deletions can be reduced to the problem of computing range sums. 

the space used by the algorithm should be small, at most polylogarithmic in n, the space required to represent a explicitly. 

Theorem 4. ε-approximate φ-quantiles can be found with probability at least 1− δ by keeping a data structure with space O( 1ε log2(n) log( log nφδ )).