scispace - formally typeset
Open AccessJournal ArticleDOI

Data streams: algorithms and applications

S. Muthukrishnan
- 01 Aug 2005 - 
- Vol. 1, Iss: 2, pp 117-236
Reads0
Chats0
TLDR
Data Streams: Algorithms and Applications surveys the emerging area of algorithms for processing data streams and associated applications, which rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity.
Abstract
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [1].

read more

Content maybe subject to copyright    Report

Foundations and Trends
R
in
Theoretical Computer Science
Vol. 1, No 2 (2005) 117–236
c
2005 S. Muthukrishnan
Data Streams: Algorithms and Applications
S. Muthukrishnan
Rutgers University, New Brunswick, NJ, USA, muthu@cs.rutgers.edu
Abstract
In the data stream scenario, input arrives very rapidly and there is lim-
ited memory to store the input. Algorithms have to work with one or
few passes over the data, space less than linear in the input size or time
significantly less than the input size. In the past few years, a new the-
ory has emerged for reasoning about algorithms that work within these
constraints on space, time, and number of passes. Some of the meth-
ods rely on metric embeddings, pseudo-random computations, sparse
approximation theory and communication complexity. The applications
for this scenario include IP network traffic analysis, mining text mes-
sage streams and processing massive data sets in general. Researchers
in Theoretical Computer Science, Databases, IP Networking and Com-
puter Systems are working on the data stream challenges. This article is
an overview and survey of data stream algorithmics and is an updated
version of [175].

1
Introduction
We study the emerging area of algorithms for processing data streams
and associated applications, as an applied algorithms research agenda.
We begin with three puzzles.
1.1 Puzzle 1: Finding Missing Numbers
Let π be a permutation of {1,...,n}. Further, let π
1
be π with one
element missing. Paul shows Carole π
1
[i] in increasing order i. Carole’s
task is to determine the missing integer. It is trivial to do the task if
Carole can memorize all the numbers she has seen thus far (formally,
she has an n-bit vector), but if n is large, this is impractical. Let us
assume she has only a few say O(logn) bits of memory. Nevertheless,
Carole must determine the missing integer.
This starter puzzle has a simple solution: Carole stores
s =
n(n +1)
2
ji
π
1
[j],
which is the missing integer in the end. Each input integer entails one
subtraction. The total number of bits stored is no more than 2logn.
This is nearly optimal because Carole needs at least log n bits in the
118

1.1. Puzzle 1: Finding Missing Numbers 119
worst case since she needs to output the missing integer. (In fact, there
exists the following optimal algorithm for Carole using log n bits. For
each i, store the parity sum of the ith bits of all numbers seen thus
far. The final parity sum bits are the bits of the missing number.) A
similar solution will work even if n is unknown, for example by letting
n = max
ji
π
1
[j] each time.
Paul and Carole have a history. It started with the “twenty ques-
tions” problem solved in [201]. Paul, which stood for Paul Erdos, was
the one who asked questions. Carole is an anagram for Oracle. Aptly,
she was the one who answered questions. Joel Spencer and Peter Win-
kler used Paul and Carole to coincide with Pusher and Chooser respec-
tively in studying certain chip games in which Carole chose which
groups the chips falls into and Paul determined which group of chips
to push. In the puzzle above, Paul permutes and Carole cumulates.
Generalizing the puzzle a little further, let π
2
be π with two
elements missing. The natural solution would be for Carole to store
s =
n(n+1)
2
ji
π
2
[j] and p = n!
ji
π
2
[j], giving two equa-
tions with two unknown numbers, but this will result in storing large
number of bits since n! is large. Instead, Carole can use far fewer bits
tracking
s =
n(n +1)
2
ji
π
2
[j] and ss =
n(n + 1)(2n +1)
6
ji
(π
2
[j])
2
In general, what is the smallest number of bits needed to identify the
k missing numbers in π
k
? Following the approach above, the solution
is to maintain power sums
s
p
(x
1
,...,x
k
)=
k
i=1
(x
i
)
p
,
for p =1,...,k and solving for x
i
’s. A different method uses elementary
symmetric polynomials [169]. The ith such polynomial σ
i
(x
1
,...,x
k
)is
the sum of all possible i term products of the parameters, i.e.,
σ
i
(x
1
,...,x
k
)=
j
1
<···<j
i
x
j
1
···x
j
i
.
Carole continuously maintains σ
i
’s for the missing k items in field F
q
for some prime n q 2n, as Paul presents the numbers one after the

120 Introduction
other (the details are in [169]). Since
i=1,...,k
(z x
i
)=
k
i=0
(1)
i
σ
i
(x
1
,...,x
k
)z
ki
,
Carole needs to factor this polynomial in F
q
to determine the missing
numbers. No deterministic algorithms are known for the factoring prob-
lem, but there are randomized algorithms take roughly O(k
2
logn) bits
and time [214]. The elementary symmetric polynomial approach above
comes from [169] where the authors solve the set reconciliation prob-
lem in the communication complexity model. The subset reconciliation
problem is related to our puzzle.
Generalizing the puzzle, Paul may present a multiset of elements in
{1,...,n} with a single missing integer, i.e., he is allowed to re-present
integers he showed before; Paul may present updates showing which
integers to insert and which to delete, and Carole’s task is to find
the integers that are no longer present; etc. All of these problems are
no longer (mere) puzzles; they are derived from motivating data stream
applications.
1.2 Puzzle 2: Fishing
Say Paul goes fishing. There are many different fish species U =
{1,...,u}. Paul catches one fish at a time, a
t
U being the fish species
he catches at time t. c
t
[j]=|{a
i
| a
i
= j,i t}| is the number of times
he catches the species j up to time t. Species j is rare at time t if it
appears precisely once in his catch up to time t. The rarity ρ[t]ofhis
catch at time t is the ratio of the number of rare j’s to u:
ρ[t]=
|{j | c
t
[j]=1}|
u
.
Paul can calculate ρ[t] precisely with a 2u-bit vector and a counter for
the current number of rare species, updating the data structure in O(1)
operations per fish caught. However, Paul wants to store only as many
bits as will fit his tiny suitcase, i.e., o(u), preferably O(1) bits.
Suppose Paul has a deterministic algorithm to compute ρ[t] pre-
cisely. Feed Paul any set S U of fish species, and say Paul’s algorithm

1.2. Puzzle 2: Fishing 121
stores only o(u) bits in his suitcase. Now we can check if any i S by
simply feeding Paul i and checking ρ[t + 1]: the number of rare items
decreases by one if and only if i S. This way we can recover entire S
from his suitcase by feeding different i’s one at a time, which is impos-
sible in general if Paul had only stored o(|S|) bits. Therefore, if Paul
wishes to work out of his one suitcase, he can not compute ρ[t] exactly.
This argument has elements of lower bound proofs found in the area of
data streams.
However, proceeding to the task at hand, Paul can approximate ρ[t].
Paul picks k random fish species each independently, randomly with
probability 1/u at the beginning and maintains the number of times
each of these fish types appear in his bounty, as he catches fish one
after another. Say X
1
[t],...,X
k
[t] are these counts after time t. Paul
outputs ˆρ[t]=
|{i|X
i
[t]=1}|
k
as an estimator for ρ.Wehave,
Pr(X
i
[t]=1)=
|{j | c
t
[j]=1}|
u
= ρ[t],
for any fixed i and the probability is over the fish type X
i
.Ifρ[t]is
large, say at least 1/kρ[t] is a good estimator for ρ[t] with arbitrarily
small ε and significant probability.
However, typically, ρ is unlikely to be large because presumably u
is much larger than the species found at any spot Paul fishes. Choosing
a random species from {1,...,u} and waiting for it to be caught is
ineffective. We can make it more realistic by redefining rarity with
respect to the species Paul in fact sees in his catch. Let
γ[t]=
|{j | c
t
[j]=1}|
|{j | c
t
[j] =0}|
.
As before, Paul would have to approximate γ[t] because he can not
compute it exactly using a small number of bits. Following [28], define
a family of hash functions H⊂[n] [n] (where [n]={1,...,n})tobe
min-wise independent if for any X [n] and x X,wehave
Pr
h∈H
[h(x) = min h(X)] =
1
|X|
,
where, h(X)={h(x): x X}. Paul chooses k min-wise independent
hash functions h
1
,h
2
,...,h
k
for some parameter k to be determined

Citations
More filters

Data Mining: Concepts and Techniques (2nd edition)

TL;DR: There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99].
Posted Content

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

TL;DR: In this article, a modular framework for constructing randomized algorithms that compute partial matrix decompositions is presented, which uses random sampling to identify a subspace that captures most of the action of a matrix and then the input matrix is compressed to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization.
Journal ArticleDOI

An improved data stream summary: the count-min sketch and its applications

TL;DR: In this paper, the authors introduce a sublinear space data structure called the countmin sketch for summarizing data streams, which allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc.
BookDOI

Compressed sensing : theory and applications

TL;DR: In this paper, the authors introduce the concept of second generation sparse modeling and apply it to the problem of compressed sensing of analog signals, and propose a greedy algorithm for compressed sensing with high-dimensional geometry.
Proceedings Article

Learning from Time-Changing Data with Adaptive Windowing

TL;DR: A new approach for dealing with distribution change and concept drift when learning from data sequences that may vary with time is presented, using sliding windows whose size is recomputed online according to the rate of change observed from the data in the window itself.
References
More filters
Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Book

Compressed sensing

TL;DR: It is possible to design n=O(Nlog(m)) nonadaptive measurements allowing reconstruction with accuracy comparable to that attainable with direct knowledge of the N most important coefficients, and a good approximation to those N important coefficients is extracted from the n measurements by solving a linear program-Basis Pursuit in signal processing.
Book

The Art of Computer Programming

TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.
Proceedings ArticleDOI

Wireless sensor networks for habitat monitoring

TL;DR: An in-depth study of applying wireless sensor networks to real-world habitat monitoring and an instance of the architecture for monitoring seabird nesting environment and behavior is presented.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "Data streams: algorithms and applications" ?

The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. This article is an overview and survey of data stream algorithmics and is an updated version of [ 175 ]. 

There are a number of basic algorithmic techniques: binary search, greedy technique, dynamic programming, divide and conquer etc. that directly apply in the data stream context, mostly in conjunction with samples or random projections. 

F2 and L2 are used to measure deviations in anomaly detection [154] or interpreted as self-join sizes [13]; with variants of L1 sketches the authors can dynamically track most frequent items [55], quantiles [107], wavelets and histograms [104], etc. in the Turnstile model; using Lp sketches for p→ 0, the authors can estimate the number of distinct elements at any time in the Turnstile model [47]. 

Other reasons to monitor database contents are approximate query answering and data quality monitoring, two rich areas in their own right with extensive literature and work. 

The space required to answer point queries correctly with any constant probability and error at most ε||A||1 is Ω(ε−1) over general distributions. 

Counting the number of distinct IP addresses that are currently using a link can be solved by determining the number of nonzero A[i]’s at any time. 

With at most O(B + logN) storage, the authors can compute the highest (best) B-term approximation to a signal exactly in the Timeseries model. 

In addition, having small memory during the processing is useful: for example, even if the data resides in the disk or tape, the small memory “summary” can be maintained in the main memory or even the cache and that yields very efficient query implementations in general. 

The deterministic sampling procedure above outputs (φ,ε)-quantiles on a cash register stream using space O (log(ε||A||1) ε) .5.1. Sampling 147 

The number of distinct values, that is |{j |A[j] = 0}|, can be approximated up to a fixed error with constant probability in O(1) space by known methods [22]. 

Consider inserting a million (≈220) IP addresses; the CM sketch still maintains O((1/ε) log(1/δ)) counts of size log ||A||1 ≈ 20 bits still, which is much less than the input size. 

Produce multiset S consisting of O( 1 2 log 1δ ) samples (x,i), where x is an item in [1,N ] and i, its count in the input stream; further, the probability that any x ∈ [1,N ] is in S is 1/|{j |A[j] = 0}|. 

the automatic data feeds that generate modern data streams arise out of monitoring applications, be they atmospheric, astronomical, networking, financial or sensor-related. 

Many companies such as NARUS and others can capture a suitable subset of the stream and use standard database technology atop for specific analyses. 

Each new data item needs O(B + logN) time to be processed; by batch processing, the authors can make update time O(1) in the amortized case. 

From a fundamental result of Parseval from 1799 [187, 25], it follows that the best B-term representation for signal A is R = ∑ i∈Λ ciψi, where ci =〈A,ψi〉 and Λ of size k maximizes ∑i∈Λ c2i . 

These are time critical tasks and need to be done in near-real time to accurately keep pace with the rate of stream updates and reflect rapidly changing trends in the data.