What are some basic algorithms that can be used in the data stream context?

There are a number of basic algorithmic techniques: binary search, greedy technique, dynamic programming, divide and conquer etc. that directly apply in the data stream context, mostly in conjunction with samples or random projections.

What are the main uses of Lp sketches for estimating p-stable distributions?

F2 and L2 are used to measure deviations in anomaly detection [154] or interpreted as self-join sizes [13]; with variants of L1 sketches the authors can dynamically track most frequent items [55], quantiles [107], wavelets and histograms [104], etc. in the Turnstile model; using Lp sketches for p→ 0, the authors can estimate the number of distinct elements at any time in the Turnstile model [47].

What are the main reasons to monitor database contents?

Other reasons to monitor database contents are approximate query answering and data quality monitoring, two rich areas in their own right with extensive literature and work.

What is the space required to answer point queries correctly?

The space required to answer point queries correctly with any constant probability and error at most ε||A||1 is Ω(ε−1) over general distributions.

How can The authorsolve the problem of a number of distinct IP addresses?

Counting the number of distinct IP addresses that are currently using a link can be solved by determining the number of nonzero A[i]’s at any time.

How do the authors compute the highest B-term approximation to a signal?

With at most O(B + logN) storage, the authors can compute the highest (best) B-term approximation to a signal exactly in the Timeseries model.

What is the advantage of having small memory during the processing?

In addition, having small memory during the processing is useful: for example, even if the data resides in the disk or tape, the small memory “summary” can be maintained in the main memory or even the cache and that yields very efficient query implementations in general.

How does the algorithm compute the quantiles of a cash register?

The deterministic sampling procedure above outputs (φ,ε)-quantiles on a cash register stream using space O (log(ε||A||1) ε) .5.1. Sampling 147

How many distinct values can be approximated in the inverse signal?

The number of distinct values, that is |{j |A[j] = 0}|, can be approximated up to a fixed error with constant probability in O(1) space by known methods [22].

How many bits of data is the CM sketch still keeping?

Consider inserting a million (≈220) IP addresses; the CM sketch still maintains O((1/ε) log(1/δ)) counts of size log ||A||1 ≈ 20 bits still, which is much less than the input size.

What is the probability that any item in the input stream is in S?

Produce multiset S consisting of O( 1 2 log 1δ ) samples (x,i), where x is an item in [1,N ] and i, its count in the input stream; further, the probability that any x ∈ [1,N ] is in S is 1/|{j |A[j] = 0}|.

What are the main types of data streams that are generated by monitoring applications?

the automatic data feeds that generate modern data streams arise out of monitoring applications, be they atmospheric, astronomical, networking, financial or sensor-related.

What is the way to capture a subset of the stream?

Many companies such as NARUS and others can capture a suitable subset of the stream and use standard database technology atop for specific analyses.

How do the authors make the update time in the amortized case?

Each new data item needs O(B + logN) time to be processed; by batch processing, the authors can make update time O(1) in the amortized case.

What is the B term representation for a signal?

From a fundamental result of Parseval from 1799 [187, 25], it follows that the best B-term representation for signal A is R = ∑ i∈Λ ciψi, where ci =〈A,ψi〉 and Λ of size k maximizes ∑i∈Λ c2i .

What are the important tasks that need to be done in near-real time?

These are time critical tasks and need to be done in near-real time to accurately keep pace with the rate of stream updates and reflect rapidly changing trends in the data.

(Open Access) Data streams: algorithms and applications (2005) | S. Muthukrishnan

Q: What are the contributions mentioned in the paper "Data streams: algorithms and applications" ?

The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. This article is an overview and survey of data stream algorithmics and is an updated version of [ 175 ].

Foundations and Trends



Theoretical Computer Science

Vol. 1, No 2 (2005) 117–236

 2005 S. Muthukrishnan

Data Streams: Algorithms and Applications

S. Muthukrishnan

Rutgers University, New Brunswick, NJ, USA, muthu@cs.rutgers.edu

Abstract

In the data stream scenario, input arrives very rapidly and there is lim-

ited memory to store the input. Algorithms have to work with one or

few passes over the data, space less than linear in the input size or time

signiﬁcantly less than the input size. In the past few years, a new the-

ory has emerged for reasoning about algorithms that work within these

constraints on space, time, and number of passes. Some of the meth-

ods rely on metric embeddings, pseudo-random computations, sparse

approximation theory and communication complexity. The applications

for this scenario include IP network traﬃc analysis, mining text mes-

sage streams and processing massive data sets in general. Researchers

in Theoretical Computer Science, Databases, IP Networking and Com-

puter Systems are working on the data stream challenges. This article is

an overview and survey of data stream algorithmics and is an updated

version of [175].

Introduction

We study the emerging area of algorithms for processing data streams

and associated applications, as an applied algorithms research agenda.

We begin with three puzzles.

1.1 Puzzle 1: Finding Missing Numbers

Let π be a permutation of {1,...,n}. Further, let π

−1

be π with one

element missing. Paul shows Carole π

−1

[i] in increasing order i. Carole’s

task is to determine the missing integer. It is trivial to do the task if

Carole can memorize all the numbers she has seen thus far (formally,

she has an n-bit vector), but if n is large, this is impractical. Let us

assume she has only a few – say O(logn) – bits of memory. Nevertheless,

Carole must determine the missing integer.

This starter puzzle has a simple solution: Carole stores

s =

n(n +1)

−



j≤i

−1

[j],

which is the missing integer in the end. Each input integer entails one

subtraction. The total number of bits stored is no more than 2logn.

This is nearly optimal because Carole needs at least log n bits in the

118

1.1. Puzzle 1: Finding Missing Numbers 119

worst case since she needs to output the missing integer. (In fact, there

exists the following optimal algorithm for Carole using log n bits. For

each i, store the parity sum of the ith bits of all numbers seen thus

far. The ﬁnal parity sum bits are the bits of the missing number.) A

similar solution will work even if n is unknown, for example by letting

n = max

j≤i

−1

[j] each time.

Paul and Carole have a history. It started with the “twenty ques-

tions” problem solved in [201]. Paul, which stood for Paul Erdos, was

the one who asked questions. Carole is an anagram for Oracle. Aptly,

she was the one who answered questions. Joel Spencer and Peter Win-

kler used Paul and Carole to coincide with Pusher and Chooser respec-

tively in studying certain chip games in which Carole chose which

groups the chips falls into and Paul determined which group of chips

to push. In the puzzle above, Paul permutes and Carole cumulates.

Generalizing the puzzle a little further, let π

−2

be π with two

elements missing. The natural solution would be for Carole to store

s =

n(n+1)

−



j≤i

−2

[j] and p = n! −



j≤i

−2

[j], giving two equa-

tions with two unknown numbers, but this will result in storing large

number of bits since n! is large. Instead, Carole can use far fewer bits

tracking

s =

n(n +1)

−



j≤i

−2

[j] and ss =

n(n + 1)(2n +1)

−



j≤i

(π

−2

[j])

In general, what is the smallest number of bits needed to identify the

k missing numbers in π

−k

? Following the approach above, the solution

is to maintain power sums

,...,x



i=1

)

for p =1,...,k and solving for x

’s. A diﬀerent method uses elementary

symmetric polynomials [169]. The ith such polynomial σ

,...,x

)is

the sum of all possible i term products of the parameters, i.e.,

,...,x



<···<j

···x

Carole continuously maintains σ

’s for the missing k items in ﬁeld F

for some prime n ≤ q ≤ 2n, as Paul presents the numbers one after the

120 Introduction

other (the details are in [169]). Since



i=1,...,k

(z − x



i=0

(−1)

,...,x

k−i

Carole needs to factor this polynomial in F

to determine the missing

numbers. No deterministic algorithms are known for the factoring prob-

lem, but there are randomized algorithms take roughly O(k

logn) bits

and time [214]. The elementary symmetric polynomial approach above

comes from [169] where the authors solve the set reconciliation prob-

lem in the communication complexity model. The subset reconciliation

problem is related to our puzzle.

Generalizing the puzzle, Paul may present a multiset of elements in

{1,...,n} with a single missing integer, i.e., he is allowed to re-present

integers he showed before; Paul may present updates showing which

integers to insert and which to delete, and Carole’s task is to ﬁnd

the integers that are no longer present; etc. All of these problems are

no longer (mere) puzzles; they are derived from motivating data stream

applications.

1.2 Puzzle 2: Fishing

Say Paul goes ﬁshing. There are many diﬀerent ﬁsh species U =

{1,...,u}. Paul catches one ﬁsh at a time, a

∈ U being the ﬁsh species

he catches at time t. c

[j]=|{a

| a

= j,i ≤ t}| is the number of times

he catches the species j up to time t. Species j is rare at time t if it

appears precisely once in his catch up to time t. The rarity ρ[t]ofhis

catch at time t is the ratio of the number of rare j’s to u:

ρ[t]=

|{j | c

[j]=1}|

Paul can calculate ρ[t] precisely with a 2u-bit vector and a counter for

the current number of rare species, updating the data structure in O(1)

operations per ﬁsh caught. However, Paul wants to store only as many

bits as will ﬁt his tiny suitcase, i.e., o(u), preferably O(1) bits.

Suppose Paul has a deterministic algorithm to compute ρ[t] pre-

cisely. Feed Paul any set S ⊂ U of ﬁsh species, and say Paul’s algorithm

1.2. Puzzle 2: Fishing 121

stores only o(u) bits in his suitcase. Now we can check if any i ∈ S by

simply feeding Paul i and checking ρ[t + 1]: the number of rare items

decreases by one if and only if i ∈ S. This way we can recover entire S

from his suitcase by feeding diﬀerent i’s one at a time, which is impos-

sible in general if Paul had only stored o(|S|) bits. Therefore, if Paul

wishes to work out of his one suitcase, he can not compute ρ[t] exactly.

This argument has elements of lower bound proofs found in the area of

data streams.

However, proceeding to the task at hand, Paul can approximate ρ[t].

Paul picks k random ﬁsh species each independently, randomly with

probability 1/u at the beginning and maintains the number of times

each of these ﬁsh types appear in his bounty, as he catches ﬁsh one

after another. Say X

[t],...,X

[t] are these counts after time t. Paul

outputs ˆρ[t]=

|{i|X

[t]=1}|

as an estimator for ρ.Wehave,

Pr(X

[t]=1)=

|{j | c

[j]=1}|

= ρ[t],

for any ﬁxed i and the probability is over the ﬁsh type X

.Ifρ[t]is

large, say at least 1/k,ˆρ[t] is a good estimator for ρ[t] with arbitrarily

small ε and signiﬁcant probability.

However, typically, ρ is unlikely to be large because presumably u

is much larger than the species found at any spot Paul ﬁshes. Choosing

a random species from {1,...,u} and waiting for it to be caught is

ineﬀective. We can make it more realistic by redeﬁning rarity with

respect to the species Paul in fact sees in his catch. Let

γ[t]=

|{j | c

[j]=1}|

|{j | c

[j] =0}|

As before, Paul would have to approximate γ[t] because he can not

compute it exactly using a small number of bits. Following [28], deﬁne

a family of hash functions H⊂[n] → [n] (where [n]={1,...,n})tobe

min-wise independent if for any X ⊂ [n] and x ∈ X,wehave

h∈H

[h(x) = min h(X)] =

|X|

where, h(X)={h(x): x ∈ X}. Paul chooses k min-wise independent

hash functions h

,...,h

for some parameter k to be determined

Data streams: algorithms and applications

Figures

Citations

Data Mining: Concepts and Techniques (2nd edition)

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

An improved data stream summary: the count-min sketch and its applications

Compressed sensing : theory and applications

Learning from Time-Changing Data with Adaptive Windowing

References

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness

Compressed sensing

The Art of Computer Programming

Wireless sensor networks for habitat monitoring

Related Papers (5)

An improved data stream summary: the count-min sketch and its applications

The Space Complexity of Approximating the Frequency Moments

Finding Frequent Items in Data Streams

An optimal algorithm for the distinct elements problem

Chapter 31 – Approximate Frequency Counts over Data Streams

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "Data streams: algorithms and applications" ?

Q2. What are some basic algorithms that can be used in the data stream context?

Q3. What are the main uses of Lp sketches for estimating p-stable distributions?

Q4. What are the main reasons to monitor database contents?

Q5. What is the space required to answer point queries correctly?

Q6. How can The authorsolve the problem of a number of distinct IP addresses?

Q7. How do the authors compute the highest B-term approximation to a signal?

Q8. What is the advantage of having small memory during the processing?

Q9. How does the algorithm compute the quantiles of a cash register?

Q10. How many distinct values can be approximated in the inverse signal?

Q11. How many bits of data is the CM sketch still keeping?

Q12. What is the probability that any item in the input stream is in S?

Q13. What are the main types of data streams that are generated by monitoring applications?

Q14. What is the way to capture a subset of the stream?

Q15. How do the authors make the update time in the amortized case?

Q16. What is the B term representation for a signal?

Q17. What are the important tasks that need to be done in near-real time?