Data streams: algorithms and applications
read more
Citations
Data Mining: Concepts and Techniques (2nd edition)
Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions
An improved data stream summary: the count-min sketch and its applications
Compressed sensing : theory and applications
Learning from Time-Changing Data with Adaptive Windowing
References
Computers and Intractability: A Guide to the Theory of NP-Completeness
Compressed sensing
The Art of Computer Programming
Wireless sensor networks for habitat monitoring
Related Papers (5)
Frequently Asked Questions (17)
Q2. What are some basic algorithms that can be used in the data stream context?
There are a number of basic algorithmic techniques: binary search, greedy technique, dynamic programming, divide and conquer etc. that directly apply in the data stream context, mostly in conjunction with samples or random projections.
Q3. What are the main uses of Lp sketches for estimating p-stable distributions?
F2 and L2 are used to measure deviations in anomaly detection [154] or interpreted as self-join sizes [13]; with variants of L1 sketches the authors can dynamically track most frequent items [55], quantiles [107], wavelets and histograms [104], etc. in the Turnstile model; using Lp sketches for p→ 0, the authors can estimate the number of distinct elements at any time in the Turnstile model [47].
Q4. What are the main reasons to monitor database contents?
Other reasons to monitor database contents are approximate query answering and data quality monitoring, two rich areas in their own right with extensive literature and work.
Q5. What is the space required to answer point queries correctly?
The space required to answer point queries correctly with any constant probability and error at most ε||A||1 is Ω(ε−1) over general distributions.
Q6. How can The authorsolve the problem of a number of distinct IP addresses?
Counting the number of distinct IP addresses that are currently using a link can be solved by determining the number of nonzero A[i]’s at any time.
Q7. How do the authors compute the highest B-term approximation to a signal?
With at most O(B + logN) storage, the authors can compute the highest (best) B-term approximation to a signal exactly in the Timeseries model.
Q8. What is the advantage of having small memory during the processing?
In addition, having small memory during the processing is useful: for example, even if the data resides in the disk or tape, the small memory “summary” can be maintained in the main memory or even the cache and that yields very efficient query implementations in general.
Q9. How does the algorithm compute the quantiles of a cash register?
The deterministic sampling procedure above outputs (φ,ε)-quantiles on a cash register stream using space O (log(ε||A||1) ε) .5.1. Sampling 147
Q10. How many distinct values can be approximated in the inverse signal?
The number of distinct values, that is |{j |A[j] = 0}|, can be approximated up to a fixed error with constant probability in O(1) space by known methods [22].
Q11. How many bits of data is the CM sketch still keeping?
Consider inserting a million (≈220) IP addresses; the CM sketch still maintains O((1/ε) log(1/δ)) counts of size log ||A||1 ≈ 20 bits still, which is much less than the input size.
Q12. What is the probability that any item in the input stream is in S?
Produce multiset S consisting of O( 1 2 log 1δ ) samples (x,i), where x is an item in [1,N ] and i, its count in the input stream; further, the probability that any x ∈ [1,N ] is in S is 1/|{j |A[j] = 0}|.
Q13. What are the main types of data streams that are generated by monitoring applications?
the automatic data feeds that generate modern data streams arise out of monitoring applications, be they atmospheric, astronomical, networking, financial or sensor-related.
Q14. What is the way to capture a subset of the stream?
Many companies such as NARUS and others can capture a suitable subset of the stream and use standard database technology atop for specific analyses.
Q15. How do the authors make the update time in the amortized case?
Each new data item needs O(B + logN) time to be processed; by batch processing, the authors can make update time O(1) in the amortized case.
Q16. What is the B term representation for a signal?
From a fundamental result of Parseval from 1799 [187, 25], it follows that the best B-term representation for signal A is R = ∑ i∈Λ ciψi, where ci =〈A,ψi〉 and Λ of size k maximizes ∑i∈Λ c2i .
Q17. What are the important tasks that need to be done in near-real time?
These are time critical tasks and need to be done in near-real time to accurately keep pace with the rate of stream updates and reflect rapidly changing trends in the data.