scispace - formally typeset
Proceedings ArticleDOI

Sketching probabilistic data streams

Reads0
Chats0
TLDR
These algorithms offer strong randomized estimation guarantees while using only sublinear space in the size of the stream(s), and rely on novel, concise streaming sketch synopses that extend conventional sketching ideas to the probabilistic streams setting.
Abstract
The management of uncertain, probabilistic data has recently emerged as a useful paradigm for dealing with the inherent unreliabilities of several real-world application domains, including data cleaning, information integration, and pervasive, multi-sensor computing. Unlike conventional data sets, a set of probabilistic tuples defines a probability distribution over an exponential number of possible worlds (i.e., "grounded", deterministic databases). This "possibleworlds" interpretation allows for clean query semantics but also raises hard computational problems for probabilistic database query processors. To further complicate matters, in many scenarios (e.g., large-scale process and environmental monitoring using multiple sensor modalities), probabilistic data tuples arrive and need to be processed in a streaming fashion; that is, using limited memory and CPU resources and without the benefit of multiple passes over a static probabilistic database. Such probabilistic data streams raise a host of new research challenges for stream-processing engines that, to date, remain largely unaddressed. In this paper, we propose the first space- and time-efficient algorithms for approximating complex aggregate queries (including, the number of distinct values and join/self-join sizes) over probabilistic data streams. Following the possible-worlds semantics, such aggregates essentially define probability distributions over the space of possible aggregation results, and our goal is to characterize such distributions through efficient approximations of their key moments (such as expectation and variance). Our algorithms offer strong randomized estimation guarantees while using only sublinear space in the size of the stream(s), and rely on novel, concise streaming sketch synopses that extend conventional sketching ideas to the probabilistic streams setting. Our experimental results verify the effectiveness of our approach.

read more

Content maybe subject to copyright    Report

Citations
More filters
BookDOI

Managing and Mining Sensor Data

TL;DR: Managing and Mining Sensor Data is a contributed volume by prominent leaders in this field, targeting advanced-level students in computer science as a secondary text book or reference, and practitioners and researchers working in the field will also find this book useful.
Proceedings ArticleDOI

Event queries on correlated probabilistic streams

TL;DR: This paper proposes Lahar1, an event processing system for probabilistic event streams that yields a much higher recall and precision than deterministic techniques operating over only the most probable tuples by using a novel static analysis and novel algorithms.
Proceedings ArticleDOI

Finding frequent items in probabilistic data

TL;DR: This paper proposes a new definition based on the possible world semantics that has been widely adopted for many query types in uncertain data management, trying to find all the items that are likely to be frequent in a randomly generated possible world.
Proceedings ArticleDOI

Approximation algorithms for clustering uncertain data

TL;DR: The core mining problem of clustering on uncertain data is studied, and appropriate natural generalizations of standard clustering optimization criteria are defined, and a variety of bicriteria approximation algorithms are shown, including the first known guaranteed approximation algorithms for the problems of clustered uncertain data.
Journal ArticleDOI

Mining Frequent Subgraph Patterns from Uncertain Graph Data

TL;DR: This paper is the first one to investigate the problem of mining frequent subgraph patterns from uncertain graph data and uses efficient methods to determine whether a subgraph pattern can be output or not and a new pruning method to reduce the complexity of examining sub graph patterns.
References
More filters
Proceedings ArticleDOI

Models and issues in data stream systems

TL;DR: The need for and research issues arising from a new model of data processing, where data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams are motivated.
Journal ArticleDOI

An improved data stream summary: the count-min sketch and its applications

TL;DR: In this paper, the authors introduce a sublinear space data structure called the countmin sketch for summarizing data streams, which allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc.
Journal ArticleDOI

Probabilistic counting algorithms for data base applications

TL;DR: A class of probabilistic counting algorithms with which one can estimate the number of distinct elements in a large collection of data in a single pass using only a small additional storage and only a few operations per element scanned is introduced.
Journal ArticleDOI

Approximate frequency counts over data streams

TL;DR: This talk will trace the history of the Approximate Frequency Counts paper, how it was conceptualized and how it influenced data stream research.
Proceedings ArticleDOI

The space complexity of approximating the frequency moments

TL;DR: It turns out that the numbers F0;F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k 6 requires n (1) space.