Proceedings Article•DOI•

Models and issues in data stream systems

Brian Babcock¹, Shivnath Babu¹, Mayur Datar¹, Rajeev Motwani¹, Jennifer Widom¹ - Show less +1 more•Institutions (1)

03 Jun 2002-pp 1-16

TL;DR: The need for and research issues arising from a new model of data processing, where data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams are motivated.

read less

Abstract: In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

...read moreread less

Summary (7 min read)

Jump to: [1 Introduction] – [3 Review of Data Stream Projects] – [4 Queries over Data Streams] – [4.1 Unbounded Memory Requirements] – [4.2 Approximate Query Answering] – [4.3 Sliding Windows] – [4.4 Batch Processing, Sampling, and Synopses] – [Batch Processing] – [Sampling] – [Synopsis Data Structures] – [4.5 Blocking Operators] – [4.6 Queries Referencing Past Data] – [5 Proposal for a DSMS] – [5.1 Query Language for a DSMS] – [5.2 Timestamps in Streams] – [5.3 Query Processing Architecture of a DSMS] – [6 Algorithmic Issues] – [6.1 Random Samples] – [6.2 Sketching Techniques] – [6.3 Histograms] – [V-Optimal Histograms over Data Streams] – [6.4 Wavelets] – [6.5 Sliding Windows] – [6.6 Negative Results] – [Data Mining] and [Multiple Streams]

1 Introduction

Recently a new class of data-intensive applications has become widely recognized: applications in which the data is modeled best not as persistent relations but rather as transient data streams.
In this paper the authors consider fundamental models and issues in developing a general-purpose Data Stream Management System (DSMS).
(Any glaring omissions are, naturally, their own fault.).
The focus is on two specific links: a customer link, C, which connects the network of a customer to the ISP's network, and a backbone link, B, which connects two routers within the backbone network of the ISP.

3 Review of Data Stream Projects

The authors now provide an overview of several past and current projects related to data stream management.
A restricted subset of SQL was used as the query language in order to provide guarantees about efficient evaluation and append-only query results.
OpenCQ uses a query processing algorithm based on incremental view maintenance, while NiagaraCQ addresses scalability in number of queries by proposing techniques for grouping continuous queries for efficient evaluation.
Of particular importance is work on self-maintenance [15, 45, 71] -ensuring that enough data has been saved to maintain a view even when the base data is unavailable-and the related problem of data expiration [36] determining when certain base data can be discarded without compromising the ability to maintain a view.
The Telegraph project [8, 47, 58, 59] shares some target applications and basic technical ideas with a DSMS.

4 Queries over Data Streams

Query processing in the data stream model of computation comes with its own unique challenges.
The authors will outline what they consider to be the most interesting of these challenges, and describe several alternative approaches for resolving them.
The issues raised in this section will frame the discussion in the rest of the paper.

4.1 Unbounded Memory Requirements

Since data streams are potentially unbounded in size, the amount of storage required to compute an exact answer to a data stream query may also grow without bound.
The continuous data stream model is most applicable to problems where timely query responses are important and there are large volumes of data that are being continually produced at a high rate over time.
For this reason, the authors are interested in algorithms that are able to confine themselves to main memory without accessing disk.
They consider a limited class of queries and, for that class, provide a complete characterization of the queries that require a potentially unbounded amount of memory (proportional to the size of the input data streams) to answer.
Their result shows that without knowing the size of the input data streams, it is impossible to place a limit on the memory requirements for most common queries involving joins, unless the domains of the attributes involved in the query are restricted (either based on known characteristics of the data or through the imposition of query predicates).

4.2 Approximate Query Answering

As described in the previous section, when the authors are limited to a bounded amount of memory it is not always possible to produce exact answers for data stream queries; however, high-quality approximate answers are often acceptable in lieu of exact answers.
Approximation algorithms for problems defined over data streams has been a fruitful research area in the algorithms community in recent years, as discussed in detail in Section 6.
Based on these summarization techniques, the authors have seen some work on approximate query answering.
In the next two subsections, the authors will touch upon several approaches to approximation, some of which are peculiar to the data stream model of computation.

4.3 Sliding Windows

One technique for producing an approximate answer to a data stream query is to evaluate the query not over the entire past history of the data streams, but rather only over sliding windows of recent data from the streams.
Only data from the last week could be considered in producing query answers, with data older than one week being discarded.
It is well-defined and easily understood: the semantics of the approximation are clear, so that users of the system can be confident that they understand what is given up in producing the approximate answer.
Research in temporal databases [80] is concerned primarily with maintaining a full history of each data value over time, while in a data stream system the authors are concerned primarily with processing new data elements on-the-fly.
This model assumes that the database system has control over which sequence to process tuples from next, e.g., when merging multiple sequences, which the authors cannot assume in a data stream system.

4.4 Batch Processing, Sampling, and Synopses

The authors describe a general framework for these techniques.
Suppose that a data stream query is answered using a data structure that can be maintained incrementally.
The most general description of such a data structure is that it supports two operations, update and computeAnswer.
The update operation is invoked to update the data structure as each new data element arrives, and the com-puteAnswer method produces new or updated results to the query.
When processing continuous queries, the best scenario is that both operations are fast relative to the arrival rate of elements in the data streams.

Batch Processing

The first scenario is that the update operation is fast but the computeAnswer operation is slow.
The query answer may be considered approximate in the sense that it is not timely, i.e., it represents the exact answer at a point in the recent past rather than the exact answer at the present moment.
This approach of approximation through batch processing is attractive because it does not cause any uncertainty about the accuracy of the answer, sacrificing timeliness instead.
An algorithm that cannot keep up with the peak data stream rate may be able to handle the average stream rate quite comfortably by buffering the streams when their rate is high and catching up during the slow periods.
This is the approach used in the XJoin algorithm [88] .

Sampling

In the second scenario, computeAnswer may be fast, but the update operation is slow -it takes longer than the average inter-arrival time of the data elements.
Instead, some tuples must be skipped altogether, so that the query is evaluated over a sample of the data stream rather than over the entire data stream.
The authors obtain an approximate answer, but in some cases one can give confidence bounds on the degree of error introduced by the sampling process [48] .
Unfortunately, for many situations (including most queries involving joins [20, 22] ), sampling-based approaches cannot give reliable approximation guarantees.
Designing sampling-based algorithms that can produce approximate answers that are provably close to the exact answer is an important and active area of research.

Synopsis Data Structures

Quite obiously, data structures where both the update and the computeAnswer operations are fast are most desirable.
For classes of data stream queries where no exact data structure with the desired properties exists, one can often design an approximate data structure that maintains a small synopsis or sketch of the data rather than an exact representation, and therefore is able to keep computation per data element to a minimum.
Performing data reduction through synopsis data structures as an alternative to batch processing or sampling is a fruitful research area with particular relevance to the data stream computation model.
Synopsis data structures are discussed in more detail in Section 6.

4.5 Blocking Operators

A blocking query operator is a query operator that is unable to produce the first tuple of its output until it has seen its entire input.
When the answer is larger, however, such as when the query answer is a relation that is to be produced in sorted order, it is more practical to maintain a data structure with the up-to-date answer, since continually retransmitting the entire answer would be cumbersome.
Tucker et al. [86] have proposed a different approach to blocking operators.
Upon seeing this punctuation, an aggregation operator that was grouping by daynumber could stream out its answers for all daynumbers less than 10.
Closely related is the idea of schema-level assertions on data streams, which also may help with blocking operators and other aspects of data stream processing.

4.6 Queries Referencing Past Data

In the data stream model of computation, once a data element has been streamed by, it cannot be revisited.
In the data stream model of computation, if the appropriate summary structure is not present, then no further recourse is available.
Ad hoc queries also raise the issue of adaptivity in query plans.
Extending this idea to adapt the joint plan for a set of continuous queries as new queries are added and old ones are removed remains an open research area.

5 Proposal for a DSMS

At Stanford the authors have begun the design and prototype implementation of a comprehensive DSMS called STREAM (for STanford StREam DatA Manager) [82] .
As discussed in earlier sections, in a DSMS traditional one-time queries are replaced or augmented with continuous queries, and techniques such as sliding windows, synopsis structures, approximate answers, and adaptive query processing become fundamental features of the system.
Other aspects of a complete DBMS also need to be reconsidered, including query languages, storage and buffer management, user and application interfaces, and transaction support.
In this section the authors will focus primarily on the query language and query processing components of a DSMS and only touch upon other issues based on their initial experiences.

5.1 Query Language for a DSMS

Any general-purpose data management system must have a flexible and intuitive method by which the users of the system can express their queries.
It is also a declarative language that gives the system flexibility in selecting the optimal evaluation procedure to produce the desired answer.
This interface is intuitive and gives the user more control over the exact series of steps by which the query answer is obtained than is provided by a declarative query language.
As in SQL-99, physical windows are specified using the ROWS keyword (e.g., ROWS 50 PRECEDING), while logical windows are specified via the RANGE keyword (e.g., RANGE 15 MINUTES PRECEDING).
The timestamp attribute is the ordering attribute for the records.

5.2 Timestamps in Streams

In the previous section, sliding windows are defined with respect to a timestamp or sequence number attribute representing a tuple's arrival time.
Timestamp issues also arise when a set of distributed streams make up a single logical stream, as in the web monitoring application described in Section 2.2, or in truly distributed streams such as sensor networks when comparing timestamps across stream elements may be relevant.
In the previous section the authors introduced implicit timestamps, in which the system adds a special field to each incoming tuple, and explicit timestamps, in which a data attribute is designated as the timestamp.
Implicit timestamps are used when the data source does not already include timestamp information, or when the exact moment in time associated with a tuple is not important, but general considerations such as "recent" or "old" may be important.
The clause ROWS 10 PRECEDING specifies a window consisting of the previous 10 tuples, strictly sorted by timestamp order.

5.3 Query Processing Architecture of a DSMS

The authors describe the query processing architecture of their DSMS.
B could be a sliding window join operator, which maintains a sliding window synopsis for each join input (Section 4.3).
During execution, an operator reads data from its input queues, updates the synopsis structures that it maintains, and writes results to its output queues.
To handle these fluctuations, all of their operators are adaptive.
The scheduler needs to provide rate synchronization within operators (such as stream joins) and also across pipelined operators in query plans [8, 89] .

6 Algorithmic Issues

The algorithms community has been fairly active of late in the area of data streams, typically motivated by problems in databases and networking.
The main complexity measure is the space used by the algorithm, although the time required to process each stream element is also relevant.
In some cases, the algorithm maintains a data structure which can be used to compute the value of the function k on demand, and then the time required to process each such query also becomes of interest.
For most interesting problems it is easy to prove a space lower bound that precludes this possibility, thereby forcing us to settle for bounds that are merely sublinear in l .
Most of these summary structures have been considered for traditional databases [13] .

6.1 Random Samples

In fact, the join synopsis in the AQUA system [2] is nothing but a uniform sample of the base relation.
Recently stratified sampling has been proposed as an alternative to uniform sampling to reduce error due to the variance in data and also to reduce error for group-by queries [1, 19] .
The reservoir sampling algorithm of Vitter [90] makes one pass over the data set and is well suited for the data stream model.
There is also an extension by Chaudhuri, Motwani and Narasayya [22] to the case of weighted sampling.

6.2 Sketching Techniques

In their seminal paper, Alon, Matias and Szegedy [5] introduced the notion of randomized sketching which has been widely used ever since.
Sketching involves building a summary of a data stream using a small amount of memory, using which it is possible to estimate the answer to certain queries (typically, "distance" queries) over the data set.
It remains an open problem to come up with techniques to maintain correlated aggregates [37] that have provable guarantees.
Consider the unary representation of the vector.
Feigenbaum et al. [33] showed how to construct such a family of range-summable Ç ±R -valued hash functions with limited (four-way) independence.

6.3 Histograms

They have been employed for a multitude of tasks such as query size estimation, approximate query answering, and data mining.
The authors consider the summarization of data streams using histograms.
There are several different types of histograms that have been proposed in the literature.
These partition the domain into buckets such that the number of ½ } values falling into each bucket is uniform across all buckets.
These will maintain exact counts of items that occur with frequency above a threshold, and approximate the other counts by an uniform distribution.

V-Optimal Histograms over Data Streams

Jagadish et al. [54] showed how to compute optimal V-Optimal Histograms for a given data set using dynamic programming.
Guha, Koudas and Shim [43] adapted this algorithm to sorted data streams.
A robust approximation is built by repeatedly adding a dyadic interval of constant value 4 which best reduces the approximation error.
This translates to efficiently computing the range sum of n -stable random variables (used for computing the # r sketch, see Indyk [50] ) over the dyadic interval.
Gilbert et al. [39] show how to construct such efficiently range-summable n -stable random variables.

6.4 Wavelets

Wavelets coefficients are projections of the given signal (set of data values) onto an orthogonal set of basis vector.
The choice of basis vectors determines the type of wavelets.
Wavelet coefficients have the desirable property that the signal reconstructed from the top few wavelet coefficients best approximates the original signal in terms of the # norm.
Recent papers have demonstrated the efficacy of wavelets for different tasks such as selectivity estimation [63] , data cube approximation [93] and computing multi-dimensional aggregates [92] .
To extend this body of work to data streams, it becomes important to devise techniques for computing wavelets in the streaming model.

6.5 Sliding Windows

As discussed in Section 4, sliding windows prevent stale data from influencing analysis and statistics, and also serve as a tool for approximation in face of bounded memory.
There has been very little work on extending summarization techniques to sliding windows and it remains a ripe research area.
The authors briefly describe some of the recent work.
Datar et al. [26] showed how to maintain simple statistics over sliding windows, including the sketches used for computing the # ¨or # À norm.
Some open problems for sliding windows are: clustering, maintaining top wavelet coefficients, maintaining statistics like variance, and computing correlated aggregates [37] .

6.6 Negative Results

Henzinger, Raghavan, and Rajagopalan [49] provided space lower bounds for concrete problems in the data stream model.
This serves as a reminder that while it may be possible to prove strong space lower bounds for stream computations, considerations from applications sometimes enable us to circumvent the negative results.
Saks and Sun [73] provide space lower bounds for distance approximation between two vectors under the # È norm, for n QÒ , in the data stream model.
Space lower bounds for maintaining simple statistics like count, sum, min/max, and number of distinct values under the sliding windows model can be found in the work of Datar et al. [26] .
It is useful for deriving space lower bounds for data stream algorithms that resort to oblivious sampling.

Data Mining

Decision trees are another form of synopsis used for prediction.
Clustering is yet another way to summarize data.
Consider the -median formulation for clustering: Given Ô data points in a metric space, the objective is to choose representative points, such that the sum of the errors over the Ô data points is minimized.
The "error" for each data point is the distance from that point to the nearest of the chosen representative points.
When the authors get l Ï weighted cluster centers by clustering different sets, they cluster them into higher-level cluster centers, and so on.

Multiple Streams

Gibbons and Tirthapura [38] considered the problem of computing simple functions, such as the number of distinct elements, over unions of data stream.
Assuming positive answers to the "meta-questions" above, the authors see several fundamental aspects to the design of data stream systems, some of which they discussed in detail in this paper.
One important general question is the interface provided by the DSMS.
Other fundamental issues discussed in the paper include timestamping and ordering, support for sliding window queries, and dealing effectively with blocking operators.
Another issue the authors touched on only briefly in Section 4.5 is that of constraints over streams, and how they can be exploited in query processing.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Models and Issues in Data Stream Systems



Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom

Department of Computer Science

Stanford University

Stanford, CA 94305



babcock,shivnath,datar,rajeev,widom



@cs.stanford.edu

Abstract

In this overview paper we motivate the need for and research issues arising from a new model of

data processing. In this model, data does not take the form of persistent relations, but rather arrives in

multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to

data stream systems and current projects in the area, the paper explores topics in stream query languages,

new requirements and challenges in query processing, and algorithmic issues.

1 Introduction

Recently a new class of data-intensive applications has become widely recognized: applications in which

the data is modeled best not as persistent relations but rather as transient data streams. Examples of such

applications include ﬁnancial applications, network monitoring, security, telecommunications data manage-

ment, web applications, manufacturing, sensor networks, and others. In the data stream model, individual

data items may be relational tuples, e.g., network measurements, call records, web page visits, sensor read-

ings, and so on. However, their continuous arrival in multiple, rapid, time-varying, possibly unpredictable

and unbounded streams appears to yield some fundamentally new research problems.

In all of the applications cited above, it is not feasible to simply load the arriving data into a tradi-

tional database management system (DBMS) and operate on it there. Traditional DBMS’s are not designed

for rapid and continuous loading of individual data items, and they do not directly support the continuous

queries [84] that are typical of data stream applications. Furthermore, it is recognized that both approxima-

tion [13] and adaptivity [8] are key ingredients in executing queries and performing other processing (e.g.,

data analysis and mining) over rapid data streams, while traditional DBMS’s focus largely on the opposite

goal of precise answers computed by stable query plans.

In this paper we consider fundamental models and issues in developing a general-purpose Data Stream

Management System (DSMS). We are developing such a system at Stanford [82], and we will touch on some

of our own work in this paper. However, we also attempt to provide a general overview of the area, along

with its related and current work. (Any glaring omissions are, naturally, our own fault.)

We begin in Section 2 by considering the data stream model and queries over streams. In this section we

take a simple view: streams are append-only relations with transient tuples, and queries are SQL operating

over these logical relations. In later sections we discuss several issues that complicate the model and query

language, such as ordering, timestamping, and sliding windows. Section 2 also presents some concrete

examples to ground our discussion.

In Section 3 we review recent projects geared speciﬁcally towards data stream processing, as well as

a plethora of past research in areas related to data streams: active databases, continuous queries, ﬁltering



Work supported by NSF Grant IIS-0118173. Mayur Datar was also supported by a Microsoft Graduate Fellowship. Rajeev

Motwani received partial support from an Okawa Foundation Research Grant.

systems, view management, sequence databases, and others. Although much of this work clearly has ap-

plications to data stream processing, we hope to show in this paper that there are many new problems to

address in realizing a complete DSMS.

Section 4 delves more deeply into the area of query processing, uncovering a number of important issues,

including:



Queries that require an unbounded amount of memory to evaluate precisely, and approximate query

processing techniques to address this problem.



Sliding window query processing (i.e., considering “recent” portions of the streams only), both as

an approximation technique and as an option in the query language since many applications prefer

sliding-window queries.



Batch processing, sampling, and synopsis structures to handle situations where the ﬂow rate of the

input streams may overwhelm the query processor.



The meaning and implementation of blocking operators (e.g., aggregation and sorting) in the presence

of unending streams.



Continuous queries that are registered when portions of the data streams have already “passed by,” yet

the queries wish to reference stream history.

Section 5 then outlines some details of a query language and an architecture for a DSMS query processor

designed speciﬁcally to address the issues above.

In Section 6 we review algorithmic results in data stream processing. Our focus is primarily on sketching

techniques and building summary structures (synopses). We also touch upon sliding window computations,

present some negative results, and discuss a few additional algorithmic issues.

We conclude in Section 7 with some remarks on the evolution of this new ﬁeld, and a summary of

directions for further work.

2 The Data Stream Model

In the data stream model, some or all of the input data that are to be operated on are not available for random

access from disk or memory, but rather arrive as one or more continuous data streams. Data streams differ

from the conventional stored relation model in several ways:



The data elements in the stream arrive online.



The system has no control over the order in which data elements arrive to be processed, either within

a data stream or across data streams.



Data streams are potentially unbounded in size.



Once an element from a data stream has been processed it is discarded or archived — it cannot be

retrieved easily unless it is explicitly stored in memory, which typically is small relative to the size of

the data streams.

Operating in the data stream model does not preclude the presence of some data in conventional stored

relations. Often, data stream queries may perform joins between data streams and stored relational data.

For the purposes of this paper, we will assume that if stored relations are used, their contents remain static.

Thus, we preclude any potential transaction-processing issues that might arise from the presence of updates

to stored relations that occur concurrently with data stream processing.

2.1 Queries

Queries over continuous data streams have much in common with queries in a traditional database manage-

ment system. However, there are two important distinctions peculiar to the data stream model. The ﬁrst

distinction is between one-time queries and continuous queries [84]. One-time queries (a class that includes

traditional DBMS queries) are queries that are evaluated once over a point-in-time snapshot of the data set,

with the answer returned to the user. Continuous queries, on the other hand, are evaluated continuously as

data streams continue to arrive. Continuous queries are the more interesting class of data stream queries, and

it is to them that we will devote most of our attention. The answer to a continuous query is produced over

time, always reﬂecting the stream data seen so far. Continuous query answers may be stored and updated as

new data arrives, or they may be produced as data streams themselves. Sometimes one or the other mode

is preferred. For example, aggregation queries may involve frequent changes to answer tuples, dictating the

stored approach, while join queries are monotonic and may produce rapid, unbounded answers, dictating

the stream approach.

The second distinction is between predeﬁned queries and ad hoc queries. A predeﬁned query is one

that is supplied to the data stream management system before any relevant data has arrived. Predeﬁned

queries are generally continuous queries, although scheduled one-time queries can also be predeﬁned. Ad

hoc queries, on the other hand, are issued online after the data streams have already begun. Ad hoc queries

can be either one-time queries or continuous queries. Ad hoc queries complicate the design of a data stream

management system, both because they are not known in advance for the purposes of query optimization,

identiﬁcation of common subexpressions across queries, etc., and more importantly because the correct

answer to an ad hoc query may require referencing data elements that have already arrived on the data

streams (and potentially have already been discarded). Ad hoc queries are discussed in more detail in

Section 4.6.

2.2 Motivating Examples

Examples motivating a data stream system can be found in many application domains including ﬁnance,

web applications, security, networking, and sensor monitoring.



Traderbot [85] is a web-based ﬁnancial search engine that evaluates queries over real-time streaming

ﬁnancial data such as stock tickers and news feeds. The Traderbot web site [85] gives some examples

of one-time and continuous queries that are commonly posed by its customers.



Modern security applications often apply sophisticated rules over network packet streams. For exam-

ple, iPolicy Networks [52] provides an integrated security platform providing services such as ﬁrewall

support and intrusion detection over multi-gigabit network packet streams. Such a platform needs to

perform complex stream processing including URL-ﬁltering based on table lookups, and correlation

across multiple network trafﬁc ﬂows.



Large web sites monitor web logs (clickstreams) online to enable applications such as personaliza-

tion, performance monitoring, and load-balancing. Some web sites served by widely distributed web

servers (e.g., Yahoo [95]) may need to coordinate many distributed clickstream analyses, e.g., to track

heavily accessed web pages as part of their real-time performance monitoring.



There are several emerging applications in the area of sensor monitoring [16, 58] where a large number

of sensors are distributed in the physical world and generate streams of data that need to be combined,

monitored, and analyzed.

The application domain that we use for more detailed examples is network trafﬁc management, which

involves monitoring network packet header information across a set of routers to obtain information on

trafﬁc ﬂow patterns. Based on a description of Babu and Widom [10], we delve into this example in some

detail to help illustrate that continuous queries arise naturally in real applications and that conventional

DBMS technology does not adequately support such queries.

Consider the network trafﬁc management system of a large network, e.g., the backbone network of an

Internet Service Provider (ISP) [30]. Such systems monitor a variety of continuous data streams that may be

characterized as unpredictable and arriving at a high rate, including both packet traces and network perfor-

mance measurements. Typically, current trafﬁc-management tools either rely on a special-purpose system

that performs online processing of simple hand-coded continuous queries, or they just log the trafﬁc data and

perform periodic ofﬂine query processing. Conventional DBMS’s are deemed inadequate to provide the kind

of online continuous query processing that would be most beneﬁcial in this domain. A data stream system

that could provide effective online processing of continuous queries over data streams would allow network

operators to install, modify, or remove appropriate monitoring queries to support efﬁcient management of

the ISP’s network resources.

Consider the following concrete setting. Network packet traces are being collected from a number of

links in the network. The focus is on two speciﬁc links: a customer link, C, which connects the network of

a customer to the ISP’s network, and a backbone link, B, which connects two routers within the backbone

network of the ISP. Let



and



denote two streams of packet traces corresponding to these two links. We

assume, for simplicity, that the traces contain just the ﬁve ﬁelds of the packet header that are listed below.

src: IP address of packet sender.

dest: IP address of packet destination.

id: Identiﬁcation number given by sender so that destination can uniquely identify each packet.

len: Length of the packet.

time: Time when packet header was recorded.

Consider ﬁrst the continuous query



, which computes load on the link B averaged over one-minute

intervals, notifying the network operator when the load crosses a speciﬁed threshold



. The functions get-

minute and notifyoperator have the natural interpretation.





: SELECT notifyoperator(sum(len))

FROM



GROUP BY getminute(time)

HAVING sum(len)



While the functionality of such a query may possibly be achieved in a DBMS via the use of triggers, we

are likely to prefer the use of special techniques for performance reasons. For example, consider the case

where the link B has a very high throughput (e.g., if it were an optical link). In that case, we may choose to

compute an approximate answer to





by employing random sampling on the stream — a task outside the

reach of standard trigger mechanisms.

The second query



isolates ﬂows in the backbone link and determines the amount of trafﬁc generated

by each ﬂow. A ﬂow is deﬁned here as a sequence of packets grouped in time, and sent from a speciﬁc

source to a speciﬁc destination.



: SELECT ﬂowid, src, dest, sum(len) AS ﬂowlen

FROM (SELECT src, dest, len, time

FROM



ORDER BY time )

GROUP BY src, dest, getflowid(src, dest, time)

AS ﬂowid

Here getflowid is a user-deﬁned function which takes the source IP address, the destination IP address,

and the timestamp of a packet, and returns the identiﬁer of the ﬂow to which the packet belongs. We assume

that the data in the view (or table expression) in the FROM clause is passed to the getflowid function in

the order deﬁned by the ORDER BY clause.

Observe that handling



over stream



is particularly challenging due to the presence of GROUP BY

and ORDER BY clauses, which lead to “blocking” operators in a query execution plan.

Consider now the task of determining the fraction of the backbone link’s trafﬁc that can be attributed to

the customer network. This query,



, is an example of the kind of ad hoc continuous queries that may be

registered during periods of congestion to determine whether the customer network is the likely cause.



:(SELECT count (*)

FROM C, B

WHERE C.src = B.src and C.dest = B.dest

and C.id = B.id)



(SELECT count (*) FROM



)

Observe that



joins streams



and



on their keys to obtain a count of the number of common packets.

Since joining two streams could potentially require unbounded intermediate storage (for example if there is

no bound on the delay between a packet showing up on the two links), the user may prefer to compute an

approximate answer. One approximation technique would be to maintain bounded-memory synopses of the

two streams (see Section 6); alternatively, one could exploit aspects of the application semantics to bound

the required storage (e.g., we may know that joining tuples are very likely to occur within a bounded time

window).

Our ﬁnal example,



, is a continuous query for monitoring the source-destination pairs in the top 5

percent in terms of backbone trafﬁc. For ease of exposition, we employ the WITH construct from SQL-

99 [87].



:WITH Load AS

(SELECT src, dest, sum(len) AS trafﬁc

FROM



GROUP BY src, dest)

SELECT src, dest, trafﬁc

FROM Load AS



WHERE (SELECT count(*)

FROM Load AS



WHERE





.trafﬁc





.trafﬁc)



(SELECT

 "!

count(*) FROM Load)

ORDER BY trafﬁc

HTML Viewer

Frequently Asked Questions (13)

Q1. What are the future works in "Models and issues in data stream systems" ?

Their approach at Stanford is to extend SQL to support stream-oriented primitives, providing a purely declarative interface as in traditional database systems, although the authors also allow direct submission of query plans. From a purely theoretical perspective, perhaps the most interesting open question is that of defining extensions of relational operators to handle data stream constructs, and to study the resulting “ stream algebra ” and other properties of these extensions. Another issue the authors touched on only briefly in Section 4. 5 is that of constraints over streams, and how they can be exploited in query processing.

Q2. What have the authors contributed in "Models and issues in data stream systems" ?

In this overview paper the authors motivate the need for and research issues arising from a new model of data processing. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

Q3. What is the key contribution of Alon et al. to estimate G?

The key contribution of Alon et al. [5] was a sketching technique to estimate G that uses onlym >ut v x twv x l E space and provides arbitrarily small approximation factors.

Q4. What is the underlying source of data for their examples?

The underlying source of data for their examples will be a stream of telephone call records, each with four attributes: customer id, type, minutes, and timestamp.

Q5. What aspects of a complete DBMS need to be reconsidered?

Other aspects of a complete DBMS also need to be reconsidered, including query languages, storage and buffer management, user and application interfaces, and transaction support.

Q6. What is the best-effort approach to tuple ordering?

When the sliding window serves mostly to increase query processing efficiency, then the best-effort approach, which allows wide latitude over the ordering of tuples, is usually acceptable.

Q7. Why is it futile to try to use all the data when computing an answer?

It is futile to attempt to make use of all the data when computing an answer, because data arrives faster than it can be processed.

Q8. What is the idea behind list-efficient streaming algorithms?

The idea behind list-efficient streaming algorithms is that instead of being presented one data item at a time, they are implicitly presented with a list of data items in a succinct form.

Q9. What are the challenges in designing algorithms that can give approximate answers using only the available memory?

In the case where the sliding window is large enough so that the entire contents of the window cannot be buffered in memory, there are also theoretical challenges in designing algorithms that can give approximate answers using only the available memory.

Q10. What is the way to handle ad hoc queries?

A more ambitious approach to handling ad hoc queries that reference past data is to maintain summaries of data streams (in the form of general-purpose synopses or aggregates) that can be used to give approximate answers to future ad hoc queries.

Q11. What is the way to evaluate a query over a sliding window?

One technique for producing an approximate answer to a data stream query is to evaluate the query not over the entire past history of the data streams, but rather only over sliding windows of recent data from the streams.

Q12. How does the algorithm compute the range sum of n -stable random variables?

This translates to efficiently computing the range sum of n -stable random variables (used for computing the H sketch, see Indyk [50]) over thedyadic interval.

Q13. What techniques have been used to build a complete synopsis?

This work has led to some general techniques for data reduction and synopsis construction, including: sketches [5, 35], random sampling [1, 2, 22], histograms [51, 70], and wavelets [17, 92].

Models and issues in data stream systems

Summary (7 min read)

1 Introduction

3 Review of Data Stream Projects

4 Queries over Data Streams

4.1 Unbounded Memory Requirements

4.2 Approximate Query Answering

4.3 Sliding Windows

4.4 Batch Processing, Sampling, and Synopses

Batch Processing

Sampling

Synopsis Data Structures

4.5 Blocking Operators

4.6 Queries Referencing Past Data

5 Proposal for a DSMS

5.1 Query Language for a DSMS

5.2 Timestamps in Streams

5.3 Query Processing Architecture of a DSMS

6 Algorithmic Issues

6.1 Random Samples

6.2 Sketching Techniques

6.3 Histograms

V-Optimal Histograms over Data Streams

6.4 Wavelets

6.5 Sliding Windows

6.6 Negative Results

Data Mining

Multiple Streams

Citations

Cites background from "Models and issues in data stream sy..."

Cites background from "Models and issues in data stream sy..."

Cites background from "Models and issues in data stream sy..."

References

"Models and issues in data stream sy..." refers background in this paper

"Models and issues in data stream sy..." refers background in this paper

"Models and issues in data stream sy..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (13)

Q1. What are the future works in "Models and issues in data stream systems" ?

Q2. What have the authors contributed in "Models and issues in data stream systems" ?

Q3. What is the key contribution of Alon et al. to estimate G?

Q4. What is the underlying source of data for their examples?

Q5. What aspects of a complete DBMS need to be reconsidered?

Q6. What is the best-effort approach to tuple ordering?

Q7. Why is it futile to try to use all the data when computing an answer?

Q8. What is the idea behind list-efficient streaming algorithms?

Q9. What are the challenges in designing algorithms that can give approximate answers using only the available memory?

Q10. What is the way to handle ad hoc queries?

Q11. What is the way to evaluate a query over a sliding window?

Q12. How does the algorithm compute the range sum of n -stable random variables?

Q13. What techniques have been used to build a complete synopsis?