scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Models and issues in data stream systems

03 Jun 2002-pp 1-16
TL;DR: The need for and research issues arising from a new model of data processing, where data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams are motivated.
Abstract: In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

Summary (7 min read)

1 Introduction

  • Recently a new class of data-intensive applications has become widely recognized: applications in which the data is modeled best not as persistent relations but rather as transient data streams.
  • In this paper the authors consider fundamental models and issues in developing a general-purpose Data Stream Management System (DSMS).
  • (Any glaring omissions are, naturally, their own fault.).
  • The focus is on two specific links: a customer link, C, which connects the network of a customer to the ISP's network, and a backbone link, B, which connects two routers within the backbone network of the ISP.

3 Review of Data Stream Projects

  • The authors now provide an overview of several past and current projects related to data stream management.
  • A restricted subset of SQL was used as the query language in order to provide guarantees about efficient evaluation and append-only query results.
  • OpenCQ uses a query processing algorithm based on incremental view maintenance, while NiagaraCQ addresses scalability in number of queries by proposing techniques for grouping continuous queries for efficient evaluation.
  • Of particular importance is work on self-maintenance [15, 45, 71] -ensuring that enough data has been saved to maintain a view even when the base data is unavailable-and the related problem of data expiration [36] determining when certain base data can be discarded without compromising the ability to maintain a view.
  • The Telegraph project [8, 47, 58, 59] shares some target applications and basic technical ideas with a DSMS.

4 Queries over Data Streams

  • Query processing in the data stream model of computation comes with its own unique challenges.
  • The authors will outline what they consider to be the most interesting of these challenges, and describe several alternative approaches for resolving them.
  • The issues raised in this section will frame the discussion in the rest of the paper.

4.1 Unbounded Memory Requirements

  • Since data streams are potentially unbounded in size, the amount of storage required to compute an exact answer to a data stream query may also grow without bound.
  • The continuous data stream model is most applicable to problems where timely query responses are important and there are large volumes of data that are being continually produced at a high rate over time.
  • For this reason, the authors are interested in algorithms that are able to confine themselves to main memory without accessing disk.
  • They consider a limited class of queries and, for that class, provide a complete characterization of the queries that require a potentially unbounded amount of memory (proportional to the size of the input data streams) to answer.
  • Their result shows that without knowing the size of the input data streams, it is impossible to place a limit on the memory requirements for most common queries involving joins, unless the domains of the attributes involved in the query are restricted (either based on known characteristics of the data or through the imposition of query predicates).

4.2 Approximate Query Answering

  • As described in the previous section, when the authors are limited to a bounded amount of memory it is not always possible to produce exact answers for data stream queries; however, high-quality approximate answers are often acceptable in lieu of exact answers.
  • Approximation algorithms for problems defined over data streams has been a fruitful research area in the algorithms community in recent years, as discussed in detail in Section 6.
  • Based on these summarization techniques, the authors have seen some work on approximate query answering.
  • In the next two subsections, the authors will touch upon several approaches to approximation, some of which are peculiar to the data stream model of computation.

4.3 Sliding Windows

  • One technique for producing an approximate answer to a data stream query is to evaluate the query not over the entire past history of the data streams, but rather only over sliding windows of recent data from the streams.
  • Only data from the last week could be considered in producing query answers, with data older than one week being discarded.
  • It is well-defined and easily understood: the semantics of the approximation are clear, so that users of the system can be confident that they understand what is given up in producing the approximate answer.
  • Research in temporal databases [80] is concerned primarily with maintaining a full history of each data value over time, while in a data stream system the authors are concerned primarily with processing new data elements on-the-fly.
  • This model assumes that the database system has control over which sequence to process tuples from next, e.g., when merging multiple sequences, which the authors cannot assume in a data stream system.

4.4 Batch Processing, Sampling, and Synopses

  • The authors describe a general framework for these techniques.
  • Suppose that a data stream query is answered using a data structure that can be maintained incrementally.
  • The most general description of such a data structure is that it supports two operations, update and computeAnswer.
  • The update operation is invoked to update the data structure as each new data element arrives, and the com-puteAnswer method produces new or updated results to the query.
  • When processing continuous queries, the best scenario is that both operations are fast relative to the arrival rate of elements in the data streams.

Batch Processing

  • The first scenario is that the update operation is fast but the computeAnswer operation is slow.
  • The query answer may be considered approximate in the sense that it is not timely, i.e., it represents the exact answer at a point in the recent past rather than the exact answer at the present moment.
  • This approach of approximation through batch processing is attractive because it does not cause any uncertainty about the accuracy of the answer, sacrificing timeliness instead.
  • An algorithm that cannot keep up with the peak data stream rate may be able to handle the average stream rate quite comfortably by buffering the streams when their rate is high and catching up during the slow periods.
  • This is the approach used in the XJoin algorithm [88] .

Sampling

  • In the second scenario, computeAnswer may be fast, but the update operation is slow -it takes longer than the average inter-arrival time of the data elements.
  • Instead, some tuples must be skipped altogether, so that the query is evaluated over a sample of the data stream rather than over the entire data stream.
  • The authors obtain an approximate answer, but in some cases one can give confidence bounds on the degree of error introduced by the sampling process [48] .
  • Unfortunately, for many situations (including most queries involving joins [20, 22] ), sampling-based approaches cannot give reliable approximation guarantees.
  • Designing sampling-based algorithms that can produce approximate answers that are provably close to the exact answer is an important and active area of research.

Synopsis Data Structures

  • Quite obiously, data structures where both the update and the computeAnswer operations are fast are most desirable.
  • For classes of data stream queries where no exact data structure with the desired properties exists, one can often design an approximate data structure that maintains a small synopsis or sketch of the data rather than an exact representation, and therefore is able to keep computation per data element to a minimum.
  • Performing data reduction through synopsis data structures as an alternative to batch processing or sampling is a fruitful research area with particular relevance to the data stream computation model.
  • Synopsis data structures are discussed in more detail in Section 6.

4.5 Blocking Operators

  • A blocking query operator is a query operator that is unable to produce the first tuple of its output until it has seen its entire input.
  • When the answer is larger, however, such as when the query answer is a relation that is to be produced in sorted order, it is more practical to maintain a data structure with the up-to-date answer, since continually retransmitting the entire answer would be cumbersome.
  • Tucker et al. [86] have proposed a different approach to blocking operators.
  • Upon seeing this punctuation, an aggregation operator that was grouping by daynumber could stream out its answers for all daynumbers less than 10.
  • Closely related is the idea of schema-level assertions on data streams, which also may help with blocking operators and other aspects of data stream processing.

4.6 Queries Referencing Past Data

  • In the data stream model of computation, once a data element has been streamed by, it cannot be revisited.
  • In the data stream model of computation, if the appropriate summary structure is not present, then no further recourse is available.
  • Ad hoc queries also raise the issue of adaptivity in query plans.
  • Extending this idea to adapt the joint plan for a set of continuous queries as new queries are added and old ones are removed remains an open research area.

5 Proposal for a DSMS

  • At Stanford the authors have begun the design and prototype implementation of a comprehensive DSMS called STREAM (for STanford StREam DatA Manager) [82] .
  • As discussed in earlier sections, in a DSMS traditional one-time queries are replaced or augmented with continuous queries, and techniques such as sliding windows, synopsis structures, approximate answers, and adaptive query processing become fundamental features of the system.
  • Other aspects of a complete DBMS also need to be reconsidered, including query languages, storage and buffer management, user and application interfaces, and transaction support.
  • In this section the authors will focus primarily on the query language and query processing components of a DSMS and only touch upon other issues based on their initial experiences.

5.1 Query Language for a DSMS

  • Any general-purpose data management system must have a flexible and intuitive method by which the users of the system can express their queries.
  • It is also a declarative language that gives the system flexibility in selecting the optimal evaluation procedure to produce the desired answer.
  • This interface is intuitive and gives the user more control over the exact series of steps by which the query answer is obtained than is provided by a declarative query language.
  • As in SQL-99, physical windows are specified using the ROWS keyword (e.g., ROWS 50 PRECEDING), while logical windows are specified via the RANGE keyword (e.g., RANGE 15 MINUTES PRECEDING).
  • The timestamp attribute is the ordering attribute for the records.

5.2 Timestamps in Streams

  • In the previous section, sliding windows are defined with respect to a timestamp or sequence number attribute representing a tuple's arrival time.
  • Timestamp issues also arise when a set of distributed streams make up a single logical stream, as in the web monitoring application described in Section 2.2, or in truly distributed streams such as sensor networks when comparing timestamps across stream elements may be relevant.
  • In the previous section the authors introduced implicit timestamps, in which the system adds a special field to each incoming tuple, and explicit timestamps, in which a data attribute is designated as the timestamp.
  • Implicit timestamps are used when the data source does not already include timestamp information, or when the exact moment in time associated with a tuple is not important, but general considerations such as "recent" or "old" may be important.
  • The clause ROWS 10 PRECEDING specifies a window consisting of the previous 10 tuples, strictly sorted by timestamp order.

5.3 Query Processing Architecture of a DSMS

  • The authors describe the query processing architecture of their DSMS.
  • B could be a sliding window join operator, which maintains a sliding window synopsis for each join input (Section 4.3).
  • During execution, an operator reads data from its input queues, updates the synopsis structures that it maintains, and writes results to its output queues.
  • To handle these fluctuations, all of their operators are adaptive.
  • The scheduler needs to provide rate synchronization within operators (such as stream joins) and also across pipelined operators in query plans [8, 89] .

6 Algorithmic Issues

  • The algorithms community has been fairly active of late in the area of data streams, typically motivated by problems in databases and networking.
  • The main complexity measure is the space used by the algorithm, although the time required to process each stream element is also relevant.
  • In some cases, the algorithm maintains a data structure which can be used to compute the value of the function k on demand, and then the time required to process each such query also becomes of interest.
  • For most interesting problems it is easy to prove a space lower bound that precludes this possibility, thereby forcing us to settle for bounds that are merely sublinear in l .
  • Most of these summary structures have been considered for traditional databases [13] .

6.1 Random Samples

  • In fact, the join synopsis in the AQUA system [2] is nothing but a uniform sample of the base relation.
  • Recently stratified sampling has been proposed as an alternative to uniform sampling to reduce error due to the variance in data and also to reduce error for group-by queries [1, 19] .
  • The reservoir sampling algorithm of Vitter [90] makes one pass over the data set and is well suited for the data stream model.
  • There is also an extension by Chaudhuri, Motwani and Narasayya [22] to the case of weighted sampling.

6.2 Sketching Techniques

  • In their seminal paper, Alon, Matias and Szegedy [5] introduced the notion of randomized sketching which has been widely used ever since.
  • Sketching involves building a summary of a data stream using a small amount of memory, using which it is possible to estimate the answer to certain queries (typically, "distance" queries) over the data set.
  • It remains an open problem to come up with techniques to maintain correlated aggregates [37] that have provable guarantees.
  • Consider the unary representation of the vector.
  • Feigenbaum et al. [33] showed how to construct such a family of range-summable Ç ±R -valued hash functions with limited (four-way) independence.

6.3 Histograms

  • They have been employed for a multitude of tasks such as query size estimation, approximate query answering, and data mining.
  • The authors consider the summarization of data streams using histograms.
  • There are several different types of histograms that have been proposed in the literature.
  • These partition the domain into buckets such that the number of ½ } values falling into each bucket is uniform across all buckets.
  • These will maintain exact counts of items that occur with frequency above a threshold, and approximate the other counts by an uniform distribution.

V-Optimal Histograms over Data Streams

  • Jagadish et al. [54] showed how to compute optimal V-Optimal Histograms for a given data set using dynamic programming.
  • Guha, Koudas and Shim [43] adapted this algorithm to sorted data streams.
  • A robust approximation is built by repeatedly adding a dyadic interval of constant value 4 which best reduces the approximation error.
  • This translates to efficiently computing the range sum of n -stable random variables (used for computing the # r sketch, see Indyk [50] ) over the dyadic interval.
  • Gilbert et al. [39] show how to construct such efficiently range-summable n -stable random variables.

6.4 Wavelets

  • Wavelets coefficients are projections of the given signal (set of data values) onto an orthogonal set of basis vector.
  • The choice of basis vectors determines the type of wavelets.
  • Wavelet coefficients have the desirable property that the signal reconstructed from the top few wavelet coefficients best approximates the original signal in terms of the # norm.
  • Recent papers have demonstrated the efficacy of wavelets for different tasks such as selectivity estimation [63] , data cube approximation [93] and computing multi-dimensional aggregates [92] .
  • To extend this body of work to data streams, it becomes important to devise techniques for computing wavelets in the streaming model.

6.5 Sliding Windows

  • As discussed in Section 4, sliding windows prevent stale data from influencing analysis and statistics, and also serve as a tool for approximation in face of bounded memory.
  • There has been very little work on extending summarization techniques to sliding windows and it remains a ripe research area.
  • The authors briefly describe some of the recent work.
  • Datar et al. [26] showed how to maintain simple statistics over sliding windows, including the sketches used for computing the # ¨or # À norm.
  • Some open problems for sliding windows are: clustering, maintaining top wavelet coefficients, maintaining statistics like variance, and computing correlated aggregates [37] .

6.6 Negative Results

  • Henzinger, Raghavan, and Rajagopalan [49] provided space lower bounds for concrete problems in the data stream model.
  • This serves as a reminder that while it may be possible to prove strong space lower bounds for stream computations, considerations from applications sometimes enable us to circumvent the negative results.
  • Saks and Sun [73] provide space lower bounds for distance approximation between two vectors under the # È norm, for n QÒ , in the data stream model.
  • Space lower bounds for maintaining simple statistics like count, sum, min/max, and number of distinct values under the sliding windows model can be found in the work of Datar et al. [26] .
  • It is useful for deriving space lower bounds for data stream algorithms that resort to oblivious sampling.

Data Mining

  • Decision trees are another form of synopsis used for prediction.
  • Clustering is yet another way to summarize data.
  • Consider the -median formulation for clustering: Given Ô data points in a metric space, the objective is to choose representative points, such that the sum of the errors over the Ô data points is minimized.
  • The "error" for each data point is the distance from that point to the nearest of the chosen representative points.
  • When the authors get l Ï weighted cluster centers by clustering different sets, they cluster them into higher-level cluster centers, and so on.

Multiple Streams

  • Gibbons and Tirthapura [38] considered the problem of computing simple functions, such as the number of distinct elements, over unions of data stream.
  • Assuming positive answers to the "meta-questions" above, the authors see several fundamental aspects to the design of data stream systems, some of which they discussed in detail in this paper.
  • One important general question is the interface provided by the DSMS.
  • Other fundamental issues discussed in the paper include timestamping and ordering, support for sliding window queries, and dealing effectively with blocking operators.
  • Another issue the authors touched on only briefly in Section 4.5 is that of constraints over streams, and how they can be exploited in query processing.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Models and Issues in Data Stream Systems
Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom
Department of Computer Science
Stanford University
Stanford, CA 94305
babcock,shivnath,datar,rajeev,widom
@cs.stanford.edu
Abstract
In this overview paper we motivate the need for and research issues arising from a new model of
data processing. In this model, data does not take the form of persistent relations, but rather arrives in
multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to
data stream systems and current projects in the area, the paper explores topics in stream query languages,
new requirements and challenges in query processing, and algorithmic issues.
1 Introduction
Recently a new class of data-intensive applications has become widely recognized: applications in which
the data is modeled best not as persistent relations but rather as transient data streams. Examples of such
applications include financial applications, network monitoring, security, telecommunications data manage-
ment, web applications, manufacturing, sensor networks, and others. In the data stream model, individual
data items may be relational tuples, e.g., network measurements, call records, web page visits, sensor read-
ings, and so on. However, their continuous arrival in multiple, rapid, time-varying, possibly unpredictable
and unbounded streams appears to yield some fundamentally new research problems.
In all of the applications cited above, it is not feasible to simply load the arriving data into a tradi-
tional database management system (DBMS) and operate on it there. Traditional DBMS’s are not designed
for rapid and continuous loading of individual data items, and they do not directly support the continuous
queries [84] that are typical of data stream applications. Furthermore, it is recognized that both approxima-
tion [13] and adaptivity [8] are key ingredients in executing queries and performing other processing (e.g.,
data analysis and mining) over rapid data streams, while traditional DBMS’s focus largely on the opposite
goal of precise answers computed by stable query plans.
In this paper we consider fundamental models and issues in developing a general-purpose Data Stream
Management System (DSMS). We are developing such a system at Stanford [82], and we will touch on some
of our own work in this paper. However, we also attempt to provide a general overview of the area, along
with its related and current work. (Any glaring omissions are, naturally, our own fault.)
We begin in Section 2 by considering the data stream model and queries over streams. In this section we
take a simple view: streams are append-only relations with transient tuples, and queries are SQL operating
over these logical relations. In later sections we discuss several issues that complicate the model and query
language, such as ordering, timestamping, and sliding windows. Section 2 also presents some concrete
examples to ground our discussion.
In Section 3 we review recent projects geared specifically towards data stream processing, as well as
a plethora of past research in areas related to data streams: active databases, continuous queries, filtering
Work supported by NSF Grant IIS-0118173. Mayur Datar was also supported by a Microsoft Graduate Fellowship. Rajeev
Motwani received partial support from an Okawa Foundation Research Grant.
1

systems, view management, sequence databases, and others. Although much of this work clearly has ap-
plications to data stream processing, we hope to show in this paper that there are many new problems to
address in realizing a complete DSMS.
Section 4 delves more deeply into the area of query processing, uncovering a number of important issues,
including:
Queries that require an unbounded amount of memory to evaluate precisely, and approximate query
processing techniques to address this problem.
Sliding window query processing (i.e., considering “recent” portions of the streams only), both as
an approximation technique and as an option in the query language since many applications prefer
sliding-window queries.
Batch processing, sampling, and synopsis structures to handle situations where the flow rate of the
input streams may overwhelm the query processor.
The meaning and implementation of blocking operators (e.g., aggregation and sorting) in the presence
of unending streams.
Continuous queries that are registered when portions of the data streams have already “passed by, yet
the queries wish to reference stream history.
Section 5 then outlines some details of a query language and an architecture for a DSMS query processor
designed specifically to address the issues above.
In Section 6 we review algorithmic results in data stream processing. Our focus is primarily on sketching
techniques and building summary structures (synopses). We also touch upon sliding window computations,
present some negative results, and discuss a few additional algorithmic issues.
We conclude in Section 7 with some remarks on the evolution of this new field, and a summary of
directions for further work.
2 The Data Stream Model
In the data stream model, some or all of the input data that are to be operated on are not available for random
access from disk or memory, but rather arrive as one or more continuous data streams. Data streams differ
from the conventional stored relation model in several ways:
The data elements in the stream arrive online.
The system has no control over the order in which data elements arrive to be processed, either within
a data stream or across data streams.
Data streams are potentially unbounded in size.
Once an element from a data stream has been processed it is discarded or archived it cannot be
retrieved easily unless it is explicitly stored in memory, which typically is small relative to the size of
the data streams.
Operating in the data stream model does not preclude the presence of some data in conventional stored
relations. Often, data stream queries may perform joins between data streams and stored relational data.
For the purposes of this paper, we will assume that if stored relations are used, their contents remain static.
Thus, we preclude any potential transaction-processing issues that might arise from the presence of updates
to stored relations that occur concurrently with data stream processing.
2

2.1 Queries
Queries over continuous data streams have much in common with queries in a traditional database manage-
ment system. However, there are two important distinctions peculiar to the data stream model. The first
distinction is between one-time queries and continuous queries [84]. One-time queries (a class that includes
traditional DBMS queries) are queries that are evaluated once over a point-in-time snapshot of the data set,
with the answer returned to the user. Continuous queries, on the other hand, are evaluated continuously as
data streams continue to arrive. Continuous queries are the more interesting class of data stream queries, and
it is to them that we will devote most of our attention. The answer to a continuous query is produced over
time, always reflecting the stream data seen so far. Continuous query answers may be stored and updated as
new data arrives, or they may be produced as data streams themselves. Sometimes one or the other mode
is preferred. For example, aggregation queries may involve frequent changes to answer tuples, dictating the
stored approach, while join queries are monotonic and may produce rapid, unbounded answers, dictating
the stream approach.
The second distinction is between predefined queries and ad hoc queries. A predefined query is one
that is supplied to the data stream management system before any relevant data has arrived. Predefined
queries are generally continuous queries, although scheduled one-time queries can also be predefined. Ad
hoc queries, on the other hand, are issued online after the data streams have already begun. Ad hoc queries
can be either one-time queries or continuous queries. Ad hoc queries complicate the design of a data stream
management system, both because they are not known in advance for the purposes of query optimization,
identification of common subexpressions across queries, etc., and more importantly because the correct
answer to an ad hoc query may require referencing data elements that have already arrived on the data
streams (and potentially have already been discarded). Ad hoc queries are discussed in more detail in
Section 4.6.
2.2 Motivating Examples
Examples motivating a data stream system can be found in many application domains including finance,
web applications, security, networking, and sensor monitoring.
Traderbot [85] is a web-based financial search engine that evaluates queries over real-time streaming
financial data such as stock tickers and news feeds. The Traderbot web site [85] gives some examples
of one-time and continuous queries that are commonly posed by its customers.
Modern security applications often apply sophisticated rules over network packet streams. For exam-
ple, iPolicy Networks [52] provides an integrated security platform providing services such as firewall
support and intrusion detection over multi-gigabit network packet streams. Such a platform needs to
perform complex stream processing including URL-filtering based on table lookups, and correlation
across multiple network traffic flows.
Large web sites monitor web logs (clickstreams) online to enable applications such as personaliza-
tion, performance monitoring, and load-balancing. Some web sites served by widely distributed web
servers (e.g., Yahoo [95]) may need to coordinate many distributed clickstream analyses, e.g., to track
heavily accessed web pages as part of their real-time performance monitoring.
There are several emerging applications in the area of sensor monitoring [16, 58] where a large number
of sensors are distributed in the physical world and generate streams of data that need to be combined,
monitored, and analyzed.
3

The application domain that we use for more detailed examples is network traffic management, which
involves monitoring network packet header information across a set of routers to obtain information on
traffic flow patterns. Based on a description of Babu and Widom [10], we delve into this example in some
detail to help illustrate that continuous queries arise naturally in real applications and that conventional
DBMS technology does not adequately support such queries.
Consider the network traffic management system of a large network, e.g., the backbone network of an
Internet Service Provider (ISP) [30]. Such systems monitor a variety of continuous data streams that may be
characterized as unpredictable and arriving at a high rate, including both packet traces and network perfor-
mance measurements. Typically, current traffic-management tools either rely on a special-purpose system
that performs online processing of simple hand-coded continuous queries, or they just log the traffic data and
perform periodic offline query processing. Conventional DBMS’s are deemed inadequate to provide the kind
of online continuous query processing that would be most beneficial in this domain. A data stream system
that could provide effective online processing of continuous queries over data streams would allow network
operators to install, modify, or remove appropriate monitoring queries to support efficient management of
the ISP’s network resources.
Consider the following concrete setting. Network packet traces are being collected from a number of
links in the network. The focus is on two specific links: a customer link, C, which connects the network of
a customer to the ISP’s network, and a backbone link, B, which connects two routers within the backbone
network of the ISP. Let
and
denote two streams of packet traces corresponding to these two links. We
assume, for simplicity, that the traces contain just the five fields of the packet header that are listed below.
src: IP address of packet sender.
dest: IP address of packet destination.
id: Identification number given by sender so that destination can uniquely identify each packet.
len: Length of the packet.
time: Time when packet header was recorded.
Consider first the continuous query

, which computes load on the link B averaged over one-minute
intervals, notifying the network operator when the load crosses a specified threshold
. The functions get-
minute and notifyoperator have the natural interpretation.
: SELECT notifyoperator(sum(len))
FROM
GROUP BY getminute(time)
HAVING sum(len)

While the functionality of such a query may possibly be achieved in a DBMS via the use of triggers, we
are likely to prefer the use of special techniques for performance reasons. For example, consider the case
where the link B has a very high throughput (e.g., if it were an optical link). In that case, we may choose to
compute an approximate answer to
by employing random sampling on the stream a task outside the
reach of standard trigger mechanisms.
The second query

isolates flows in the backbone link and determines the amount of traffic generated
by each flow. A flow is defined here as a sequence of packets grouped in time, and sent from a specific
source to a specific destination.
4


: SELECT flowid, src, dest, sum(len) AS flowlen
FROM (SELECT src, dest, len, time
FROM
ORDER BY time )
GROUP BY src, dest, getflowid(src, dest, time)
AS flowid
Here getflowid is a user-defined function which takes the source IP address, the destination IP address,
and the timestamp of a packet, and returns the identifier of the flow to which the packet belongs. We assume
that the data in the view (or table expression) in the FROM clause is passed to the getflowid function in
the order defined by the ORDER BY clause.
Observe that handling

over stream
is particularly challenging due to the presence of GROUP BY
and ORDER BY clauses, which lead to “blocking operators in a query execution plan.
Consider now the task of determining the fraction of the backbone link’s traffic that can be attributed to
the customer network. This query,

, is an example of the kind of ad hoc continuous queries that may be
registered during periods of congestion to determine whether the customer network is the likely cause.

:(SELECT count (*)
FROM C, B
WHERE C.src = B.src and C.dest = B.dest
and C.id = B.id)
(SELECT count (*) FROM
)
Observe that

joins streams
and
on their keys to obtain a count of the number of common packets.
Since joining two streams could potentially require unbounded intermediate storage (for example if there is
no bound on the delay between a packet showing up on the two links), the user may prefer to compute an
approximate answer. One approximation technique would be to maintain bounded-memory synopses of the
two streams (see Section 6); alternatively, one could exploit aspects of the application semantics to bound
the required storage (e.g., we may know that joining tuples are very likely to occur within a bounded time
window).
Our final example,

, is a continuous query for monitoring the source-destination pairs in the top 5
percent in terms of backbone traffic. For ease of exposition, we employ the WITH construct from SQL-
99 [87].

:WITH Load AS
(SELECT src, dest, sum(len) AS traffic
FROM
GROUP BY src, dest)
SELECT src, dest, traffic
FROM Load AS

WHERE (SELECT count(*)
FROM Load AS

WHERE
.traffic

.traffic)
(SELECT
 "!
count(*) FROM Load)
ORDER BY traffic
5

Citations
More filters
01 Jan 2006
TL;DR: There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99].
Abstract: The book Knowledge Discovery in Databases, edited by Piatetsky-Shapiro and Frawley [PSF91], is an early collection of research papers on knowledge discovery from data. The book Advances in Knowledge Discovery and Data Mining, edited by Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy [FPSSe96], is a collection of later research results on knowledge discovery and data mining. There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99], Building Data Mining Applications for CRM by Berson, Smith, and Thearling [BST99], Data Mining: Practical Machine Learning Tools and Techniques by Witten and Frank [WF05], Principles of Data Mining (Adaptive Computation and Machine Learning) by Hand, Mannila, and Smyth [HMS01], The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman [HTF01], Data Mining: Introductory and Advanced Topics by Dunham, and Data Mining: Multimedia, Soft Computing, and Bioinformatics by Mitra and Acharya [MA03]. There are also books containing collections of papers on particular aspects of knowledge discovery, such as Machine Learning and Data Mining: Methods and Applications edited by Michalski, Brakto, and Kubat [MBK98], and Relational Data Mining edited by Dzeroski and Lavrac [De01], as well as many tutorial notes on data mining in major database, data mining and machine learning conferences.

2,591 citations

Journal ArticleDOI
TL;DR: The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art and aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Abstract: Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.

2,374 citations


Cites background from "Models and issues in data stream sy..."

  • ...Two basic types of sliding windows are [Babcock et al. 2002] (i) sequence based, where the size of a window is characterized by the number of observations, and (ii) timestamp based, where the size of a window is defined by duration time....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors introduce a sublinear space data structure called the countmin sketch for summarizing data streams, which allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc.

1,939 citations

Proceedings ArticleDOI
13 Jun 2003
TL;DR: A new symbolic representation of time series is introduced that is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.
Abstract: The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued.Many researchers have also considered transforming real valued time series into symbolic representations, nothing that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly "batch-only" problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms.In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.We will demonstrate the utility of our representation on the classic data mining tasks of clustering, classification, query by content and anomaly detection.

1,922 citations


Cites background from "Models and issues in data stream sy..."

  • ...The parallel explosions of interest in streaming da ta [4, 8, 10, 18], and data mining of time series [6, 7, 9, 20, 21, 24 , 26, 34] have had surprisingly little intersection....

    [...]

Book ChapterDOI
09 Sep 2003
TL;DR: A fundamentally different philosophy for data stream clustering is discussed which is guided by application-centered requirements and uses the concepts of a pyramidal time frame in conjunction with a microclustering approach.
Abstract: The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. For example, a simple one-pass clustering algorithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously perform dynamic clustering over all possible time horizons for a data stream of even moderately large volume. This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. The idea is divide the clustering process into an online component which periodically stores detailed summary statistics and an offine component which uses only this summary statistics. The offine component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The problems of efficient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyramidal time frame in conjunction with a microclustering approach. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.

1,836 citations


Cites background from "Models and issues in data stream sy..."

  • ...The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters....

    [...]

References
More filters
Book
01 Jan 1995
TL;DR: This book introduces the basic concepts in the design and analysis of randomized algorithms and presents basic tools such as probability theory and probabilistic analysis that are frequently used in algorithmic applications.
Abstract: For many applications, a randomized algorithm is either the simplest or the fastest algorithm available, and sometimes both. This book introduces the basic concepts in the design and analysis of randomized algorithms. The first part of the text presents basic tools such as probability theory and probabilistic analysis that are frequently used in algorithmic applications. Algorithmic examples are also given to illustrate the use of each tool in a concrete setting. In the second part of the book, each chapter focuses on an important area to which randomized algorithms can be applied, providing a comprehensive and representative selection of the algorithms that might be used in each of these areas. Although written primarily as a text for advanced undergraduates and graduate students, this book should also prove invaluable as a reference for professionals and researchers.

4,412 citations

Proceedings ArticleDOI
01 Aug 2000
TL;DR: This paper describes and evaluates VFDT, an anytime system that builds decision trees using constant memory and constant time per example, and applies it to mining the continuous stream of Web access data from the whole University of Washington main campus.
Abstract: Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. This paper describes and evaluates VFDT, an anytime system that builds decision trees using constant memory and constant time per example. VFDT can incorporate tens of thousands of examples per second using off-the-shelf hardware. It uses Hoeffding bounds to guarantee that its output is asymptotically nearly identical to that of a conventional learner. We study VFDT's properties and demonstrate its utility through an extensive set of experiments on synthetic data. We apply VFDT to mining the continuous stream of Web access data from the whole University of Washington main campus.

2,171 citations


"Models and issues in data stream sy..." refers background in this paper

  • ...[28, 29] have studied the problem of maintaining decision trees over datastreams....

    [...]

Book
01 Jan 1996
TL;DR: This chapter surveys the theory of two-party communication complexity and presents results regarding the following models of computation: • Finite automata • Turing machines • Decision trees • Ordered binary decision diagrams • VLSI chips • Networks of threshold gates.
Abstract: In this chapter we survey the theory of two-party communication complexity. This field of theoretical computer science aims at studying the following, seemingly very simple, scenario: There are two players Alice who holds an n-bit string x and Bob who holds an n-bit string y. Their goal is to communicate in order to compute the value of some boolean function f(x, y), while exchanging a number of bits which is as small as possible. In the first part of this survey we present, mainly by giving examples, some of the results (and techniques) developed as part of this theory. We put an emphasis on proving lower bounds on the amount of communication that must be exchanged in the above scenario for certain functions f . In the second part of this survey we will exemplify the wide applicability of the results proved in the first part to other areas of computer science. While it is obvious that there are many applications of the results to problems in which communication is involved (e.g., in distributed systems), we concentrate on applications in which communication does not appear explicitly in the statement of the problems. In particular, we present results regarding the following models of computation: • Finite automata • Turing machines • Decision trees • Ordered binary decision diagrams (OBDDs) • VLSI chips • Networks of threshold gates We provide references to many other issues and applications of communication complexity which are not discussed in this survey.

2,004 citations


"Models and issues in data stream sy..." refers background in this paper

  • ...are derived from results in communication complexity [56]....

    [...]

  • ...We feel that techniques based on communication complexity results [56] will prove useful in this context....

    [...]

Proceedings ArticleDOI
26 Aug 2001
TL;DR: An efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner is proposed, called CVFDT, which stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate.
Abstract: Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.

1,790 citations


"Models and issues in data stream sy..." refers background in this paper

  • ...[28, 29] have studied the problem of maintaining decision trees over datastreams....

    [...]

01 Jan 1999
TL;DR: XPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer, and has been endorsed by the Director as a W3C Recommendation.
Abstract: XPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer. Status of this document This document has been reviewed by W3C Members and other interested parties and has been endorsed by the Director as a W3C Recommendation. It is a stable document and may be used as reference material or cited as a normative reference from other documents. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web. XML Path Language (XPath) http://www.w3.org/TR/1999/REC-xpath-19991116 (1 of 30) [7/19/2001 5:31:03 PM] The list of known errors in this specification is available at http://www.w3.org/1999/11/REC-xpath-19991116-errata. Comments on this specification may be sent to www-xpath-comments@w3.org; archives of the comments are available. The English version of this specification is the only normative version. However, for translations of this document, see http://www.w3.org/Style/XSL/translations.html. A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR. This specification is joint work of the XSL Working Group and the XML Linking Working Group and so is part of the W3C Style activity and of the W3C XML activity.

1,785 citations

Frequently Asked Questions (13)
Q1. What are the future works in "Models and issues in data stream systems" ?

Their approach at Stanford is to extend SQL to support stream-oriented primitives, providing a purely declarative interface as in traditional database systems, although the authors also allow direct submission of query plans. From a purely theoretical perspective, perhaps the most interesting open question is that of defining extensions of relational operators to handle data stream constructs, and to study the resulting “ stream algebra ” and other properties of these extensions. Another issue the authors touched on only briefly in Section 4. 5 is that of constraints over streams, and how they can be exploited in query processing. 

In this overview paper the authors motivate the need for and research issues arising from a new model of data processing. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues. 

The key contribution of Alon et al. [5] was a sketching technique to estimate G that uses onlym >ut v x twv x l E space and provides arbitrarily small approximation factors. 

The underlying source of data for their examples will be a stream of telephone call records, each with four attributes: customer id, type, minutes, and timestamp. 

Other aspects of a complete DBMS also need to be reconsidered, including query languages, storage and buffer management, user and application interfaces, and transaction support. 

When the sliding window serves mostly to increase query processing efficiency, then the best-effort approach, which allows wide latitude over the ordering of tuples, is usually acceptable. 

It is futile to attempt to make use of all the data when computing an answer, because data arrives faster than it can be processed. 

The idea behind list-efficient streaming algorithms is that instead of being presented one data item at a time, they are implicitly presented with a list of data items in a succinct form. 

In the case where the sliding window is large enough so that the entire contents of the window cannot be buffered in memory, there are also theoretical challenges in designing algorithms that can give approximate answers using only the available memory. 

A more ambitious approach to handling ad hoc queries that reference past data is to maintain summaries of data streams (in the form of general-purpose synopses or aggregates) that can be used to give approximate answers to future ad hoc queries. 

One technique for producing an approximate answer to a data stream query is to evaluate the query not over the entire past history of the data streams, but rather only over sliding windows of recent data from the streams. 

This translates to efficiently computing the range sum of n -stable random variables (used for computing the H sketch, see Indyk [50]) over thedyadic interval. 

This work has led to some general techniques for data reduction and synopsis construction, including: sketches [5, 35], random sampling [1, 2, 22], histograms [51, 70], and wavelets [17, 92].