scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

High-performance complex event processing over streams

TL;DR: This paper proposes a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications and describes a query plan-based approach to efficiently implementing this language.
Abstract: In this paper, we present the design, implementation, and evaluation of a system that executes complex event queries over real-time streams of RFID readings encoded as events. These complex event queries filter and correlate events to match specific patterns, and transform the relevant events into new composite events for the use of external monitoring applications. Stream-based execution of these queries enables time-critical actions to be taken in environments such as supply chain management, surveillance and facility management, healthcare, etc. We first propose a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications. We then describe a query plan-based approach to efficiently implementing this language. Our approach uses native operators to efficiently handle query-defined sequences, which are a key component of complex event processing, and pipeline such sequences to subsequent operators that are built by leveraging relational techniques. We also develop a large suite of optimization techniques to address challenges such as large sliding windows and intermediate result sizes. We demonstrate the effectiveness of our approach through a detailed performance analysis of our prototype implementation under a range of data and query workloads as well as through a comparison to a state-of-the-art stream processor.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This paper presents a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics, and presents the prevalent Hadoop framework for addressing big data challenges.
Abstract: Recent technological advancements have led to a deluge of data from distinctive domains (e.g., health care and scientific sensors, user-generated data, Internet and financial companies, and supply chain systems) over the past two decades. The term big data was coined to capture the meaning of this emerging trend. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, big data is commonly unstructured and require more real-time analysis. This development calls for new system architectures for data acquisition, transmission, storage, and large-scale data processing mechanisms. In this paper, we present a literature survey and system tutorial for big data analytics platforms, aiming to provide an overall picture for nonexpert readers and instill a do-it-yourself spirit for advanced audiences to customize their own big-data solutions. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of numerous approaches and mechanisms from research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data challenges. Finally, we outline several evaluation benchmarks and potential research directions for big data systems.

1,002 citations


Cites background from "High-performance complex event proc..."

  • ...A shoplifting example that uses highlevel complex events is discussed in [254]....

    [...]

Journal ArticleDOI
TL;DR: A general, unifying model is proposed to capture the different aspects of an IFP system and use it to provide a complete and precise classification of the systems and mechanisms proposed so far.
Abstract: A large number of distributed applications requires continuous and timely processing of information as it flows from the periphery to the center of the system. Examples include intrusion detection systems which analyze network traffic in real-time to identify possible attacks; environmental monitoring applications which process raw data coming from sensor networks to identify critical situations; or applications performing online analysis of stock prices to identify trends and forecast future values.Traditional DBMSs, which need to store and index data before processing it, can hardly fulfill the requirements of timeliness coming from such domains. Accordingly, during the last decade, different research communities developed a number of tools, which we collectively call Information flow processing (IFP) systems, to support these scenarios. They differ in their system architecture, data model, rule model, and rule language. In this article, we survey these systems to help researchers, who often come from different backgrounds, in understanding how the various approaches they adopt may complement each other.In particular, we propose a general, unifying model to capture the different aspects of an IFP system and use it to provide a complete and precise classification of the systems and mechanisms proposed so far.

918 citations


Cites background from "High-performance complex event proc..."

  • ...Sase [Wu et al. 2006] is a monitoring system designed to perform complex queries over real-time flows of RFID readings....

    [...]

Proceedings ArticleDOI
09 Jun 2008
TL;DR: This paper presents a formal evaluation model that offers precise semantics for this new class of queries and a query evaluation framework permitting optimizations in a principled way and further analyzes the runtime complexity of query evaluation using this model and develops a suite of techniques that improve runtime efficiency by exploiting sharing in storage and processing.
Abstract: Pattern matching over event streams is increasingly being employed in many areas including financial services, RFIDbased inventory management, click stream analysis, and electronic health systems. While regular expression matching is well studied, pattern matching over streams presents two new challenges: Languages for pattern matching over streams are significantly richer than languages for regular expression matching. Furthermore, efficient evaluation of these pattern queries over streams requires new algorithms and optimizations: the conventional wisdom for stream query processing (i.e., using selection-join-aggregation) is inadequate.In this paper, we present a formal evaluation model that offers precise semantics for this new class of queries and a query evaluation framework permitting optimizations in a principled way. We further analyze the runtime complexity of query evaluation using this model and develop a suite of techniques that improve runtime efficiency by exploiting sharing in storage and processing. Our experimental results provide insights into the various factors on runtime performance and demonstrate the significant performance gains of our sharing techniques.

441 citations

Proceedings Article
01 Jan 2007
TL;DR: This work describes the design and implementation of the Cornell Cayuga System for scalable event processing and presents a query language based on Cayuga Algebra for naturally expressing complex event patterns.
Abstract: We describe the design and implementation of the Cornell Cayuga System for scalable event processing. We present a query language based on Cayuga Algebra for naturally expressing complex event patterns. We also describe several novel system design and implementation issues, focusing on Cayuga’s query processor, its indexing approach, how Cayuga handles simultaneous events, and its specialized garbage collector.

393 citations


Cites background from "High-performance complex event proc..."

  • ...Complex event systems such as SNOOP [1], ODE [11] and SASE [18] are closest in spirit to our own work....

    [...]

Journal ArticleDOI
TL;DR: A survey of optimizations for stream processing, in a style similar to catalogs of design patterns or refactorings, to help future streaming system builders to stand on the shoulders of giants from not just their own community.
Abstract: Various research communities have independently arrived at stream processing as a programming model for efficient and parallel computing. These communities include digital signal processing, databases, operating systems, and complex event processing. Since each community faces applications with challenging performance requirements, each of them has developed some of the same optimizations, but often with conflicting terminology and unstated assumptions. This article presents a survey of optimizations for stream processing. It is aimed both at users who need to understand and guide the system’s optimizer and at implementers who need to make engineering tradeoffs. To consolidate terminology, this article is organized as a catalog, in a style similar to catalogs of design patterns or refactorings. To make assumptions explicit and help understand tradeoffs, each optimization is presented with its safety constraints (when does it preserve correctnessq) and a profitability experiment (when does it improve performanceq). We hope that this survey will help future streaming system builders to stand on the shoulders of giants from not just their own community.

314 citations


Cites methods or result from "High-performance complex event proc..."

  • ...An example is SASE [Wu et al. 2006]....

    [...]

  • ...A related approach is SASE, which can fuse certain operators with the source operator and then implement these operators by a different algorithm [Wu et al. 2006]....

    [...]

  • ...For instance, when SASE fuses a source operator that reads input data with a downstream operator, it combines them such that the downstream operator is piggy-backed incrementally on the source operator, producing fewer intermediate results [Wu et al. 2006]....

    [...]

  • ...For example, Galax uses nested-relational algebra for XML processing [Ré et al. 2006], and SASE uses a custom algebra for finding temporal patterns across sequences of data items [Wu et al. 2006]....

    [...]

  • ...2006], and SASE uses a custom algebra for finding temporal patterns across sequences of data items [Wu et al. 2006]....

    [...]

References
More filters
Proceedings Article
01 Jan 2003
TL;DR: The next generation Telegraph system, called TelegraphCQ, is focused on meeting the challenges that arise in handling large streams of continuous queries over high-volume, highly-variable data streams and leverages the PostgreSQL open source code base.
Abstract: Increasingly pervasive networks are leading towards a world where data is constantly in motion. In such a world, conventional techniques for query processing, which were developed under the assumption of a far more static and predictable computational environment, will not be sufficient. Instead, query processors based on adaptive dataflow will be necessary. The Telegraph project has developed a suite of novel technologies for continuously adaptive query processing. The next generation Telegraph system, called TelegraphCQ, is focused on meeting the challenges that arise in handling large streams of continuous queries over high-volume, highly-variable data streams. In this paper, we describe the system architecture and its underlying technology, and report on our ongoing implementation effort, which leverages the PostgreSQL open source code base. We also discuss open issues and our research agenda.

1,248 citations


"High-performance complex event proc..." refers background or methods in this paper

  • ...Languages for stream processing [2][7][19] lack constructs to address nonoccurrences of events and become unwieldy for specifying complex order-oriented constraints....

    [...]

  • ...We compare two algorithms: the Basic algorithm (presented in Section 3.2) that constructs event sequences from the runtime stack used by the NFA,3 and the AIS algorithm (presented in Section 4.1) that builds active instance stacks for sequence construction....

    [...]

Proceedings ArticleDOI
03 Jun 2002
TL;DR: This paper proposes a novel holistic twig join algorithm, TwigStack, that uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern.
Abstract: XML employs a tree-structured data model, and, naturally, XML queries specify patterns of selection predicates on multiple elements related by a tree structure. Finding all occurrences of such a twig pattern in an XML database is a core operation for XML query processing. Prior work has typically decomposed the twig pattern into binary structural (parent-child and ancestor-descendant) relationships, and twig matching is achieved by: (i) using structural join algorithms to match the binary relationships against the XML database, and (ii) stitching together these basic matches. A limitation of this approach for matching twig patterns is that intermediate result sizes can get large, even when the input and output sizes are more manageable.In this paper, we propose a novel holistic twig join algorithm, TwigStack, for matching an XML query twig pattern. Our technique uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern. When the twig pattern uses only ancestor-descendant relationships between elements, TwigStack is I/O and CPU optimal among all sequential algorithms that read the entire input: it is linear in the sum of sizes of the input lists and the final result list, but independent of the sizes of intermediate results. We then show how to use (a modification of) B-trees, along with TwigStack, to match query twig patterns in sub-linear time. Finally, we complement our analysis with experimental results on a range of real and synthetic data, and query twig patterns.

1,014 citations


"High-performance complex event proc..." refers methods in this paper

  • ...2 Active Instance Stacks (AIS) in SASE appear similar to PathStacks for XML pattern matching [3] in stack arrangements....

    [...]

Proceedings ArticleDOI
01 May 1999
TL;DR: It is proved that for predicates reducible to conjunctions of elementary tests, the expected time to match a random event is no greater than O(N 1 ) where N is the number of subscriptions, and is a closed-form expression that depends on the number and type of attributes.
Abstract: Content-based subscription systems are an emerging alternative to traditional publish-subscribe systems, because they permit more flexible subscriptions along multiple dimensions. In these systems, each subscription is a predicate which may test arbitrary attributes within an event. However, the matching problem for content-based systems — determining for each event the subset of all subscriptions whose predicates match the event — is still an open problem. We present an efficient, scalable solution to the matching problem. Our solution has an expected time complexity that is sub-linear in the number of subscriptions, and it has a space complexity that is linear. Specifically, we prove that for predicates reducible to conjunctions of elementary tests, the expected time to match a random event is no greater than O(N 1 ) where N is the number of subscriptions, and is a closed-form expression that depends on the number and type of attributes (in some cases, 1=2). We present some optimizations to our algorithms that improve the search time. We also present the results of simulations that validate the theoretical bounds and that show acceptable performance levels for tens of thousands of subscriptions. Department of Computer Science, Cornell University, Ithaca, N.Y. 14853-7501, aguilera@cs.cornell.edu IBM T.J. Watson Research Center, Yorktown Heights, N.Y. 10598, fstrom, sturman, tusharg@watson.ibm.com Department of Computer Science, University of Illinois at Urbana-Champaign, 1304 W. Springfield Ave, Urbana, I.L. 61801, astley@cs.uiuc.edu

736 citations

Proceedings Article
01 Jan 2003
TL;DR: The architectural challenges facing the design of large-scale distributed stream processing systems are described, and novel approaches for addressing load management, high availability, and federated operation issues are discussed.
Abstract: Stream processing fits a large class of new applications for which conventional DBMSs fall short. Because many stream-oriented systems are inherently geographically distributed and because distribution offers scalable load management and higher availability, future stream processing systems will operate in a distributed fashion. They will run across the Internet on computers typically owned by multiple cooperating administrative domains. This paper describes the architectural challenges facing the design of large-scale distributed stream processing systems, and discusses novel approaches for addressing load management, high availability, and federated operation issues. We describe two stream processing systems, Aurora* and Medusa, which are being designed to explore complementary solutions to these challenges. This paper discusses the architectural issues facing the design of large-scale distributed stream processing systems. We begin in Section 2 with a brief description of our centralized stream processing system, Aurora [4]. We then discuss two complementary efforts to extend Aurora to a distributed environment: Aurora* and Medusa. Aurora* assumes an environment in which all nodes fall under a single administrative domain. Medusa provides the infrastructure to support federated operation of nodes across administrative boundaries. After describing the architectures of these two systems in Section 3, we consider three design challenges common to both: infrastructures and protocols supporting communication amongst nodes (Section 4), load sharing in response to variable network conditions (Section 5), and high availability in the presence of failures (Section 6). We also discuss high-level policy specifications employed by the two systems in Section 7. For all of these issues, we believe that the push-based nature of stream-based applications not only raises new challenges but also offers the possibility of new domain-specific solutions.

624 citations


"High-performance complex event proc..." refers background or methods in this paper

  • ...been extensively studied in the field of stream processing [2][7] [9][13], we expect to adopt many stream processing techniques in our system....

    [...]

  • ...Stream processing systems in the relational setting [7][9][19][24] are not optimized for complex event processing, whereas event processing systems very recently developed [18][26][29] have not focused on fast implementations....

    [...]

Proceedings ArticleDOI
01 May 2001
TL;DR: In this article, the authors describe an attempt at the construction of such algorithms and its implementation using a combination of data structures, application-specific caching policies, and application specific query processing, which can handle 600 events per second for a typical workload containing 6 million subscriptions.
Abstract: Publish/Subscribe is the paradigm in which users express long-term interests (“subscriptions”) and some agent “publishes” events (e.g., offers). The job of Publish/Subscribe software is to send events to the owners of subscriptions satisfied by those events. For example, a user subscription may consist of an interest in an airplane of a certain type, not to exceed a certain price. A published event may consist of an offer of an airplane with certain properties including price. Each subscription consists of a conjunction of (attribute, comparison operator, value) predicates. A subscription closely resembles a trigger in that it is a long-lived conditional query associated with an action (usually, informing the subscriber). However, it is less general than a trigger so novel data structures and implementations may enable the creation of more scalable, high performance publish/subscribe systems. This paper describes an attempt at the construction of such algorithms and its implementation. Using a combination of data structures, application-specific caching policies, and application-specific query processing our system can handle 600 events per second for a typical workload containing 6 million subscriptions.

587 citations