scispace - formally typeset
Search or ask a question
Journal ArticleDOI

High speed and robust event correlation

01 May 1996-IEEE Communications Magazine (IEEE)-Vol. 34, Iss: 5, pp 82-90
TL;DR: The authors describe a network management system and illustrates its application to managing a distributed database application on a complex enterprise network.
Abstract: The authors describe a network management system and illustrate its application to managing a distributed database application on a complex enterprise network.
Citations
More filters
Proceedings ArticleDOI
23 Jun 2002
TL;DR: This work presents a dynamic analysis methodology that automates problem determination in these environments by coarse-grained tagging of numerous real client requests as they travel through the system and using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault.
Abstract: Traditional problem determination techniques rely on static dependency models that are difficult to generate accurately in today's large, distributed, and dynamic application environments such as e-commerce systems. We present a dynamic analysis methodology that automates problem determination in these environments by 1) coarse-grained tagging of numerous real client requests as they travel through the system and 2) using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault. To validate our methodology, we have implemented Pinpoint, a framework for root cause analysis on the J2EE platform that requires no knowledge of the application components. Pinpoint consists of three parts: a communications layer that traces client requests, a failure detector that uses traffic-sniffing and middleware instrumentation, and a data analysis engine. We evaluate Pinpoint by injecting faults into various application components and show that Pinpoint identifies the faulty components with high accuracy and produces few false-positives.

910 citations


Cites background from "High speed and robust event correla..."

  • ...The number of software and hardware components in these systems increases as new functionalities are added and as components are replicated for performance and fault toler- ance, often increasing the complexity of the system....

    [...]

Journal ArticleDOI
Klaus Julisch1
TL;DR: A novel alarm-clustering method is proposed that supports the human analyst in identifying root causes and shows that the alarm load decreases quite substantially if the identified root causes are eliminated so that they can no longer trigger alarms in the future.
Abstract: It is a well-known problem that intrusion detection systems overload their human operators by triggering thousands of alarms per day. This paper presents a new approach for handling intrusion detection alarms more efficiently. Central to this approach is the notion that each alarm occurs for a reason, which is referred to as the alarm's root causes. This paper observes that a few dozens of rather persistent root causes generally account for over 90p of the alarms that an intrusion detection system triggers. Therefore, we argue that alarms should be handled by identifying and removing the most predominant and persistent root causes. To make this paradigm practicable, we propose a novel alarm-clustering method that supports the human analyst in identifying root causes. We present experiments with real-world intrusion detection alarms to show how alarm clustering helped us identify root causes. Moreover, we show that the alarm load decreases quite substantially if the identified root causes are eliminated so that they can no longer trigger alarms in the future.

481 citations


Cites background or methods from "High speed and robust event correla..."

  • ...Yet other systems implement root cause analysis by means of case-based reasoning [Lewis 1993] or codebooks [Yemini et al. 1996]....

    [...]

  • ...Yet other systems implement root cause analysis by means of case-based reasoning [Lewis 1993] or codebooks [ Yemini et al. 1996 ]....

    [...]

  • ...…In network fault management [Bouloutas et al. 1994; Houck et al. 1995; Jakobson and Weissman 1993, 1995; Lewis 1993; Nygate 1995; Ohsie 1998; Yemini et al. 1996], alarms indicate problems in a network s operation, such as hardware or software failures, performance degradations, or…...

    [...]

  • ...In network fault management [Bouloutas et al. 1994; Houck et al. 1995; Jakobson and Weissman 1993; 1995; Lewis 1993; Nygate 1995; Ohsie 1998; Yemini et al. 1996 ], alarms indicate problems in a network’s operation, such as hardware or software failures, performance degradations, or miscongurations....

    [...]

01 Jan 2002
TL;DR: Recovery Oriented Computing (ROC) takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved, and thus offers higher availability.
Abstract: It is time to broaden our performance-dominated research agenda. A four order of magnitude increase in performance since the first ASPLOS in 1982 means that few outside the CS&E research community believe that speed is the only problem of computer hardware and software. Current systems crash and freeze so frequently that people become violent. 1 Fast but flaky should not be our 21 st century legacy. Recovery Oriented Computing (ROC) takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved. By concentrating on Mean Time to Repair (MTTR) rather than Mean Time to Failure (MTTF), ROC reduces recovery time and thus offers higher availability. Since a large portion of system administration is dealing with failures, ROC may also reduce total cost of ownership. One to two orders of magnitude reduction in cost mean that the purchase price of hardware and software is now a small part of the total cost of ownership. In addition to giving the motivation and definition of ROC, we introduce failure data for Internet sites that shows that the leading cause of outages is operator error. We also demonstrate five ROC techniques in five case studies, which we hope will influence designers of architectures and operating systems. If we embrace availability and maintainability, systems of the future may compete on recovery performance rather than just SPEC performance, and on total cost of ownership rather than just system price. Such a change may restore our pride in the architectures and operating systems we craft.

470 citations


Cites methods from "High speed and robust event correla..."

  • ...One is to use models and dependency graphs to perform diagnosis [Choi99] [Gruschke98] [Katker97] [Lee00] [ Yemini96 ]....

    [...]

Proceedings ArticleDOI
27 Aug 2007
TL;DR: An Inference Graph model is introduced, which is well-adapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults, and takes into account multi-level structure, which leads to a 30% improvement in fault localization, as compared to two-level approaches.
Abstract: Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an Inference Graph model, which is well-adapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults. Further, we introduce the Sherlock system to discover Inference Graphs in the operational enterprise, infer critical attributes, and then leverage the result to automatically detect and localize problems. To illuminate strengths and limitations of the approach, we provide results from a prototype deployment in a large enterprise network, as well as from testbed emulations and simulations. In particular, we find that taking into account multi-level structure leads to a 30% improvement in fault localization, as compared to two-level approaches.

405 citations


Cites methods from "High speed and robust event correla..."

  • ...Today, enterprises use sophisticated commercial tools, such as EMC’s SMARTS [21], HP Openview [13], IBM Tivoli [19], or Microsoft Operations Manager [10]....

    [...]

Journal ArticleDOI
TL;DR: The challenges of fault localization in complex communication systems are discussed and an overview of solutions proposed in the course of the last ten years are presented, while discussing their advantages and shortcomings.

397 citations


Cites background from "High speed and robust event correla..."

  • ...This paper discusses the challenges of fault localization in complex communication systems and presents an overview of solutions proposed in the course of the last ten years, while discussing their advantages and shortcomings....

    [...]

  • ...It is fair to say that despite this research effort, fault localization in complex communication systems remains an open research problem....

    [...]

  • ...In particular, they are unable to model situations in which failure of a device may depend on a logical combination of other device failures [41]....

    [...]

  • ...All rights reserved. doi:10.1016/j.scico.2004.01.010...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The authors discuss the development of an alarm correlation model and a corresponding software support system that allow efficient specification of alarm correlation by the domain experts themselves and emphasis is placed on the end-user orientation of IMPACT, the intelligent management platform for alarm correlation tasks which implements the proposed model.
Abstract: The authors discuss the development of an alarm correlation model and a corresponding software support system that allow efficient specification of alarm correlation by the domain experts themselves. Emphasis is placed on the end-user orientation of IMPACT, the intelligent management platform for alarm correlation tasks which implements the proposed model. The desire was to lower the barrier between the network management application development process and the end user of the application, the network management personnel. IMPACT is a step towards this goal. The proposed alarm correlation model was used for three purposes: intelligent alarm filtering, alarm generalization and fault diagnosis. >

292 citations

Book ChapterDOI
S. Klinger, S. Yemini, Yechiam Yemini1, D. Ohsie1, Salvatore J. Stolfo1 
01 Jan 1995
TL;DR: Preliminary benchmarks of the SEMS demonstrate that the coding approach provides a speedup at least two orders of magnitude over other published correlation systems, and scales well to very large domains involving thousands of problems.
Abstract: This paper describes a novel approach to event correlation in networks based on coding techniques. Observable symptom events are viewed as a code that identifies the problems that caused them; correlation is performed by decoding the set of observed symptoms. The coding approach has been implemented in SMARTS Event Management System (SEMS), as server running under Sun Solaris 2.3. Preliminary benchmarks of the SEMS demonstrate that the coding approach provides a speedup at least two orders of magnitude over other published correlation systems. In addition, it is resilient to high rates of symptom loss and false alarms. Finally, the coding approach scales well to very large domains involving thousands of problems.

239 citations


"High speed and robust event correla..." refers methods in this paper

  • ...Our approach to correlation is based on coding techniques [ 6 ]....

    [...]

Journal ArticleDOI
TL;DR: A description is given of the Network Management Analysis and Testing Environment (NETMATE) project, the prime goal of which is to develop a unified and comprehensive software environment for network management to oversee and orchestrate the operations of diverse devices and protocols in large, heterogeneous computer networks.
Abstract: A description is given of the Network Management Analysis and Testing Environment (NETMATE) project, the prime goal of which is to develop a unified and comprehensive software environment for network management to oversee and orchestrate the operations of diverse devices and protocols in large, heterogeneous computer networks. The overall NETMATE architecture is discussed, and the network management functions performed by each component are described. The problem of network modeling and the NETMATE approach to it are presented. The current implementation status of NETMATE is given, and some conclusions are offered. >

73 citations

Journal ArticleDOI
TL;DR: In this paper, the authors discuss design issues that they have encountered during an investigation of expert systems for network management that they believe to be generic to the real-time diagnosis of self-correcting networks.
Abstract: The authors discuss design issues that they have encountered during an investigation of expert systems for network management that they believe to be generic to the real-time diagnosis of self-correcting networks. By real-time they mean that the diagnostic system must keep pace with a dynamic process, that is, the flow of alarms from intelligent network elements. The objective is to present the operator with a set of recommended actions rather than large volumes of raw alarm data. They outline the general requirements of such a system and then suggest how each can be addressed using an expert-system approach. >

47 citations

Book ChapterDOI
01 Jan 1994
TL;DR: This model enriches the traditional Entity-Relationship (E-R) model with object oriented abstractions, constraints and event-correlation information, and is thus particularly suitable for supporting complex event management systems.
Abstract: This paper presents a preliminary report on a semantic model of managed information, ERC. This model enriches the traditional Entity-Relationship (E-R) model with object oriented abstractions, constraints and event-correlation information. The model has much more expressive power than the CMIP management information model, (which in turn is more expressive than SNMP’s model), yet is simpler overall, and is more efficient to implement. It is thus particularly suitable for supporting complex event management systems. The paper presents the model, illustrates it through example applications, and briefly describes an efficient, protocol-independent implementation of the model used in a distributed management system.

7 citations