scispace - formally typeset
Search or ask a question
Author

S. Kliger

Bio: S. Kliger is an academic researcher. The author has contributed to research in topics: Local area network & Enterprise private network. The author has an hindex of 1, co-authored 1 publications receiving 404 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: The authors describe a network management system and illustrates its application to managing a distributed database application on a complex enterprise network.
Abstract: The authors describe a network management system and illustrate its application to managing a distributed database application on a complex enterprise network.

404 citations


Cited by
More filters
Proceedings ArticleDOI
23 Jun 2002
TL;DR: This work presents a dynamic analysis methodology that automates problem determination in these environments by coarse-grained tagging of numerous real client requests as they travel through the system and using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault.
Abstract: Traditional problem determination techniques rely on static dependency models that are difficult to generate accurately in today's large, distributed, and dynamic application environments such as e-commerce systems. We present a dynamic analysis methodology that automates problem determination in these environments by 1) coarse-grained tagging of numerous real client requests as they travel through the system and 2) using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault. To validate our methodology, we have implemented Pinpoint, a framework for root cause analysis on the J2EE platform that requires no knowledge of the application components. Pinpoint consists of three parts: a communications layer that traces client requests, a failure detector that uses traffic-sniffing and middleware instrumentation, and a data analysis engine. We evaluate Pinpoint by injecting faults into various application components and show that Pinpoint identifies the faulty components with high accuracy and produces few false-positives.

910 citations

Journal ArticleDOI
Klaus Julisch1
TL;DR: A novel alarm-clustering method is proposed that supports the human analyst in identifying root causes and shows that the alarm load decreases quite substantially if the identified root causes are eliminated so that they can no longer trigger alarms in the future.
Abstract: It is a well-known problem that intrusion detection systems overload their human operators by triggering thousands of alarms per day. This paper presents a new approach for handling intrusion detection alarms more efficiently. Central to this approach is the notion that each alarm occurs for a reason, which is referred to as the alarm's root causes. This paper observes that a few dozens of rather persistent root causes generally account for over 90p of the alarms that an intrusion detection system triggers. Therefore, we argue that alarms should be handled by identifying and removing the most predominant and persistent root causes. To make this paradigm practicable, we propose a novel alarm-clustering method that supports the human analyst in identifying root causes. We present experiments with real-world intrusion detection alarms to show how alarm clustering helped us identify root causes. Moreover, we show that the alarm load decreases quite substantially if the identified root causes are eliminated so that they can no longer trigger alarms in the future.

481 citations

01 Jan 2002
TL;DR: Recovery Oriented Computing (ROC) takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved, and thus offers higher availability.
Abstract: It is time to broaden our performance-dominated research agenda. A four order of magnitude increase in performance since the first ASPLOS in 1982 means that few outside the CS&E research community believe that speed is the only problem of computer hardware and software. Current systems crash and freeze so frequently that people become violent. 1 Fast but flaky should not be our 21 st century legacy. Recovery Oriented Computing (ROC) takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved. By concentrating on Mean Time to Repair (MTTR) rather than Mean Time to Failure (MTTF), ROC reduces recovery time and thus offers higher availability. Since a large portion of system administration is dealing with failures, ROC may also reduce total cost of ownership. One to two orders of magnitude reduction in cost mean that the purchase price of hardware and software is now a small part of the total cost of ownership. In addition to giving the motivation and definition of ROC, we introduce failure data for Internet sites that shows that the leading cause of outages is operator error. We also demonstrate five ROC techniques in five case studies, which we hope will influence designers of architectures and operating systems. If we embrace availability and maintainability, systems of the future may compete on recovery performance rather than just SPEC performance, and on total cost of ownership rather than just system price. Such a change may restore our pride in the architectures and operating systems we craft.

470 citations

Proceedings ArticleDOI
27 Aug 2007
TL;DR: An Inference Graph model is introduced, which is well-adapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults, and takes into account multi-level structure, which leads to a 30% improvement in fault localization, as compared to two-level approaches.
Abstract: Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an Inference Graph model, which is well-adapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults. Further, we introduce the Sherlock system to discover Inference Graphs in the operational enterprise, infer critical attributes, and then leverage the result to automatically detect and localize problems. To illuminate strengths and limitations of the approach, we provide results from a prototype deployment in a large enterprise network, as well as from testbed emulations and simulations. In particular, we find that taking into account multi-level structure leads to a 30% improvement in fault localization, as compared to two-level approaches.

405 citations

Journal ArticleDOI
TL;DR: The challenges of fault localization in complex communication systems are discussed and an overview of solutions proposed in the course of the last ten years are presented, while discussing their advantages and shortcomings.

397 citations