scispace - formally typeset
Search or ask a question

Showing papers on "Dependability published in 1999"


Journal ArticleDOI
TL;DR: An infrastructure supporting two simultaneous processes in self-adaptive software: system evolution, the consistent application of change over time, and system adaptation, the cycle of detecting changing circumstances and planning and deploying responsive modifications are described.
Abstract: Self-adaptive software requires high dependability robustness, adaptability, and availability. The article describes an infrastructure supporting two simultaneous processes in self-adaptive software: system evolution, the consistent application of change over time, and system adaptation, the cycle of detecting changing circumstances and planning and deploying responsive modifications.

1,080 citations


Book ChapterDOI
20 Sep 1999
TL;DR: The automatic train operating system for METEOR, the first driverless metro in the city of Paris, is designed to manage the traffic of the vehicles controlled automatically or manually, developed using the B formal method together with the Vital Coded Processor.
Abstract: The automatic train operating system for METEOR, the first driverless metro in the city of Paris, is designed to manage the traffic of the vehicles controlled automatically or manually. This system, developed by Matra Transport International for the RATP, requires a very high level of dependability and safety for the users and the operator. To achieve this, the safety critical software located in the different control units (ground, line and on-board) was developed using the B formal method together with the Vital Coded Processor. This architecture thus ensures an optimum level of safety agreed with the customer. This experience with the METEOR project has convinced Matra Transport International of the advantages of using this B formal method for large-scale industrial developments.

304 citations


Journal ArticleDOI
TL;DR: Chameleon as mentioned in this paper is an adaptive infrastructure, which allows different levels of availability requirements to be simultaneously supported in a networked environment, through the use of special ARMORs-Adaptive.
Abstract: This paper presents Chameleon, an adaptive infrastructure, which allows different levels of availability requirements to be simultaneously supported in a networked environment. Chameleon provides dependability through the use of special ARMORs-Adaptive. Reconfigurable, and Mobile Objects for Reliability-that control all operations in the Chameleon environment. Three broad classes of ARMORs are defined: 1) Managers oversee other ARMORs and recover from failures in their subordinates. 2) Daemons provide communication gateways to the ARMORs at the host node. They also make available a host's resources to the Chameleon environment. 3) Common ARMORs implement specific techniques for providing application-required dependability. Employing ARMORs, Chameleon makes available different fault-tolerant configurations and maintains run-time adaptation to changes in the availability requirements of an application. Flexible ARMOR architecture allows their composition to be reconfigured at run-time, i.e., the ARMORs may dynamically adapt to changing application requirements. In this paper, we describe ARMOR architecture, including ARMOR class hierarchy, basic building blocks, ARMOR composition, and use of ARMOR factories. We present how ARMORs can be reconfigured and reengineered and demonstrate how the architecture serves our objective of providing an adaptive software infrastructure. To our knowledge, Chameleon is one of the few real implementations which enables multiple fault tolerance strategies to exist in the same environment and supports fault-tolerant execution of substantially off-the-shelf applications via a software infrastructure only. Chameleon provides fault tolerance from the application's point of view as well as from the software infrastructure's point of view. To demonstrate the Chameleon capabilities, we have implemented a prototype infrastructure which provides set of ARMORs to initialize the environment and to support the dual and TMR application execution modes. Through this testbed environment, we measure the execution overhead and recovery times from failures in the user application, the Chameleon ARMORs, the hardware, and the operating system.

173 citations


Proceedings ArticleDOI
16 Dec 1999
TL;DR: This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period.
Abstract: This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.

149 citations


Patent
17 Nov 1999
TL;DR: In this article, a method for increased software dependability, including learning how to predict an outage of a software system running on a computer, and, based on the learning, predicting an imminent outage, and avoiding the outage.
Abstract: A method (and system) for increased software dependability, includes learning how to predict an outage of a software system running on a computer, and, based on the learning, predicting an imminent outage, and avoiding the outage.

129 citations


Journal ArticleDOI
TL;DR: This paper proposes new methods to: 1) Perform fault tolerance based task clustering, which determines the best placement of assertion and duplicate-and-compare tasks, 2) Derive the best error recovery topology using a small number of extra processing elements, and 4) Share assertions to reduce the fault tolerance overhead.
Abstract: Embedded systems employed in critical applications demand high reliability and availability in addition to high performance. Hardware-software co-synthesis of an embedded system is the process of partitioning, mapping, and scheduling its specification into hardware and software modules to meet performance, cost, reliability, and availability goals. In this paper, we address the problem of hardware-software co-synthesis of fault-tolerant real-time heterogeneous distributed embedded systems. Fault detection capability is imparted to the embedded system by adding assertion and duplicate-and-compare tasks to the task graph specification prior to co-synthesis. The dependability (reliability and availability) of the architecture is evaluated during co-synthesis. Our algorithm, called COFTA (Co-synthesis Of Fault-Tolerant Architectures), allows the user to specify multiple types of assertions for each task. It uses the assertion or combination of assertions which achieves the required fault coverage without incurring too much overhead. We propose new methods to: 1) Perform fault tolerance based task clustering, which determines the best placement of assertion and duplicate-and-compare tasks, 2) Derive the best error recovery topology using a small number of extra processing elements, 3) Exploit multidimensional assertions, and 4) Share assertions to reduce the fault tolerance overhead. Our algorithm can tackle multirate systems commonly found in multimedia applications. Application of the proposed algorithm to a large number of real-life telecom transport system examples (the largest example consisting of 2,172 tasks) shows its efficacy. For fault secure architectures, which just have fault detection capabilities, COFTA is able to achieve up to 48.8 percent and 25.6 percent savings in embedded system cost over architectures employing duplication and task-based fault tolerance techniques, respectively. The average cost overhead of COFTA fault-secure architectures over simplex architectures is only 7.3 percent. In case of fault-tolerant architectures, which cannot only detect but also tolerate faults, COFTA is able to achieve up to 63.1 percent and 23.8 percent savings in embedded system cost over architectures employing triple-modular redundancy, and task-based fault tolerance techniques, respectively. The average cost overhead of COFTA fault-tolerant architectures over simplex architectures is only 55.4 percent.

111 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a generic fault-tolerant computer architecture based on commercial off-the-shelf (COTS) components (both processor hardware boards and real-time operating systems).
Abstract: The development and validation of fault-tolerant computers for critical real-time applications are currently both costly and time consuming. Often, the underlying technology is out-of-date by the time the computers are ready for deployment. Obsolescence can become a chronic problem when the systems in which they are embedded have lifetimes of several decades. This paper gives an overview of the work carried out in a project that is tackling the issues of cost and rapid obsolescence by defining a generic fault-tolerant computer architecture based essentially on commercial off-the-shelf (COTS) components (both processor hardware boards and real-time operating systems). The architecture uses a limited number of specific, but generic, hardware and software components to implement an architecture that can be configured along three dimensions: redundant channels, redundant lanes, and integrity levels. The two dimensions of physical redundancy allow the definition of a wide variety of instances with different fault tolerance strategies. The integrity level dimension allows application components of different levels of criticality to coexist in the same instance. The paper describes the main concepts of the architecture, the supporting environments for development and validation, and the prototypes currently being implemented.

107 citations


Book
01 Jan 1999
TL;DR: In this article, the authors discuss the importance of maintainability in the Government Procurement Process and in the Commercial Sector and the importance, purpose, and results of maintenance efforts.
Abstract: CHAPTER 1: INTRODUCTION. What is Maintainability?. The Importance, Purpose, and Results of Maintainability Efforts. Maintainability in the Government Procurement Process and in the Commercial Sector. Maintenance Engineering Versus Maintainability Engineering. Maintainability Science and Downtime. Maintainability Standards, Handbooks, and Information Sources. Maintainability Terms and Definitions. Problems. References. CHAPTER 2: MAINTAINABILITY MANAGEMENT . Introduction . Maintainability Management Functions in the Product Life Cycle. Maintainability Organization Functions and Tasks. Maintainability Organizational Structures. Maintainability Program Plan. Personnel Associated with Maintainability. Maintainability Design Reviews. Problems. References. CHAPTER 3: MAINTAINABILITY MEASURES, FUNCTIONS, AND MODELS. Introduction. Maintainability Measures. Maintainability Functions. System Effectiveness and Related Availability and Dependability Models. Mathematical Models. Problems . References. CHAPTER 4: MAINTAINABILITY TOOLS. Introduction. Failure Mode, Effects, and Criticality Analysis. Fault Tree Analysis. Cause and Effect Diagram. Total Quality Management. Maintainability Allocation. Problems. References. CHAPTER 5: SPECIFIC MAINTAINABILITY DESIGN CONSIDERATIONS. Introduction. Maintainability Design Characteristics. Standardization. Interchangeability. Modularization. Simplification. Accessibility. Identification. Accessibility and Identification Checklist. General Maintainability Design Guidelines and Common Maintainability Design Errors. Problems. References. CHAPTER 6: HUMAN FACTORS CONSIDERATIONS. Introduction. Human Factors Problems in Maintenance and Typical Human Behaviors. Human Body Measurements. Human Sensory Capacities. Environmental Factors. Auditory and Visual Warning Devices. Selected Formulas for Human Factors. Problems . References. CHAPTER 7: SAFETY CONSIDERATIONS. Introduction. Safety and Maintainability Design. Electrical, Mechanical, and Other Hazards. Safety Analysis Tools . Safety and Human Behavior. Safety Checklist. Problems. References. CHAPTER 8: COST CONSIDERATIONS. Introduction. Costs Associated with Maintainability . Reliability Cost. Discounting Formulas. Life Cycle Costing. Maintenance Cost Estimation Models. Maintainability, Maintenance Costs, and Cost Comparisons. Problems. References. CHAPTER 9: RELIABILITY-CENTERED MAINTENANCE. Introduction. The Definition of Reliability-Centered Maintenance. The RCM Process. RCM Implementation. RCM Review Groups . Methods of Monitoring Equipment Condition. RCM Applications and Achievements . Reasons for RCM Failures. Problems . References. CHAPTER 10: MAINTAINABILITY TESTING, DEMONSTRATION, AND DATA. Introduction. Planning and Control Requirements for Maintainability Testing and Demonstration. Test Approaches. Testing Methods. Preparing for Maintainability Demonstrations and Evaluating the Results. Checklists for Maintainability Demonstration Plans, Procedures, and Reports. Testability. Maintainability Data. Problems. References. CHAPTER 11: MAINTENANCE MODELS AND WARRANTIES. Introduction. Maintenance Models. Warranties. Problems. References. CHAPTER 12: TOPICS IN RELIABILITY. Introduction. Reliability and Maintainability . Bathtub Hazard Rate Concept. Reliability Terms, Definitions, and Formulas. Static Structures. Dynamic Structures. System Availability. Reliability Data Sources. Problems . References. Index.

72 citations


Journal ArticleDOI
TL;DR: In this article, the authors provide perspectives on issues and problems that impact the verification and validation (V&V) of KBSs and provide an overview of different techniques and tools that have been developed for performing V&V activities.
Abstract: Knowledge-based systems (KBSs) are being used in many applications areas where their failures can be costly because of losses in services, property or even life. To ensure their reliability and dependability, it is therefore important that these systems are verified and validated before they are deployed. This paper provides perspectives on issues and problems that impact the verification and validation (V&V) of KBSs. Some of the reasons why V&V of KBSs is difficult are presented. The paper also provides an overview of different techniques and tools that have been developed for performing V&V activities. Finally, some of the research issues that are relevant for future work in this field are discussed.

72 citations


Patent
15 Dec 1999
TL;DR: In this paper, the authors present a method for automatically selecting the sub-network transmitting the ACARS or ATN networks most suited for exchanging digital messages with ground with regard to the aircraft equipment possibilities, those existing on the ground in the zone flown over, costs and dependability of possible links and the preferences of the pilot, of his company and of the control services.
Abstract: The invention concerns the management aboard an aircraft of the aeronautical digital ACARS and ATN telecommunication networks. It concerns a method for automatically selecting the subnetwork transmitting the ACARS or ATN networks most suited for exchanging digital messages with ground with regard to the aircraft equipment possibilities, those existing on the ground in the zone flown over, costs and dependability of possible links and the preferences of the pilot, of his company and of the control services. Said method essentially consists in: generating and updating the database containing data concerning: costs, performances, security/dependability, aircraft configuration, availability of communication subnetworks and the instructions of the pilot, of his company and of the control services; and in automatically selecting a communication mode by subnetwork of the ACARS and ATN networks taking into account the order of preference established from criteria based on the data contained in the database (342).

71 citations


Journal ArticleDOI
TL;DR: Establishing computer system dependability benchmarks would make tests much easier and enable comparison of results across different machines.
Abstract: Computer-based systems are expected to be more and more dependable. For that, they have to operate correctly even in the presence of faults, and this fault tolerance of theirs must be thoroughly tested by the injection of faults both real and artificial. Users should start to request reports from manufacturers on the outcomes of such experiments, and on the mechanisms built into systems to handle faults. To inject artificial physical faults, fault injection offers a reasonably mature option today, with Swift tools being preferred for most applications because of their flexibility and low cost. To inject software bugs, although some promising ideas are being researched, no established technique yet exists. In any case, establishing computer system dependability benchmarks would make tests much easier and enable comparison of results across different machines.

Journal ArticleDOI
TL;DR: Real-time dependable (RTD) channels are presented, a communication-oriented abstraction that can be configured to meet the QoS requirements of a variety of distributed applications.
Abstract: Communication services that provide enhanced Quality of Service (QoS) guarantees related to dependability and real time are important for many applications in distributed systems. This paper presents real-time dependable (RTD) channels, a communication-oriented abstraction that can be configured to meet the QoS requirements of a variety of distributed applications. This customization ability is based on using CactusRT, a system that supports the construction of middleware services out of software modules called micro-protocols. Each micro-protocol implements a different semantic property or property variant and interacts with other micro-protocols using an event-driven model supported by the CactusRT runtime system. In addition to RTD channels CactusRT and its implementation are described. This prototype executes on a cluster of Pentium PCs running the OpenGroup/RI MK 7.3 Mach real-time operating system and CORDS, a system for building network protocols based on the x-kernel.

Journal ArticleDOI
TL;DR: Two fault injection methodologies are presented-stress-based injection and path-based injections; both are based on resource activity analysis to ensure that injections cause fault tolerance activity and, thus, the resulting exercise of fault tolerance mechanisms.
Abstract: The objective of fault injection is to mimic the existence of faults and to force the exercise of the fault tolerance mechanisms of the target system. To maximize the efficacy of each injection, the locations, timing, and conditions for faults being injected must be carefully chosen. Faults should be injected with a high probability of being accessed. This paper presents two fault injection methodologies-stress-based injection and path-based injection; both are based on resource activity analysis to ensure that injections cause fault tolerance activity and, thus, the resulting exercise of fault tolerance mechanisms. The difference between these two methods is that stress-based injection validates the system dependability by monitoring the run-time workload activity at the system level to select faults that coincide with the locations and times of greatest workload activity, while path-based injection validates the system from the application perspective by using an analysis of the program flow and resource usage at the application program level to select faults during the program execution. These two injection methodologies focus separately on the system and process viewpoints to facilitate the testing of system dependability. Details of these two injection methodologies are discussed in this paper, along with their implementations, experimental results, and advantages and disadvantages.

Proceedings ArticleDOI
01 Jan 1999
TL;DR: The presentation of the BBN built by EDF to model one of its assessment approaches, valid for the products for which EDF writes the requirements specification, and then monitors the development made by an external supplier is presented.
Abstract: Assessment of safety critical systems including software cannot rely only on conventional techniques, based on statistics and dependability models. In such systems, the predominant faults usually are design faults, which are very hard to predict. Therefore, the assessment can only be qualitative, and is performed by experts, who take into account various evidence sources. The aim of the SERENE European project is to improve the understandability, and repeatability of such assessments, thanks to a representation of the expert's reasoning by a mathematical model (a Bayesian belief network). The subject of this paper is the presentation of the BBN built by EDF to model one of its assessment approaches, valid for the products for which EDF writes the requirements specification, and then monitors the development made by an external supplier. No doubt that, before it yields reliable forecasts, this kind of model will require many years of calibration, by comparison between the predictions it gives, and the real, observed safety level of the evaluated systems. However, the authors think that in the short term, they can bring a rationale in the discussions between experts. They will also help in determining which are the most influential variables in the design process of a system, which is a necessary prerequisite for setting up any kind of field experience collection.

Proceedings ArticleDOI
02 May 1999
TL;DR: An automatic transformation is defined for the generation of models to capture systems dependability attributes, like reliability, and will be integrated in the toolsets available for the ESPRIT LTR HIDE project.
Abstract: The paper deals with the automatic dependability analysis of systems designed using UML. An automatic transformation is defined for the generation of models to capture systems dependability attributes, like reliability. The transformation concentrates on structural UML views, available early in the design, to operate at different levels of refinement, and tries to capture only the information relevant for dependability to limit the size (state space) of the models. Due to the modular construction, these models can be refined later as more detailed, relevant information becomes available. Moreover a careful selection of those critical parts to be detailed allows one to avoid explosion of the size. An implementation of the transformation is in progress and will be integrated in the toolsets available for the ESPRIT LTR HIDE project.

Book ChapterDOI
Mogens Blanke1
01 Jan 1999
TL;DR: This paper gives an overview of recent progress in theory and methods to analyze and develop fault-tolerant control systems and shows how the different concepts are used and the benefits from active fault tolerance as compared to a traditionally designed control architecture.
Abstract: Fault tolerant control offers enhanced availability and reduced risk of safety hazards when component failure and other unexpected events occur in a controlled plant Fault-tolerant control merges several disciplines into a framework with common goals The fault-tolerant properties are obtained through on-line fault detection and isolation, automatic condition assessment and calculation of appropriate remedial actions The final step is activation of the necessary actions through software The actions to accommodate a fault cover a wide range of possibilities and underlying theory Appropriate re-tuning can sometimes suffice, estimation of a signal replacing a measurement from a faulty sensor is needed in other events, and some cases require complex re-configuration or on-line redesign The basis for a remedial action is always detection of an undesired event and the correct assessment of the situation through isolation of the fault Analysis of the effects of the not-normal conditions, and the possible remedial actions, is a truly complex problem in most cases The paper gives an overview of recent progress in theory and methods to analyze and develop fault-tolerant control systems Fault propagation analysis and severity assessment are shown to be the basic means to evaluate safety and dependability Following this, an analysis of structure will disclose available redundancy and possibilities to recover from faults in the system These overall tools lead to requirements to fault detection and isolation Fault detection theory has been the subject of intensive study for two decades Nevertheless, the requirements from the use in fault-tolerant architectures have caused new challenges and further development This paper focus on recent results in overall design methods for fault-tolerant control systems An example shows how the different concepts are used and illustrates the benefits from active fault tolerance as compared to a traditionally designed control architecture

Proceedings ArticleDOI
06 Jan 1999
TL;DR: It is shown how the DSPN model capabilities are able to deal with various peculiar features of phased-mission systems, including those systems where the next phase to be performed can be chosen at the time the preceding phase ends.
Abstract: We focus on analytical modeling for the dependability evaluation of phased-mission systems Because of their dynamic behavior, systems showing a phased behavior offer challenges in modeling We propose the modeling and evaluation of phased-mission system dependability through the Deterministic and Stochastic Petri Nets (DSPN) The DSPN approach to the phased-mission systems offers many advantages, concerning both the modeling and the solution The DSPN model of the mission can be a very concise one, and it can be efficiently solved for dependability evaluation purposes The solution procedure is supported by the existence of an analytical solution for the transient probabilities of the marking process underlying the DSPN model This analytical solution can be fully automated We show how the DSPN model capabilities are able to deal with various peculiar features of phased-mission systems, including those systems where the next phase to be performed can be chosen at the time the preceding phase ends

Proceedings ArticleDOI
17 Nov 1999
TL;DR: An automatic transformation from UML diagrams to Timed Petri Nets for model based dependability evaluation is applied, which completely hides the mathematical background, thus eliminating the need for a specific expertise in abstract mathematics and the tedious remodeling of the system for mathematical analysis.
Abstract: Even though a thorough system specification improves the quality of the design, it is not sufficient to guarantee that a system will satisfy its reliability targets. Within this paper, we present an application example of one of the activities performed in the European ESPRIT project HIDE, aiming at the creation of an integrated environment where design toolsets based on UML are augmented with modeling and analysis tools for the automatic validation of the system under design. We apply an automatic transformation from UML diagrams to Timed Petri Nets for model based dependability evaluation. It allows a designer to use UML as a front-end for the specification of both the system and the user requirements, and to evaluate dependability figures of the system since the early phases of the design, thus obtaining precious clues for design refinement. The transformation completely hides the mathematical background, thus eliminating the need for a specific expertise in abstract mathematics and the tedious remodeling of the system for mathematical analysis.

Journal ArticleDOI
TL;DR: In this article, the authors propose a flexible framework for assessing dependability of measurement using generalizability theory (GT) to estimate the total proportion of variance in ratings due to error rather than focusing on one source of error at a time.
Abstract: Classical approaches to the assessment of reliability neglect to take into account multiple sources of error and to consider diverse measurement contexts. Generalizability theory (GT) offers a flexible framework for assessing dependability of measurement. With GT, investigators can estimate the total proportion of variance in ratings that is due to error rather than focusing on one source of error at a time. Simultaneous consideration of multiple sources of error allows investigators to assess the overall impact of measurement error in terms of attenuation of study findings and reduction of statistical power. Estimation of variance components allows for flexible application of findings to a variety of possible future research designs. Illustrative analyses demonstrate the special advantages of GT for planning studies in which observer ratings will be used.

Journal ArticleDOI
TL;DR: A hierarchical and modular methodology for modeling and evaluation of phased-mission systems in which phases have constant predetermined duration, and where missions may evolve dynamically selecting the next phase to be performed according to the system state is proposed.
Abstract: This paper proposes a hierarchical and modular methodology for modeling and evaluation of phased-mission systems in which phases have constant predetermined duration, and where missions may evolve dynamically selecting the next phase to be performed according to the system state. A 2-level modeling is proposed: the higher one models the mission itself; the lower one models the various phases. A separate 'modeling and resolution of phases' and 'dependencies among phases' are considered. This methodology is applied using an example of a space application. This method is compared with previous models. The advantages of this approach are in the great flexibility, easy applicability and reusability of the defined models. It permits: (1) obtaining information on the overall behavior of the system; and (2) focusing on each single phase to detect system dependability bottlenecks. The explicit modeling of the phase changes: (1) is a neat and easily understandable representation of the interphase dependencies; and (2) allows a straightforward modeling of the mission-profile dynamic selection. General purpose tools available to the reliability community can easily manage the computational complexity of the analysis.

Book ChapterDOI
01 Sep 1999
TL;DR: It is shown that any FT can be easily mapped into a BN and that basic inference techniques on the latter may be used to obtain classical parameters computed using the former (i.e. reliability of the Top Event or of any sub-system, criticality of components, etc...).
Abstract: Bayesian Networks (BN) provide a robust probabilistic method of reasoning under uncertainty. They have been successfully applied in a variety of real-world tasks and their suitability for dependability analysis is now considered by several researchers. In the present paper, we aim at defining a formal comparison between BN and one of the most popular techniques for dependability analysis: Fault Trees (FT). We will show that any FT can be easily mapped into a BN and that basic inference techniques on the latter may be used to obtain classical parameters computed using the former (i.e. reliability of the Top Event or of any sub-system, criticality of components, etc...). Moreover, we will discuss how, by using BN, some additional power can be obtained, both at the modeling and at the analysis level. In particular, dependency among components and noisy gates can be easily accommodated in the BN framework, together with the possibility of performing general diagnostic analysis. The comparison of the two methodologies is carried on through the analysis of an example that consists of a redundant multiprocessor system, with local and shared memories, local mirrored disks and a single bus.

Journal ArticleDOI
TL;DR: The complete transient analysis of Op allows discussion of the Poisson approximation by Littlewood for his model, and results are obtained for the distribution function of the number of failures in a fixed mission; and dependability metrics which are much more informative than the usual ones in a white-box approach.
Abstract: Dependability evaluation is a basic component in assessing the quality of repairable systems. A general model (Op) is presented and is specifically designed for software systems; it allows the evaluation of various dependability metrics, in particular, of availability measures. Op is of the structural type, based on Markov process theory. In particular, Op is an attempt to overcome some limitations of the well-known Littlewood reliability model for modular software. This paper gives the: mathematical results necessary to the transient analysis of this general model; and algorithms that can efficiently evaluate it. More specifically, from the parameters describing the: evolution of the execution process when there is no failure; failure processes together with the way they affect the execution; and recovery process, the results are obtained for the: distribution function of the number of failures in a fixed mission; and dependability metrics which are much more informative than the usual ones in a white-box approach. The estimation procedures of the Op parameters are briefly discussed. Some simple examples illustrate the interest in such a structural view and explain how to consider reliability growth of part of the software with the transformation approach developed by Laprie et al. The complete transient analysis of Op allows discussion of the Poisson approximation by Littlewood for his model.

Journal ArticleDOI
TL;DR: A hierarchical simulation methodology that enables accurate system evaluation under realistic faults and conditions and is demonstrated and validated in the case study of Myrinet (a commercial, high-speed network) based network system.
Abstract: This paper presents a hierarchical simulation methodology that enables accurate system evaluation under realistic faults and conditions. In this methodology, effects of low-level (i.e., transistor or circuit level) faults are propagated to higher levels (i.e., system level) using fault dictionaries. The primary fault models are obtained via simulation of the transistor-level effect of a radiation particle penetrating a device. The resulting current bursts constitute the first-level fault dictionary and are used in the circuit-level simulation to determine the impact on circuit latches and flip-flops. The latched outputs constitute the next level fault dictionary in the hierarchy and are applied in conducting fault injection simulation at the chip-level under selected workloads or application programs. Faults injected at the chip-level result in memory corruptions, which are used to form the next level fault dictionary for the system-level simulation of an application running on simulated hardware. When an application terminates, either normally or abnormally, the overall fault impact on the software behavior is quantified and analyzed. The system in this sense can be a single workstation or a network. The simulation method is demonstrated and validated in the case study of Myrinet (a commercial, high-speed network) based network system.

Patent
Joseph S. Rosen1
09 Dec 1999
TL;DR: In this paper, the authors proposed a method for applying expert knowledge and machine learning routines to a continuous stream of information. But this method is not suitable for the problem of fault localization in the form of decision-tree based classification and dependability models.
Abstract: The present invention includes a mechanism for applying expert knowledge and machine-learning routines to a continuous stream of information. The present method comprises learning a set of dependability models, one for each classification model, that characterize the situations in which each of the classification models is able to make correct predictions. At appropriate intervals the method produces new fault localization knowledge in the form of decision-tree based classification and dependability models. Such knowledge is used to enhance the existing classification knowledge already available. Each of these classification models has a particular sub-domain where it is the most reliable, and hence the best choice to use. For future unlabeled examples, these dependability models are consulted to select the most appropriate classification model, and the prediction of that classification model is then accepted.

Journal ArticleDOI
TL;DR: In this paper, a new class of coloured Petri nets is introduced, which is well suited to the modelling of manufacturing systems, and a library of model templates helps to create large models.
Abstract: The design of a manufacturing system requires modelling and performance evaluation techniques. To support this process, a modelling method based on Petri nets is proposed in this paper. A new class of coloured Petri nets is introduced, which is well suited to the modelling of manufacturing systems. Using this net class, the structure and the work plans of a manufacturing system can both be modelled separately. A library of model templates helps to create large models. The different model parts are merged automatically to create a complete model of the manufacturing system. Measures of interest can be obtained from the model by numerical analysis or simulation, showing its performance and dependability. The usefulness of the approach is shown by applying the proposed techniques to a real-life manufacturing system.

Proceedings ArticleDOI
08 Sep 1999
TL;DR: This paper presents an overview of the Mobius project, which aims to provide a modeling framework and software environment that support multiple modeling formalisms, methods for model composition and connection, and a way to integrate multiple analytical/numerical- and simulation-based model solution methods.
Abstract: There have been significant advances in methods for specifying and solving models that aim to predict the performance and dependability of computer systems and networks. At the same time, however, there have been dramatic increases in the complexity of the systems whose performance and dependability must be evaluated, and considerable increases in the expectations of analysts that use performance/dependability evaluation tools. This paper briefly reviews the progress that has been made in the development of performance/dependability evaluation tools, and argues that the next important step is the creation of modeling frameworks and software environments that support multi-level, multi-formalism modeling and multiple solution methods within a single integrated framework. In addition, this paper presents an overview of the Mobius project, which aims to provide a modeling framework and software environment that support multiple modeling formalisms, methods for model composition and connection, and a way to integrate multiple analytical/numerical- and simulation-based model solution methods. Finally, it suggests research that must take place to make this aim a reality, and thus facilitate the performance and dependability evaluation of complex computer systems and networks.

Journal ArticleDOI
TL;DR: This paper proposes a new algorithm based on the classical uniformization technique in which a test to detect the stationary behavior of the system is used to stop the computation if the stationarity is reached, and provides the transient availability measures and bounds for the steady state availability.
Abstract: Point availability and expected interval availability are dependability measures respectively defined by the probability that a system is in operation at a given instant and by the mean percentage of time during which a system is in operation over a finite observation period. We consider a repairable computer system and we assume, as usual, that the system is modeled by a finite Markov process. We propose in this paper a new algorithm to compute these two availability measures. This algorithm is based on the classical uniformization technique in which a test to detect the stationary behavior of the system is used to stop the computation if the stationarity is reached. In that case, the algorithm gives not only the transient availability measures, but also the steady state availability, with significant computational savings, especially when the time at which measures are needed is large. In the case where the stationarity is not reached, the algorithm provides the transient availability measures and bounds for the steady state availability. It is also shown how the new algorithm can be extended to the computation of performability measures.

Journal ArticleDOI
TL;DR: This work defines a set of alternative architectures, gives some elements for constructing their dependability models, and compares their availability, to provide a quantified means of helping in the definition of a new architecture for CAUTRA, a subset of the French Air Traffic Control system.
Abstract: The aim of our work is to provide a quantified means of helping in the definition of a new architecture for CAUTRA, a subset of the French Air Traffic Control system. In this paper, we define a set of alternative architectures, give some elements for constructing their dependability models, and compare their availability. Modeling is carried out following a modular and systematic approach, based on the derivation of block models at a high level of abstraction. In a second step, the blocks are replaced by their equivalent Generalized Stochastic Petri Nets to build up the detailed model of the architecture. The evaluations performed permit identification of a subset of architectures whose availability meets the dependability requirements and also identification of the best architecture among this subset.

Proceedings Article
30 Jul 1999
TL;DR: This paper will discuss how both modeling and analysis issues can be naturally dealt with by BN, and how some limitations intrinsic to combinatorial dependability methods such as Fault Trees can be overcome using BN.
Abstract: Bayesian Networks (BN) provide robust probabilistic methods of reasoning under uncertainty, but despite their formal grounds are strictly based on the notion of conditional dependence, not much attention has been paid so far to their use in dependability analysis. The aim of this paper is to propose BN as a suitable tool for dependability analysis, by challenging the formalisnl with basic issues arising in dependability tasks. We will discuss how both modeling and analysis issues can be naturally dealt with by BN. Moreover, we will show how some limitations intrinsic to combinatorial dependability methods such as Fault Trees can be overcome using BN. This will be pursued through the study of a real-world example concerning the reliability analysis of a redundant digital Programmable Logic Controller (PLC) with majority voting 2:3

Proceedings ArticleDOI
02 May 1999
TL;DR: This paper introduces the concept of a QoS-oriented gateway to integrate a variety of QoS enforcement and implementation mechanisms controlling the underlying distributed interactions and discusses the functions performed by such a component in achieving the desired overall end-to-end QoS.
Abstract: As networks and the use of communications within applications continue to grow and find more uses, so too does the demand for more control and manageability of various "system properties" through middleware. An important component supporting an integrated property architecture is the concept of an object gateway, which is a quality-of-service (QoS)-aware element transparently inserted at the transport layer between clients and objects to provide managed communication behavior for the particular property being supported. In this paper, we introduce the concept of a QoS-oriented gateway to integrate a variety of QoS enforcement and implementation mechanisms controlling the underlying distributed interactions. We discuss the functions performed by such a component in achieving the desired overall end-to-end QoS, and the design considerations underlying our current implementation. We conclude with experiences to date with two variations of the gateway: one controlling managed latency and throughput using bandwidth allocation, and one controlling dependability through the coordination of object replicas.