scispace - formally typeset
Search or ask a question

Showing papers on "Dependability published in 2012"


Journal ArticleDOI
TL;DR: A bibliographical review over the last decade is presented on the application of Bayesian networks to dependability, risk analysis and maintenance and an increasing trend of the literature related to these domains is shown.

635 citations


Book
05 Oct 2012
TL;DR: This introductory reference and tutorial is ideal for self-directed learning or classroom instruction, and is an excellent reference for practitioners, including architects, developers, integrators, validators, certifiers, first-level technical leaders, and project managers.
Abstract: Conventional build-then-test practices are making todays embedded, software-reliant systems unaffordable to build. In response, more than thirty leading industrial organizations have joined SAE (formerly, the Society of Automotive Engineers) to define the SAE Architecture Analysis & Design Language (AADL) AS-5506 Standard, a rigorous and extensible foundation for model-based engineering analysis practices that encompass software system design, integration, and assurance. Using AADL, you can conduct lightweight and rigorous analyses of critical real-time factors such as performance, dependability, security, and data integrity. You can integrate additional established and custom analysis/specification techniques into your engineering environment, developing a fully unified architecture model that makes it easier to build reliable systems that meet customer expectations. Model-Based Engineering with AADL is the first guide to using this new international standard to optimize your development processes. Coauthored by Peter H. Feiler, the standards author and technical lead, this introductory reference and tutorial is ideal for self-directed learning or classroom instruction, and is an excellent reference for practitioners, including architects, developers, integrators, validators, certifiers, first-level technical leaders, and project managers. Packed with real-world examples, it introduces all aspects of the AADL notation as part of an architecture-centric, model-based engineering approach to discovering embedded software systems problems earlier, when they cost less to solve. Throughout, the authors compare AADL to other modeling notations and approaches, while presenting the language via a complete case study: the development and analysis of a realistic example system through repeated refinement and analysis. Part One introduces both the AADL language and core Model-Based Engineering (MBE) practices, explaining basic software systems modeling and analysis in the context of an example system, and offering practical guidelines for effectively applying AADL. Part Two describes the characteristics of each AADL element, including their representations, applicability, and constraints. The Appendix includes comprehensive listings of AADL language elements, properties incorporated in the AADL standard, and a description of the books example system.

303 citations


Journal ArticleDOI
12 Jan 2012-Sensors
TL;DR: This work proposes a methodology based on an automatic generation of a fault tree to evaluate the reliability and availability of Wireless Sensor Networks, when permanent faults occur on network devices.
Abstract: Wireless Sensor Networks (WSN) currently represent the best candidate to be adopted as the communication solution for the last mile connection in process control and monitoring applications in industrial environments. Most of these applications have stringent dependability (reliability and availability) requirements, as a system failure may result in economic losses, put people in danger or lead to environmental damages. Among the different type of faults that can lead to a system failure, permanent faults on network devices have a major impact. They can hamper communications over long periods of time and consequently disturb, or even disable, control algorithms. The lack of a structured approach enabling the evaluation of permanent faults, prevents system designers to optimize decisions that minimize these occurrences. In this work we propose a methodology based on an automatic generation of a fault tree to evaluate the reliability and availability of Wireless Sensor Networks, when permanent faults occur on network devices. The proposal supports any topology, different levels of redundancy, network reconfigurations, criticality of devices and arbitrary failure conditions. The proposed methodology is particularly suitable for the design and validation of Wireless Sensor Networks when trying to optimize its reliability and availability requirements.

180 citations


Journal ArticleDOI
TL;DR: An approach and support tools are illustrated that enable a holistic view of the design and run-time management of adaptive software systems, based on formal (probabilistic) models that are used at design time to reason about dependability of the application in quantitative terms.
Abstract: Modern software systems are increasingly requested to be adaptive to changes in the environment in which they are embedded. Moreover, adaptation often needs to be performed automatically, through self-managed reactions enacted by the application at run time. Off-line, human-driven changes should be requested only if self-adaptation cannot be achieved successfully. To support this kind of autonomic behavior, software systems must be empowered by a rich run-time support that can monitor the relevant phenomena of the surrounding environment to detect changes, analyze the data collected to understand the possible consequences of changes, reason about the ability of the application to continue to provide the required service, and finally react if an adaptation is needed. This paper focuses on non-functional requirements, which constitute an essential component of the quality that modern software systems need to exhibit. Although the proposed approach is quite general, it is mainly exemplified in the paper in the context of service-oriented systems, where the quality of service (QoS) is regulated by contractual obligations between the application provider and its clients. We analyze the case where an application, exported as a service, is built as a composition of other services. Non-functional requirements—such as reliability and performance—heavily depend on the environment in which the application is embedded. Thus changes in the environment may ultimately adversely affect QoS satisfaction. We illustrate an approach and support tools that enable a holistic view of the design and run-time management of adaptive software systems. The approach is based on formal (probabilistic) models that are used at design time to reason about dependability of the application in quantitative terms. Models continue to exist at run time to enable continuous verification and detection of changes that require adaptation.

141 citations


Journal Article
TL;DR: The paper discusses the need for these systems implementation in various application domains and the research challenges for defining an appropriate formalism that represent more than networking and information technology - the information and knowledge will be integrated into physical objects.
Abstract: Cyber-Physical Systems (CPSs) represent an emerging research area that has attracted the attention of many researchers. Starting from the definition of CPS, the paper discusses the need for these systems implementation in various application domains and the research challenges for defining an appropriate formalism that represent more than networking and information technology - the information and knowledge will be integrated into physical objects. As CPSs are expected to play a major role in the design and development of future engineering systems, a short state of the art regarding the main CPS research areas (generic architecture, design principles, modeling, dependability, and implementation) ends the paper.

125 citations


Journal ArticleDOI
TL;DR: The advocated methodology aims to reduce the likelihood of manifestation of hidden failures and potential cascading events by adjusting the security/dependability balance of protection systems.
Abstract: Recent blackouts offer testimonies of the crucial role played by protection relays in a reliable power system. It is argued that embracing the paradigm shift of adaptive protection is a fundamental step toward a reliable power grid. The purpose of this paper is to present a methodology to implement a security/dependability adaptive protection scheme. The advocated methodology aims to reduce the likelihood of manifestation of hidden failures and potential cascading events by adjusting the security/dependability balance of protection systems. The proposed methodology is based on wide-area measurements obtained with the aid of phasor measurement units. A data-mining algorithm, known as decision trees, is used to classify the power system state and to predict the optimal security/dependability bias of a critical protection scheme. The methodology is tested on a detailed 4000-bus system.

101 citations


Journal ArticleDOI
TL;DR: The survey shows that more works are devoted to reliability and safety, fewer to availability and maintainability, and none to integrity, and more research is needed for tool development to automate the derivation of analysis models and to give feedback to designers.
Abstract: The goal is to survey dependability modeling and analysis of software and systems specified with UML, with focus on reliability, availability, maintainability, and safety (RAMS). From the literature published in the last decade, 33 approaches presented in 43 papers were identified. They are evaluated according to three sets of criteria regarding UML modeling issues, addressed dependability characteristics, and quality assessment of the surveyed approaches. The survey shows that more works are devoted to reliability and safety, fewer to availability and maintainability, and none to integrity. Many methods support early life-cycle phases (from requirements to design). More research is needed for tool development to automate the derivation of analysis models and to give feedback to designers.

93 citations


Journal ArticleDOI
TL;DR: A method based on the combined application of genetic algorithms and a finite element method (FEM) investigation is proposed and applied for the serviceability assessment of a long-span suspension bridge.
Abstract: A long-span suspension bridge is a complex structural system that interacts with the surrounding environment and the users. The environmental actions and the corresponding loads (wind, temperature, rain, earthquake, etc.) together with the live loads (railway traffic, highway traffic), have a strong influence on the dynamic response of the bridge, and can significantly influence the structural behavior and alter its geometry, thus limiting the serviceability performance even up to a partial closure. This article will present some general considerations and operative aspects of the activities related to the analysis and design of such a complex structural system. Specific reference is made to the dependability assessment and the performance requirements of the whole system, while focus is given on methods for handling the completeness and the uncertainty in the assessment of the load scenarios. Aiming at the serviceability assessment, a method based on the combined application of genetic algorithms and a finite element method (FEM) investigation is proposed and applied.

92 citations



BookDOI
31 Jul 2012
TL;DR: This book provides an overview of the work of two successive ESPRIT Basic Research Projects on Predictably Dependable Computing Systems (PDCS), as well as their major achievements.
Abstract: This book provides an overview of the work of two successive ESPRIT Basic Research Projects on Predictably Dependable Computing Systems (PDCS), as well as their major achievements The purpose of the projects has been "to contribute to making the process of designing and constructing dependable computing systems much more predictable and cost-effective" The book contains a carefully edited selection of papers on all four main topics in PDCS: fault prevention, fault tolerance, fault removal, and fault forecasting Problems of real-time and distributed systems, system structuring, qualitative evaluation, and software dependability modelling are emphasized The book reports on the latest research on PDCS from a team including many of Europe's leading researchers

88 citations


Proceedings ArticleDOI
02 Jun 2012
TL;DR: The case demonstrates the feasibility of fully capturing a system-level design as a single comprehensive formal model and analyze it automatically using a toolset based on (probabilistic) model checkers.
Abstract: This paper reports on the usage of a broad palette of formal modeling and analysis techniques on a regular industrial-size design of an ultra-modern satellite platform. These efforts were carried out in parallel with the conventional software development of the satellite platform. The model itself is expressed in a formalized dialect of AADL. Its formal nature enables rigorous and automated analysis, for which the recently developed COMPASS toolset was used. The whole effort revealed numerous inconsistencies in the early design documents, and the use of formal analyses provided additional insight on discrete system behavior (comprising nearly 50 million states), on hybrid system behavior involving discrete and continuous variables, and enabled the automated generation of large fault trees (66 nodes) for safety analysis that typically are constructed by hand. The model's size pushed the computational tractability of the algorithms underlying the formal analyses, and revealed bottlenecks for future theoretical research. Additionally, the effort led to newly learned practices from which subsequent formal modeling and analysis efforts shall benefit, especially when they are injected in the conventional software development lifecycle. The case demonstrates the feasibility of fully capturing a system-level design as a single comprehensive formal model and analyze it automatically using a toolset based on (probabilistic) model checkers.

Journal ArticleDOI
TL;DR: The results indicate that while the total number of measurement devices for system observability may increase (and therefore, the observability is improved), the total cost for the plan is reduced and the proposed method is advantageous over the techniques optimizing each section independently.
Abstract: The term wide area measurement system (WAMS) implies a system including new digital metering devices (e.g., phasor measurement unit) together with communication system which is designed for monitoring, operating, and controlling power systems. Generally, a WAMS process includes three main functions: data acquisition, data transmitting, and data processing performed by measurement devices, communication systems, and energy management systems, respectively. While these three functions are seriously interdependent, most researches carried out on this topic investigate these functions independently. In this paper, meters placement and their required communication infrastructure for state estimation program are co-optimized simultaneously. To perform this, these two planning issues are jointly formulated in a single genetic algorithm (GA) problem. Then, three IEEE test networks without any conventional measurements and communications are used to investigate the advantages of considering dependability of optimization of these two sections. The results confirm that the proposed method is advantageous over the techniques optimizing each section independently. The results indicate that while the total number of measurement devices for system observability may increase (and therefore, the observability is improved), the total cost for the plan is reduced.

Journal ArticleDOI
TL;DR: A procedure to work in that direction after having presented the advantages, the possibilities and the challenges to use GNSS in rail transportation is proposed, based on experiments for the evaluation of RAMS properties related to satellite-based localization units.
Abstract: Satellite-based localization technologies are strategic opportunities in railway applications because they offer new possibilities of service and have advantages that current technologies relying mainly on infrastructures deployed along tracks cannot equal. GNSSs (Global Navigation Satellite Systems) can, in particular, offer localization services in ERTMS (European Rail Traffic Management System), the system developed within the European railway community to harmonize, at European scale, railway signalling and control/command systems. However, using GNSS in such safety applications is slowed down when trying to comply with railway standards. Indeed, demonstrations of RAMS properties (Reliability, Availability, Maintainability, Safety) are required on new solutions embedded in trains. They aim at verifying if all dependability (RAM) and safety aspects are controlled over the lifecycle of the solutions before using them operationally. No RAMS evaluation technique exists for systems based on signal propagation and subject to failures provoked by environment effects. The major challenge is so to develop proof methods that will give means to fulfil the railway certification process. In this article, we propose a procedure to work in that direction after having presented the advantages, the possibilities and the challenges to use GNSS in rail transportation. The procedure is based on experiments for the evaluation of RAMS properties related to satellite-based localization units. We apply the method to different position measurements obtained in several typical railway environments. The obtained results are discussed according to the dependability and safety points of view.

Proceedings ArticleDOI
13 Dec 2012
TL;DR: This paper investigates the benefits of a warm-standy replication mechanism in a Eucalyptus cloud computing environment and shows an enhanced dependability for the proposed redundant system, as well as a decrease in the annual downtime.
Abstract: High availability in cloud computing services is essential for maintaining customer confidence and avoiding revenue losses due to SLA violation penalties. Since the software and hardware components of cloud infrastructures may have limited reliability, fault tolerance mechanisms are a means of achieving the necessary dependability requirements. This paper investigates the benefits of a warm-standy replication mechanism in a Eucalyptus cloud computing environment. A hierarchical heterogeneous modeling approach is used to represent a redundant architecture and compare its availability to that of a non-redundant architecture. Both hardware and software failures are considered in the proposed analytical models. The results show an enhanced dependability for the proposed redundant system, as well as a decrease in the annual downtime. The results also demonstrate that the simple replacement of hardware by more reliable machines would not produce improvements in system availability to the same extent as would the fault tolerant approach.

Proceedings Article
07 Oct 2012
TL;DR: It is taken that a consistency benchmark should paint a comprehensive picture of the relationship between the storage system under consideration, the workload, the pattern of failures, and the consistency observed by clients as they execute the workload under consideration.
Abstract: Large-scale key-value storage systems sacrifice consistency in the interest of dependability (i.e., partition-tolerance and availability), as well as performance (i.e., latency). Such systems provide eventual consistency, which--to this point--has been difficult to quantify in real systems. Given the many implementations and deployments of eventually-consistent systems (e.g., NoSQL systems), attempts have been made to measure this consistency empirically, but they suffer from important drawbacks. For example, state-of-the art consistency benchmarks exercise the system only in restricted ways and disrupt the workload, which limits their accuracy. In this paper, we take the position that a consistency benchmark should paint a comprehensive picture of the relationship between the storage system under consideration, the workload, the pattern of failures, and the consistency observed by clients. To illustrate our point, we first survey prior efforts to quantify eventual consistency. We then present a benchmarking technique that overcomes the shortcomings of existing techniques to measure the consistency observed by clients as they execute the workload under consideration. This method is versatile and minimally disruptive to the system under test. As a proof of concept, we demonstrate this tool on Cassandra.

Proceedings ArticleDOI
02 Apr 2012
TL;DR: In this article, the effect of distributed generation on protection concepts and approaches needs to be understood, and accounted for, and special considerations are provided on ensuring security and dependability, as well as on protection parameterization and coordination.
Abstract: The National Institute of Standards and Technology, NIST, is assigned by the US Department of Energy, DoE, to drive Smart Grid developments and harmonization efforts in the power industry. Distributed Generation has been identified as one of the important areas for the Smart Grid developments. Multiple generation sources, bi-directional power flow, power flow time co-ordination and management bring significant benefits and challenges for the existing and emerging power grids and microgrids. In particular, the effect of distributed generation on protection concepts and approaches needs to be understood, and accounted for. This paper describes distributed generation concepts, applications and scenarios. Benefits and challenges are discussed and analyzed on a number of real life examples. Special considerations are provided on ensuring security and dependability, as well as on protection parameterization and coordination.

Book ChapterDOI
18 Jun 2012
TL;DR: An adapted fault taxonomy suitable for autonomous robots is presented and information on the nature, the relevance and impact of faults in robot systems are given that are beneficial for researcher dealing with fault mitigation and management in autonomous systems.
Abstract: Faults that occur in an autonomous robot system negatively affect its dependability. The aim of truly dependable and autonomous systems requires that one has to deal with these faults in some way. In order to be able to do this efficiently one has to have information on the nature of these faults. Very few studies on this topic have been conducted so far. In this paper we present results of a survey on faults of autonomous robots we conducted in the context of RoboCup. The major contribution of this paper is twofold. First we present an adapted fault taxonomy suitable for autonomous robots. Second we give information on the nature, the relevance and impact of faults in robot systems that are beneficial for researcher dealing with fault mitigation and management in autonomous systems.

Book
12 Jan 2012
TL;DR: Fundamentals of Dependable Computing for Software Engineers presents the essential elements of computer system dependability and provides a framework for engineers to reason and make decisions about software and its dependability.
Abstract: Fundamentals of Dependable Computing for Software Engineers presents the essential elements of computer system dependability. The book describes a comprehensive dependability-engineering process and explains the roles of software and software engineers in computer system dependability. Readers will learn: Why dependability matters What it means for a system to be dependable How to build a dependable software system How to assess whether a software system is adequately dependable The author focuses on the actions needed to reduce the rate of failure to an acceptable level, covering material essential for engineers developing systems with extreme consequences of failure, such as safety-critical systems, security-critical systems, and critical infrastructure systems. The text explores the systems engineering aspects of dependability and provides a framework for engineers to reason and make decisions about software and its dependability. It also offers a comprehensive approach to achieve software dependability and includes a bibliography of the most relevant literature. Emphasizing the software engineering elements of dependability, this book helps software and computer engineers in fields requiring ultra-high levels of dependability, such as avionics, medical devices, automotive electronics, weapon systems, and advanced information systems, construct software systems that are dependable and within budget and time constraints.

Journal ArticleDOI
TL;DR: This paper adopts a concise mathematic tool, stochastic Petri nets (SPNs), to analyze the dependability of control center networks in smart grid, and presents the general model of control centre networks by considering different backup strategies of critical components.
Abstract: As an indispensable infrastructure for the future life, smart grid is being implemented to save energy, reduce costs, and increase reliability. In smart grid, control center networks have attracted a great deal of attention, because their security and dependability issues are critical to the entire smart grid. Several studies have been conducted in the field of smart grid security, but few work focuses on the dependability analysis of control center networks. In this paper, we adopt a concise mathematic tool, stochastic Petri nets (SPNs), to analyze the dependability of control center networks in smart grid. We present the general model of control center networks by considering different backup strategies of critical components. With the general SPNs model, we can measure the dependability from two metrics, i.e., the reliability and availability, through analyzing the transient and steady-state probabilities simultaneously. To avoid the state-space explosion problem in computing, the state-space explosion avoidance method is proposed as well. Finally, we study a specific case to demonstrate the feasibility and efficiency of the proposed model in the dependability analysis of control center networks in smart grid.

Journal ArticleDOI
TL;DR: A compiler-based methodology for facilitating the design of fault-tolerant embedded systems based on a generic microprocessor architecture that facilitates the implementation of software-based techniques, providing a uniform isolated-from-target hardening core that allows the automatic generation of protected source code (hardened code).
Abstract: The protection of processor-based systems to mitigate the harmful effect of transient faults (soft errors) is gaining importance as technology shrinks. At the same time, for large segments of embedded markets, parameters like cost and performance continue to be as important as reliability. This paper presents a compiler-based methodology for facilitating the design of fault-tolerant embedded systems. The methodology is supported by an infrastructure that permits to easily combine hardware/software soft errors mitigation techniques in order to best satisfy both usual design constraints and dependability requirements. It is based on a generic microprocessor architecture that facilitates the implementation of software-based techniques, providing a uniform isolated-from-target hardening core that allows the automatic generation of protected source code (hardened code). Two case studies are presented. In the first one, several software-based mitigation techniques are implemented and evaluated showing the flexibility of the infrastructure. In the second one, a customized fault tolerant embedded system is designed by combining selective protection on both hardware and software. Several trade-offs among performance, code size, reliability, and hardware costs have been explored. Results show the applicability of the approach. Among the developed software-based mitigation techniques, a novel selective version of the well known SWIFT-R is presented.

Journal ArticleDOI
TL;DR: A framework for the assessment of WSNs based on the automated generation of analytical models is presented, which hides modeling details, and it allows designers to focus on simulation results to drive their design choices.
Abstract: Wireless Sensor Networks (WSNs) are widely recognized as a promising solution to build next-generation monitoring systems. Their industrial uptake is however still compromised by the low level of trust on their performance and dependability. Whereas analytical models represent a valid mean to assess nonfunctional properties via simulation, their wide use is still limited by the complexity and dynamicity of WSNs, which lead to unaffordable modeling costs. To reduce this gap between research achievements and industrial development, this paper presents a framework for the assessment of WSNs based on the automated generation of analytical models. The framework hides modeling details, and it allows designers to focus on simulation results to drive their design choices. Models are generated starting from a high-level specification of the system and by a preliminary characterization of its fault-free behavior, using behavioral simulators. The benefits of the framework are shown in the context of two case studies, based on the wireless monitoring of civil structures.

Journal ArticleDOI
TL;DR: The role of middleware is addressed and how adaptation services can be used to improve dependability in instrumented cyber-physical systems based on the principles of “computational reflection” is focused on.
Abstract: In this paper, we address the role of middleware in enabling robust and resilient cyber-physical systems (CPSs) of the future. In particular, we will focus on how adaptation services can be used to improve dependability in instrumented cyber-physical systems based on the principles of “computational reflection.” CPS environments incorporate a variety of sensing and actuation devices in a distributed architecture; such a deployment is used to create a digital representation of the evolving physical world and its processes for use by a broad range of applications. CPS applications, in particular, mission critical tasks, must execute dependably despite disruptions caused by failures and limitations in sensing, communications, and computation. This paper discusses a range of applications, their reliability needs, and potential dependability holes that can cause performance degradation and application failures. In particular, we distinguish between the notion of infrastructure and information dependability and illustrate the need to formally model and reason about a range of CPS applications and their dependability needs. Formal methods based tools can help us design meaningful cross-layer adaptation techniques at different system layers of the CPS environment and thereby achieve end-to-end dependability at both the infrastructure and information levels.

Book ChapterDOI
02 Nov 2012
TL;DR: This volume is a comprehensive overview of the state of the art in a field of continuously growing practical importance, and aim at academic and industrial researchers in these areas as well as graduate students and lecturers in related fields.
Abstract: The resilience of computing systems includes their dependability as well as their fault tolerance and security. It defines the ability of a computing system to perform properly in the presence of various kinds of disturbances and to recover from any service degradation. These properties are immensely important in a world where many aspects of our daily life depend on the correct, reliable and secure operation of often large-scale distributed computing systems. Wolter and her co-editors grouped the 20 chapters from leading researchers into seven parts: an introduction and motivating examples, modeling techniques, model-driven prediction, measurement and metrics, testing techniques, case studies, and conclusions. The core is formed by 12 technical papers, which are framed by motivating real-world examples and case studies, thus illustrating the necessity and the application of the presented methods. While the technical chapters are independent of each other and can be read in any order, the reader will benefit more from the case studies if he or she reads them together with the related techniques. The papers combine topics like modeling, benchmarking, testing, performance evaluation, and dependability, and aim at academic and industrial researchers in these areas as well as graduate students and lecturers in related fields. In this volume, they will find a comprehensive overview of the state of the art in a field of continuously growing practical importance.

Journal ArticleDOI
TL;DR: The main contribution of this work is that it exploits the semantics of the WS-AT services to minimize the use of Byzantine Agreement (BA), instead of applying BFT techniques naively, which would be prohibitively expensive.
Abstract: The Web Services Atomic Transactions (WS-AT) specification makes it possible for businesses to engage in standard distributed transaction processing over the Internet using Web Services technology. For such business applications, trustworthy coordination of WS-AT is crucial. In this paper, we explain how to render WS-AT coordination trustworthy by applying Byzantine Fault Tolerance (BFT) techniques. More specifically, we show how to protect the core services described in the WS-AT specification, namely, the Activation service, the Registration service, the Completion service and the Coordinator service, against Byzantine faults. The main contribution of this work is that it exploits the semantics of the WS-AT services to minimize the use of Byzantine Agreement (BA), instead of applying BFT techniques naively, which would be prohibitively expensive. We have incorporated our BFT protocols and mechanisms into an open-source framework that implements the WS-AT specification. The resulting BFT framework for WS-AT is useful for business applications that are based on WS-AT and that require a high degree of dependability, security, and trust.

Journal ArticleDOI
TL;DR: The rescheduling component is designed as a middleware service that aims to increase the dependability of large scale distributed systems and offers an improved mechanism for resource management.
Abstract: Scheduling is a key component for performance guarantees in the case of distributed applications running in large scale heterogeneous environments. Another function of the scheduler in such system is the implementation of resilience mechanisms to cope with possible faults. In this case resilience is best approached using dedicated rescheduling mechanisms. The performance of rescheduling is very important in the context of large scale distributed systems and dynamic behavior. The paper proposes a generic rescheduling algorithm. The algorithm can use a wide variety of scheduling heuristics that can be selected by users in advance, depending on the system's structure. The rescheduling component is designed as a middleware service that aims to increase the dependability of large scale distributed systems. The system was evaluated in a real-world implementation for a Grid system. The proposed approach supports fault tolerance and offers an improved mechanism for resource management. The evaluation of the proposed rescheduling algorithm was performed using modeling and simulation. We present experimental results confirming the performance and capabilities of the proposed rescheduling algorithm.

Journal ArticleDOI
TL;DR: A procedure to obtain maintainability indicators for industrial devices to achieve a better design regarding maintainability requirements, to improve this maintainability under specific industrial environment and to foresee maintainability problems due to eventual changes in a device operation conditions is described.

Proceedings ArticleDOI
03 Jun 2012
TL;DR: An instruction scheduling technique that targets at improving the reliability of a software program given a user-provided tolerable performance overhead and a reliability-driven instruction scheduler that provides on average a 22% reduction of program failures.
Abstract: An instruction scheduling technique is presented that targets at improving the reliability of a software program given a user-provided tolerable performance overhead. A look-ahead-based heuristic schedules instructions by evaluating the reliability of dependent instructions while reducing the impact of spatial and temporal vulnerabilities of various processor components. Our reliability-driven instruction scheduler (implemented into the GCC compiler) provides on average a 22% reduction of program failures compared to state-of-the-art.

Book ChapterDOI
15 Dec 2012
TL;DR: Experimental results in an institute wide cloud computing system show that the detection accuracy of the algorithm improves as it evolves and it can achieve 92.1% detection sensitivity and 83.8% detection specificity, which makes it well suitable for building highly dependable clouds.
Abstract: Modern production utility clouds contain thousands of computing and storage servers. Such a scale combined with ever-growing system complexity of their components and interactions, introduces a key challenge for anomaly detection and resource management for highly dependable cloud computing. Autonomic anomaly detection is a crucial technique for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system level dependability assurance. We propose a new hybrid self-evolving anomaly detection framework using one-class and two-class support vector machines. Experimental results in an institute wide cloud computing system show that the detection accuracy of the algorithm improves as it evolves and it can achieve 92.1% detection sensitivity and 83.8% detection specificity, which makes it well suitable for building highly dependable clouds.

Proceedings ArticleDOI
25 Jun 2012
TL;DR: This paper presents a novel approach to assess time coalescence techniques, based on the use of automatically generated logs, and focuses on supercomputer logs, due to increasing importance of automatic event log analysis for these systems.
Abstract: This paper presents a novel approach to assess time coalescence techniques. These techniques are widely used to reconstruct the failure process of a system and to estimate dependability measurements from its event logs. The approach is based on the use of automatically generated logs, accompanied by the exact knowledge of the ground truth on the failure process. The assessment is conducted by comparing the presumed failure process, reconstructed via coalescence, with the ground truth. We focus on supercomputer logs, due to increasing importance of automatic event log analysis for these systems. Experimental results show how the approach allows to compare different time coalescence techniques and to identify their weaknesses with respect to given system settings. In addition, results revealed an interesting correlation between errors caused by the coalescence and errors in the estimation of dependability measurements.

Proceedings ArticleDOI
25 Jun 2012
TL;DR: NINEPIN is a non-invasive and energy efficient performance isolation mechanism that mitigates performance interference among heterogeneous applications hosted in virtualized servers and is capable of increasing data center utility.
Abstract: A virtualized data center faces important but challenging issue of performance isolation among heterogeneous customer applications. Performance interference resulting from the contention of shared resources among co-located virtual servers has significant impact on the dependability of application QoS. We propose and develop NINEPIN, a non-invasive and energy efficient performance isolation mechanism that mitigates performance interference among heterogeneous applications hosted in virtualized servers. It is capable of increasing data center utility. Its novel hierarchical control framework aligns performance isolation goals with the incentive to regulate the system towards optimal operating conditions. The framework combines machine learning based self-adaptive modeling of performance interference and energy consumption, utility optimization based performance targeting and a robust model predictive control based target tracking. We implement NINEPIN on a virtualized HP ProLiant blade server hosting SPEC CPU2006 and RUBiS benchmark applications. Experimental results demonstrate that NINEPIN outperforms a representative performance isolation approach, Q-Clouds, improving the overall system utility and reducing energy consumption.