Showing papers on "Dependability published in 2003"

PDF

Open Access

Book•

[...]

Rogério de Lemos¹, Cristina Gacek², Alexander Romanovsky²•Institutions (2)

University of Kent¹, Newcastle University²

21 Aug 2003

TL;DR: In this paper, the authors present a survey of work on software architectures and dependability, focusing on the work at the 2007 International Conference on Dependable Systems and Networks and the Third Workshop on the Role of Software Architecture for Testing and Analysis (ROSATEA 2007).

...read moreread less

Abstract: As software systems become ubiquitous, the issues of dependability become more and more crucial. Given that solutions to these issues must be considered from the very beginning of the design process, it is reasonable that dependability is addressed at the architectural level. This book was born of an effort to bring together the research communities of software architectures and dependability. This state-of-the-art survey contains expanded and peer-reviewed papers based on the carefully selected contributions to two workshops: the Workshop on Architecting Dependable Systems (WADS 2007), organized at the 2007 International Conference on Dependable Systems and Networks (DSN 2007), held in Edinburgh, UK in June 2007 and the Third Workshop on the Role of Software Architecture for Testing and Analysis (ROSATEA 2007) organized as part of a federated conference on Component-Based Software Engineering and Software Architecture (CompArch 2007), held in Medford, MA, USA in July 2007. It also contains a number of invited papers written by recognized experts in the area. The 14 papers are organized in topical sections on critical infrastructures, rigorous design/fault tolerance, and verification and validation.

...read moreread less

208 citations

Book Chapter•DOI•

Intrusion-tolerant architectures: concepts and design

[...]

Paulo Veríssimo, Nuno Neves, Miguel Correia

01 Jan 2003-Lecture Notes in Computer Science

TL;DR: The paper describes the fundamental concepts behind IT, tracing their connection with classical fault tolerance and security, and discusses the main strategies and mechanisms for architecting IT systems, and reports on recent advances on distributed IT system architectures.

...read moreread less

Abstract: There is a significant body of research on distributed computing architectures, methodologies and algorithms, both in the fields of fault tolerance and security. Whilst they have taken separate paths until recently, the problems to be solved are of similar nature. In classical dependability, fault tolerance has been the workhorse of many solutions. Classical security-related work has on the other hand privileged, with few exceptions, intrusion prevention. Intrusion tolerance (IT) is a new approach that has slowly emerged during the past decade, and gained impressive momentum recently. Instead of trying to prevent every single intrusion, these are allowed, but tolerated: the system triggers mechanisms that prevent the intrusion from generating a system security failure. The paper describes the fundamental concepts behind IT, tracing their connection with classical fault tolerance and security. We discuss the main strategies and mechanisms for architecting IT systems, and report on recent advances on distributed IT system architectures.

...read moreread less

202 citations

Journal Article•DOI•

Integration of management systems: focus on safety in the nuclear industry

[...]

Ingo Beckmerhagen, H.P. Berg, Stanislav Karapetrovic, W.O. Willborn

01 Mar 2003-International Journal of Quality & Reliability Management

TL;DR: In this paper, the integration of a safety management system within an integrated management system (IMS) framework is discussed, by means of an example from the nuclear industry, where safety is of such a paramount importance in nuclear plants, it makes sense to integrate safety requirements within a quality management system.

...read moreread less

Abstract: The need to create integrated management systems (IMS) in order to handle the proliferation of management system standards is undeniable. There is also evidence in literature and practice that organizations are slowly starting to tackle the IMS issue, mainly by putting an integrated quality and environmental management system in place. Due to the existence of internationally accepted standards covering these two fields, namely ISO 9000 and 14000 series, such a scope of integration comes as no surprise. However, can and should other systems, for example, the ones for occupational health and safety, dependability, social accountability or complaints handling, be included? What would such an integration mean for the existing organizational structures and how could be it be accomplished? When we attempt to address IMS issues, do we really talk about the integration of standards, systems, both or neither? These and other important questions regarding IMS are addressed here. By means of an example from the nuclear industry, this paper focuses in particular on the integration of a safety management system within an IMS framework. Since safety is of such a paramount importance in nuclear plants, it makes sense to integrate safety requirements within a quality management system, as a possible first step in the integration efforts. Subsequently, other function‐specific requirements may be included to form a “real” IMS.

...read moreread less

177 citations

Proceedings Article•DOI•

Towards a rigorous definition of information system survivability

[...]

John C. Knight¹, Elisabeth A. Strunk¹, Kevin Sullivan¹•Institutions (1)

University of Virginia¹

22 Apr 2003

TL;DR: This paper presents the basis for a rigorous definition of survivability, a precise notion of what forms of degraded service are acceptable to users, under what circumstances each form is most useful, and the fraction of time such degraded service levels are acceptable.

...read moreread less

Abstract: The computer systems that provide the information underpinnings for critical infrastructure applications, both military and civilian, are essential to the operation of those applications. Failure of the information systems can cause a major loss of service, and so their dependability is a major concern. Current facets of dependability, such as reliability and availability, do not address the needs of critical information systems adequately because they do not include the notion of degraded service as an explicit requirement. What is needed is a precise notion of what forms of degraded service are acceptable to users, under what circumstances each form is most useful, and the fraction of time such degraded service levels are acceptable. This concept is termed survivability. In this paper, we present the basis for a rigorous definition of survivability and an example of its use.

...read moreread less

151 citations

Proceedings Article•DOI•

Autonomic Computing - a means of achieving dependability?

[...]

Roy Sterritt¹, David W. Bustard¹•Institutions (1)

Ulster University¹

07 Apr 2003

TL;DR: The purpose of the paper is to consider how autonomic computing can provide a framework for dependability.

...read moreread less

Abstract: Autonomic computing is emerging as a significant new approach to the design of computing systems. Its goal is the development of systems that are self-configuring, self-healing, self-protecting and self-optimizing. Dependability is a long-standing desirable property of all computer-based systems. The purpose of the paper is to consider how autonomic computing can provide a framework for dependability.

...read moreread less

144 citations

Journal Article•DOI•

Flexibility in logistic systems—modeling and performance evaluation

[...]

Miryam Barad¹, D Even Sapir•Institutions (1)

Tel Aviv University¹

11 Aug 2003-International Journal of Production Economics

TL;DR: A bottom-up framework for flexibility in logistic systems is suggested before one among the proposed system logistic flexibility types, denoted here trans-routing flexibility, is quantitatively investigated.

...read moreread less

141 citations

Book Chapter•DOI•

Increasing system dependability through architecture-based self-repair

[...]

David Garlan¹, Shang-Wen Cheng¹, Bradley Schmerl¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2003-Lecture Notes in Computer Science

TL;DR: In this paper, the authors describe a partial solution in which stylized architectural design models are maintained at run time as a vehicle for automatically monitoring system behavior, for detecting when that behavior falls outside of acceptable ranges, and for deciding on a high-level repair strategy.

...read moreread less

Abstract: One increasingly important technique for improving system dependability is to provide mechanisms for a system to adapt at run time in order to accommodate varying resources, system errors, and changing requirements. For such "self-repairing" systems one of the hard problems is determining when a change is needed, and knowing what kind of adaptation is required. In this paper we describe a partial solution in which stylized architectural design models are maintained at run time as a vehicle for automatically monitoring system behavior, for detecting when that behavior falls outside of acceptable ranges, and for deciding on a high-level repair strategy. The main innovative feature of the approach is the ability to specialize a generic run time adaptation framework to support particular architectural styles and properties of interest. Specifically, a formal description of an architectural style defines for a family of related systems the conditions under which adaptation should be considered, provides an analytic basis for detecting anomalies, and serves as a basis for developing sound repair strategies.

...read moreread less

133 citations

Journal Article•DOI•

AQuA: an adaptive architecture that provides dependable distributed objects

[...]

Yansong Ren¹, David E. Bakken², T. Courtney³, Michel Cukier⁴, David A. Karr, Paul Rubel⁵, C. Sabnis, William H. Sanders³, Richard E. Schantz⁵, Mouna Seri³ - Show less +6 more•Institutions (5)

Bell Labs¹, Washington State University², University of Illinois at Urbana–Champaign³, University of Maryland, College Park⁴, BBN Technologies⁵

01 Jan 2003-IEEE Transactions on Computers

TL;DR: The AQuA architecture is described and the active replication pass-first scheme is presented, in detail, and results from the study of fault detection, recovery, and blocking times are presented.

...read moreread less

Abstract: Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and production reasons, there is interest in approaches and architectures that facilitate building such systems. The AQuA architecture is one such approach; its goal is to provide adaptive fault tolerance to CORBA applications by replicating objects. The AQuA architecture allows application programmers to request desired levels of dependability during applications' runtimes. It provides fault tolerance mechanisms to ensure that a CORBA client can always obtain reliable services, even if the CORBA server object that provides the desired services suffers from crash failures and value faults. AQuA includes a replicated dependability manager that provides dependability management by configuring the system in response to applications' requests and changes in system resources due to faults. It uses Maestro/Ensemble to provide group communication services. It contains a gateway to intercept standard CORBA IIOP messages to allow any standard CORBA application to use AQuA. It provides different types of replication schemes to forward messages reliably to the remote replicated objects. All of the replication schemes ensure strong, data consistency among replicas. This paper describes the AQuA architecture and presents, in detail, the active replication pass-first scheme. In addition, the interface to the dependability manager and the design of the dependability manager replication are also described. Finally, we describe performance measurements that were conducted for the active replication pass-first scheme, and we present results from our study of fault detection, recovery, and blocking times.

...read moreread less

130 citations

Proceedings Article•DOI•

Testing of droplet-based microelectrofluidic systems

[...]

Fei Su¹, Sule Ozev, Krishnendu Chakrabarty•Institutions (1)

Duke University¹

30 Sep 2003

TL;DR: This paper presents a costeffective concurrent test methodology for droplet-based microelectrofluidic systems, presents a classification of catastrophic and parametric faults in such systems and shows how faults can be detected by electrostatically controlling and tracking droplet motion.

...read moreread less

Abstract: Composite microsystems that integrate mechanical and fluidic components are fast emerging as the next generation of system-on-chip designs. As these systems become widespread in safety-critical biomedical applications, dependability emerges as a critical performance parameter. In this paper, we present a costeffective concurrent test methodology for droplet-based microelectrofluidic systems. We present a classification of catastrophic and parametric faults in such systems and show how faults can be detected by electrostatically controlling and tracking droplet motion. We then present tolerance analysis based on Monte-Carlo simulations to characterize the impact of parameter variations on system performance. To the best of our knowledge, this constitutes the first attempt to define a fault model and to develop a test methodology for droplet-based microelectrofluidic systems.

...read moreread less

127 citations

Proceedings Article•DOI•

JAGR: an autonomous self-recovering application server

[...]

George Candea¹, Emre Kiciman¹, S. Zhang¹, P. Keyani¹, Armando Fox¹ - Show less +1 more•Institutions (1)

Stanford University¹

25 Jun 2003

TL;DR: This paper demonstrates that the dependability of generic, evolving J2EE applications can be enhanced through a combination of a few recovery-oriented techniques, resulting in JAGR-JBoss with application-generic recovery - a self-recovering execution platform.

...read moreread less

Abstract: This paper demonstrates that the dependability of generic, evolving J2EE applications can be enhanced through a combination of a few recovery-oriented techniques. Our goal is to reduce downtime by automatically and efficiently recovering from a broad class of transient software failures without having to modify applications. We describe here the integration of three new techniques into JBoss, an open-source J2EE application server. The resulting system is JAGR-JBoss with application-generic recovery - a self-recovering execution platform. JAGR combines application-generic failure-path inference (AFPI), path-based failure detection, and micro-reboots. AFPI uses controlled fault injection and observation to infer paths that faults follow through a J2EE application. Path-based failure detection uses tagging of client requests and statistical analysis to identify anomalous component behavior. Micro-reboots are fast reboots we perform at the sub-application level to recover components from transient failures; by selectively rebooting only those components that are necessary to repair the failure, we reduce recovery time. These techniques are designed to be autonomous and application-generic, making them well suited to the rapidly changing software of Internet services.

...read moreread less

115 citations

Book Chapter•DOI•

A robust and scalable peer-to-peer gossiping protocol

[...]

Spyros Voulgaris¹, Márk Jelasity², Maarten van Steen¹•Institutions (2)

VU University Amsterdam¹, University of Bologna²

14 Jul 2003-Lecture Notes in Computer Science

TL;DR: This paper presents the newscast model and reports on experiments using a Java implementation, demonstrating that the system indeed shows the scalability and dependability properties predicted by the previous theoretical and simulation results.

...read moreread less

Abstract: The newscast model is a general approach for communication in large agent-based distributed systems. The two basic services—membership management and information dissemination—are implemented by the same epidemic-style protocol. In this paper we present the newscast model and report on experiments using a Java implementation. The experiments involve communication in a large, wide-area cluster computer. By analysis of the outcome of the experiments we demonstrate that the system indeed shows the scalability and dependability properties predicted by our previous theoretical and simulation results.

...read moreread less

Book Chapter•DOI•

A dependability benchmark for OLTP application environments

[...]

Marco Vieira¹, Henrique Madeira²•Institutions (2)

Polytechnic Institute of Coimbra¹, University of Coimbra²

09 Sep 2003

TL;DR: This paper proposes a dependability benchmark for OLTP systems that uses the workload of the TPC-C performance benchmark and specifies the measures and all the steps required to evaluate both the performance and key dependability features ofOLTP systems, with emphasis on availability.

...read moreread less

Abstract: The ascendance of networked information in our economy and daily lives has increased the awareness of the importance of dependability features. OLTP (On-Line Transaction Processing) systems constitute the kernel of the information systems used today to support the daily operations of most of the business. Although these systems comprise the best examples of complex business-critical systems, no practical way has been proposed so far to characterize the impact of faults in such systems or to compare alternative solutions concerning dependability features. This paper proposes a dependability benchmark for OLTP systems. This dependability benchmark uses the workload of the TPC-C performance benchmark and specifies the measures and all the steps required to evaluate both the performance and key dependability features of OLTP systems, with emphasis on availability. This dependability benchmark is presented through a concrete example of benchmarking the performance and dependability of several different transactional systems configurations. The effort required to run the dependability benchmark is also discussed in detail.

...read moreread less

Book Chapter•DOI•

Safeguarding SCADA Systems with Anomaly Detection

[...]

John Bigham¹, David Gamez¹, Ning Lu¹•Institutions (1)

Queen Mary University of London¹

21 Sep 2003

TL;DR: The performance of invariant induction and n- gram anomaly-detectors will be compared and plans for taking this work further by integrating the output from several anomaly- detecting techniques using Bayesian networks are outlined.

...read moreread less

Abstract: This paper will show how the accuracy and security of SCADA systems can be improved by using anomaly detection to identify bad values caused by attacks and faults. The performance of invariant induction and n- gram anomaly-detectors will be compared and this paper will also outline plans for taking this work further by integrating the output from several anomaly- detecting techniques using Bayesian networks. Although the methods outlined in this paper are illustrated using the data from an electricity network, this research springs from a more general attempt to improve the security and dependability of SCADA systems using anomaly detection.

...read moreread less

Proceedings Article•DOI•

Coordinated forward error recovery for composite Web services

[...]

Valérie Issarny¹, Ferda Tartanoglu¹, Alexander Romanovsky², Nicole Lévy•Institutions (2)

French Institute for Research in Computer Science and Automation¹, Newcastle University²

20 Oct 2003

TL;DR: This paper defines the notion of Web Service Composition Action (WSCA), based on the Coordinated Atomic Action concept, which allows structuring composite Web services in terms of dependable actions, and introduces a framework enabling the development of Composite Web services based on WSCAs, consisting of an XML-based language for the specification of WSCA.

...read moreread less

Abstract: This paper proposes a solution based on forward error recovery, oriented towards providing dependability of composite Web services. While exploiting their possible support for fault tolerance (e.g., transactional support at the level of each service), the proposed solution has no impact on the autonomy of the individual Web services, our solution lies in system structuring in terms of co-operative atomic actions that have a well-defined behavior, both in the absence and in the presence of service failures. More specifically, we define the notion of Web Service Composition Action (WSCA), based on the Coordinated Atomic Action concept, which allows structuring composite Web services in terms of dependable actions. Fault tolerance can then be obtained as an emergent property of the aggregation of several potentially non-dependable services. We further introduce a framework enabling the development of composite Web services based on WSCAs, consisting of an XML-based language for the specification of WSCAs.

...read moreread less

Dissertation•

Timely Actions in the Presence of Uncertain Timeliness

[...]

António Casimiro

01 Dec 2003

TL;DR: This paper proposes an architectural construct that takes a generic approach to the problem of programming in the presence of uncertain timeliness, called the Timely Computing Base, TCB, and discusses the application programming interface for accessing the TCB services.

...read moreread less

Abstract: Real-time behavior is specified in compliance with timeliness requirements, which in essence calls for synchronous system models. However systems often rely on unpredictable and unreliable infrastructures, that suggest the use of asynchronous models. Several models have been proposed to address this issue. We propose an architectural construct that takes a generic approach to the problem of programming in the presence of uncertain timeliness. We assume the existence of a component, capable of executing timing functions, which helps applications with varying degrees of synchrony to behave reliably despite the occurrence of timing failures. We call this component the Timely Computing Base, TCB. This paper describes the TCB architecture and model, and discusses the application programming interface for accessing the TCB services. The implementation of the TCB services uses fail-awareness techniques to increase the coverage of TCB properties.

...read moreread less

Elements of the Self-Healing System Problem Space

[...]

Philip Koopman¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2003

TL;DR: This paper proposes a taxonomy for describing the problem space for self-healing systems including fault models, system responses, system completeness, and design context to help researchers understand what aspects of the system dependability problem they are (and aren’t) addressing with specific research projects.

...read moreread less

Abstract: One of the potential approaches to achieving dependable system operation is to incorporate so-called “self-healing” mechanisms into system architectures and implementations. A previous workshop on this topic exposed a wide diversity of researcher perspectives on what self-healing systems really are. This paper proposes a taxonomy for describing the problem space for self-healing systems including fault models, system responses, system completeness, and design context. It is hoped that this taxonomy will help researchers understand what aspects of the system dependability problem they are (and aren’t) addressing with specific research projects.

...read moreread less

Book Chapter•DOI•

Dependability in the web services architecture

[...]

Ferda Tartanoglu¹, Valérie Issarny¹, Alexander Romanovsky², Nicole Lévy³•Institutions (3)

French Institute for Research in Computer Science and Automation¹, Newcastle University², Versailles Saint-Quentin-en-Yvelines University³

01 Jan 2003-Lecture Notes in Computer Science

TL;DR: In this paper, the authors present a survey of base fault tolerance mechanisms, considering both backward and forward error recovery mechanisms, and show how they are adapted to deal with the specifics of the Web in the light of ongoing work in the area.

...read moreread less

Abstract: The Web services architecture is expected to play a prominent role in developing next generation distributed systems. This chapter discusses how to build dependable systems based on the Web services architecture. More specifically, it surveys base fault tolerance mechanisms, considering both backward and forward error recovery mechanisms, and shows how they are adapted to deal with the specifics of the Web in the light of ongoing work in the area. Existing solutions, targeting the development of dependable composite Web services, may be subdivided into two categories that are respectively related to the specification of Web services composition and to the design of dedicated distributed protocols.

...read moreread less

Journal Article•DOI•

Using run-time reconfiguration for fault injection applications

[...]

L. Antoni, Regis Leveugle, Bela Feher¹•Institutions (1)

Budapest University of Technology and Economics¹

27 Oct 2003-IEEE Transactions on Instrumentation and Measurement

TL;DR: An alternative approach to Fault injection techniques is proposed, based on hardware emulation and run-time reconfiguration, which is carried out by direct modifications in the bitstream, so that re-synthesizing the description can be avoided.

...read moreread less

Abstract: The probability of faults occurring in the field increases with the evolution of the CMOS technologies. It becomes, therefore, increasingly important to analyze the potential consequences of such faults on the applications. Fault injection techniques have been used for years to validate the dependability level of circuits and systems, and approaches have been proposed to analyze very early in the design process the functional consequences of faults. These approaches are based on the high-level description of the circuit or system and classically use simulation. Recently, hardware emulation on FPGA-based systems has been proposed to accelerate the experiments; in that case, an important characteristic is the time to reconfigure the hardware, including re-synthesis, place and route, and bitstream downloading. In this paper, an alternative approach is proposed, based on hardware emulation and run-time reconfiguration. Fault injection is carried out by direct modifications in the bitstream, so that re-synthesizing the description can be avoided. Moreover, with some FPGA families (e.g., Virtex or AT6000), it is possible to reconfigure the hardware partially at run-time. Important time-savings can be achieved when taking advantage of these features, since the injection of a fault necessitates the reconfiguration of only a few resources of the device. The injection process is detailed for several types of faults and experimental results are discussed.

...read moreread less

Journal Article•DOI•

Managing multiple and dependable identities

[...]

Ernesto Damiani¹, S. De Capitani di Vimercati¹, Pierangela Samarati¹•Institutions (1)

University of Milan¹

01 Nov 2003-IEEE Internet Computing

TL;DR: This work investigates the problems inherent in identity management, emphasizing the requirements for multiplicity and dependability, and enables a new generation of advanced MDDI services on the global information infrastructure.

...read moreread less

Abstract: Digital management of multiple robust identities is a crucial issue in developing the next generation of distributed applications. Our daily activities increasingly rely on remote resources and services - specifically, on interactions between different, remotely located parties. Because these parties might (and sometimes should) know little about each other, digital identities - electronic representations of individuals' or organizations' sensitive information - help introduce them to each other and control the amount of information transferred. In its broadest sense, identity management encompasses definitions and life-cycle management for digital identities and profiles, as well as environments for exchanging and validating such information. Digital identity management - especially support for identity dependability and multiplicity - is crucial for building and maintaining trust relationships in today's globally interconnected society. We investigate the problems inherent in identity management, emphasizing the requirements for multiplicity and dependability. We enable a new generation of advanced MDDI services on the global information infrastructure.

...read moreread less

Journal Article•DOI•

Deadlock-free dynamic reconfiguration schemes for increased network dependability

[...]

Timothy Mark Pinkston¹, Ruoming Pang, José Duato•Institutions (1)

University of Southern California¹

01 Aug 2003-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Efficient and deadlock-free dynamic reconfiguration schemes that are applicable to routing algorithms and networks which use wormhole, virtual cut-through, or store-and-forward switching, combined with hard link-level flow control are proposed.

...read moreread less

Abstract: Network-based parallel computing systems often require the ability to reconfigure the routing algorithm to reflect changes in network topology if and when voluntary or involuntary changes occur. The process of reconfiguring a network's routing capabilities may be very inefficient and/or deadlock-prone if not handled properly. We propose efficient and deadlock-free dynamic reconfiguration schemes that are applicable to routing algorithms and networks which use wormhole, virtual cut-through, or store-and-forward switching, combined with hard link-level flow control. One requirement is that the network architecture use virtual channels or duplicate physical channels for deadlock-handling as well as performance purposes. The proposed schemes do not impede the injection, transmission, or delivery of user packets during the reconfiguration process. Instead, they provide uninterrupted service, increased availability/reliability, and improved overall quality-of-service support as compared to traditional techniques based on static reconfiguration.

...read moreread less

A Survey of Dependability Issues in Mobile Wireless Networks

[...]

C. Basile, Marc-Olivier Killijian, David Powell

01 Jan 2003

Proceedings Article•DOI•

A framework for scalable analysis and design of system-wide graceful degradation in distributed embedded systems

[...]

Charles Shelton¹, Philip Koopman, William Nace•Institutions (1)

Carnegie Mellon University¹

04 Aug 2003

TL;DR: A framework that will enable scalable analysis and design of graceful degradation in distributed embedded systems and improves the system dependability by identifying architectural properties that enhance a system's ability to gracefully degrade is presented.

...read moreread less

Abstract: We present a framework that will enable scalable analysis and design of graceful degradation in distributed embedded systems. We define graceful degradation in terms of utility. A system that gracefully degrades suffers a proportional loss of system utility as individual software and hardware components fail. However, explicitly designing a system to gracefully degrade; i.e. handle all possible combinations of component failures, becomes impractical for systems with more than a few components. We avoid this exponential complexity of component combinations by exploiting the structure of the system architecture to partition components into subsystems. We view each subsystem as a configuration of components that changes when components are removed or added. Thus, a subsystem's utility changes when components fail or are repaired. We then view the system as a composition of subsystems that each contribute to overall system utility. We demonstrate the scalability of our framework by applying it to an example automobile navigation system. Using this framework, we improve the system dependability by identifying architectural properties that enhance a system's ability to gracefully degrade.

...read moreread less

Proceedings Article•DOI•

An Accurate Analysis of the Effects of Soft Errors in the Instruction and Data Caches of a Pipelined Microprocessor

[...]

Maurizio Rebaudengo¹, Matteo Sonza Reorda¹, Massimo Violante¹•Institutions (1)

Polytechnic University of Turin¹

03 Mar 2003

TL;DR: An efficient simulation-based fault injection environment is developed and an extensive analysis of the effects of soft errors on a processor running several applications under different memory configurations is presented.

...read moreread less

Abstract: Instruction and data caches are well known architectural solutions that allow significant improvement in the performance of high-end processors. Due to their sensitivity to soft errors, they are often disabled in safety critical applications, thus sacrificing performance for improved dependability. In this paper, we report an accurate analysis of the effects of soft errors in the instruction and data caches of a soft core implementing the SPARC architecture. Thanks to an efficient simulation-based fault injection environment we developed, we are able to present in this paper an extensive analysis of the effects of soft errors on a processor running several applications under different memory configurations. The procedure we followed allows the precise computation of the processor failure rate when the cache is enabled even without resorting to expensive radiation experiments.

...read moreread less

Fault Tolerance within a Grid Environment

[...]

Paul Townend¹, Jie Xu¹•Institutions (1)

Durham University¹

01 Jan 2003

TL;DR: This approach combines a replication-based fault tolerance approach with both dynamic prioritization and dynamic scheduling, which concludes that timing, omission and interaction faults may become more prevalent in Grid applications than is the case in traditional distributed systems.

...read moreread less

Abstract: Fault tolerance is an important property in Grid computing as the dependability of individual Grid resources may not be able to be guaranteed; also as resources are used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious in some way. As part of the e-Demand project at the University of Durham we are seeking to develop both an improved fault model for Grid computing and a method for providing fault tolerance for Grid applications that will provide protection against both malicious and erroneous services. We have firstly begun to investigate whether the traditional distributed systems fault model can be readily applied to Grid computing, or whether improvements and alterations need to be made. From our initial investigation, we have concluded that timing, omission and interaction faults may become more prevalent in Grid applications than is the case in traditional distributed systems. From this initial fault model, we have begun to develop an approach for fault tolerance based on the idea of job replication, as anomalous results (either maliciously altered or simply wrong) should be caught at the voting stage. This approach combines a replication-based fault tolerance approach with both dynamic prioritization and dynamic scheduling.

...read moreread less

Book Chapter•DOI•

Stochastic dependability analysis of system architecture based on UML models

[...]

István Majzik¹, András Pataricza¹, Andrea Bondavalli²•Institutions (2)

Budapest University of Technology and Economics¹, University of Florence²

01 Jan 2003-Lecture Notes in Computer Science

TL;DR: The work in this paper defines a dependability modeling and model based evaluation approach based on UML to be used in the early phases of the system design to capture system dependability attributes like reliability and availabiUty, thus providing guidelines for the choice among different architectural and design solutions.

...read moreread less

Abstract: The work in this paper is devoted to the definition of a dependability modeling and model based evaluation approach based on UML models. It is to be used in the early phases of the system design to capture system dependability attributes like reliability and availabiUty, thus providing guidelines for the choice among different architectural and design solutions. We show how structural UML diagrams can be processed to filter out the dependability related information and how a system-wide dependability model is constructed. Due to the modular construction, this model can be refined later as more detailed information becomes available. We discuss the model refinement based on the General Resource Model, an extension of UML. We show that the dependability model can be constructed automatically by using graph transformation techniques.

...read moreread less

Proceedings Article•DOI•

Assessing the Dependability of SOAP RPC-Based Web Services by Fault Injection

[...]

Looker¹, Jie Xu•Institutions (1)

Durham University¹

01 Oct 2003

TL;DR: This paper implements an extendable fault injector framework, and undertaken some proof of concept experiments with a system based around Apache SOAP and Apache Tomcat, to derive a new method and fault model for testing web services.

...read moreread less

Abstract: This paper presents our research on devising a dependability assessment method for SOAP-based Web Services using network level fault injection We compare existing DCE middleware dependability testing research with the requirements of testing SOAP RPC-based applications and derive a new method and fault model for testing web services From this we have implemented an extendable fault injector framework and undertaken some proof of concept experiments with a system based around Apache SOAP and Apache Tomcat We also present results from our initial experiments, which uncovered a discrepancy within our system We finally detail future research, including plans to adapt this fault injector framework from the stateless environment of a standard web service to the stateful environment of an OGSA service

...read moreread less

Proceedings Article•DOI•

Dependability enhancement for IEEE 802.11 wireless LAN with redundancy techniques

[...]

Dongyan Chen¹, Sachin Garg², C. Kintala², Kishor S. Trivedi¹•Institutions (2)

Duke University¹, Avaya²

22 Jun 2003

TL;DR: The paper outlines the details of how redundancy may be implemented by making enhancements to the basic 802.11 channel access protocol and presents the reliability, availability and survivability analysis of the two configurations and shows that the redundancy schemes demonstrate significant improvement in connection dependability over Mobile terminal.

...read moreread less

Abstract: The presence of physical obstacles and radio interference results in the so called “shadow regions” in wireless networks. When a mobile station roams into a shadow region, it loses its network connectivity. In cellular networks, in order to minimize the connection unreliability, careful cell planning is required to prevent the occurrance of the shadow regions in the first place. In 802.11b/g wireless LANs, however, due to the limited frequency spectrum, it is not always possible to prevent a shadow region by adding another cell at a different frequency. Our contribution in this paper is to propose the alternate approach of tolerating the existence of “shadow regions” as opposed to prevention in order to enhance the connection dependability. A redundant access point (AP) is placed in the shadow region to serve the mobile stations which roam into that region. Since the redundant AP operates on the same frequency as the primary AP, it does not constitute a separate cell. In fact, the primary and the secondary AP communicate to grant medium access to stations within the shadow region. We consider two configurations, which differ in how the two APs communicate with each other. In the first, the secondary AP is connected to the same distribution system as the primary AP. In the second, the secondary AP acts as a wireless forwarding bridge for traffic to/from the mobile stations in the shadow region to the primary AP. The paper outlines the details of how redundancy may be implemented by making enhancements to the basic 802.11 channel access protocol. To evaluate the dependability of the network under study, we present the reliability, availability and survivability analysis of the two configurations and compare them with the scheme with no redundancy. With numerical examples, we show that the redundancy schemes demonstrate significant improvement in connection dependability over Mobile terminal

...read moreread less

Proceedings Article•DOI•

Multi-legged arguments:the impact of diversity upon confidence in dependability arguments

[...]

Robin Bloomfield, Bev Littlewood¹•Institutions (1)

City University London¹

22 Jun 2003

TL;DR: Diverse arguments are examined ‐ in particular arguments to support claims about system dependability (reliability, safety) and whether the probabilistic approach that has been so successful in design diversity can be applied to diversity in arguments.

...read moreread less

Abstract: Intellectual diversity ‐ difference ‐ has long been used in human affairs to minimise the impact of mistakes. In the past couple of decades design diversity has been used to seek dependability in software-based systems. This use of design diversity prompted the first formal studies of the efficacy of intellectual diversity. In this paper we examine diverse arguments ‐ in particular arguments to support claims about system dependability (reliability, safety). Our purpose is to see whether the probabilistic approach that has been so successful in design diversity can be applied to diversity in arguments. The work reported here is somewhat tentative and speculative.

...read moreread less

Proceedings Article•DOI•

Importance analysis with Markov chains

[...]

Ricardo M. Fricks¹, Kishor S. Trivedi•Institutions (1)

Motorola¹

28 Feb 2003

TL;DR: Novel techniques for computing importance measures in state space dependability models through combinatorial models and structure functions are introduced, and reward functions in a Markov reward model are utilized for this purpose.

...read moreread less

Abstract: In this paper, the authors introduce novel techniques for computing importance measures in state space dependability models. Specifically, reward functions in a Markov reward model (MRM) are utilized for this purpose, in contrast to the common method of computing importance measures through combinatorial models and structure functions. The advantage of bringing these measures in the context of MRMs is that the mapping extends the applicability of these substantial results of reliability engineering, previously considered only associated with fault trees and other combinatorial modeling techniques. As a consequence, software packages that allows the automatic description of MRMs can easily compute the importance measures under this new circumstance.

...read moreread less

Proceedings Article•DOI•

Dependable and secure data storage and retrieval in mobile, wireless networks

[...]

Stefano Chessa¹, P. Maestrini²•Institutions (2)

University of Pisa¹, National Research Council²

22 Jun 2003

TL;DR: A distributed data storage for mobile, wireless networks based on a peer-to-peer paradigm provides support to create and share files under a write-once model, and ensures data confidentiality and dependability by encoding files in a Redundant Residue Number System.

...read moreread less

Abstract: This paper introduces a distributed data storage for mobile, wireless networks based on a peer-to-peer paradigm. The distributed storage provides support to create and share files under a write-once model, and ensures at the same time data confidentiality and dependability by encoding files in a Redundant Residue Number System. More specifically files are partitioned into records and each record in encoded separately as (h+r)-tuples of data residues using h+r moduli. In turn, the residues are distributed among the mobiles in the network. Dependability is ensured since data can be reconstructed in the presence of up to s≤r residue erasures, combined with up to ¨ o 2 s r − corrupted residues, and data confidentiality is ensured since recovering the original information requires knowledge of the entire set of moduli.

...read moreread less

Collapse