scispace - formally typeset
Search or ask a question
Conference

Pacific Rim International Symposium on Fault-Tolerant Systems 

About: Pacific Rim International Symposium on Fault-Tolerant Systems is an academic conference. The conference publishes majorly in the area(s): Fault tolerance & Fault coverage. Over the lifetime, 77 publications have been published by the conference receiving 503 citations.

Papers published on a yearly basis

Papers
More filters
Proceedings ArticleDOI
15 Dec 1997
TL;DR: A formal verification of the start-up algorithm of the DACAPO protocol, which uses TDMA (Time Division Multiple Access) bus arbitration, was verified that an ensemble of four communicating stations becomes synchronized and operational within a bounded time from an arbitrary initial state.
Abstract: This paper presents a formal verification of the start-up algorithm of the DACAPO protocol. The protocol uses TDMA (Time Division Multiple Access) bus arbitration. It was verified that an ensemble of four communicating stations becomes synchronized and operational within a bounded time from an arbitrary initial state. The system model included a clock drift corresponding to /spl plusmn/10/sup -3/. The protocol was modeled using a network of timed automata, and verification was performed using the symbolic model checker UPPAAL.

80 citations

Proceedings ArticleDOI
P.E. Chung1, Yennun Huang, S. Yajnik, Glenn S. Fowler, Kiem-Phong Vo, Yi-Min Wang 
15 Dec 1997
TL;DR: The CosMiC system is a user-level process migration environment that aims to provide fault-tolerance by migrating a process from a failed machine to another machine.
Abstract: The CosMiC system is a user-level process migration environment. Process migration is defined as the mechanism to checkpoint the state of an unfinished process, transfer the state from one machine to another and resume process execution on the new machine. The main purposes of process migration are: (1) to utilize the CPU power and balance load on all machines in an environment; (2) to provide fault-tolerance by migrating a process from a failed machine to another machine.

27 citations

Proceedings ArticleDOI
15 Dec 1997
TL;DR: This paper analyzes and compares two time-based checkpointing protocols, and obtains a closed-form expression for the forward progress of the two protocols considered, and determines the checkpoint interval value that will maximize forward progress.
Abstract: Time-based checkpointing protocols are a recently proposed way to improve a system's dependability. They claim to have the advantages of coordinated protocols without the normal costs of coordination. This paper investigates that claim, by analyzing and comparing two time-based checkpointing protocols. The analysis is performed by determining the forward progress of a system using each protocol, and it is described in such a way as to be easily modifiable for other time-based protocols. By carefully analyzing the behavior of each protocol between renewal points, we are able to obtain a closed-form expression for the forward progress of the two protocols considered. We also determine the checkpoint interval value that will maximize forward progress. A validation of the analytical model is then performed via a detailed simulation. The results obtained from the model show the advantages and disadvantages of each protocol.

26 citations

Proceedings ArticleDOI
15 Dec 1997
TL;DR: A checkpointing implementation for MPI programs is presented, which is transparent, and requires no changes to the application programs, and combines coordinated, uncoordinated and message logging techniques.
Abstract: Many scientific problems can be distributed on a large number of processes to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing parallel programs. In this paper, we present a checkpointing implementation for MPI programs, which is transparent, and requires no changes to the application programs. Our implementation combines coordinated, uncoordinated and message logging techniques.

23 citations

Proceedings ArticleDOI
15 Dec 1997
TL;DR: Algorithms to simulate the failure behavior of three commonly used fault tolerant architectures, viz., Distributed Recovery Block, N-Version Programming (NVP) and N-Self Checking programming (NSCP) are developed and demonstrated.
Abstract: Fault tolerance is a survival attribute of complex computer systems and software in their ability to deliver continuous service to their users in the presence of faults. Formulating an analytic model for dependability and performance evaluation of hardware/software fault tolerant architectures can be quite cumbersome. Also, in practice, isolating the effect of various parameters on a system, while holding the others constant requires exploring a variety of scenarios. It is economically infeasible to build several such systems. Simulation offers an attractive mechanism for dependability evaluation and the study of the influence of various parameters on the failure behavior of the system. In this paper, we develop algorithms to simulate the failure behavior of three commonly used fault tolerant architectures, viz., Distributed Recovery Block (DRB), N-Version Programming (NVP) and N-Self Checking Programming (NSCP). We demonstrate the ability of the approach to simulate complex failure scenarios with various dependencies using some illustrative numerical examples.

22 citations

Performance
Metrics
No. of papers from the Conference in previous years
YearPapers
19981
199736
19951
199139