Showing papers on "Rollback published in 1994"

PDF

Open Access

Patent•

Progressive retry method and apparatus for software failure recovery in multi-process message-passing applications

[...]

W.K. Fuchs¹, Yennun Huang¹, Yi-Min Wang¹•Institutions (1)

22 Jun 1994

TL;DR: A progressive retry recovery system based on checkpointing, message logging, rollback, message replaying and message reordering is disclosed in this article, which minimizes the number of involved processes as well as the total rollback distance.

...read moreread less

Abstract: A progressive retry recovery system based on checkpointing, message logging, rollback, message replaying and message reordering is disclosed. The disclosed progressive retry system minimizes the number of involved processes as well as the total rollback distance. The progressive retry method consists of a number of retry steps which gradually increase the scope of the rollback when a previous retry step fails. Step one attempts to bypass a software fault by having the faulty process replay the messages in its message log. Step two will attempt to bypass the software fault by having the faulty process reorder and then replay the messages in its message log. Step three will attempt to bypass the software fault by having the processes which have sent messages to the faulty process resend those messages to the faulty process. Step four will attempt to bypass the software fault by having the processes which have sent messages to the faulty process reorder and then resend their in-transit messages. Step five will attempt to bypass the software fault by implementing a large scope roll back of all monitored processes to the latest consistent global checkpoint. A mechanism is included for verifying the piecewise deterministic assumption.

...read moreread less

122 citations

Proceedings Article•DOI•

Adaptive checkpointing in Time Warp

[...]

Robert Rönngren¹, Rassul Ayani¹•Institutions (1)

Royal Institute of Technology¹

01 Jul 1994

TL;DR: This paper analyses the effects of doing the state saving less frequently and presents a method that allows each logical process to adapt its state saving interval to its rollback behaviour and indicates that the proposed method improves performance of the Time Warp system.

...read moreread less

Abstract: In Time Warp optimistic discrete event simulation, there exists a need to occasionally save the states of the logical processes. The state saving often constitutes a substantial overhead. However it is not necessary to save each state of a logical process since states can be restored from earlier states by re-executing intermediate events. In this paper, we analyse the effects of doing the state saving less frequently and present a method that allows each logical process to adapt its state saving interval to its rollback behaviour. Experimental results indicate that the proposed method improves performance of the Time Warp system.

...read moreread less

99 citations

Patent•

Method and system for performing resource updates and recovering operational records within a fault-tolerant transaction-oriented data processing system

[...]

Andrew John Schofield¹, Anthony Robert Washer¹•Institutions (1)

IBM¹

01 Sep 1994

TL;DR: In this article, a fault-tolerant transaction processing system and method stores records associated with operations of the system in order to permit recovery in the event of a need to roll back a transaction or to restart the system.

...read moreread less

Abstract: A fault-tolerant transaction processing system and method stores records associated with operations of the system in order to permit recovery in the event of a need to roll back a transaction or to restart the system. At least some of the operational records are stored as a recovery log in low-speed non-volatile storage and at least some are stored as a recovery list in high speed volatile storage. Rollback of an individual transaction is effected by reference to the recovery list whereas restart of the system is effected by reference to the recovery log.

...read moreread less

92 citations

Patent•

Electric vehicle drive train with rollback detection and compensation

[...]

Charles Edward Konrad¹•Institutions (1)

General Electric¹

08 Feb 1994

TL;DR: In this article, an electric vehicle drive train includes a controller for detecting and compensating for vehicle rollback, such as when the vehicle is started upward on an incline, and a gear selector permits the driver to select an intended or desired direction of vehicle movement.

...read moreread less

Abstract: An electric vehicle drive train includes a controller for detecting and compensating for vehicle rollback, as when the vehicle is started upward on an incline. The vehicle includes an electric motor rotatable in opposite directions corresponding to opposite directions of vehicle movement. A gear selector permits the driver to select an intended or desired direction of vehicle movement. If a speed end rotational sensor associated with the motor indicates vehicle movement opposite to the intended direction of vehicle movement, the motor is driven to a torque output magnitude as a nonconstant function of the rollback speed to counteract the vehicle rollback. The torque function may be either a linear function of speed or a function of the speed squared.

...read moreread less

80 citations

Posted Content•

Optimistic concurrency control revisited

[...]

Rainer Unland

01 Jan 1994

TL;DR: This paper will lay bare the shortcomings of the original approach and present some major improvements and several techniques will be presented which especially support read transactions with the consequence that the number of backups can be decreased substantially.

...read moreread less

Abstract: Several years ago optimistic concurrency control gained much attention in the database community. However, two-phase locking was already well established, especially in the relational database market. Concerning traditional database systems most developers felt that pessimistic concurrency control might not be the best solution for concurrency control, but, a well-known and accepted one. With the work on new generation database systems, however, there has been a revival of optimistic concurrency control (at least a partial one). This paper will reconsider optimistic concurrency control. It will lay bare the shortcomings of the original approach and present some major improvements. Moreover, several techniques will be presented which especially support read transactions with the consequence that the number of backups can be decreased substantially. Finally, a general solution for the starvation problem is presented. The solution is perfectly consistent with the underlying optimistic approach.

...read moreread less

68 citations

Proceedings Article•DOI•

Timeliness via speculation for real-time databases

[...]

Bestavros¹, Braoudakis¹•Institutions (1)

Boston University¹

07 Dec 1994

TL;DR: This work proposes a Speculative Concurrency Control technique (SCC) technique that minimizes the impact of block ages and rollbacks, and presents a number of SCC-based algorithms that differ in the level of speculation they introduce, and the amount of System resources they require.

...read moreread less

Abstract: Various concurrency control algorithms differ in the time when conflicts are detected, and in the way they are resolved. Pessimistic (PCC) protocols detect conflicts as soon as they occur and resolve them using blocking. Optimistic (OCC) protocols detect conflicts at transaction commit time and resolve them using rollbacks. For real-time databases, blockages and rollbacks are hazards that increase the likelihood of transactions missing their deadlines. We propose a Speculative Concurrency Control (SCC) technique that minimizes the impact of block ages and rollbacks. SCC relies on added system resources to speculate on potential serialization orders, ensuring that if such serialization orders materialize, the hazards of blockages and roll-back are minimized. We present a number of SCC-based algorithms that differ in the level of speculation they introduce, and the amount of System resources (mainly memory) they require. We show the performance gains (in terms of number of satisfied timing constraints) to be expected when a representative SCC algorithm (SCC-2S) is adopted. >

...read moreread less

65 citations

Proceedings Article•DOI•

Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers

[...]

G. Janakiraman¹, Yuval Tamir¹•Institutions (1)

University of California, Los Angeles¹

25 Oct 1994

TL;DR: An error recovery scheme based on coordinated checkpointing and rollback for DSM multicomputers is proposed, and performance evaluation based on trace-driven simulations indicates that this scheme incurs less checkpoint traffic than recovery schemes previously proposed for DSM systems.

...read moreread less

Abstract: Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determined only by the reliability requirements of the application. Efficient adaptation of this approach to DSM multicomputers is complicated by the absence of explicit messages in DSM systems, the presence of a shared and partially replicated address space, and the presence of a distributed coherency directory. We present solutions to these issues, and propose an error recovery scheme based on coordinated checkpointing and rollback for DSM multicomputers. Our performance evaluation based on trace-driven simulations indicates that this scheme incurs less checkpoint traffic than recovery schemes previously proposed for DSM systems. >

...read moreread less

63 citations

Proceedings Article•DOI•

Roll-forward and rollback recovery: performance-reliability trade-off

[...]

Dhiraj K. Pradhan¹, Nitin H. Vaidya¹•Institutions (1)

Texas A&M University¹

15 Jun 1994

TL;DR: It is shown that the roll-forward schemes improve performance with only a small loss in reliability as compared to rollback schemes.

...read moreread less

Abstract: Performance and reliability achieved by a modular redundant system depend on the recovery scheme used. Typically, gain in performance using comparable resources results in reduced reliability. Several high performance computers are noted for small mean time to failure. Performance is measured here in terms of mean and variance of the task completion time, reliability being a task-based measure defined as the probability that a task is completed correctly. Two roll-forward schemes are compared with two rollback schemes for achieving recovery in duplex systems. The roll-forward schemes discussed here are based on a roll-forward checkpointing concept. Roll-forward recovery schemes achieve significantly better performance than rollback schemes by avoiding rollback in most common fault scenarios. It is shown that the roll-forward schemes improve performance with only a small loss in reliability as compared to rollback schemes. >

...read moreread less

57 citations

Proceedings Article•DOI•

pGVT: an algorithm for accurate GVT estimation

[...]

Loy M. D'Souza, Xianzhi Fan, Philip A. Wilsey

01 Jul 1994

TL;DR: A new algorithm for GVT estimation called pGVT was designed to support accurate estimates of the actual GVT value and it operates in an environment where the communication subsystem does not support FIFO message delivery and where message delivery failure may occur.

...read moreread less

Abstract: The time warp mechanism uses memory space to save event and state information for rollback processing. As the simulation advances in time, old state and event information can be discarded and the memory space reclaimed. This reclamation process is called fossil collection and is guided by a global time value called Global Virtual Time (GVT). That is, GVT represents the greatest minimum time of the fully committed events (the time before which no rollback will occur). GVT is then used to establish a boundary for fossil collection. This paper presents a new algorithm for GVT estimation called pGVT. pGVT was designed to support accurate estimates of the actual GVT value and it operates in an environment where the communication subsystem does not support FIFO message delivery and where message delivery failure may occur. We show that pGVT correctly estimates GVT values and present some performance comparisons with other GVT algorithms.

...read moreread less

55 citations

Patent•

Method and system for selectable consistency level maintenance in a resilent database system

[...]

Alain Azagury¹, Danny Dolev¹, German Goft¹, John Marberg¹, James Gregory Ranweiler¹, Julian Satran¹ - Show less +2 more•Institutions (1)

IBM¹

15 Apr 1994

TL;DR: In this article, a replica database which is fully consistent with the primary database is provided for seamless switchover in the event of a primary database failure, while a second replica database may be provided to respond to queries by applications which do not require fully consistent data, greatly enhancing the efficiency of access to that database.

...read moreread less

Abstract: In a resilient database system which includes a journaled database which is implemented at one or more locations within a distributed data processing system, multiple diverse consistency levels are specified which each detail a level of consistency to be maintained between a primary database and a replica database. A user is then permitted to select a particular level of consistency for each replica database. Thereafter, each update to a record within the primary database is utilized to initiate an update to the corresponding record within each replica database in a manner which is consistent with the selected level of consistency for that replica database. In this manner, a replica database which is fully consistent with the primary database may be provided for seamless switchover in the event of a primary database failure, while a second replica database may be provided to respond to queries by applications which do not require fully consistent data, greatly enhancing the efficiency of access to that database.

...read moreread less

51 citations

Proceedings Article•DOI•

Using message semantics to reduce rollback in optimistic message logging recovery schemes

[...]

Hong Va Leong¹, Divyakant Agrawal¹•Institutions (1)

University of California, Santa Barbara¹

21 Jun 1994

TL;DR: This work proposes to identify messages that are not influential in a computation through message semantics and develops an algorithm for identifying these messages, which gives rise to recoverable states that dominate the recoverable state defined under conventional model.

...read moreread less

Abstract: Recovery from failures can be achieved through asynchronous checkpointing and optimistic message logging. These schemes have low overheads during failure-free operations. Central to these protocols is the determination of a maximal consistent global state, which is recoverable. Message semantics is not exploited in most existing recovery protocols to determine the recoverable state. We propose to identify messages that are not influential in a computation through message semantics. These messages can be logically removed from the computation without changing its meaning or result. We show that considering these messages in the recoverable state computation gives rise to recoverable states that dominate the recoverable state defined under conventional model. We then develop an algorithm for identifying these messages. This technique can also be applied to ensure a more timely commitment for output in a distributed computation. >

...read moreread less

Journal Article•DOI•

The performance of cache-based error recovery in multiprocessors

[...]

B. Janssens¹, W.K. Fuchs¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Oct 1994-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval.

...read moreread less

Abstract: Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval. >

...read moreread less

Journal Article•DOI•

Efficient algorithms for optimistic crash recovery

[...]

Subbarayan Venkatesan¹, Tony T.-Y. Juang¹•Institutions (1)

University of Texas at Dallas¹

01 Oct 1994-Distributed Computing

TL;DR: The rollback problem as a closure problem is formulates and a centralized closure algorithm is presented together with two efficient distributed implementations for solving the problem.

...read moreread less

Abstract: Recovery from transient processor failures can be achieved by using optimistic message logging and checkpointing. The faulty processors roll back, and some/all of the non-faulty processors also may have to roll back. This paper formulates the rollback problem as a closure problem. A centralized closure algorithm is presented together with two efficient distributed implementations. Several related problems are also considered and distributed algorithms are presented for solving them.

...read moreread less

Patent•

Sets and holds in virtual time logic simulation for parallel processors

[...]

Philip Lee Childs¹, Joseph F. Skovira¹•Institutions (1)

IBM¹

30 Dec 1994

TL;DR: In this article, user-initiated requests for setting and holding simulated logic circuit elements are accommodated during forward simulation, rollback, and advance of system global virtual time.

...read moreread less

Abstract: In an event-driven, virtual time logic simulation system, user-initiated requests for setting and holding simulated logic circuit elements are accommodated during forward simulation, rollback, and advance of system global virtual time.

...read moreread less

Journal Article•DOI•

Memory management algorithms for optimistic parallel simulation

[...]

Yi-Bing Lin¹•Institutions (1)

Telcordia Technologies¹

01 Mar 1994-Information Sciences

TL;DR: This paper evaluates four Time Warp memory management algorithms: fossil collection, message sendback, cancelback and artificial rollback and shows that if an algorithm satisfies this second criterion, then the amount of memory consumed by Time Warp simulation is bounded by the amount consumed by sequential simulation.

...read moreread less

Proceedings Article•DOI•

Cost of state saving & rollback

[...]

John G. Cleary¹, Fabian Gomes², Brian W. Unger², Zhonge Xiao², Raimar Thudt³ - Show less +1 more•Institutions (3)

University of Waikato¹, University of Calgary², Siemens³

01 Jul 1994

TL;DR: Experimental results show the necessity and sufficiency of incremental state saving for this application and approaches to state saving and rollback for a shared memory, optimistically synchronized, simulation executive are presented.

...read moreread less

Abstract: Approaches to state saving and rollback for a shared memory, optimistically synchronized, simulation executive are presented. An analysis of copy state saving and incremental state saving is made and these two schemes are compared. Two benchmark programs are then described, one a simple, all overhead, model and one a performance model of a regional Canadian public telephone network. The latter is a large SS7 common channel signalling model that represents a very challenging, practical, test application for parallel simulation. Experimental results are presented which show the necessity and sufficiency of incremental state saving for this application.

...read moreread less

Proceedings Article•DOI•

Reducing interprocessor dependence in recoverable distributed shared memory

[...]

B. Janssens¹, W.K. Fuchs¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

25 Oct 1994

TL;DR: This paper develops an ownership timestamp scheme to tolerate the loss of block state information and develops a passive server model of execution where interactions between processors are considered atomic.

...read moreread less

Abstract: Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfer of blocks of data, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop an ownership timestamp scheme to tolerate the loss of block state information and develop a passive server model of execution where interactions between processors are considered atomic. With our scheme, dependencies are significantly reduced compared to the traditional message-passing model. >

...read moreread less

Patent•

Method and apparatus for block-level auditing and database recovery in a transaction processing system

[...]

Thomas P. Cooper¹, Michael J. Hill¹, Dennis R. Konrad¹, Thomas L. Nowatzki¹•Institutions (1)

Unisys¹

22 Dec 1994

TL;DR: In this article, a transaction processing audit and recovery system is described, in which an audit manager logs in audit records only changed blocks of data of a segment of a database, and a recovery manager reads the audit records and copies the changed blocks back to the database backing storage.

...read moreread less

Abstract: A transaction processing audit and recovery system is disclosed. After processing a transaction, an audit manager logs in audit records only changed blocks of data of a segment of a database. Upon failure of database backing storage, a prior copy of the database is reloaded to database backing storage that is available and a recovery manager reads the audit records and copies the changed blocks back to the database backing storage. An outboard file cache system is used in conjunction with the recovery manager to recover the database. The outboard file cache provides cache storage for segments of the database and writes non-contiguous blocks of one or more segments as directed in a single input/output request initiated from the recovery manager.

...read moreread less

Proceedings Article•DOI•

Checkpoint/rollback in a distributed system using coarse-grained dataflow

[...]

D. Cummings¹, L. Alkalaj¹•Institutions (1)

California Institute of Technology¹

15 Jun 1994

TL;DR: This paper describes the CosMOS distributed checkpoint/rollback approach, which exploits the fact that a COSMOS application program is based on a coarse-grained dataflow programming paradigm and therefore most of the state of a distributed application programs is contained in the data tokens.

...read moreread less

Abstract: The Common Spaceborne Multicomputer Operating System (COSMOS) is a spacecraft operating system for distributed memory multiprocessors, designed to meet the on-board computing requirements of long-life interplanetary missions. One of the main features of COSMOS is software-implemented fault-tolerance, including 2-way voting, 3-way voting, and check point/rollback. This paper describes the COSMOS distributed checkpoint/rollback approach, which exploits the fact that a COSMOS application program is based on a coarse-grained dataflow programming paradigm and therefore most of the state of a distributed application program is contained in the data tokens. Furthermore, all computers maintain a consistent view of this dynamic state, which facilitates the implementation of a coordinated checkpoint. >

...read moreread less

Patent•

Distributed database management

[...]

Amy Chang¹, Daniel Jerome Coyle¹, Timothy R. Malkemus¹, Walter G. Wilson¹•Institutions (1)

IBM¹

08 Nov 1994

TL;DR: In parallel database management systems, database update requests typically result in activity at several nodes and a coordination process monitors for failure or success of updates is required if the update of any node fails as discussed by the authors.

...read moreread less

Abstract: In parallel database management systems, database update requests typically result in activity at several nodes. Rollback of all updates is required if the update of any node fails. A coordination process monitors for failure or success of updates. The coordinator further provides for distinguishing activities that have taken place at any given node from the other nodes for different database update requests. Savepoints are local. This allows rollback of a selected update without affecting nodes which did not process the update.

...read moreread less

Book Chapter•DOI•

On the semantics of (bi)temporal variable databases

[...]

James Clifford¹, Tomas Isakowitz¹•Institutions (1)

New York University¹

02 May 1994

TL;DR: This paper proposes a framework for providing a formal specification of the precise semantics of this type of database, which it is called a variable database, and discusses several alternative semantics that can be given to these temporal variable databases incorporating one or more of these variables.

...read moreread less

Abstract: Numerous proposals for extending the relational data model to incorporate the temporal dimension of data have appeared during the past several years. These have ranged from historical data models, incorporating a valid time dimension, to rollback data models, incorporating a transaction time dimension, to bitemporal data models, incorporating both of these temporal dimensions. Many of these models have been presented in a non-traditional fashion, allowing the use of variables at the instance level. Unfortunately, the precise semantics of these database objects, e.g. tuples, with variables has not been made clear. In this paper we propose a framework for providing a formal specification of the precise semantics of this type of database, which we call a variable database. In addition, since more than one possible interpretation can be given to the specific temporal variables, such as now and ∞, which have appeared in the literature, we discuss several alternative semantics that can be given to these temporal variable databases incorporating one or more of these variables. We also present a constraint on the way such databases are allowed to evolve in time if they are to support a rollback operator.

...read moreread less

Proceedings Article•DOI•

Group communications algorithm for dynamically updating in distributed systems

[...]

Hiroaki Higaki

19 Dec 1994

TL;DR: In this article, a dynamic updating technique is proposed to achieve extension or modification of functions in a distributed system, which can be invoked asynchronously by each process with the assurance of correct execution of the system.

...read moreread less

Abstract: The paper proposes a novel updating technique, dynamically updating to achieve extension or modification of functions in a distributed system. Usual updating techniques require multiple processes to suspend simultaneously in order to avoid an unspecified reception caused by the conflict of different versions of processes. By using the proposed dynamically updating technique, the updating operation can be invoked asynchronously by each process with the assurance of correct execution of the system, i.e., the system can cope with the effect of an unspecified reception caused by a mixture of multiple version processes. This is implemented by using a novel distributed algorithm that consists of group communication, checkpoint setting, and rollback recovery. This algorithm achieves rollback recovery with the lowest overhead, i.e., a set of checkpoints determines the last global state for consistent rollback recovery and a set of processes that need to rollback simultaneously is the smallest.

...read moreread less

Book•

Oracle8 DBA handbook

[...]

Kevin Loney, Marlene Theriault

01 Jan 1994

TL;DR: By following the techniques in this book, you'll no longer have to worry about disasters striking your databases, and Administering the database will become easier as the users get a better product, while the database works-and works well.

...read moreread less

Abstract: From the Book: Whether you're an experienced DBA, a new DBA, or an application developer, you need to know how the internal structures of the ORACLE8 database work and interact. Properly managing the database's internals will allow your database to meet two goals: it wiIl work, and it wiII work well. In this book, you'll find the information you need to achieve both of these goals. The emphasis throughout is on managing the database's capabilities in an effective, efficient manner to deliver a quality product. The end result will be a database that is dependable, robust, secure, extensible, and designed to meet the objectives of the applications it supports. Several components are critical to these goals, and you'll see that all of them are covered here in depth. A well-designed logical and physical database architecture will improve performance and ease administration by properly distributing database objects. Determining the correct number and size of rollback segments will allow your database to support all of its transactions. You'll also see appropriate monitoring, security, and tuning strategies for stand-alone and networked databases. Optimal backup and recovery procedures are also provided to help ensure the database's recoverability. The focus in all of these sections is on the proper planning and management techniques for each area. You'll also find information on how to manage specific problems, such as dealing with very large databases or very high availability requirements. Networking issues and the management of distributed and client/server databases are thoroughly covered. SQL*Net (now known as Net8), networking configurations,snapshots, location transparency, and everything else you need to successfully implement a distributed or client/server database are described in detail in Part III of this book. You'll also find real-world examples and for every major configuration. In addition to the commands needed to perform DBA activities, you will also see the Oracle Enterprise Manager screens that perform similar functions. In addition to descriptions of the ORACLE8i features, you will also see sections that compare prior releases to ORACLE8i, to facilitate your migration path. "Solutions" sections throughout the book offer common solutions to the most frequently encountered problems. By following the techniques in this book, you'll no longer have to worry about disasters striking your databases. Your systems can be designed and implemented so well that tuning efforts will be minimal. Administering the database will become easier as the users get a better product, while the database works-and works well.

...read moreread less

Book Chapter•DOI•

The FTMPS-Project: Design and Implementation of Fault-Tolerance Techniques for Massively Parallel Systems

[...]

Johan Vounckx¹, Geert Deconinck¹, Rudy Lauwereins¹, G. Viehover, R. Wagner, Henrique Madeira², João Gabriel Silva², Frank Balbach³, Jörn Altmann³, Bernd Bieker, Harald Willeke - Show less +7 more•Institutions (3)

Katholieke Universiteit Leuven¹, University of Coimbra², University of Erlangen-Nuremberg³

18 Apr 1994

TL;DR: The FTMPS-project provides a solution to the need for faulttolerance in large systems by developing and being implemented a complete fault-tolerance approach.

...read moreread less

Abstract: The FTMPS-project provides a solution to the need for faulttolerance in large systems A complete fault-tolerance approach is developed and being implemented The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures Combined with the diagnosis software, the necessary information for the OSS (statistics and visualisation) and the possibly reconfiguration is collected Backward error recovery based on checkpointing and rollback, is implemented

...read moreread less

Book Chapter•DOI•

A Fault-Tolerant Mechanism for Simple Controllers

[...]

João Gabriel Silva¹, Luis Silva¹, Henrique Madeira¹, Jorge Bernardino¹•Institutions (1)

University of Coimbra¹

04 Oct 1994

TL;DR: A proposal for a simple and low-cost fault-tolerant technique, yet very effective, that can be used even in the simplest controllers, that uses behaviour-based error detection, with checkpointing and rollback, to give resiliency to the application.

...read moreread less

Abstract: There are many industrial controllers where no systematic fault-tolerant mechanisms are included because of cost constraints. This paper addresses that problem by making a proposal for a simple and low-cost fault-tolerant technique, yet very effective, that can be used even in the simplest controllers. The mechanism is able of tolerating both hardware and software faults. It uses behaviour-based error detection, with checkpointing and rollback, to give resiliency to the application. The programs are made of possibly non-deterministic processes that communicate solely by messages. The technique, called RP-Actions, also guarantees that the recovery is domino-effect free. Software bugs are caught by acceptance tests as in recovery blocks. Forward error recovery is used for time, since time cannot be rolled back. Several implementations of the proposed mechanisms were made — we present in this paper some important results.

...read moreread less

Proceedings Article•DOI•

Efficient checkpointing over local area networks

[...]

Avi Ziv¹, Jehoshua Bruck•Institutions (1)

Stanford University¹

12 Jun 1994

TL;DR: A novel checkpointing approach that enables efficient performance over local area networks and shows that in some cases the overhead of the DMR checkpointing schemes over LAN's can be improved by as much as 20%.

...read moreread less

Abstract: Parallel and distributed computing on clusters of workstations is becoming very popular as it provides a cost effective way for high performance computing. In these systems, the bandwidth of the communication subsystem (using Ethernet technology) is about an order of magnitude smaller compared to the bandwidth of the storage subsystem. Hence, storing a state in a checkpoint is much more efficient than comparing states over the network. In this paper we present a novel checkpointing approach that enables efficient performance over local area networks. The main idea is that we use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (where the state is only stored). The store-checkpoints reduce the rollback needed after a fault is detected, without performing many unnecessary comparisons. As a particular example of this approach we analyzed the DMR checkpointing scheme with store-checkpoints. Our main result is that the overhead of the execution time can be significantly reduced when store-checkpoints are introduced. We have implemented a prototype of the new DMR scheme and run it on workstations connected by a LAN. The experimental results we obtained match the analytical results and show that in some cases the overhead of the DMR checkpointing schemes over LAN's can be improved by as much as 20%.

...read moreread less

Patent•

Fault-tolerant transaction-oriented data processing

[...]

Andrew John Schofield¹, Anthony Robert Washer¹•Institutions (1)

IBM¹

19 Aug 1994

...read moreread less

Proceedings Article•DOI•

Adaptive redundancy for fault-tolerant real-time systems

[...]

Chia-Mei Chen¹, Satish K. Tripathi¹, Sheng-Tzong Cheng¹•Institutions (1)

University of Maryland, College Park¹

12 Jun 1994

TL;DR: This paper proposes a fault-tolerance approach, a hybrid method of rollback and replication, for the real-time systems which require both system reliability and the guarantee of meeting deadlines.

...read moreread less

Abstract: Reliability is an important aspect of real-time systems because the result of a real-time application may be valid only if the application functions correctly and its timing constraints are satisfied There are two kinds of faults: hardware and software faults In this paper, we consider hardware transient faults Full replication or full hardware redundancy can achieve a high degree of reliability; however, it may waste resources We propose a fault-tolerance approach, a hybrid method of rollback and replication, for the real-time systems which require both system reliability and the guarantee of meeting deadlines We define that a task is fault-tolerant if it can be recovered from a transient error either by rollback or duplication Our approach attempts to make as many tasks fault-tolerant as possible

...read moreread less

Journal Article•DOI•

Cost of state saving & rollback

[...]

ClearyJohn, GomesFabian, UngerBrian, XiaoZhonge, ThudtRaimar - Show less +1 more

01 Jul 1994-ACM Sigsim Simulation Digest

TL;DR: Approaches to state saving and rollback for a shared memory, optimistically synchronized, simulation executive are presented and an analysis of copy statesaving and incremental state saving is made.

...read moreread less

Proceedings Article•DOI•

Increasing concurrency through optimism: a reason for HOPE

[...]

Crispin Cowan, Hanan Lutfiyya, Mike Bauer

08 Mar 1994