The Recovery Manager of the System R Database Manager

doi:10.1145/356842.356847

JIM GRAY

Tandem Computers, 19333 Vallco Parkway, Cupertino, Californta 95014

PAUL McJONES

Xerox Corporatwn, 3333 Coyote Htll Road, Palo Alto, Cahfornia 94304

MIKE BLASGEN, BRUCE LINDSAY, RAYMOND LORIE, TOM PRICE,

FRANCO PUTZOLU, AND IRVING TRAIGER

IBM San Jose Research Laboratory, 5600 Cottle Road, San Jose, Cahfornm 95193

The recovery subsystem of an experimental data management system is described and

evaluated. The transactmn concept allows application programs to commit, abort, or

partially undo their effects. The DO-UNDO-REDO protocol allows new recoverable types

and operations to be added to the recovery system Apphcation programs can record data

m the transaction log to facilitate application-specific recovery. Transaction undo and

redo are based on records kept in a transaction log. The checkpoint mechanism is based

on differential fries (shadows). The recovery log is recorded on disk rather than tape.

Keywords and Phrases transactions, database, recovery, reliability

CR Categortes: 4.33

INTRODUCTION

Application Interface to System R

Making computers easier to use is the goal

of most software. Database management

systems, in particular, provide a program-

ming interface to ease the task of writing

electronic bookkeeping programs. The re-

covery manager of such a system in turn

eases the task of writing fault-tolerant ap-

plication programs.

System R [ASTR76] is a database system

which supports the relational model of

data. The SQL language [CHAM76] pro-

vides operators that manipulate the data-

base. Typically, a user writes a PL/I or

COBOL program which has imbedded SQL

statements. A collection of such statements

is required to make a consistent transfor-

mation of the database. To transfer funds

from one account to another, for example,

requires two SQL statements: one to debit

the first account and one to credit the sec-

ond account. In addition, the transaction

probably records the transfer in a history

file for later reporting and for auditing pur-

poses. Figure 1 gives an example of such a

program written in pseudo-PL/I.

The program effects a consistent trans-

formation of the books of a hypothetical

bank. Its actions are either to

• discover an error,

• accept the input message, and

• produce a failure message,

or to

• discover no errors,

• accept the input message,

Permismon to copy without fee all or part of this material is granted provided that the copies are not made or

¢hstnbuted for direct commercial advantage, the ACM copyright notme and the title of the publication and its

date appear, and notme is given that copying is by pernusmon of the Association for Computing Machinery. To

copy otherwise, or to republish, reqmres a fee and/or specific permission.

Computing Surveys, Vol. 13, No. 2, June 1981

224 • Jim Gray et al.

CONTENTS

INTRODUCTION

Apphcatlon Interface to System R

Structure of System R

Model of Failures

1. DESCRIPTION OF SYSTEM R RECOVERY

MANAGER

1 1 What Is a Transaction?

1.2 Transactmn Save Points

1 3 Summary

2 IMPLEMENTATION OF SYSTEM R

RECOVERY

2.1 Files, Versmns, and Shadows

2 2 Logs and the DO, UNDO, REDO Protocol

2.3 Commit Processing

2.4 Transactmn UNDO

2 5 Transaction Save Points

2.6 System Configuratmn, Startup and Shutdown

2.7 System Checkpoint

2.8 System Restart

2 9 Medm Failure

2 10 Managing the Log

2 11 Recovery and Locking

3 EVALUATION

3 1 Implementation Cost

3 2 Execution Cost

3.3 I/O Cost

3.4 Success Rate

3.5 Complexity

3.6 Dmk-Based Log

3 7 Save Points

3 8 Shadows

3.9 Message Recovery, an Oversight

3 10 New Features

ACKNOWLEDGMENTS

REFERENCES

A

v

• debit the source account by AMOUNT,

• credit the destination account by

AMOUNT,

• record the transaction in a history file,

and

• produce a success message.

The programmer who writes such a pro-

gram ensures its correctness by ensuring

that it performs the desired transformation

on both the database state and the outside

world (via messages). The programmer and

the user both want the execution to be

• atomic: either all actions are performed

(the transaction has an effect) or the re-

sults of all actions are undone (the trans-

action has no effect);

• durable: once the transaction completes,

its effects cannot be lost due to computer

failure;

• consistent: the transaction occurs as

though it had executed on a system which

sequentially executes only one transaction

at a time.

In order to state this intention, the SQL

programmer brackets the transformations

with the SQL statements, BEGIN__

TRANSACTION to signal the beginning

of the transaction and COMMIT__

TRANSACTION to signal its completion.

If the programmer wants to return to the

beginning of the transaction, the command

RESTORE__TRANSACTION will undo

all actions since the issuance of the BE-

GIN__TRANSACTION command (see

Figure 1).

The System R recovery manager sup-

ports these commands and guarantees an

atomic, durable execution.

System R generally runs several trans-

actions concurrently. The concurrency con-

trol mechanism of System R hides such

concurrency from the programmer by a

locking technique [EswA76, GRAY78,

NAUM78] and gives the appearance of a

consistent system.

Structure of System R

System R consists of an external layer

called the Research Data System (RDS),

and a completely internal layer called the

Research Storage System (RSS) (see

Figure 2).

The external layer provides a relational

data model, and operators thereon. It also

provides catalog management, a data

dictionary, authorization, and alternate

views of data. The RDS is manipulated

using the language SQL [CHAM76]. The

SQL compiler maps SQL statements into

sequences of RSS calls.

The RSS is a nonsymbolic record-at-a-

time access method. It supports the notions

of file, record type, record instance, field

within record, index (B-tree associative

and sequential access path), parent-child

set (an access path supporting the

operations PARENT, FIRST__CHILD,

NEXT__SIBLING, PREVIOUS__SIB-

LING with direct pointers), and cursor

(which navigates over access paths to locate

Computing Surveys, Vol. 13, No 2, June 1981

The Recovery Manager of the System R Database Manager °

225

FUNDS__TRANSFER. PROCEDURE,

$BEGIN__TRANSACTION;

ON ERROR DO; /* in case of error */

$RESTORE_TRANSACTION, /* undo all work */

GET INPUT MESSAGE; /* reacquire input */

PUT MESSAGE ('TRANSFER FAILED'); /* report failure */

GO TO COMMIT;

END;

GET INPUT MESSAGE;

EXTRACT ACCOUNT~EBIT, ACCOUNT_CREDIT,

AMOUNT FROM MESSAGE,

$UPDATE ACCOUNTS /* do debit */

SET BALANCE ffi BALANCE - AMOUNT

WHERE ACCOUNTS. NUMBER = ACCOUNT__DEBIT;

$UPDATE ACCOUNTS /* do credit */

SET BALANCE = BALANCE + AMOUNT

WHERE ACCOUNTS. NUMBER = ACCOUNT_CREDIT;

$INSERT INTO HISTORY /* keep audit trail */

<DATE, MESSAGE>;

PUT MESSAGE ('TRANSFER DONE'); /* report success */

COMMIT: /* commit updates */

$COMMIT TRANSACTION

END; /* end of program */

/* get and parse input */

Figure 1. A snnple PL/I-SQL program whmh transfers funds from one account to another.

Application Programs in PL/I or COBOL, plus SQL

Research Data System (RDS)

* Supports the relational data model

• Supports the relational language SQL

• Does naming and authorization

• Compiles SQL statements into RSS call sequences

Research Storage System (RSS)

• Provides nonsymbolic record-at-a-time database ac-

cess

• Maps records onto operating system files

• Provides transaction concept (recovery and locking)

Operating System

• Provides file system to manage disks

• Provides I/O system to manage terminals

• Provides process structure (multlprogramming)

Hardware

Figure

2. System R consists of two layers above the

operating system. The RSS provides the transaction

concept, recovery notions, and a record-at-a-time data

access method. The RDS accepts application PL/I or

COBOL programs containing SQL statements. It

translates them into COBOL or PL/I programs plus

subroutines which represent the compilation of the

SQL statements into RSS calls.

records). Unfortunately, these objects have

the nonstandard names "segment," "rela-

tion," "tuple," "field," "image," "link," and

"scan" in the System R documentation.

The former, more standard, names are used

here. RSS provides actions to create in-

stances of these objects and to retrieve,

modify, and delete them.

The RSS support of data is substantially

more sophisticated than that normally

found in an access method; it supports vari-

able-length fields, indices on multiple fields,

multiple record types per file, interffle and

intraffle sets, physical clustering of records

by attribute, and a catalog describing the

data, which is kept as a file which may be

manipulated like any other data.

Another major contribution of the RSS

is its support of the notion of

transaction,

a unit of recovery consisting of an applica-

tion-specified sequence of RSS actions. An

application declares the start of a transac-

tion by issuing a BEGIN action. Thereafter

all RSS actions by that application are

within the scope of that transaction until

the application issues a COMMIT or an

ABORT action. The RSS assumes all re-

sponsibility for running concurrent trans-

actions and for assuring that each transac-

tion sees a consistent view of the database.

The RSS is also responsible for recovering

the data to their most recent consistent

state in the event of transaction, action,

system, or media failure or a user request

to cancel the transaction.

Computing Surveys, Vol. 13, No. 2, June 1981

226 •

Jim Gray et al.

A final component of System R is the

operating system. System R runs under the

VM/370 [GRAY75] and the MVS operating

system on IBM S/370 processors. The Sys-

tem R recovery manager is also part of the

SQL/DS product running on DOS/CICS.

The operating system provides processes, a

simple file system, and terminal manage-

ment.

System R allocates an operating system

process for each user to run both the user's

application program and the System R da-

tabase manager. Application programs are

written in a conventional programming lan-

guage (e.g., COBOL or PL/I) augmented

with the SQL language. A SQL preproces-

sor maps the SQL statements to sequences

of RSS calls. Typically, a single application

program or group of programs (main plus

subroutines) constitute a transaction. In

this paper we ignore the RDS and assume

that application programs, like those pro-

duced by the SQL compiler, consist of con-

ventional programs which invoke se-

quences of RSS operations.

Model of Failures

The recovery manager eases the task of

writing

fault-tolerant

programs. It does so

by the careful use of redundancy. Choosing

appropriate redundancy requires a quanti-

tative model of system failures.

In our experience about 97 percent of all

transactions execute successfully. Of the

remainder, almost all fail because of incor-

rect user input or because of user cancella-

tion. Occasionally {much less than 1 per-

cent) transactions are aborted by the sys-

tem as a result of some overload such as

deadlock. In a typical system running one

transaction per second, transaction undo

occurs about twice a minute. Because of its

frequency, transaction undo must run

about as fast as forward processing of trans-

actions.

Every few days the system

restarts

(fol-

lowing a crash). Almost all crashes are due

to hardware or operating system failures,

although System R also initiates crash and

restart whenever it detects damage to its

data structures. The state of primary mem-

ory is lost after a crash. We assume that the

state of the disks (secondary and tertiary

storage) is preserved across crashes, so at

Table

1. Frequency and Recovery Time of Failures

Recovery manager trade-offs

Recovery

Fault Frequency tune

Transaction Several per unnute Milliseconds

abort

System Several per month Seconds

restart

Media failure Several per year Minutes

restart the most recently committed state

is reconstructed from the surviving disk

state by referencing a log of recent activity

to restore the work of committed and

aborted transactions. This process com-

pletes within a matter of seconds or min-

utes.

Occasionally, the integrity of the disk

state will be lost at restart. This may be

caused by hardware failure (disk head crash

or disk dropped on the floor) or by software

failure (bad data written on a disk page by

System R or other program). Such events

are called

media failures

and initiate a

reconstruction of the current state from an

archive version (old and undamaged ver-

sion of the system state) plus a log of activ-

ity since that time. This procedure is in-

voked once or twice a year and is expected

to complete within an hour.

If all these recovery procedures fail, the

user will have lost data owing to an

unre-

coverable failure.

We have very limited

statistics on unrecoverable failures. The

current release of System R has experi-

enced about 25 years of service in a variety

of installations, and to our knowledge al-

most all unrecoverable failures have re-

sulted from operations errors {e.g., failure

to make archive dumps) or from bugs in

the operating system utility for dumping

and restoring disks. The fact that the ar-

chive mechanism is only a minor source of

unrecoverable failure probably indicates

that it is appropriately designed. Table 1

summarizes this discussion.

If the archive mechanism fails once every

hundred years of operation, and if there are

10,000 installations of System R, then it will

fail someone once a month. From this per-

spective, it might be underdesigned.

We assume that System R, the operating

system, the microcode, and the hardware

all have bugs in them. However, each of

Computing Surveys, Vol. 13, No. 2, June 1981

The Recovery Manager of the System R Database Manager • 227

these systems does quite a bit of checking

of its data structures (defensive program-

ming}. We postulate that these errors are

detected and that the system crashes before

the data are seriously corrupted. If this

assumption is incorrect, then the situation

is treated as a media failure. This attitude

assumes that the archive and log mecha-

nism are very reliable and have failure

modes independent of the other parts of

the system.

Some commercial systems are much

more demanding. They run hundreds of

transactions per second, and because they

have hundreds of disks, they see disk fail-

ures hundreds of times as frequently as

typical users of System R {once a week

rather than once a year). They also cannot

tolerate downtimes exceeding a few min-

utes. Although the concepts presented in

this paper are applicable to such systems,

much more redundancy is needed to meet

such demands (e.g., duplexed processors

and disks, and utilities which can recover

small parts of the database without having

to recover it all every time). The recovery

manager presented here is a textbook one,

whose basic facilities are only a subset of

those provided by more sophisticated sys-

tems.

The transaction model is an unrealizable

ideal. At best, careful use of redundancy

minimizes the probability of unrecoverable

failures and consequent loss of committed

updates. Redundant copies are designed to

have independent failure modes, making it

unlikely that all records will be lost at once.

However, Murphy's law ensures that all

recovery techniques will sometimes fail. As

seen below, however, System R can tolerate

any single failure and can often tolerate

multiple failures.

1. DESCRIPTION OF SYSTEM R RECOVERY

MANAGER

1.1 What is a Transaction?

The RSS provides actions on the objects it

implements. These actions include opera-

tions to create, destroy, manipulate, re-

trieve, and modify RSS objects (files, rec-

ord types, record instances, indices, sets,

and cursors). Each RSS action is atomic--

it either happens or has no effect--and

consistent--if any two actions relate to the

same object, they appear to execute in some

serial order. These two qualities are en-

sured by (1) undoing the partial effects of

any actions which fail and (2) locking nec-

essary RSS resources for the duration of

the action.

RSS actions are rather primitive. In gen-

eral, functions like "hire an employee" or

"make a deposit in an account" require

several actions. The user, in mapping ab-

stractions like "employee" or "account"

into such a system, must combine several

actions into an atomic transaction. The

classic example of an atomic transaction is

a funds transfer which debits one account,

credits another, writes an activity record,

and does some terminal input or output.

The user of such a transaction wants it to

be an all-or-nothing affair, in that he does

not want only some of the actions to have

occurred. If the transaction is correctly im-

plemented, it looks and acts atomic.

In a multiuser environment, transactions

take on the additional attribute that any

two transactions concurrently operating on

common objects appear to run serially (i.e.,

as though there were no concurrency). This

property is called consistency and is han-

dled by the RSS lock subsystem [ESWA76,

GRAY76, GRAY78, NAUM78].

The application declares a sequence of

actions to be a transaction by beginning the

sequence with a BEGIN action and ending

it with a COMMIT action. All intervening

actions by that application (be it one or

several processes) are considered to be

parts of a single recovery unit. If the appli-

cation gets into trouble, it may issue the

ABORT action which undoes all actions in

the transaction. Further, the system may

unilaterally abort in-progress transactions

in case of an authorization violation, re-

source limit, deadlock, system shutdown, or

crash. Figure 3 shows the three possible

outcomes--commit, abort, or system abor-

tion-of a transaction, and Figure 4 shows

the outcomes of five sample transactions in

the event of a system crash.

If a transaction either aborts or is

aborted, the system must undo all actions

of that transaction. Once a transaction com-

mits, however, its updates and messages to

Computmg Surveys, Vol. 13, No. 2, June 1981

The Recovery Manager of the System R Database Manager

Citations

Principles of Distributed Database Systems

Query evaluation techniques for large databases

Principles of transaction-oriented database recovery

ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging

Implementation techniques for main memory database systems

References

The notions of consistency and predicate locks in a database system

Notes on Data Base Operating Systems

System R: relational approach to database management

SEQUEL 2: a unified approach to data definition, manipulation, and control

Physical integrity in a large segmented database

Related Papers (5)

The Design and Implementation of a Log-structured file system

A case for redundant arrays of inexpensive disks (RAID)

Method for maintaining consistent states of a file system and for creating user-accessible read-only copies of a file system

Notes on Data Base Operating Systems

ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging