A methodology for implementing highly concurrent data objects

doi:10.1145/161468.161469

A Methodology for Implementing H ghly

Concurrent Data Objects

MAURICE HERLIHY

Digital Equipment Corporation

A concurrent object is a data structure shared by concurrent processes. Conventional techniques

for implementing concurrent objects typically rely on

crztical sections: ensuring that only one

process at a time can operate on the object. Nevertheless, critical sections are poorly suited for

asynchronous systems: if one process is halted or delayed in a critical section, other, nonfaulty

processes will be unable to progress. By contrast, a concurrent object implementation is

lock free

if it always guarantees that some process will complete an operation in a finite number of steps,

and it is

wait free if it guarantees that each process will complete an operation in a finite

number of steps. This paper proposes a new methodology for constructing lock-free and wait-free

implementations of concurrent objects. The object’s representation and operations are written as

stylized sequential programs, with no explicit synchronization. Each sequential operation

is automatically transformed into a lock-free or wait-free operation using novel synchroniza-

tion and memory management algorithms. These algorithms are presented for a multiple

instruction/multiple data (MIMD) architecture in which

n processes communicate by applying

atomic

read, wrzte, load_linked, and store_conditional operations to a shared memory.

Categories and Subject Descriptors: D.2.1

[Software Engineering]: Requirements/specifics-

tions—rnethodologies; D,3.3

[Programming Languages]: Language Constructs and

Features—concurrent programming structures; D.4.1 [Operating Systems]: Process Manage-

ment—concurrency; deadlocks;

Synch-on uzng

General Terms: Algorithms, Management, Performance, Theory

1. INTRODUCTION

A

concurrent object is a data structure shared by concurrent processes.

Conventional techniques for implementing concurrent objects typically rely

on critical sections to ensure that only one process at a time is allowed

to access to the object. Nevertheless, critical sections are poorly suited for

asynchronous systems; if one process is halted or delayed in a critical section,

other, faster processes will be unable to progress. Possible sources of unex-

pected delay include page faults, cache misses, scheduling -preemption, and

perhaps even processor failure.

By contrast, a concurrent object implementation is lock free if some process

must complete an operation after the system as a whole takes a finite number

Author’s address: Digital Equipment Corporation, Cambridge Research Laboratory, One Kendall

Square, Cambridge, MA 02139; email: herlihy@crl.dec. corn.

Permission to copy without fee all or part of this material is granted provided that the copies are

not made or distributed for direct commercial advantage, the ACM copyright notice and the title

of the publication and its date appear, and notice is given that copying is by permission of the

Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and\m-

specific

permission.

m 1993 ACM 0164–0925/93/1100–0745 $03.50

ACM Transactions on Programmmg Languages and Systems, Vd 15, N. 5, November 1993,Pages 745-?70

746 . Maurice Herlihy

of steps,l and it is wait free if each process must complete an operation after

taking a finite number of steps. The lock-free condition guarantees that some

process will always make progress despite arbitrary halting failures or delays

by other processes, while the wait-free condition guarantees that all non-

halted processes make progress. Either condition rules out the use of critical

sections, since a process that halts in a critical section can force other

processes trying to enter that critical section to run forever without making

progress. The lock-free condition is appropriate for systems where starvation

is unlikely, while the (strictly stronger) wait-free condition may be appropri-

ate when some processes are inherently slower than others, as in certain

heterogeneous architectures.

The theoretical issues surrounding lock-free synchronization protocols have

received a fair amount of attention, but the practical issues have not. In this

paper, we make a first step toward addressing these practical aspects by

proposing a new methodology for constructing lock-free and wait-free imple-

mentations of concurrent objects. Our approach focuses on two distinct issues:

ease of reasoning and performance.

—It is no secret that reasoning about concurrent programs is difficult. A

practical methodology should permit a programmer to design, say, a cor-

rect lock-free priority queue, without ending up with a publishable result.

—The lock-free and wait-free properties, like most kinds of fault-tolerance,

incur a cost, especially in the absence of failures or delays. A methodology

can be considered practical only if (1) we understand the inherent costs of

the resulting programs, (2) this cost can be kept to acceptable levels, and

(3) the programmer has some ability to influence these costs.

We address the reasoning issue by having programmers implement data

objects as stylized sequential programs, with no explicit synchronization.

Each sequential implementation is automatically transformed into a lock-

free or wait-free implementation via a collection of novel synchronization and

memory management techniques introduced in this paper. If the sequential

implementation is a correct sequential program, and if it follows certain

simple conventions described below, then the transformed program will be a

correct concurrent implementation. The advantage of starting with sequen-

tial programs is clear: the formidable problem of reasoning about concurrent

programs and data structures is reduced to the more familiar sequential

domain. (Because programmers are required to follow certain conventions,

this methodology is not intended to parallelize arbitrary sequential programs

after the fact. )

To address the performance issue, we built and tested prototype implemen-

tations of several concurrent objects on a multiprocessor. We show that a

naive implementation of our methodology performs poorly because of exces-

sive memory contention, but simple techniques from the literature (such as

exponential backofll have a dramatic effect on performance. We also compare

1 The lock-free condition is sometimes called non blocking.

ACM Transactions on Programmmg Languages and Systems, VOI 15, No. 5, November 1993

Implementing Highly Concurrent Data Objects .

747

our implementations with more conventional implementations based on spin

locks. Even in the absence of timing anomalies, our example implementations

sometimes outperform conventional spin-lock techniques, and lie within a

factor of two of more sophisticated spin-lock techniques.

We focus on a multiple instruction/multiple data (MIMD) architecture

in which n asynchronous processes communicate by applying atomic read,

write, load–linked, and store_conditional operations to a shared memory.

The load_linked operation copies the value of a shared variable to a local

variable. A subsequent store _conditional to the shared variable will change

its value only if no other process has modified that variable in the interim.

Either way, the store _conditional returns an indication of success or failure.

(Note

that a store-conditional is permitted to fail even if the variable has not

changed. We assume that such spurious failures are rare, though possible.)

We chose to focus on the load–linked and store_ conditional synchroniza-

tion primitives for three reasons. First, they can be implemented efficiently

in a cache-coherent architectures [Jensen et al. 1987; Kane 1989; Sites 1992],

since store–conditional need only check whether the cached copy of the

shared variable has been invalidated. Second, many other “classical” synchro-

nization primitives are provably inadequate—we have shown elsewhere

[Herlihy 1991] that it is impossible

2 to construct lock-free or wait-free

implementations of many simple and useful data types using any combina-

tion of read, write, test& set, fetch& add [Gottlieb et al. 1984], and memory-

to-register swap. The load_ linked and store_conditional operations, how-

ever, are universal—at least in principle, they are powerful enough to

transform a sequential implementation of any object into a lock-free or

wait-free implementation. Finally, we have found load _linked and store _con -

ditional easy to use. Elsewhere [Herlihy 1990], we present a collection of

synchronization and memory management algorithms based on compare&

swap [IBM]. Although these algorithms have the same functionality as those

given here, they are less efficient, and conceptually more complex.

In our prototype implementations, we used the C language [Kernighan and

Ritchie 1988] on an Encore Multimax [Encore 1989] with eighteen NS32532

processors. This architecture does not provide load_ linked or store _condi-

tional primitives, so we simulated them using short critical sections. Natu-

rally, our simulation is less efficient than direct hardware support. For

example, a successful store–conditional requires twelve machine instructions

rather than one. Nevertheless, these prototype implementations are instruc-

tive because they allow us to compare the relative efficiency of different

implementations using load_ linked and store _conditional, and because they

still permit an approximate comparison of the relative efficiency of waiting

versus nonwaiting techniques. We assume readers have some knowledge of

the syntax and semantics of C.

In Section 2, we give a brief survey of related work. Section 3 describes our

model. In Section 4, we present protocols for transforming sequential imple-

2Although our impossibility results were presented in terms of wait-free implementations, they

hold for lock-free implementations as well.

ACM Transactions on Programming Languages and Systems, Vol. 15, No. 5, November 1993.

748 .

Maurice Herlihy

mentations of small objects into lock-free and wait-free implementations,

together with experimental results showing that our techniques can be made

to perform well even when each process has a dedicated processor. In Section

5, we extend this methodology to encompass large objects. Section 6 summa-

rizes our results, and concludes with a discussion.

2.

RELATED WORK

Early work on lock-free protocols focused on impossibility results [Chor et al.

1987; Dolev et al. 1987; Dwork et al. 1986; 1988; Fischer et al. 1985; Herlihy

199 1], showing that certain problems cannot be solved in asynchronous

systems using certain primitives. By contrast, a synchronization primitive is

Un iz)ersal if it can be used to transform any sequential object implementation

into a wait-free concurrent implementation. The author [Herlihy 1991] gives

a necessary and sufficient condition for universality: a synchronization primi-

tive is universal in an n-process system if and only if it solves the well-known

consensus problem [Fischer et al. 1985] for n processes. Although this result

established that wait-free (and lock-free) implementations are possible in

principle, the construction given was too inefficient to be practical. Plotkin

[ 1989] gives a detailed universal construction for a sticky-bit primitive. This

construction is also of theoretical rather than practical interest. Elsewhere

[Herlihy 1990], the author gives a simple and relatively efficient technique

for transforming stylized sequential object implementations into lock-free and

wait-free implementations using the compare & swap synchronization primi-

tive. Although the overall approach is similar to the one presented here, the

details are quite different. In particular, the constructions presented in this

paper are simpler and more efficient, for reasons discussed below.

Many researchers have studied the problem of constructing wait-free atomic

registers from simpler primitives [Burns and Peterson 1987; Lamport 1986;

Li et al. 1991; Peterson 1983; Peterson and Burns 1986]. Atomic registers,

however, have few if any interesting applications for concurrent data struc-

tures, since they cannot be combined to construct lock-free or wait-free

implementations of most common data types [Herlihy 1991]. There exists an

extensive literature on concurrent data structures constructed from more

powerful primitives. Gottlieb et al. [ 1983] give a highly concurrent queue

implementation based on the replace-add operation, a variant of fetch & add.

This implementation permits concurrent enqueuing and dequeuing processes,

but it is blocking, since it uses critical sections to synchronize access to

individual queue elements. Lamport [ 1983] gives a wait-fl-ee queue imple-

mentation that permits one enqueuing process to execute concurrently with

one dequeuing process. Herlihy and Wing [1987] give a lock-free queue

implementation, employing fetch & add and swap, that permits an arbitl’ary

number of enqueuing and dequeuing processes. Lanin and Shasha [1988] give

a lock-free set implementation that uses operations similar to compare&

swap. There exists an extensive literature on locking algorithms for concur-

rent B-trees [Bayer and Schkolnick 1977; Lehman and Yao 1981] and for

related search structures [Biswas and Browne 1987; Ellis 1980; Ford and

ACM Transactions on Programmmg Languages and Systems, Vol 15, No 5, November 199:3,

Implementing Highly Concurrent Data Objects .

749

Calhoun 1984; Guibas and Sedgewick 1978; Jones 1989]. More recent ap-

proaches to implementing lock-free data structures include Allemany and

Felton’s work on operating system support [Allemany and Felton 1992], and

Herlihy and Moss’s work on hardware support [Herlihy and Moss 1993].

The load–linked and store–conditional synchronization -primitives were

first proposed as part of the S-1 project [Jensen et al. 198’7] at Lawrence

Livermore Laboratories, and they are currently supported in the MIPS-II

architecture [Kane 1989] and Digital’s Alpha [Sites 1992]. They are closely

related to the compare& swap operation first introduced by the IBM 370

architecture [IBM].

Our techniques are distantly related to optimistic concurrency control

methods from the database literature [Kung and Robinson 1981]. In these

schemes, transactions execute without synchronization, but each transaction

must be validated before it is allowed to commit to ensure that synchroniza-

tion conflicts did not occur. Our method also checks after-the-fact whether

synchronization conflicts occurred, but the technical details are entirely

different.

3.

OVERVIEW

A concurrent system consists of a collection of n sequential processes that

communicate through shared typed objects. Processes are sequential—each

process applies a sequence of operations to objects, alternately issuing an

invocation and then receiving the associated response. We make no fairness

assumptions about processes. A process can halt, or display arbitrary varia-

tions in speed. In particular, one process cannot tell whether another has

halted or is just running very slowly.

Objects are data structures in memory. Each object has a type, which

defines a set of possible values and a set of primitive operations that provide

the only means to manipulate that object. Each object has a sequential

specification that defines how the object behaves when its operations are

invoked one at a time by a single process. For example, the behavior of a

queue object can be specified by requiring that enqueue insert an item in the

queue, and that dequeue remove the oldest item present in the queue. In a

concurrent system, however, an object’s operations can be invoked by concur-

rent processes, and it is necessary to give a meaning to interleaved operation

executions.

An object is linearizable [Herlihy and Wing 1987] if each operation appears

to take effect instantaneously at some point between the operation’s invoca-

tion and response. Linearizability implies that processes appear to be inter-

leaved at the granularity of complete operations, and that the order of

nonoverlapping operations is preserved. As discussed in more detail else-

where [Herlihy and Wing 1987], the notion of linearizability generalizes and

uni~les a number of ad hoc correctness conditions in the literature, and it is

related to (but not identical with) correctness criteria such as sequential

consistency [Lamport 1979] and strict serializability [Papadimitriou 1979].

ACM Transactions on Programming Languages and Systems, Vol

15, No. 5, November 1993.

A methodology for implementing highly concurrent data objects

Citations

The C programming language

The nesC language: a holistic approach to networked embedded systems

Parallel Computer Architecture: A Hardware/Software Approach

Software transactional memory

The Art of Multiprocessor Programming

References

Introduction to Algorithms

The C++ Programming Language

Impossibility of distributed consensus with one faulty process

Transactional memory: architectural support for lock-free data structures

How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

Related Papers (5)

Wait-free synchronization

Linearizability: a correctness condition for concurrent objects

Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

Transactional memory: architectural support for lock-free data structures

Lock-free linked lists using compare-and-swap