scispace - formally typeset
Open AccessJournal ArticleDOI

A methodology for implementing highly concurrent data objects

Maurice Herlihy
- 01 Nov 1993 - 
- Vol. 15, Iss: 5, pp 745-770
Reads0
Chats0
TLDR
This paper proposes a new methodology for constructing lock-free and wait-free implementations of concurrent objects that are presented for a multiple instruction/multiple data (MIMD) architecture in which n processes communicate by applying atomic read, write, load_linked, and store_conditional operations to a shared memory.
Abstract
A concurrent object is a data structure shared by concurrent processes. Conventional techniques for implementing concurrent objects typically rely on critical sections; ensuring that only one process at a time can operate on the object. Nevertheless, critical sections are poorly suited for asynchronous systems: if one process is halted or delayed in a critical section, other, nonfaulty processes will be unable to progress. By contrast, a concurrent object implementation is lock free if it always guarantees that some process will complete an operation in a finite number of steps, and it is wait free if it guarantees that each process will complete an operation in a finite number of steps. This paper proposes a new methodology for constructing lock-free and wait-free implementations of concurrent objects. The object's representation and operations are written as stylized sequential programs, with no explicit synchronization. Each sequential operation is atutomatically transformed into a lock-free or wait-free operation using novel synchronization and memory management algorithms. These algorithms are presented for a multiple instruction/multiple data (MIMD) architecture in which n processes communicate by applying atomic read, write, load_linked, and store_conditional operations to a shared memory.

read more

Content maybe subject to copyright    Report

A Methodology for Implementing H ghly
Concurrent Data Objects
MAURICE HERLIHY
Digital Equipment Corporation
A concurrent object is a data structure shared by concurrent processes. Conventional techniques
for implementing concurrent objects typically rely on
crztical sections: ensuring that only one
process at a time can operate on the object. Nevertheless, critical sections are poorly suited for
asynchronous systems: if one process is halted or delayed in a critical section, other, nonfaulty
processes will be unable to progress. By contrast, a concurrent object implementation is
lock free
if it always guarantees that some process will complete an operation in a finite number of steps,
and it is
wait free if it guarantees that each process will complete an operation in a finite
number of steps. This paper proposes a new methodology for constructing lock-free and wait-free
implementations of concurrent objects. The object’s representation and operations are written as
stylized sequential programs, with no explicit synchronization. Each sequential operation
is automatically transformed into a lock-free or wait-free operation using novel synchroniza-
tion and memory management algorithms. These algorithms are presented for a multiple
instruction/multiple data (MIMD) architecture in which
n processes communicate by applying
atomic
read, wrzte, load_linked, and store_conditional operations to a shared memory.
Categories and Subject Descriptors: D.2.1
[Software Engineering]: Requirements/specifics-
tions—rnethodologies; D,3.3
[Programming Languages]: Language Constructs and
Features—concurrent programming structures; D.4.1 [Operating Systems]: Process Manage-
ment—concurrency; deadlocks;
Synch-on uzng
General Terms: Algorithms, Management, Performance, Theory
1. INTRODUCTION
A
concurrent object is a data structure shared by concurrent processes.
Conventional techniques for implementing concurrent objects typically rely
on critical sections to ensure that only one process at a time is allowed
to access to the object. Nevertheless, critical sections are poorly suited for
asynchronous systems; if one process is halted or delayed in a critical section,
other, faster processes will be unable to progress. Possible sources of unex-
pected delay include page faults, cache misses, scheduling -preemption, and
perhaps even processor failure.
By contrast, a concurrent object implementation is lock free if some process
must complete an operation after the system as a whole takes a finite number
Author’s address: Digital Equipment Corporation, Cambridge Research Laboratory, One Kendall
Square, Cambridge, MA 02139; email: herlihy@crl.dec. corn.
Permission to copy without fee all or part of this material is granted provided that the copies are
not made or distributed for direct commercial advantage, the ACM copyright notice and the title
of the publication and its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and\m-
specific
permission.
m 1993 ACM 0164–0925/93/1100–0745 $03.50
ACM Transactions on Programmmg Languages and Systems, Vd 15, N. 5, November 1993,Pages 745-?70

746 . Maurice Herlihy
of steps,l and it is wait free if each process must complete an operation after
taking a finite number of steps. The lock-free condition guarantees that some
process will always make progress despite arbitrary halting failures or delays
by other processes, while the wait-free condition guarantees that all non-
halted processes make progress. Either condition rules out the use of critical
sections, since a process that halts in a critical section can force other
processes trying to enter that critical section to run forever without making
progress. The lock-free condition is appropriate for systems where starvation
is unlikely, while the (strictly stronger) wait-free condition may be appropri-
ate when some processes are inherently slower than others, as in certain
heterogeneous architectures.
The theoretical issues surrounding lock-free synchronization protocols have
received a fair amount of attention, but the practical issues have not. In this
paper, we make a first step toward addressing these practical aspects by
proposing a new methodology for constructing lock-free and wait-free imple-
mentations of concurrent objects. Our approach focuses on two distinct issues:
ease of reasoning and performance.
—It is no secret that reasoning about concurrent programs is difficult. A
practical methodology should permit a programmer to design, say, a cor-
rect lock-free priority queue, without ending up with a publishable result.
—The lock-free and wait-free properties, like most kinds of fault-tolerance,
incur a cost, especially in the absence of failures or delays. A methodology
can be considered practical only if (1) we understand the inherent costs of
the resulting programs, (2) this cost can be kept to acceptable levels, and
(3) the programmer has some ability to influence these costs.
We address the reasoning issue by having programmers implement data
objects as stylized sequential programs, with no explicit synchronization.
Each sequential implementation is automatically transformed into a lock-
free or wait-free implementation via a collection of novel synchronization and
memory management techniques introduced in this paper. If the sequential
implementation is a correct sequential program, and if it follows certain
simple conventions described below, then the transformed program will be a
correct concurrent implementation. The advantage of starting with sequen-
tial programs is clear: the formidable problem of reasoning about concurrent
programs and data structures is reduced to the more familiar sequential
domain. (Because programmers are required to follow certain conventions,
this methodology is not intended to parallelize arbitrary sequential programs
after the fact. )
To address the performance issue, we built and tested prototype implemen-
tations of several concurrent objects on a multiprocessor. We show that a
naive implementation of our methodology performs poorly because of exces-
sive memory contention, but simple techniques from the literature (such as
exponential backofll have a dramatic effect on performance. We also compare
1 The lock-free condition is sometimes called non blocking.
ACM Transactions on Programmmg Languages and Systems, VOI 15, No. 5, November 1993

Implementing Highly Concurrent Data Objects .
747
our implementations with more conventional implementations based on spin
locks. Even in the absence of timing anomalies, our example implementations
sometimes outperform conventional spin-lock techniques, and lie within a
factor of two of more sophisticated spin-lock techniques.
We focus on a multiple instruction/multiple data (MIMD) architecture
in which n asynchronous processes communicate by applying atomic read,
write, load–linked, and store_conditional operations to a shared memory.
The load_linked operation copies the value of a shared variable to a local
variable. A subsequent store _conditional to the shared variable will change
its value only if no other process has modified that variable in the interim.
Either way, the store _conditional returns an indication of success or failure.
(Note
that a store-conditional is permitted to fail even if the variable has not
changed. We assume that such spurious failures are rare, though possible.)
We chose to focus on the load–linked and store_ conditional synchroniza-
tion primitives for three reasons. First, they can be implemented efficiently
in a cache-coherent architectures [Jensen et al. 1987; Kane 1989; Sites 1992],
since store–conditional need only check whether the cached copy of the
shared variable has been invalidated. Second, many other “classical” synchro-
nization primitives are provably inadequate—we have shown elsewhere
[Herlihy 1991] that it is impossible
2 to construct lock-free or wait-free
implementations of many simple and useful data types using any combina-
tion of read, write, test& set, fetch& add [Gottlieb et al. 1984], and memory-
to-register swap. The load_ linked and store_conditional operations, how-
ever, are universal—at least in principle, they are powerful enough to
transform a sequential implementation of any object into a lock-free or
wait-free implementation. Finally, we have found load _linked and store _con -
ditional easy to use. Elsewhere [Herlihy 1990], we present a collection of
synchronization and memory management algorithms based on compare&
swap [IBM]. Although these algorithms have the same functionality as those
given here, they are less efficient, and conceptually more complex.
In our prototype implementations, we used the C language [Kernighan and
Ritchie 1988] on an Encore Multimax [Encore 1989] with eighteen NS32532
processors. This architecture does not provide load_ linked or store _condi-
tional primitives, so we simulated them using short critical sections. Natu-
rally, our simulation is less efficient than direct hardware support. For
example, a successful store–conditional requires twelve machine instructions
rather than one. Nevertheless, these prototype implementations are instruc-
tive because they allow us to compare the relative efficiency of different
implementations using load_ linked and store _conditional, and because they
still permit an approximate comparison of the relative efficiency of waiting
versus nonwaiting techniques. We assume readers have some knowledge of
the syntax and semantics of C.
In Section 2, we give a brief survey of related work. Section 3 describes our
model. In Section 4, we present protocols for transforming sequential imple-
2Although our impossibility results were presented in terms of wait-free implementations, they
hold for lock-free implementations as well.
ACM Transactions on Programming Languages and Systems, Vol. 15, No. 5, November 1993.

748 .
Maurice Herlihy
mentations of small objects into lock-free and wait-free implementations,
together with experimental results showing that our techniques can be made
to perform well even when each process has a dedicated processor. In Section
5, we extend this methodology to encompass large objects. Section 6 summa-
rizes our results, and concludes with a discussion.
2.
RELATED WORK
Early work on lock-free protocols focused on impossibility results [Chor et al.
1987; Dolev et al. 1987; Dwork et al. 1986; 1988; Fischer et al. 1985; Herlihy
199 1], showing that certain problems cannot be solved in asynchronous
systems using certain primitives. By contrast, a synchronization primitive is
Un iz)ersal if it can be used to transform any sequential object implementation
into a wait-free concurrent implementation. The author [Herlihy 1991] gives
a necessary and sufficient condition for universality: a synchronization primi-
tive is universal in an n-process system if and only if it solves the well-known
consensus problem [Fischer et al. 1985] for n processes. Although this result
established that wait-free (and lock-free) implementations are possible in
principle, the construction given was too inefficient to be practical. Plotkin
[ 1989] gives a detailed universal construction for a sticky-bit primitive. This
construction is also of theoretical rather than practical interest. Elsewhere
[Herlihy 1990], the author gives a simple and relatively efficient technique
for transforming stylized sequential object implementations into lock-free and
wait-free implementations using the compare & swap synchronization primi-
tive. Although the overall approach is similar to the one presented here, the
details are quite different. In particular, the constructions presented in this
paper are simpler and more efficient, for reasons discussed below.
Many researchers have studied the problem of constructing wait-free atomic
registers from simpler primitives [Burns and Peterson 1987; Lamport 1986;
Li et al. 1991; Peterson 1983; Peterson and Burns 1986]. Atomic registers,
however, have few if any interesting applications for concurrent data struc-
tures, since they cannot be combined to construct lock-free or wait-free
implementations of most common data types [Herlihy 1991]. There exists an
extensive literature on concurrent data structures constructed from more
powerful primitives. Gottlieb et al. [ 1983] give a highly concurrent queue
implementation based on the replace-add operation, a variant of fetch & add.
This implementation permits concurrent enqueuing and dequeuing processes,
but it is blocking, since it uses critical sections to synchronize access to
individual queue elements. Lamport [ 1983] gives a wait-fl-ee queue imple-
mentation that permits one enqueuing process to execute concurrently with
one dequeuing process. Herlihy and Wing [1987] give a lock-free queue
implementation, employing fetch & add and swap, that permits an arbitl’ary
number of enqueuing and dequeuing processes. Lanin and Shasha [1988] give
a lock-free set implementation that uses operations similar to compare&
swap. There exists an extensive literature on locking algorithms for concur-
rent B-trees [Bayer and Schkolnick 1977; Lehman and Yao 1981] and for
related search structures [Biswas and Browne 1987; Ellis 1980; Ford and
ACM Transactions on Programmmg Languages and Systems, Vol 15, No 5, November 199:3,

Implementing Highly Concurrent Data Objects .
749
Calhoun 1984; Guibas and Sedgewick 1978; Jones 1989]. More recent ap-
proaches to implementing lock-free data structures include Allemany and
Felton’s work on operating system support [Allemany and Felton 1992], and
Herlihy and Moss’s work on hardware support [Herlihy and Moss 1993].
The load–linked and store–conditional synchronization -primitives were
first proposed as part of the S-1 project [Jensen et al. 198’7] at Lawrence
Livermore Laboratories, and they are currently supported in the MIPS-II
architecture [Kane 1989] and Digital’s Alpha [Sites 1992]. They are closely
related to the compare& swap operation first introduced by the IBM 370
architecture [IBM].
Our techniques are distantly related to optimistic concurrency control
methods from the database literature [Kung and Robinson 1981]. In these
schemes, transactions execute without synchronization, but each transaction
must be validated before it is allowed to commit to ensure that synchroniza-
tion conflicts did not occur. Our method also checks after-the-fact whether
synchronization conflicts occurred, but the technical details are entirely
different.
3.
OVERVIEW
A concurrent system consists of a collection of n sequential processes that
communicate through shared typed objects. Processes are sequential—each
process applies a sequence of operations to objects, alternately issuing an
invocation and then receiving the associated response. We make no fairness
assumptions about processes. A process can halt, or display arbitrary varia-
tions in speed. In particular, one process cannot tell whether another has
halted or is just running very slowly.
Objects are data structures in memory. Each object has a type, which
defines a set of possible values and a set of primitive operations that provide
the only means to manipulate that object. Each object has a sequential
specification that defines how the object behaves when its operations are
invoked one at a time by a single process. For example, the behavior of a
queue object can be specified by requiring that enqueue insert an item in the
queue, and that dequeue remove the oldest item present in the queue. In a
concurrent system, however, an object’s operations can be invoked by concur-
rent processes, and it is necessary to give a meaning to interleaved operation
executions.
An object is linearizable [Herlihy and Wing 1987] if each operation appears
to take effect instantaneously at some point between the operation’s invoca-
tion and response. Linearizability implies that processes appear to be inter-
leaved at the granularity of complete operations, and that the order of
nonoverlapping operations is preserved. As discussed in more detail else-
where [Herlihy and Wing 1987], the notion of linearizability generalizes and
uni~les a number of ad hoc correctness conditions in the literature, and it is
related to (but not identical with) correctness criteria such as sequential
consistency [Lamport 1979] and strict serializability [Papadimitriou 1979].
ACM Transactions on Programming Languages and Systems, Vol
15, No. 5, November 1993.

Citations
More filters

The C programming language

TL;DR: This ebook is the first authorized digital version of Kernighan and Ritchie's 1988 classic, The C Programming Language (2nd Ed.), and is a "must-have" reference for every serious programmer's digital library.
Journal ArticleDOI

The nesC language: a holistic approach to networked embedded systems

TL;DR: nesc (nesc) as mentioned in this paper is a programming language for networked embedded systems that represents a new design space for application developers and is used to implement TinyOS, a small operating system for sensor networks, as well as several significant sensor applications.
Book

Parallel Computer Architecture: A Hardware/Software Approach

TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Proceedings ArticleDOI

Software transactional memory

TL;DR: STM is used to provide a general highly concurrent method for translating sequential object implementations to non-blocking ones based on implementing a k-word compare&swap STM-transaction, a novel software method for supporting flexible transactional programming of synchronization operations.
Book

The Art of Multiprocessor Programming

TL;DR: Transactional memory as discussed by the authors is a computational model in which threads synchronize by optimistic, lock-free transactions, and there is a growing community of researchers working on both software and hardware support for this approach.
References
More filters
Book

Introduction to Algorithms

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Book

The C++ Programming Language

TL;DR: Bjarne Stroustrup makes C even more accessible to those new to the language, while adding advanced information and techniques that even expert C programmers will find invaluable.
Journal ArticleDOI

Impossibility of distributed consensus with one faulty process

TL;DR: In this paper, it is shown that every protocol for this problem has the possibility of nontermination, even with only one faulty process.
Proceedings ArticleDOI

Transactional memory: architectural support for lock-free data structures

TL;DR: Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
Journal ArticleDOI

How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

TL;DR: Many large sequential computers execute operations in a different order than is specified by the program, and a correct execution by each processor does not guarantee the correct execution of the entire program.