What is the main focus of section 6?

Section 6 focuses on a simpleOptimistic replication · 7subclass of optimistic replication systems, called state-transfer systems, and several interesting techniques available to them.

Why do sites have to undo and redo operations?

Because sites may receive operations in different orders, they must undo and redo operations repeatedly as they gradually learn the final order.

What is the meaning of state-transfer systems?

Such systems are called state-transfer systems, as they only need to record and transmit the final values of objects, not the sequence of operations.

What are the key design choices for optimistic replication systems?

Section 3 introduces six key design choices for optimistic replication systems, including the number of masters, state- vs operation transfer, scheduling, conflict management, operation propagation, and consistency guaratees.

(Open Access) Optimistic replication (2005) | Yasushi Saito

Optimistic replication

Yasushi Saito

Hewlett-Packard Laboratories, Palo Alto, CA (USA)

and

Marc Shapiro

Microsoft Research Ltd., Cambridge (UK)

Data replication is a key technology in distributed data sharing systems, enabling higher availability and perfor-

mance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short

term, in order to support concurrent work practices and to tolerate failures in low-quality communication links.

The importance of such techniques is increasing as collaboration through wide-area and mobile networks be-

comes popular.

Optimistic replication techniques are different from traditional “pessimistic” ones. Instead of synchronous

replica coordination, an optimistic algorithm propagates changes in the background, discovers conﬂicts after they

happen and reaches agreement on the ﬁnal contents incrementally.

We explore the solution space for optimistic replication algorithms. This paper identiﬁes key challenges facing

optimistic replication systems — ordering operations, detecting and resolving conﬂicts, propagating changes

efﬁciently, and bounding replica divergence — and provides a comprehensive survey of techniques developed for

addressing these challenges.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems—Dis-

tributed applications; H.3.4 [Information Storage and Retrieval]: Systems and Software—Distributed systems

General Terms: Algorithms, Performance

Additional Key Words and Phrases: Replication, Distributed Systems, Internet

1. INTRODUCTION

Data replication consists of maintaining multiple copies of critical data, called replicas, on

separate computers. It is a critical enabling technology of distributed services, improving

both their availability and performance. Availability is improved by allowing access to the

data even when some of the replicas are unavailable. Performance improvements concern

reduced latency, which improves by letting users access nearby replicas and avoiding re-

mote network access, and increased throughput, by letting multiple computers serve the

data.

This work is supported in part by DARPA Grant F30602-97-2-0226 and National Science Foundation Grant #

EIA-9870740.

Authors’ addresses: Yasushi Saito, Hewlett-Packard Laboratories, 1501 Page Mill Rd, MS 1U-34, Palo Alto, CA,

93403, USA. mailto:yasushi@cs.washington.edu, http://www.hpl.hp.com/personal/Yasushi Saito. Marc

Shapiro, Microsoft Research Ltd., 7 J J Thomson Ave, Cambridge CB3 0FB, United Kingdom. mailto:Marc.

Shapiro@acm.org, http://www-sor.inria.fr/

∼

shapiro/.

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without

fee provided that the copies are not made or distributed for proﬁt or commercial advantage, the copyright notice,

the title of the publication, and its date appear, and notice is given that coping is by permission of the ACM, Inc.

To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior speciﬁc permission

and/or a fee.

2 · Saito and Shapiro

This paper surveys optimistic replication algorithms. Compared to traditional “pes-

simistic” techniques, optimistic replication promises higher availability and performance,

but lets replicas temporarily diverge and lets users see inconsistent data. The remainder of

this introduction overviews the concept of optimistic replication, deﬁnes its basic elements,

and compares it to traditional replication techniques.

1.1 Traditional replication techniques and their limitations

Traditional replication techniques try to maintain single-copy consistency — they give

users an illusion of having a single, highly available copy of data [Bernstein and Goodman

1983; Bernstein et al. 1987].This goal can be achieved in many ways, but the basic concept

remains the same: traditional techniques block access to a replica unless it is provably up

to date. We call these techniques “pessimistic” for this reason. For example, primary-copy

algorithms, used widely in commercial systems, elect a primary replica that is responsible

for handling all accesses to a particular object [Bernstein et al. 1987; Dietterich 1994; Or-

acle 1996]. After an update, the primary synchronously writes the change to the secondary

replicas. If the primary crashes, secondaries confer to elect a new primary. Such pes-

simistic techniques perform well in local-area networks, in which latencies are small and

failures uncommon. Given the continuing progress of Internet technologies, it is tempt-

ing to apply pessimistic algorithms to wide-area data replication. We cannot expect good

performance and availability in this environment, however, for three key reasons.

First, the Internet remains slow and unreliable. The Internet’s communication end-to-

end latency and availability do not seem to be improving [Zhang et al. 2000; Chandra

et al. 2001]. In addition, mobile computers with intermittent connectivity are becoming

increasingly popular. A pessimistic replication algorithm, attempting to synchronize with

an unavailable site, would block completely. Well-known impossibility results even raise

the possiblity that it might corrupt data; for instance it is impossible to agree on a single

primary after a failure when network delay is unpredictable [Fischer et al. 1985; Chandra

and Toueg 1996].

Second, pessimistic algorithms scale poorly in the wide area. It is difﬁcult to build a

large, pessimistically replicated system with frequent updates, because its throughput and

availability suffer as the number of sites increases [Yu and Vahdat 2001; Yu and Vahdat

2002]. This is why many Internet and mobile services are optimistic, for instance Usenet

[Spencer and Lawrence 1998; Lidl et al. 1994], DNS [Mockapetris 1987; Mockapetris and

Dunlap 1988; Albitz and Liu 2001], and mobile ﬁle and database systems [Walker et al.

1983; Kistler and Satyanarayanan 1992; Moore 1995; Ratner 1998].

Third, some human activities require asynchronous data sharing. Cooperative engineer-

ing or program development often requires people to work in relative isolation. It is better

to allow concurrent operations, and to repair occasional conﬂicts after they happen, than to

lock out the data while someone is editing it.

1.2 What is optimistic replication?

Optimistic replication is a group of techniques for sharing data efﬁciently in wide-area

or mobile environments. The key feature that separates optimistic replication algorithms

from their pessimistic counterparts is their approach to concurrency control. Pessimistic

algorithms synchronously coordinate replicas during accesses and block the other users

during an update. In contrast, optimistic algorithms let data be read or written without

a priori synchronization, based on the “optimistic” assumption that problems will occur

Optimistic replication · 3

only rarely, if at all. Updates are propagated in the background, and occasional conﬂicts

are ﬁxed after they happen. It is not a new idea,

but its use has exploded due to the

proliferation of the Internet and mobile computing technologies.

Optimistic algorithms offer many advantages over their pessimistic counterparts. First,

they improve availability: applications make progress even when network links and sites

are unreliable.

Second, they are ﬂexible with respect to networking, because techniques

such as epidemic replication propagate operations reliably to all replicas, even when the

communication graph is unknown and variable. Third, optimistic algorithms should be

able to scale to a large number of replicas, because they require little synchronization

among sites. Fourth, sites and users are highly autonomous: for example, services such

as FTP and Usenet mirroring [Nakagawa 1996; Krasel 2000] let a replica be added with

no change to existing sites. Optimistic replication also enables asynchronous collaboration

between users, for instance in CVS [Cederqvist et al. 2001; Vesperman 2003] or Lotus

Notes [Kawell et al. 1988]. Finally, optimistic algorithms provide quick feedback, as they

can apply updates tentatively as soon as they are submitted.

These beneﬁts, however, come at a cost. Any distributed system faces a trade-off be-

tween availability and consistency [Fox and Brewer 1999; Yu and Vahdat 2002]. Where a

pessimistic algorithm waits, an optimistic one speculates. Optimistic replication faces the

unique challenges of diverging replicas and conﬂicts between concurrent operations. It is

thus applicable only for applications that can tolerate occasional conﬂicts and inconsistent

data. Fortunately, in many real-world systems, especially ﬁle systems, conﬂicts are known

to be rather rare, thanks to the data partitioning and access arbitration that naturally happen

between users [Ousterhout et al. 1985; Baker et al. 1991; Vogels 1999; Wang et al. 2001].

1.3 Elements of optimistic replication

This section introduces some basic concepts of optimistic replication and deﬁnes com-

mon terms used throughout the paper. Figure 1 illustrates how these concepts ﬁt together,

and Table 1 provides a reference for common terms. This section provides only a terse

overview, as later ones will go into more detail.

1.3.1 Objects, replicas, and sites. Any replicated system has a concept of the minimal

unit of replication. We call such unit an object. A replica is a copy of an object stored in

a site, or a computer. A site may store replicas of multiple objects, but we often use terms

replica and site interchangeably, since most optimistic replication algorithms manage each

object independently. When describing algorithms, it is useful to distinguish sites that can

update an object — called master sites — from those that store read-only replicas. We use

the symbol N to denote the total number of replicas and M to denote the number of master

replicas for a given object. Common values are M = 1 (single-master systems) and M = N.

1.3.2 Operations. An optimistic replication system must allow access to a replica even

while it is disconnected. In this paper, we call a self-contained update to an object an

operation. To update an object, a user submits an operation at some site. An operation

includes a prescription to update the object as well as a precondition for detecting conﬂicts.

The concrete nature of prescriptions and preconditions varies widely among systems.

Our earliest reference is from Johnson and Thomas [1976], but the idea was certainly developed much earlier.

Tolerating Byzantine (malicious) failures is outside our scope; we cite a few recent papers in this area: Spreitzer

et al. [1997], Minsky [2002] and Mazi

eres and Shasha [2002].

4 · Saito and Shapiro

1+2

(a) Operation submission:

Users at different sites submit

operations independently.

(b) Propagation: Sites

communicate and exchange

operations.

compute the ordering

of operations.

(d) Conflict resolution: Sites detect conflicts

and transform offending operations to

produce results intended by users.

(e) Commitment: Sites agree on the final

ordering and reconciliation result. Their

changes become permanent.

1+2

Fig. 1. Elements of optimistic replication and their roles. Disks represent replicas, memo sheets represent

operations, and arrows represent communications between replicas.

Many systems support only whole-object updates, including Palm [PalmSource 2002] and

DNS [Albitz and Liu 2001]. Such systems are called state-transfer systems, as they only

need to record and transmit the ﬁnal values of objects, not the sequence of operations.

Other systems, called operation-transfer systems, allow for more sophisticated descrip-

tions of updates. For example, updates in Bayou [Terry et al. 1995] are written in SQL.

A site applies an operation locally immediately, and it exchanges and applies remote

operations in the background. Such systems are said to offer eventual consistency, because

they guarantee that the state of replicas will converge only eventually. Such a weak guar-

antee is enough for many optimistic replication applications, but some systems provide

stronger guarantees, e.g., that a replica’s state is never more than 1 hour old.

1.3.3 Propagation. An operation submitted by the user of a replica is tentatively ap-

plied to the local replica to let the user continue working based on that update. It is also

logged, i.e., remembered in order to be propagated to other sites later. These systems of-

ten deploy epidemic propagation to let all sites receive operations, even when they cannot

communicate with each other directly [Demers et al. 1987]. Epidemic propagation lets any

two sites that happen to communicate exchange their local operations as well as operations

they received from a third site — an operation spreads like a virus does among humans.

1.3.4 Tentative execution and scheduling. Because of background propagation, opera-

tions are not always received in the same order at all sites. Each site must reconstruct an

appropriate ordering that produces an equivalent result across sites and matches the users’

intuitive expectations. Thus, an operation is initially considered tentative. A site might

reorder or transform operations repeatedly until it agrees with others on the ﬁnal operation

ordering. We use the term scheduling to refer to the (often non-deterministic) ordering

policy.

1.3.5 Detecting and resolving conﬂicts. With no a priori site coordination, multiple

users may update the same object at the same time. One could simply ignore such a situa-

Optimistic replication · 5

tion — for instance, a room-booking system could handle two requests to the same room

by picking one arbitrarily and discarding the other. However, simply dropping concurrent

requests is not desirable in many applications, including room booking. This problem is

called lost updates.

A better way to handle this problem is to detect operations that are in conﬂict and resolve

them, for example, by letting people renegotiate their schedule. A conﬂict happens when

the precondition of an operation is violated, if it is to be executed according to the system’s

scheduling policy. In many systems, preconditions are built implicitly into the replication

algorithm. The simplest example is when all concurrent operations are ﬂagged to be in

conﬂict, as with the Palm Pilot [PalmSource 2002] and the Coda mobile ﬁle system [Kumar

and Satyanarayanan 1995]. Other systems let users write preconditions explicitly — for

example, in a room booking system written in Bayou, a precondition might check the status

of the room and disallow double booking [Terry et al. 1995].

Conﬂict resolution is usually highly application speciﬁc. Most systems simply ﬂag a

conﬂict and let users ﬁx it manually. Some systems can resolve a conﬂict automatically.

For example, in Coda, concurrent writes to a ’*.o’ ﬁle can be resolved simply by recom-

piling the source ﬁle [Kumar and Satyanarayanan 1995]. We discuss conﬂict detection and

resolution in more detail in Sections 5 and 6.

1.3.6 Commitment. Scheduling and conﬂict resolution often both involve non-

deterministic choices, e.g., regarding ordering of concurrent operations. Moreover, a

replica may not have received all the operations that others have. Commitment refers to

an algorithm to converge the state of replicas by letting sites agree on the set of operations

and their ﬁnal ordering and conﬂict-resolution results.

1.4 Comparison with advanced transaction models

Optimistic replication is related to relaxed (or advanced) transaction models [Elmagarmid

1992; Ramamritham and Chrysanthis 1996]. Both relax the ACID requirements of tradi-

tional databases to improve performance and availability, but the motives are different.

Advanced transaction models try to increase the system’s throughput by, for example,

letting transactions read values produced by non-committed transactions [Pu et al. 1995].

Designed for a single-node or well-connected distributed database, they require frequent

communication during transaction execution.

Optimistic replication systems, in contrast, are designed to work with a high degree of

asynchrony and autonomy. Sites exchange operations in the background and still agree on

a common state. They must learn about relationships between operations, often long after

they were submitted, and at sites different from where submitted. Their techniques, such

as the use of operations, scheduling, and conﬂict detection, reﬂect the characteristics of

environments for which they are designed. Preconditions play a role similar to traditional

concurrency control mechanisms, such as two-phase locking or optimistic concurrency

control [Bernstein et al. 1987], but it operates without inter-site coordination. Conﬂict

resolution corresponds to transaction abortion, in that both are designed to ﬁx problems in

concurrency control.

That said, there are many commonalities between optimistic replication and advanced

ACID demands that a group of operations, called a transaction, be: Atomic (all-or-nothing), Consistent (safe

when executed sequentially), Isolated (intermediate state is not observable) and Durable (the ﬁnal state is persis-

tent) [Gray and Reuter 1993].

Optimistic replication

Figures

Citations

コンピュータ・サイエンス : ACM computing surveys

Conflict-free Replicated Data Types

A comprehensive study of Convergent and Commutative Replicated Data Types

Logically centralized?: state distribution trade-offs in software defined networks

Making geo-replicated systems fast as possible, consistent when necessary

References

Time, clocks, and the ordering of events in a distributed system

Time, clocks, and the ordering of events in a distributed system

Impossibility of distributed consensus with one faulty process

Concurrency Control and Recovery in Database Systems

Hypertext Transfer Protocol -- HTTP/1.1

Related Papers (5)

Managing update conflicts in Bayou, a weakly connected replicated storage system

Time, clocks, and the ordering of events in a distributed system

Dynamo: amazon's highly available key-value store

Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

The dangers of replication and a solution

Frequently Asked Questions (7)

Q1. What is the main focus of section 6?

Q2. Why do sites have to undo and redo operations?

Q3. What is the meaning of state-transfer systems?

Q4. What are the commonalities between optimistic replication and advanced ACID?

Q5. What are the characteristics of environments for which they are designed?

Q6. What are some of the reasons why optimistic replication is so popular?

Q7. What are the key design choices for optimistic replication systems?