How do the authors solve the multi-update problem?

The authors are also working to solve the multi-update problem so that relativistic implementation can have concurrent updaters as well as concurrent readers with a single updater.

Why is the traversal dominated by the time?

This is because all the synchronization mechanisms except lock allow read concurrency and because the time is dominated by the traversal, not by the synchronization.

What are the consequences of a traversal?

The consequences are that a traversal will take O(N log(N)) time, and the tree that is traversed may not represent any state present in a globally ordered time.

What is the mutex used for the write lock?

To allow this type of traversal using the relativistic read and update algorithms described earlier, the mutex used for the write lock is replaced with a reader-writer lock.

(Open Access) Relativistic red-black trees (2014) | Philip William Howard

Q: What are the contributions in "Relativistic red-black trees" ?

In this paper the authors present a novel and highly scalable concurrent red-black tree. Their red-black tree has linear read scalability, uncontended read performance that is at least 25 % faster than other known approaches, and deterministic lookup times for a given tree size, making it suitable for realtime applications.

Q: What is the meaning of the term relativistic?

The authors refer to these algorithms as “relativistic” because they weaken the ordering requirements on concurrent reads and updates such that each reader observes the data structure in its own temporal frame of reference.

Portland State University Portland State University

PDXScholar PDXScholar

Computer Science Faculty Publications and

Presentations

Computer Science

1-2011

Relativistic Red-Black Trees Relativistic Red-Black Trees

Philip William Howard

Portland State University

Jonathan Walpole

Portland State University

Follow this and additional works at: https://pdxscholar.library.pdx.edu/compsci_fac

Part of the Computer and Systems Architecture Commons, and the Databases and Information

Systems Commons

Let us know how access to this document bene5ts you.

Citation Details Citation Details

Howard, Philip W., and Jonathan Walpole. Relativistic red-black trees. Technical Report 10-06, Portland

State University, Computer Science Department, 2010.

This Technical Report is brought to you for free and open access. It has been accepted for inclusion in Computer

Science Faculty Publications and Presentations by an authorized administrator of PDXScholar. Please contact us if

we can make this document more accessible: pdxscholar@pdx.edu.

Relativistic Red-Black Trees

Philip W. Howard

Portland State University

pwh@cecs.pdx.edu

Jonathan Walpole

Portland State University

walpole@cs.pdx.edu

Abstract

Operating system performance and scalability on shared-

memory many-core systems depends critically on efﬁcient

access to shared data structures. Scalability has proven dif-

ﬁcult to achieve for many data structures. In this paper we

present a novel and highly scalable concurrent red-black

tree.

Red-black trees are widely used in operating systems,

but typically exhibit poor scalability. Our red-black tree has

linear read scalability, uncontended read performance that

is at least 25% faster than other known approaches, and

deterministic lookup times for a given tree size, making it

suitable for realtime applications.

Keywords synchronization, data structures, scalability, con-

current programming, red-black trees

1. Introduction

The advent of many-core hardware introduces the need

for highly scalable operating system designs. Many-core

hardware poses a special challenge for symmetric shared-

memory operating system architectures because it dramat-

ically increases the degree of concurrency and at the same

time decreases the locality of accesses to kernel data. The

conventional strategy of using mutual exclusion severely

limits scalability by serializing accesses to shared data and

requiring extensive inter-core communication.

Some researchers have chosen to address this chal-

lenge by throwing out symmetric shared-memory multi-

processor operating system architecture, focusing instead

on OS architectures that are non-symmetric in their use of

shared-memory p1855745, or forgo shared-memory alto-

gether [Baumann 2009]. Our research takes a different ap-

proach. We continue to assume a shared-memory operating

system architecture can scale [Boyd-Wickizer 2010] and

[Copyright notice will appear here once ’preprint’ option is removed.]

work toward this by weakening the ordering constraints that

normally govern concurrent accesses to shared data struc-

tures. In this paper we illustrate our approach by considering

a particular data structure, namely a red-black tree.

Red-black trees are used to store sorted hkey,valuei pairs,

and are widely used in operating systems. They are used

in the Linux kernel for I/O schedulers, the process sched-

uler, the ext3 ﬁle system, and in many other places [Land-

ley 2007]. Linux kernel primitives that manipulate red-black

trees do not have concurrency control embedded in them. In-

stead, higher level uses of the primitives must manage con-

currency. This is typically done through mutual exclusion

which does not scale [Piggin 2010].

Our approach is similar to RCU-based approaches that

have been applied to simpler Linux data structures such as

lists and hash tables [Triplett 2010]. We refer to these al-

gorithms as “relativistic” because they weaken the ordering

requirements on concurrent reads and updates such that each

reader observes the data structure in its own temporal frame

of reference. Our research is attempting to generalize the

concept of relativistic programming. While this goal has yet

to be reached, the development of a relativistic data struc-

ture as complex as a red-black tree represents a signiﬁcant

milestone.

Our relativistic red-black tree has the following perfor-

mance and scalability properties:

1. Linear scalability of read accesses even in the presence

of concurrent updates. This property has been tested out

to 64 hardware threads.

2. Updaters can proceed concurrently with any number of

readers, but not other updaters.

3. Safe, fast, wait-free read access in the presence of up-

dates.

By fast we mean performance approaches that of un-

synchronized access

. Our read access achieves 93% of the

throughput of unsynchronized read access over a wide range

of tree sizes and thread counts. Our implementation is also

25% faster than the best lock based implentation for an un-

Unsynchronized access is safe for single threaded or read-only implemen-

tations, but not for multi-threaded implementations that include updates.

1 2010/12/8

contended read. As contention increases, the advantage of

our implementation grows signiﬁcantly.

By wait-free we mean that the read path does not use

locks, does not block, and never needs to wait for another

thread (neither a reader nor an updater). Furthermore, the

read path does not require any atomic instructions and on

x86 does not require memory barriers.

The rest of this paper is outlined as follows: Section 2

gives an overview of red-black trees, the operations they sup-

port, and the mechanisms that are used to preserve the bal-

anced nature of red-black trees. This section also discusses

the state of the art for parallelizing red-black trees. Section 3

discusses the ordering constraints that are, and are not, pre-

served by relativistic programming. This section also pro-

vides a justiﬁcation for why these weakened ordering con-

straints are appropriate for concurrent red-black trees. Sec-

tion 4 presents our implementation for a relativistic red-

black tree. Section 5 shows the performance of our imple-

mentation compared with red-black trees implemented us-

ing other synchronization mechanisms. Section 6 discusses

some of the issues and trade-offs involved in performing

complete tree traversals (as opposed to single look-ups). Fi-

nally, Section 7 presents concluding remarks.

2. Red-Black Trees

Since red-black trees are well known and well documented

[Guibas 1978, Plauger 1999, Schneier 1992], we do not give

a complete explanation of them. Rather, we give a brief

overview to facilitate a discussion of our relativistic imple-

mentation. In particular, we discuss the individual steps that

make up red-black tree algorithms without discussing the

glue that combines these steps. This is because the glue is

not impacted by the relativistic implementation.

Red-black trees are partially balanced, sorted, binary

trees. The trees store hkey,valuei pairs. They support the

following operations:

insert(key, value) inserts a new hkey,valuei pair into the

tree.

lookup(key) returns the value associated with a key.

delete(key) removes a hkey,valuei pair from the tree.

ﬁrst()/last() returns the ﬁrst (lowest keyed) / last (highest

keyed) value in the tree.

next()/prev() returns the next/previous value in key-sorted

order from the tree.

Red-black trees are sorted by preserving the following

properties:

1. All nodes on the left branch of a subtree have a key ≤ the

key of the root of the subtree.

2. All nodes on the right branch of a subtree have a key >

the key of the root of the subtree.

The tree is balanced by assigning a color to each node

(red or black) and preserving the following properties:

1. Both children of a red node are black.

2. The black depth of every leaf is the same. The black depth

is the number of black nodes encountered on the path

from the root to the leaf.

These invariants are sufﬁcient to guarantee O (log(N ))

lookups because the longest possible path (alternating black

and red nodes) is at most twice the shortest possible path

(all black nodes). The operations required to rebalance a tree

following an insert or delete are limited to the path from the

inserted/deleted node back to the root. The rebalancing is,

worst case, O(log(N )) meaning that inserts and deletes can

also be done in O(log(N )).

We will use the following deﬁnitions in the explanation

of the tree operations:

internal node A node with two non-empty children.

leaf A node with at least one empty child.

Observe that if next() is called on any internal node, the

result is always a leaf. This is true because next() is the left-

most node of the right subtree.

The following steps are used to implement red-black

trees:

Insertion New nodes are always inserted at the bottom of

the tree. This is possible because if prev(new-node) is an

internal node, then from the observation above, the new

node must be a leaf. If prev(new-node) is a leaf, the new

node will be a child of that node on an empty branch. The

insert may leave the tree unbalanced. If so, restructures or

recolors (see below) are required to restore the balance

properties of the tree.

Delete Nodes are always deleted from the bottom of the tree

(possibly following a swap—see below). The delete may

leave the tree unbalanced. If so, restructures or recolors

(see below) are required to restore the balance properties

of the tree.

Swap If an interior node needs to be deleted, it is ﬁrst

swapped with next(deleted-node) prior to removal. This

makes the node to be deleted a leaf.

Restructures Restructure operations, sometimes called a

rotations, are used to rebalance the tree. Restructures

always involve three adjacent nodes: child, parent, and

grandparent. See Figure 1 for an illustration of the two

types of restructure operations.

Recolor Nodes get recolored as part of the rebalancing pro-

cess. Recoloring doesn’t involve changing the structure

of the tree, only the colors applied to particular nodes.

2 2010/12/8

1 2

1 2 3

2 3

2 3 4

DiagRestructure

ZigRestructure

Figure 1. Restructure operations used to rebalance a red-

black tree. There are left and right versions of these, but they

are symmetric so only the left version is shown here.

2.1 Concurrent red-black trees

In thinking about concurrent red-black trees, it is useful to

make a distinction between events and operations. In our

terminology, operations are composed of events. Events are

steps in an operation which have a visible effect. Events can

be thought of as instantaneous. Since operations are typically

composed of multiple events, they have a duration.

We will use the following deﬁnitions to describe concur-

rent implementations:

the Start of operation n.

the Finish of operation n.

the Effect of operation n. For example, if operation n is

an insert, the effect of that operation is that the tree has a

new node.

a ⇒ b deﬁnes a happens-before relation such that a happens

before b.

If either of the following two relations holds, operations a

and b are said to be concurrent:

⇒ S

⇒ F

⇒ S

⇒ F

Graphically, this means that the time-lines of the two oper-

ations overlap. There is no implied happens-before relation

between the effects of two concurrent operations. The effects

of the two operations could occur in any order.

Implementations of objects that allow updates to happen

concurrently with reads, require additional properties so that

every intermediate representation of the data structure can be

mapped to a value of the abstract object [Herlihy 1990]. For

a sorted tree, the following properties must be maintained:

1. Lookups will always ﬁnd a node that exists in the tree.

2. Traversals will always return nodes in the correct order

without skipping any nodes that exist in the tree.

Because reads have a duration, and because updates can

proceed concurrent with reads, it’s possible that the tree will

change during a read. As a result, we need to be speciﬁc in

what we mean by “nodes that exist in the tree”. In particular,

it means the following: if operation r is a read looking for

node N (or a complete traversal of the tree), operation i is

the insert of node N , and operation d is the delete of node

N , then

1. If F

⇒ S

and F

⇒ S

then N exists in the tree and

must be observed by r in the correct traversal order.

2. If F

⇒ B

or if F

⇒ S

then N does not exist in the

tree and must not be observed by r.

3. if i is concurrent with r then N may or may not be

observed by r depending on whether the relative view

of r is E

⇒ E

or E

⇒ E

4. if d is concurrent with r then N may or may not be

observed by r depending on whether the relative view

of r is E

⇒ E

or E

⇒ E

Another way to state these properties is as follows: Prop-

erties one and two state that any update that strictly precedes

a read must be observable by the read, and any update that

strictly follows a read must not be observable by the read.

Properties three and four state that any update that is con-

current with the read may or may not be observable by the

read.

2.2 The State of the Art

The most common way to synchronize access to a red-black

tree is through locking. Unfortunately, this approach doesn’t

scale because accesses are serialized. Since accesses can

be easily divided into reads (lookups) and writes (inserts,

deletes), a reader-writer lock can be used which allows read

parallelism. This approach scales for some number of read

threads, but eventually the contention for the lock dominates

and the approach no longer scales (see the performance data

in Section 5.1 for evidence of this).

Fine grained locking of red-black trees is problematic.

Since updates may affect all the nodes from where the update

occurred back to the root, the simplest approach of acquiring

a write lock on all nodes that might change degrades to

coarse grain locking—all updaters must acquire a write lock

on the root. If one attempts to only acquire write locks on the

nodes that will actually be changed, it is difﬁcult to avoid

deadlock. If the locks are acquired from the bottom up, a

reader progressing down the tree, but above the updater, may

3 2010/12/8

acquire a lock that prevents the write from completing. If

the locks are acquired from the top down, another updater

may change the structure of the tree between the time the

initial change was made (e.g. an insert) and the time when

the necessary locks are acquired to perform a restructure.

Transactional Memory approaches provide a more au-

tomatic approach to disjoint concurrency. However, as the

changes required to rebalance a tree progress up the tree,

more and more concurrent read transactions would get in-

validated. We haven’t done any investigation to determine

what percentage of concurrent transactions might get inval-

idated, so we can not predict the performance impact of the

invalidations.

Bronson [2010] developed a concurrent AVL tree

. Their

approach allows readers to proceed without locks, but the

readers have to check each step of the way to see if the

tree has changed or is in the process of changing. If so,

the reader has to wait and retry. Since readers don’t acquire

locks, this simpliﬁes the ﬁne grained locking of the writers.

Their approach is quite complicated and this degrades read

performance as more code must execute at each node of

the tree. Their approach allows concurrent updates and their

performance data show good scalability. We are working to

port their implementation from Java to C to perform a fair

side-by-side comparison, but we have not yet completed this

work. Work done to date indicates that our read approach is

much faster.

A number of researchers have attempted to decouple

rebalancing from insert and delete [Guibas 1978, Hanke

1998]. This allows updates to proceed more quickly because

individual inserts and deletes don’t have to rebalance the

tree. The rebalancing work can potentially be done in paral-

lel and some redundant work can be skipped. None of this

improves read access time, and readers and writers still need

some synchronization between them.

3. Relativistic Programming

The name for relativistic programming is borrowed from

Einstein’s theory of relativity in which each observer is al-

lowed to have their own frame of reference. In relativis-

tic programming, each reader is allowed to have their own

frame of reference with respect to the order of updates.

Relativistic programming is characterized by the follow-

ing two properties:

1. Writes can occur concurrently with reads.

2. Writes are not totally ordered with respect to reads.

Consider the time-line in ﬁgure 2. If operations A and B

are writes and operation C is a read, then C can observe the

writes in either order. In particular, since A is concurrent

with C, both E

⇒ E

and E

⇒ E

are equally

valid. The same is true of B and C. Combining all three

AVL trees are similar to red-black trees, but they have a different balance

property.

operations, the ordering E

⇒ E

is valid even

though this violates the happens-before relation between the

non-concurrent A and B.

A B

Figure 2. Operation C can see operations A and B in any

order.

It is important to note that the order mentioned above rep-

resents the reference frame of a particular reader. There is

no “global observer” which determines the “correct” order.

Each reader has their own relative view of concurrent oper-

ations which may differ from the view of other concurrent

readers.

3.1 Is this OK?

While it might be disconcerting to have writes appear to

happen in different orders, there are two conditions which, if

met, make this acceptable. The conditions are as follows:

1. The underlying data structure does not have an inherent

time order

2. The updates are independent or commutative

Last In First Out Stacks and First In First Out Queues have

an inherent time order (thus the First and Last in their

names). As a result, these data structures are not good ﬁts

for relativistic implementations

. However many other data

structures (lists, dictionaries, trees, etc.) have no such inher-

ent time order and thus allow a relativistic implementation.

To illustrate what is meant by independent or commutable

updates, consider a phone company that uses a tree to main-

tain phone book information. If two customers call the phone

company to change their service, the two calls can be han-

dled in either order. Neither call affects the other so they are

independent. They are also commutable because the results

are the same regardless of the order. Any query that saw nei-

ther, either, or both updates is equally valid. Even printing a

phone book that included neither, either, or both updates is

equally valid. If either of these customers called to complain

about their inclusion or omission from the phone book, the

phone company could legitimately reply that the book was

printed either just before or just after their information was

entered into the system.

3.2 Memory Management

Relativistic programming requires a mechanism to reclaim

memory that has been freed by one thread while still in

use by another. Freed memory comes from two sources:

nodes that are removed from the tree and the “old” copy of

Some researchers have proposed weak ordering for LIFO’s and FIFO’s.

This would yield a Later In Earlier Out or Earlier In Earlier Out structure

that would be suitable for relativistic techniques.

4 2010/12/8

Relativistic red-black trees

Figures

Citations

Scalable address spaces using RCU balanced trees

RadixVM: scalable address spaces for multithreaded applications

Resizable, scalable, concurrent hash tables via relativistic programming

Concurrent updates with RCU: search tree as an example

ffwd: delegation is (much) faster than you think

References

Linearizability: a correctness condition for concurrent objects

Skip lists: a probabilistic alternative to balanced trees

The multikernel: a new OS architecture for scalable multicore systems

Skip Lists: A Probabilistic Alternative to Balanced Trees

A dichromatic framework for balanced trees

Related Papers (5)

Read-copy update: using execution history to solve concurrency problems

Exploiting deferred destruction: an analysis of read-copy-update techniques in operating system kernels

User-Level Implementations of Read-Copy Update

The Art of Multiprocessor Programming

Linearizability: a correctness condition for concurrent objects

Frequently Asked Questions (9)

Q1. What are the contributions in "Relativistic red-black trees" ?

Q2. what is the way to synchronize accesses?

Q3. Why is it possible that the tree will change during a read?

Q4. What is the approach to acquiring a write lock?

Q5. How do the authors solve the multi-update problem?

Q6. What is the meaning of the term relativistic?

Q7. Why is the traversal dominated by the time?

Q8. What are the consequences of a traversal?

Q9. What is the mutex used for the write lock?