scispace - formally typeset
Open AccessProceedings ArticleDOI

Concurrent cache-oblivious b-trees

TLDR
The cache-oblivious model is extended to a parallel or distributed setting and three concurrent CO B-trees are presented, showing that these data structures are linearizable, meaning that completed operations appear to an outside viewer as though they occurred in some serialized order.
Abstract
This paper presents concurrent cache-oblivious (CO) B-trees. We extend the cache-oblivious model to a parallel or distributed setting and present three concurrent CO B-trees. Our first data structure is a concurrent lock-based exponential CO B-tree. This data structure supports insertions and non-blocking searches/successor queries. The second and third data structures are lock-based and lock-free variations, respectively, on the packed-memory CO B-tree. These data structures support range queries and deletions in addition to the other operations. Each data structure achieves the same serial performance as the original data structure on which it is based. In a concurrent setting, we show that these data structures are linearizable, meaning that completed operations appear to an outside viewer as though they occurred in some serialized order. The lock-based data structures are also deadlock free, and the lock-free data structure guarantees forward progress by at least one process.

read more

Content maybe subject to copyright    Report

CACHE-OBLIVIOUS B-TREES
MICHAEL A. BENDER
, ERIK D. DEMAINE
, AND MARTIN FARACH-COLTON
§
Abstract. This paper presents two dynamic search trees attaining near-optimal performance
on any hierarchical memory. The data structures are independent of the parameters of the memory
hierarchy, e.g., the number of memory levels, the block-transfer size at each level, and t he relative
speeds of memory levels. The performance is analyzed in terms of the number of memory transfers
between two memory levels with an arbitrary block-transfer size of B; this analysis can then be
applied to every adjacent pair of levels in a multilevel memory hierarchy. Both search trees match the
optimal search bound of Θ(1+log
B+1
N) memory transfers. This bound is also achieve d by the classic
B-tree data structure on a two-level memory hierarchy with a known block-transfer size B. The first
search tree supports insertions and deletions in Θ(1 + log
B+1
N) amortized memory transfers, which
matches the B-tree’s worst-case bounds. The second search tree supports scanning S consecutive
elements optimally in Θ(1 + S/B) memory transfers and supports insertions and deletions in Θ(1 +
log
B+1
N +
log
2
N
B
) amortized memory transfers, matching the performance of the B-tree for B =
Ω(log N log log N).
Key words. Memory hierarchy, cache efficiency, data structures, search trees
AMS subject classifications. 68P05, 68P30, 68P20
DOI. 10.1137/S0097539701389956
1. Introduction. The memory hierarchies of modern computers are becoming
increasingly steep. Typically, an L1 cache access is two orders of magnitude faster
than a main memory access and six orders of magnitude faster than a disk access [27].
Thus, it is dangerously inaccurate to design algorithms assuming a flat memory with
uniform access times.
Many computational models attempt to capture the effects of the memory hier-
archy on the running times of algorithms. There is a tradeoff between the accuracy of
the model and its ease of use. One body of work explores multilevel memory hierar-
chies [2, 3, 5, 7, 43, 44, 49, 51], though the proliferation of parameters in these models
makes them cumbersome for algorithm design. A second body of work concentrates
on two-level memory hierarchies, either main memory and disk [4, 12, 32, 49, 50] or
cache and main memory [36, 45]. With these models the programmer must anticipate
which level of the memory hierarchy is the bottleneck. For example, a B-tree that has
been tuned to run on disk has poor performance in memory.
1.1. Cache-Oblivious Algorithms. The cache-oblivious model enables us to
reason about a simple two-level memory but prove results about an unknown mul-
tilevel memory. This model was introduced by Frigo et al. [31] and Prokop [40].
They show that several basic problems—namely, matrix multiplication, matrix trans-
pose, the fast Fourier transform (FFT), and sorting—have optimal algorithms that
are cache oblivious. Optimal cache-oblivious algorithms have also been found for LU
Received by the editors May 31, 2001; accepted for publication (in re vised form) May 25, 2005;
published electronically DATE. A preliminary version of this paper appeared in FOCS 2000 [18].
Department of Computer Science, State University of New York, Stony Brook, NY 11794-4400
(bender@ cs.sunysb.edu). This author’s work was supported in part by HRL Laboratories, ISX Cor-
poration, Sandia National Laboratories, and NSF grants EIA-0112849 and CCR-0208670.
Computer Science and Artificial Intelligence Laboratory, MIT, 32 Vassar Street, Cambridge, MA
02139 (edemaine@mit.edu). This author’s work was supported in part by NSF grant EIA-0112849.
§
Department of Computer Science, Rutgers University, Piscataway, NJ 08855 (farach@cs.rutgers.
edu). This auth or’s work was supported by NSF grant CCR-9820879.
1

2 M. A. BENDER, E. D. DEMAINE, AND M. FARACH-COLTON
decomposition [21, 46] and static binary search [40]. These algorithms perform an
asymptotically optimal number of memory transfers for any memory hierarchy and
at all levels of the hierarchy. More precisely, the number of memory transfers be-
tween any two levels is within a constant factor of optimal. In particular, any linear
combination of the transfer counts is optimized.
The theory of cache-oblivious algorithms is based on the ideal-cache model of
Frigo et al. [31] and Prokop [40]. In the ideal-cache model there are two levels in
the memory hierarchy, called cache and main memory, although they could represe nt
any pair of levels. Main memory is partitioned into memory blocks, each consisting
of a fixed number B of consecutive cells. The cache has size M , and consequently
has capacity to store M/B memory blocks.
1
In this paper, we require that M/B be
greater than a sufficiently large constant. The cache is fully associative, that is, it can
contain an arbitrary set of M/B memory blocks at any time.
The parameters B and M are unknown to the cache-oblivious algorithm or data
structure. As a result, the algorithm cannot explicitly manage memory, and this
burden is taken on by the system. When the algorithm accesses a location in memory
that is not stored in cache, the system fetches the relevant memory block from main
memory in what is called a memory transfer. If the cache is full, a memory block
is elected for replacement based on an optimal offline analysis of the future memory
accesses of the algorithm.
Although this model may superficially seem unrealistic, Frigo et al. show that it
can be simulated by essentially any memory system with a small constant-factor over-
head. Thus, if we run a cache-oblivious algorithm on a multilevel memory hierarchy,
we can use the ideal-cache model to analyze the number of memory transfers between
each pair of adjacent levels. See [31, 40] for details.
The concept of algorithms that are uniformly optimal across multiple memory
models was considered previously by Aggarwal et al. [2]. These authors introduce the
Hierarchical Memory Model (HMM), in which the cost to access memory location x
is df(x)e where f(x) is monotone nondecreasing and polynomially bounded. They
give algorithms for matrix multiplication and the FFT that are optimal for any cost
function f (x). One distinction between the HMM model and the cache-oblivious
model is that, in the HMM model, memory is managed by the algorithm designer,
whereas in the cache-oblivious model, memory is managed by the existing caching and
paging mechanisms. Also, the HMM model does not include block transfers, though
Aggarwal, Chandra, and Snir [3] later extended the HMM to the Block Transfer (BT)
model to take into account block transfers. In the BT model the algorithm can choose
and vary the block size, whereas in the cache-oblivious model the block size is fixed
and unknown.
1.2. B-Trees. In this paper, we initiate the study of dynamic cache-oblivious
data structures by developing cache-oblivious search trees.
The classic I/O-efficient search tree is the B-tree [13]. The basic idea is to maintain
a balanced tree of N elements with node fanout proportional to the memory block
size B. Thus, one block read determines the next node out of Θ(B) nodes, so a search
completes in Θ(1 + log
B+1
N) memory transfers.
2
A simple information-theoretic
argument shows that this bound is optimal.
1
Note that B and M are parameters, not constants. Consequently, they must be preserved in
asymptotic notation in order to obtain accurate running-time estimates.
2
We use B + 1 as the base of the logarithm to correctly capture that the spe cial case of B = 1
correspond s to the RAM.

CACHE-OBLIVIOUS B-TREES 3
The B-tree is designed for a two-level hierarchy, and the situation becomes more
complex with more than two levels. We need a multilevel structure, with one level
per transfer block size. Suppose B
1
> B
2
> ··· > B
L
are the block sizes between the
L + 1 levels of me mory. At the top level we have a B
1
-tree; each node of this B
1
-tree
is a B
2
-tree; etc. Even when it is possible to determine all these parameters, such a
data structure is cumbersome. Also, each level of recursion incurs a constant-factor
wastage in storage, in order to amortize dynamic changes, leading to suboptimal
memory-transfer performance for L = ω(1).
1.3. Results. We develop two cache-oblivious search trees. These results are the
first demonstration that even irregular and dynamic problems, such as data structures,
can be solved efficiently in the cache-oblivious model. Since the conference version [18]
of this paper appeared, many other data-structural problems have been addressed in
the c ache-oblivious model; see Table 1.1.
Our results achieve the memory-transfer bounds listed below. The parameter N
denotes the number of elements stored in the tree. Updates refer to both key insertions
and deletions.
1. The first cache-oblivious search tree attains the following memory-transfer
bounds:
Search: O(1 + log
B+1
N), which is optimal and matches the search bound
of B-tree s.
Update: O(1 + log
B+1
N) amortized, which matches the update bound of
B-trees, though the B-tree bound is worst case.
2. The second cache-oblivious search tree adds the scan operation (also called
the range search operation). Given a key x and a positive integer S, the scan
operation accesses S elements in key order, starting after x. The memory-
transfer bounds are as follows:
Search: O(1 + log
B+1
N).
Scan: O(1 + S/B), which is optimal.
Update: O(1 + log
B+1
N +
log
2
N
B
) amortized, which matches the B-tree
update bound of O(1 + log
B+1
N) when B = Ω(log N log log N).
This last relation between B and N usually holds in external memory but
often does not hold in internal memory.
In the development of these data structures, we build and identify tools for cache-
oblivious manipulation of data. These tools have since been used in many of the
cache-oblivious data structures listed in Table 1.1. In Section 2.1, we show how to
linearize a tree according to what we call the van Emde Boas layout, along the lines
of Prokop’s static search tree [40]. In Section 2.2, we describe a type of strongly
weight-balanced search tree [11] useful for maintaining locality of reference. Following
the work of Itai, Konheim, and Rodeh [33] and Willard [52, 53, 54], we develop a
packed-memory array for maintaining an ordered collection of N items in an array
of size O(N) subject to insertions and deletions in O(1 +
log
2
N
B
) amortized memory
transfers; see Section 2.3. This structure can be thought of as a cache-oblivious linked
list that supports scanning S consecutive elements in O(1 + S/B) memory transfers
(instead of the na¨ıve O(S)) and updates in O(1+
log
2
N
B
) amortized memory transfers.
1.4. Notation. We define the hyperfloor of x, denoted bbxcc, to be 2
blog xc
, i.e.,
the largest power of 2 smaller than x.
3
Thus, x/2 < bbxcc x. Similarly, the
3
All logarithms are base 2 if not otherwise specified.

4 M. A. BENDER, E. D. DEMAINE, AND M. FARACH-COLTON
B-tree Simplification via packed-memory
structure/low-height trees
[20, 25]
Simplification and persistence
via exponential structures
[42, 17]
Implicit [29, 30]
Static se arch trees Bas ic layout [40]
Experiments [35]
Optimal constant factor [14]
Linked lists supporting scans [15]
Priority queues [8, 23, 26]
Trie layout [6, 19]
Computational geometry Distribution sweeping [22]
Voronoi diagrams [34]
Orthogonal range searching [1, 9]
Rec tangle stabbing [10]
Lower bounds [24]
Table 1.1
Related work in cache-oblivious data structures. These results, except the static search tree
of [40], appeared after the conference version of this paper.
hyperceiling ddxee is defined to be 2
dlog xe
. Analogously, we define hyperhyperfloor and
hyperhyperceiling by bbbxccc = 2
bblog xcc
and dddxeee = 2
ddlog xee
. These operators satisfy
x < bbbxccc x and x dddxeee < x
2
.
2. Tools for Cache-Oblivious Data Structures.
2.1. Static Layout and Searches. We first present a cache-oblivious static
search-tree structure, which is the starting point for the dynamic structures. Consider
a O(log N)-height search tree in which every node has at least two and at most a
constant number of children and in which all leaves are on the same level. We describe
a mapping from the nodes of the tree to positions in memory. The cost of any search
in this layout is Θ(1 + log
B+1
N) memory transfers, which is optimal up to constant
factors. Our layout is a modified version of Prokop’s layout for a complete binary tree
whose height is a power of 2 [40, pp. 61–62]. We call the layout the van Emde Boas
layout because it resembles the van Emde Boas data structure [47, 48].
4
The van Emde Boas layout proceeds recursively. Let h be the height of the tree,
or more precisely, the number of levels of nodes in the tree. Suppose first that h is
a power of 2. Conceptually split the tree at the middle level of edges, between nodes
of height h/2 and h/2 + 1. This breaks the tree into the top recursive subtree A of
height h/2, and several bottom recursive subtrees B
1
, B
2
, . . . , B
`
, each of height h/2.
If all nonleaf nodes have the same number of children, then the recursive subtrees all
have size roughly
N, and ` is roughly
N. The layout of the tree is obtained by
recursively laying out each subtree and combining these layouts in the order A, B
1
,
B
2
, . . . , B
`
; s ee Figure 2.1.
If h is not a power of 2, we assign a number of levels that is a power of 2 to
the bottom recursive subtrees and assign the remaining levels to the top recursive
subtree. More precisely, the bottom subtrees have height ddh/2ee (= bbh 1cc) and
4
We do n ot use a van Emde Boas t ree—we use a normal tree with pointers from each node to its
parent and children—but the order of the nodes in memory is reminiscent of van Emde Boas trees.

CACHE-OBLIVIOUS B-TREES 5
20
21 22
23
24 25 27
26
28
29
30 31
1
5 6 7 8 9 10 11 1213 151614 2221201819 2324 25 2930312627 281 2 3 4 17
17
18
19
2
4
6 7
85
9 10
11
1312
14
15 16
3
A B
1
B
`
N
N
B
`
N
N
B
1
A
N
N
N
N
Fig. 2.1. The van Emde Boas layout. Left: in general; right: of a tree of height 5.
the top subtree has height h ddh/2ee. This rounding scheme is important for later
dynamic structures because the heights of the cut lines in the lower trees do not vary
with N. In contrast, this property is not shared by the simple rounding scheme of
assigning bh/2c levels to the top recursive subtree and dh/2e levels to the bottom
recursive subtrees.
The memory-transfer analysis views the van Emde Boas layout at a particular
level of detail. Each level of detail is a partition of the tree into disjoint recursive
subtrees. In the finest level of detail, 0, each node forms its own recursive subtree. In
the coarsest level of detail, dlog
2
he, the entire tree forms the unique recursive subtree.
Level of detail k is derived by starting with the entire tree, recursively partitioning it
as described above, and exiting a branch of the recursion upon reaching a recursive
subtree of height 2
k
. The key property of the van Emde Boas layout is that, at any
level of detail, each recursive subtree is stored in a contiguous block of memory.
One useful consequence of our rounding scheme is the following.
Lemma 2.1. At level of detail k all recursive subtrees except the one containing
the root have the same height of 2
k
. The recursive subtree containing the root has
height between 1 and 2
k
inclusive.
Proof. The proof follows from a simple induction on the level of detail. Consider
a tree T of height h. At the coarsest level of detail, dlog
2
he, there is a single recursive
subtree, which includes the root. In this case the lem ma is trivial. Suppose by
induction that the lemma holds for level of detail k. In this level of detail the recursive
subtree containing the root of T has height h
0
, where 1 h
0
2
k
, and all other
recursive subtrees have height 2
k
. To progress to the next finer level of detail, k 1,
all recursive subtrees that do not contain the root are recursively split once more so
that they have height 2
k1
. If the height h
0
of the top recursive subtree is at most
2
k1
, then it is not split in level of detail k1. Otherwise, the root is split into bottom
recursive subtrees of height 2
k1
and a top recursive subtree of height h
00
2
k1
. The
inductive step follows.
Lemma 2.2. Consider an N -node search tree T that is stored in a van Emde Boas
layout. Suppose that each node in T has between δ 2 and = O(1) children. Let h
be the height of T . Then a search in T uses at most 4
log
δ
log
B+1
N + log
B+1
=
O(1 + log
B+1
N) memory transfers.
Proof. Let k be the coarses t level of detail such that every recursive subtree

Citations
More filters
Proceedings ArticleDOI

Flat combining and the synchronization-parallelism tradeoff

TL;DR: This work uses flat-combining to devise, among other structures, new linearizable stack, queue, and priority queue algorithms that greatly outperform all prior algorithms.
Book

Algorithms and Data Structures for External Memory

TL;DR: The state of the art in the design and analysis of algorithms and data structures for external memory (or EM for short), where the goal is to exploit locality and parallelism in order to reduce the I/O costs is surveyed.
Proceedings ArticleDOI

Non-blocking binary search trees

TL;DR: This paper describes the first complete implementation of a non-blocking binary search tree in an asynchronous shared-memory system using single-word compare-and-swap operations.
Proceedings ArticleDOI

Fast concurrent lock-free binary search trees

TL;DR: A new lock-free algorithm for concurrent manipulation of a binary search tree in an asynchronous shared memory system that supports search, insert and delete operations and significantly outperforms all other algorithms for a concurrent binarysearch tree in many cases.
Patent

High-Performance Streaming Dictionary

TL;DR: In this article, a high-performance dictionary data structure is defined for storing data in a disk storage system, which supports full transactional semantics, concurrent access from multiple transactions, and logging and recovery.
References
More filters
Book

Distributed algorithms

Nancy Lynch
TL;DR: This book familiarizes readers with important problems, algorithms, and impossibility results in the area, and teaches readers how to reason carefully about distributed algorithms-to model them formally, devise precise specifications for their required behavior, prove their correctness, and evaluate their performance with realistic measures.
Journal ArticleDOI

Linearizability: a correctness condition for concurrent objects

TL;DR: This paper defines linearizability, compares it to other correctness conditions, presents and demonstrates a method for proving the correctness of implementations, and shows how to reason about concurrent objects, given they are linearizable.
Journal ArticleDOI

Ubiquitous B-Tree

TL;DR: The major variations of the B-tree are discussed, especially the B+-tree, contrasting the merits and costs of each implementation and illustrating a general purpose access method that uses a B- tree.
Journal ArticleDOI

The input/output complexity of sorting and related problems

TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Proceedings ArticleDOI

Cache-oblivious algorithms

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.
Frequently Asked Questions (17)
Q1. What are the contributions in this paper?

This paper presents two dynamic search trees attaining near-optimal performance on any hierarchical memory. 

For a node u at depth k, the upper-bound density threshold τk is τ0 + τd−τ0d k and the lower-bound density threshold ρk is ρ0 − ρ0−ρdd k. Thus,0 < ρd < ρd−1 < · · · < ρ0 < τ0 < τ1 < · · · < τd = 1.A node u at depth k is within threshold if ρk ≤ density(u) ≤ τk. 

Scanning through a group in the middle or bottom layer costs O(1 + log NB ) memory transfers because each group contains O(log N) elements and is stored in a contiguous array of O(log N) elements. 

When such a split or merge occurs, the authors pay O(log2 N) amortized memory transfers to update the top layer, which is only O(1) amortized memory transfers per update to the data structure. 

The buffer-node solution is to add a large number of extra (dataless) nodes in between data of different fluidity to “protect” the low-fluidity data from the frequent updates of high-fluidity data. 

For any desired c > 1, the packed-memory array maintains N elements in an array of size cN and supports insertions and deletions in O(1+ log2 N B )amortized memory transfers and scanning S consecutive elements in O(1 + S/B) memory transfers. 

The first data structure maintains the data in sorted order in an array with gaps, and it stores an auxiliary structure for searching within this array. 

There is a cache-oblivious data structure that maintains an ordered set subject to searches in O(logB+1 N) memory transfers, insertions and deletions in O(logB+1 N) amortized memory transfers, and scanning S consecutive elements in O(1 + S/B) amortized memory transfers. 

When such a split or merge occurs, the authors pay O(1+ log2 N B )amortized memory transfers to rebalance the packed-memory array on the middle layer. 

Level of detail k is derived by starting with the entire tree, recursively partitioning it as described above, and exiting a branch of the recursion upon reaching a recursive subtree of height ≤ 2k. 

That is, the authors increase the pointers by ‖A′′‖, the amount of space occupied in the packedmemory array by A′′, including unused nodes. 

A tree is weight balanced if, for every node v, its left subtree (including v) and its right subtree (including v) have sizes that differ by at most a constant factor. 

This rounding scheme is important for later dynamic structures because the heights of the cut lines in the lower trees do not vary with N . 

The middle layer may also store ghost elements which have been deleted from the bottom layer but are still representatives in the middle layer. 

the bottom layer is implemented by an unordered collection of groups, where the elements in each group are stored in an arbitrary order within a contiguous region of memory. 

The density of a node u in the tree, denoted density(u), is the number of elements stored in u’s subarray divided by capacity(u). 

For any of the cache-oblivious B-tree structures, a natural question is whether the O( log2 N B ) term in the update bound can be removed while still supporting scansoptimally.