What is the density of a node u at depth k?

For a node u at depth k, the upper-bound density threshold τk is τ0 + τd−τ0d k and the lower-bound density threshold ρk is ρ0 − ρ0−ρdd k. Thus,0 < ρd < ρd−1 < · · · < ρ0 < τ0 < τ1 < · · · < τd = 1.A node u at depth k is within threshold if ρk ≤ density(u) ≤ τk.

What is the cost of a group in the middle or bottom layer?

Scanning through a group in the middle or bottom layer costs O(1 + log NB ) memory transfers because each group contains O(log N) elements and is stored in a contiguous array of O(log N) elements.

How much do the authors pay for the update of the top layer?

When such a split or merge occurs, the authors pay O(log2 N) amortized memory transfers to update the top layer, which is only O(1) amortized memory transfers per update to the data structure.

What is the way to protect the low-fluidity data from the frequent updates?

The buffer-node solution is to add a large number of extra (dataless) nodes in between data of different fluidity to “protect” the low-fluidity data from the frequent updates of high-fluidity data.

What is the c of the packed-memory array?

For any desired c > 1, the packed-memory array maintains N elements in an array of size cN and supports insertions and deletions in O(1+ log2 N B )amortized memory transfers and scanning S consecutive elements in O(1 + S/B) memory transfers.

What is the way to store the data in a buffer-node array?

The first data structure maintains the data in sorted order in an array with gaps, and it stores an auxiliary structure for searching within this array.

What is the way to avoid a cache-oblivious data structure?

There is a cache-oblivious data structure that maintains an ordered set subject to searches in O(logB+1 N) memory transfers, insertions and deletions in O(logB+1 N) amortized memory transfers, and scanning S consecutive elements in O(1 + S/B) amortized memory transfers.

What is the cost of a split or merge?

When such a split or merge occurs, the authors pay O(1+ log2 N B )amortized memory transfers to rebalance the packed-memory array on the middle layer.

How do the authors increase the pointers in the recursive array?

That is, the authors increase the pointers by ‖A′′‖, the amount of space occupied in the packedmemory array by A′′, including unused nodes.

What is the property of a tree that is weight balanced?

A tree is weight balanced if, for every node v, its left subtree (including v) and its right subtree (including v) have sizes that differ by at most a constant factor.

What is the cost of deleting ghost elements from the bottom layer?

The middle layer may also store ghost elements which have been deleted from the bottom layer but are still representatives in the middle layer.

What is the way to store elements in the bottom layer?

the bottom layer is implemented by an unordered collection of groups, where the elements in each group are stored in an arbitrary order within a contiguous region of memory.

What is the density of a node in the tree?

The density of a node u in the tree, denoted density(u), is the number of elements stored in u’s subarray divided by capacity(u).

What is the way to remove the O( log2 N B) term in the update?

For any of the cache-oblivious B-tree structures, a natural question is whether the O( log2 N B ) term in the update bound can be removed while still supporting scansoptimally.

(Open Access) Concurrent cache-oblivious b-trees (2005) | Michael A. Bender

Q: What is the cost of a group in the middle or bottom layer?

Scanning through a group in the middle or bottom layer costs O(1 + log NB ) memory transfers because each group contains O(log N) elements and is stored in a contiguous array of O(log N) elements.

Q: How much do the authors pay for the update of the top layer?

When such a split or merge occurs, the authors pay O(log2 N) amortized memory transfers to update the top layer, which is only O(1) amortized memory transfers per update to the data structure.

Q: What is the way to protect the low-fluidity data from the frequent updates?

The buffer-node solution is to add a large number of extra (dataless) nodes in between data of different fluidity to “protect” the low-fluidity data from the frequent updates of high-fluidity data.

Q: What is the c of the packed-memory array?

For any desired c > 1, the packed-memory array maintains N elements in an array of size cN and supports insertions and deletions in O(1+ log2 N B )amortized memory transfers and scanning S consecutive elements in O(1 + S/B) memory transfers.

Q: How does the level of detail k work?

Level of detail k is derived by starting with the entire tree, recursively partitioning it as described above, and exiting a branch of the recursion upon reaching a recursive subtree of height ≤ 2k.

CACHE-OBLIVIOUS B-TREES

∗

MICHAEL A. BENDER

†

, ERIK D. DEMAINE

‡

, AND MARTIN FARACH-COLTON

Abstract. This paper presents two dynamic search trees attaining near-optimal performance

on any hierarchical memory. The data structures are independent of the parameters of the memory

hierarchy, e.g., the number of memory levels, the block-transfer size at each level, and t he relative

speeds of memory levels. The performance is analyzed in terms of the number of memory transfers

between two memory levels with an arbitrary block-transfer size of B; this analysis can then be

applied to every adjacent pair of levels in a multilevel memory hierarchy. Both search trees match the

optimal search bound of Θ(1+log

B+1

N) memory transfers. This bound is also achieve d by the classic

B-tree data structure on a two-level memory hierarchy with a known block-transfer size B. The ﬁrst

search tree supports insertions and deletions in Θ(1 + log

B+1

N) amortized memory transfers, which

matches the B-tree’s worst-case bounds. The second search tree supports scanning S consecutive

elements optimally in Θ(1 + S/B) memory transfers and supports insertions and deletions in Θ(1 +

log

B+1

N +

log

) amortized memory transfers, matching the performance of the B-tree for B =

Ω(log N log log N).

Key words. Memory hierarchy, cache eﬃciency, data structures, search trees

AMS subject classiﬁcations. 68P05, 68P30, 68P20

DOI. 10.1137/S0097539701389956

1. Introduction. The memory hierarchies of modern computers are becoming

increasingly steep. Typically, an L1 cache access is two orders of magnitude faster

than a main memory access and six orders of magnitude faster than a disk access [27].

Thus, it is dangerously inaccurate to design algorithms assuming a ﬂat memory with

uniform access times.

Many computational models attempt to capture the eﬀects of the memory hier-

archy on the running times of algorithms. There is a tradeoﬀ between the accuracy of

the model and its ease of use. One body of work explores multilevel memory hierar-

chies [2, 3, 5, 7, 43, 44, 49, 51], though the proliferation of parameters in these models

makes them cumbersome for algorithm design. A second body of work concentrates

on two-level memory hierarchies, either main memory and disk [4, 12, 32, 49, 50] or

cache and main memory [36, 45]. With these models the programmer must anticipate

which level of the memory hierarchy is the bottleneck. For example, a B-tree that has

been tuned to run on disk has poor performance in memory.

1.1. Cache-Oblivious Algorithms. The cache-oblivious model enables us to

reason about a simple two-level memory but prove results about an unknown mul-

tilevel memory. This model was introduced by Frigo et al. [31] and Prokop [40].

They show that several basic problems—namely, matrix multiplication, matrix trans-

pose, the fast Fourier transform (FFT), and sorting—have optimal algorithms that

are cache oblivious. Optimal cache-oblivious algorithms have also been found for LU

∗

Received by the editors May 31, 2001; accepted for publication (in re vised form) May 25, 2005;

published electronically DATE. A preliminary version of this paper appeared in FOCS 2000 [18].

†

Department of Computer Science, State University of New York, Stony Brook, NY 11794-4400

(bender@ cs.sunysb.edu). This author’s work was supported in part by HRL Laboratories, ISX Cor-

poration, Sandia National Laboratories, and NSF grants EIA-0112849 and CCR-0208670.

‡

Computer Science and Artiﬁcial Intelligence Laboratory, MIT, 32 Vassar Street, Cambridge, MA

02139 (edemaine@mit.edu). This author’s work was supported in part by NSF grant EIA-0112849.

Department of Computer Science, Rutgers University, Piscataway, NJ 08855 (farach@cs.rutgers.

edu). This auth or’s work was supported by NSF grant CCR-9820879.

2 M. A. BENDER, E. D. DEMAINE, AND M. FARACH-COLTON

decomposition [21, 46] and static binary search [40]. These algorithms perform an

asymptotically optimal number of memory transfers for any memory hierarchy and

at all levels of the hierarchy. More precisely, the number of memory transfers be-

tween any two levels is within a constant factor of optimal. In particular, any linear

combination of the transfer counts is optimized.

The theory of cache-oblivious algorithms is based on the ideal-cache model of

Frigo et al. [31] and Prokop [40]. In the ideal-cache model there are two levels in

the memory hierarchy, called cache and main memory, although they could represe nt

any pair of levels. Main memory is partitioned into memory blocks, each consisting

of a ﬁxed number B of consecutive cells. The cache has size M , and consequently

has capacity to store M/B memory blocks.

In this paper, we require that M/B be

greater than a suﬃciently large constant. The cache is fully associative, that is, it can

contain an arbitrary set of M/B memory blocks at any time.

The parameters B and M are unknown to the cache-oblivious algorithm or data

structure. As a result, the algorithm cannot explicitly manage memory, and this

burden is taken on by the system. When the algorithm accesses a location in memory

that is not stored in cache, the system fetches the relevant memory block from main

memory in what is called a memory transfer. If the cache is full, a memory block

is elected for replacement based on an optimal oﬄine analysis of the future memory

accesses of the algorithm.

Although this model may superﬁcially seem unrealistic, Frigo et al. show that it

can be simulated by essentially any memory system with a small constant-factor over-

head. Thus, if we run a cache-oblivious algorithm on a multilevel memory hierarchy,

we can use the ideal-cache model to analyze the number of memory transfers between

each pair of adjacent levels. See [31, 40] for details.

The concept of algorithms that are uniformly optimal across multiple memory

models was considered previously by Aggarwal et al. [2]. These authors introduce the

Hierarchical Memory Model (HMM), in which the cost to access memory location x

is df(x)e where f(x) is monotone nondecreasing and polynomially bounded. They

give algorithms for matrix multiplication and the FFT that are optimal for any cost

function f (x). One distinction between the HMM model and the cache-oblivious

model is that, in the HMM model, memory is managed by the algorithm designer,

whereas in the cache-oblivious model, memory is managed by the existing caching and

paging mechanisms. Also, the HMM model does not include block transfers, though

Aggarwal, Chandra, and Snir [3] later extended the HMM to the Block Transfer (BT)

model to take into account block transfers. In the BT model the algorithm can choose

and vary the block size, whereas in the cache-oblivious model the block size is ﬁxed

and unknown.

1.2. B-Trees. In this paper, we initiate the study of dynamic cache-oblivious

data structures by developing cache-oblivious search trees.

The classic I/O-eﬃcient search tree is the B-tree [13]. The basic idea is to maintain

a balanced tree of N elements with node fanout proportional to the memory block

size B. Thus, one block read determines the next node out of Θ(B) nodes, so a search

completes in Θ(1 + log

B+1

N) memory transfers.

A simple information-theoretic

argument shows that this bound is optimal.

Note that B and M are parameters, not constants. Consequently, they must be preserved in

asymptotic notation in order to obtain accurate running-time estimates.

We use B + 1 as the base of the logarithm to correctly capture that the spe cial case of B = 1

correspond s to the RAM.

CACHE-OBLIVIOUS B-TREES 3

The B-tree is designed for a two-level hierarchy, and the situation becomes more

complex with more than two levels. We need a multilevel structure, with one level

per transfer block size. Suppose B

> B

> ··· > B

are the block sizes between the

L + 1 levels of me mory. At the top level we have a B

-tree; each node of this B

-tree

is a B

-tree; etc. Even when it is possible to determine all these parameters, such a

data structure is cumbersome. Also, each level of recursion incurs a constant-factor

wastage in storage, in order to amortize dynamic changes, leading to suboptimal

memory-transfer performance for L = ω(1).

1.3. Results. We develop two cache-oblivious search trees. These results are the

ﬁrst demonstration that even irregular and dynamic problems, such as data structures,

can be solved eﬃciently in the cache-oblivious model. Since the conference version [18]

of this paper appeared, many other data-structural problems have been addressed in

the c ache-oblivious model; see Table 1.1.

Our results achieve the memory-transfer bounds listed below. The parameter N

denotes the number of elements stored in the tree. Updates refer to both key insertions

and deletions.

1. The ﬁrst cache-oblivious search tree attains the following memory-transfer

bounds:

Search: O(1 + log

B+1

N), which is optimal and matches the search bound

of B-tree s.

Update: O(1 + log

B+1

N) amortized, which matches the update bound of

B-trees, though the B-tree bound is worst case.

2. The second cache-oblivious search tree adds the scan operation (also called

the range search operation). Given a key x and a positive integer S, the scan

operation accesses S elements in key order, starting after x. The memory-

transfer bounds are as follows:

Search: O(1 + log

B+1

N).

Scan: O(1 + S/B), which is optimal.

Update: O(1 + log

B+1

N +

log

) amortized, which matches the B-tree

update bound of O(1 + log

B+1

N) when B = Ω(log N log log N).

This last relation between B and N usually holds in external memory but

often does not hold in internal memory.

In the development of these data structures, we build and identify tools for cache-

oblivious manipulation of data. These tools have since been used in many of the

cache-oblivious data structures listed in Table 1.1. In Section 2.1, we show how to

linearize a tree according to what we call the van Emde Boas layout, along the lines

of Prokop’s static search tree [40]. In Section 2.2, we describe a type of strongly

weight-balanced search tree [11] useful for maintaining locality of reference. Following

the work of Itai, Konheim, and Rodeh [33] and Willard [52, 53, 54], we develop a

packed-memory array for maintaining an ordered collection of N items in an array

of size O(N) subject to insertions and deletions in O(1 +

log

) amortized memory

transfers; see Section 2.3. This structure can be thought of as a cache-oblivious linked

list that supports scanning S consecutive elements in O(1 + S/B) memory transfers

(instead of the na¨ıve O(S)) and updates in O(1+

log

) amortized memory transfers.

1.4. Notation. We deﬁne the hyperﬂoor of x, denoted bbxcc, to be 2

blog xc

, i.e.,

the largest power of 2 smaller than x.

Thus, x/2 < bbxcc ≤ x. Similarly, the

All logarithms are base 2 if not otherwise speciﬁed.

4 M. A. BENDER, E. D. DEMAINE, AND M. FARACH-COLTON

B-tree • Simpliﬁcation via packed-memory

structure/low-height trees

[20, 25]

• Simpliﬁcation and persistence

via exponential structures

[42, 17]

• Implicit [29, 30]

Static se arch trees • Bas ic layout [40]

• Experiments [35]

• Optimal constant factor [14]

Linked lists supporting scans [15]

Priority queues [8, 23, 26]

Trie layout [6, 19]

Computational geometry • Distribution sweeping [22]

• Voronoi diagrams [34]

• Orthogonal range searching [1, 9]

• Rec tangle stabbing [10]

Lower bounds [24]

Table 1.1

Related work in cache-oblivious data structures. These results, except the static search tree

of [40], appeared after the conference version of this paper.

hyperceiling ddxee is deﬁned to be 2

dlog xe

. Analogously, we deﬁne hyperhyperﬂoor and

hyperhyperceiling by bbbxccc = 2

bblog xcc

and dddxeee = 2

ddlog xee

. These operators satisfy

√

x < bbbxccc ≤ x and x ≤ dddxeee < x

2. Tools for Cache-Oblivious Data Structures.

2.1. Static Layout and Searches. We ﬁrst present a cache-oblivious static

search-tree structure, which is the starting point for the dynamic structures. Consider

a O(log N)-height search tree in which every node has at least two and at most a

constant number of children and in which all leaves are on the same level. We describe

a mapping from the nodes of the tree to positions in memory. The cost of any search

in this layout is Θ(1 + log

B+1

N) memory transfers, which is optimal up to constant

factors. Our layout is a modiﬁed version of Prokop’s layout for a complete binary tree

whose height is a power of 2 [40, pp. 61–62]. We call the layout the van Emde Boas

layout because it resembles the van Emde Boas data structure [47, 48].

The van Emde Boas layout proceeds recursively. Let h be the height of the tree,

or more precisely, the number of levels of nodes in the tree. Suppose ﬁrst that h is

a power of 2. Conceptually split the tree at the middle level of edges, between nodes

of height h/2 and h/2 + 1. This breaks the tree into the top recursive subtree A of

height h/2, and several bottom recursive subtrees B

, B

, . . . , B

, each of height h/2.

If all nonleaf nodes have the same number of children, then the recursive subtrees all

have size roughly

√

N, and ` is roughly

√

N. The layout of the tree is obtained by

recursively laying out each subtree and combining these layouts in the order A, B

, . . . , B

; s ee Figure 2.1.

If h is not a power of 2, we assign a number of levels that is a power of 2 to

the bottom recursive subtrees and assign the remaining levels to the top recursive

subtree. More precisely, the bottom subtrees have height ddh/2ee (= bbh − 1cc) and

We do n ot use a van Emde Boas t ree—we use a normal tree with pointers from each node to its

parent and children—but the order of the nodes in memory is reminiscent of van Emde Boas trees.

CACHE-OBLIVIOUS B-TREES 5

21 22

24 25 27

30 31

5 6 7 8 9 10 11 1213 151614 2221201819 2324 25 2930312627 281 2 3 4 17

6 7

9 10

1312

15 16

A B

√

Fig. 2.1. The van Emde Boas layout. Left: in general; right: of a tree of height 5.

the top subtree has height h − ddh/2ee. This rounding scheme is important for later

dynamic structures because the heights of the cut lines in the lower trees do not vary

with N. In contrast, this property is not shared by the simple rounding scheme of

assigning bh/2c levels to the top recursive subtree and dh/2e levels to the bottom

recursive subtrees.

The memory-transfer analysis views the van Emde Boas layout at a particular

level of detail. Each level of detail is a partition of the tree into disjoint recursive

subtrees. In the ﬁnest level of detail, 0, each node forms its own recursive subtree. In

the coarsest level of detail, dlog

he, the entire tree forms the unique recursive subtree.

Level of detail k is derived by starting with the entire tree, recursively partitioning it

as described above, and exiting a branch of the recursion upon reaching a recursive

subtree of height ≤ 2

. The key property of the van Emde Boas layout is that, at any

level of detail, each recursive subtree is stored in a contiguous block of memory.

One useful consequence of our rounding scheme is the following.

Lemma 2.1. At level of detail k all recursive subtrees except the one containing

the root have the same height of 2

. The recursive subtree containing the root has

height between 1 and 2

inclusive.

Proof. The proof follows from a simple induction on the level of detail. Consider

a tree T of height h. At the coarsest level of detail, dlog

he, there is a single recursive

subtree, which includes the root. In this case the lem ma is trivial. Suppose by

induction that the lemma holds for level of detail k. In this level of detail the recursive

subtree containing the root of T has height h

, where 1 ≤ h

≤ 2

, and all other

recursive subtrees have height 2

. To progress to the next ﬁner level of detail, k − 1,

all recursive subtrees that do not contain the root are recursively split once more so

that they have height 2

k−1

. If the height h

of the top recursive subtree is at most

k−1

, then it is not split in level of detail k−1. Otherwise, the root is split into bottom

recursive subtrees of height 2

k−1

and a top recursive subtree of height h

≤ 2

k−1

. The

inductive step follows.

Lemma 2.2. Consider an N -node search tree T that is stored in a van Emde Boas

layout. Suppose that each node in T has between δ ≥ 2 and ∆ = O(1) children. Let h

be the height of T . Then a search in T uses at most 4



log

∆ log

B+1

N + log

B+1

∆



O(1 + log

B+1

N) memory transfers.

Proof. Let k be the coarses t level of detail such that every recursive subtree

Concurrent cache-oblivious b-trees

Figures

Citations

Flat combining and the synchronization-parallelism tradeoff

Algorithms and Data Structures for External Memory

Non-blocking binary search trees

Fast concurrent lock-free binary search trees

High-Performance Streaming Dictionary

References

Distributed algorithms

Linearizability: a correctness condition for concurrent objects

Ubiquitous B-Tree

The input/output complexity of sorting and related problems

Cache-oblivious algorithms

Related Papers (5)

The input/output complexity of sorting and related problems

Cache-oblivious algorithms

Non-blocking binary search trees

A practical concurrent binary search tree

A bridging model for parallel computation

Frequently Asked Questions (17)

Q1. What are the contributions in this paper?

Q2. What is the density of a node u at depth k?

Q3. What is the cost of a group in the middle or bottom layer?

Q4. How much do the authors pay for the update of the top layer?

Q5. What is the way to protect the low-fluidity data from the frequent updates?

Q6. What is the c of the packed-memory array?

Q7. What is the way to store the data in a buffer-node array?

Q8. What is the way to avoid a cache-oblivious data structure?

Q9. What is the cost of a split or merge?

Q10. How does the level of detail k work?

Q11. How do the authors increase the pointers in the recursive array?

Q12. What is the property of a tree that is weight balanced?

Q13. What is the significance of the rounding scheme?

Q14. What is the cost of deleting ghost elements from the bottom layer?

Q15. What is the way to store elements in the bottom layer?

Q16. What is the density of a node in the tree?

Q17. What is the way to remove the O( log2 N B) term in the update?