Cache-Oblivious Algorithms

doi:10.1145/2071379.2071383

Home
/
Papers
/
Cache-Oblivious Algorithms

Journal Article•DOI•

Cache-Oblivious Algorithms

Matteo Frigo¹, Charles E. Leiserson¹, Harald Prokop¹, Sridhar Ramachandran¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2012-ACM Transactions on Algorithms (ACM)-Vol. 8, Iss: 1, pp 4

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.

read less

Abstract: This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults.We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

The pochoir stencil compiler

[...]

Yuan Tang¹, Rezaul Chowdhury², Bradley C. Kuszmaul³, Chi-Keung Luk⁴, Charles E. Leiserson³ - Show less +1 more•Institutions (4)

Fudan University¹, Boston University², Massachusetts Institute of Technology³, Intel⁴

04 Jun 2011

TL;DR: The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm.

...read moreread less

Abstract: A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.

...read moreread less

364 citations

Proceedings Article•DOI•

Implicit and explicit optimizations for stencil computations

[...]

Shoaib Kamil¹, Kaushik Datta², Samuel Williams², Leonid Oliker¹, John Shalf¹, Katherine Yelick¹ - Show less +2 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

22 Oct 2006

TL;DR: Several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor are examined, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure.

...read moreread less

Abstract: Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.

...read moreread less

150 citations

Journal Article•DOI•

Communication lower bounds and optimal algorithms for numerical linear algebra

[...]

Grey Ballard¹, Erin Carson², James Demmel², Mark Hoemmen¹, Nicholas Knight², Oded Schwartz² - Show less +2 more•Institutions (2)

Sandia National Laboratories¹, University of California, Berkeley²

01 May 2014-Acta Numerica

TL;DR: This paper describes lower bounds on communication in linear algebra, and presents lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices.

...read moreread less

Abstract: The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

...read moreread less

138 citations

Cache-Oblivious Algorithms and Data Structures

[...]

Erik D. Demaine¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2003

TL;DR: A recent body of work has developed cache-oblivious algorithms and data structures that perform as well or nearly as well as standard external-memory structures which require knowledge of the cache/memory size and block transfer size.

...read moreread less

Abstract: A recent direction in the design of cache-efficient and diskefficient algorithms and data structures is the notion of cache obliviousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in 1999. Cache-oblivious algorithms perform well on a multilevel memory hierarchy without knowing any parameters of the hierarchy, only knowing the existence of a hierarchy. Equivalently, a single cache-oblivious algorithm is efficient on all memory hierarchies simultaneously. While such results might seem impossible, a recent body of work has developed cache-oblivious algorithms and data structures that perform as well or nearly as well as standard external-memory structures which require knowledge of the cache/memory size and block transfer size. Here we describe several of these results with the intent of elucidating the techniques behind their design. Perhaps the most exciting of these results are the data structures, which form general building blocks immediately leading to several algorithmic results.

...read moreread less

97 citations

Proceedings Article•DOI•

Cache-oblivious string B-trees

[...]

Michael A. Bender¹, Martin Farach-Colton², Bradley C. Kuszmaul³•Institutions (3)

Stony Brook University¹, Rutgers University², Massachusetts Institute of Technology³

26 Jun 2006

TL;DR: This paper presents a cache-oblivious string B-tree (COSB-tree) data structure that is efficient in all these ways: searches asymptotically optimally and inserts and deletes nearly optimally, and maintains an index whose size is proportional to the front-compressed size of the dictionary.

...read moreread less

Abstract: B-trees are the data structure of choice for maintaining searchable data on disk. However, B-trees perform suboptimally when keys are long or of variable length,when keys are compressed, even when using front compression, the standard B-tree compression scheme,for range queries, andwith respect to memory effects such as disk prefetching.This paper presents a cache-oblivious string B-tree (COSB-tree) data structure that is efficient in all these ways: The COSB-tree searches asymptotically optimally and inserts and deletes nearly optimally.It maintains an index whose size is proportional to the front-compressed size of the dictionary. Furthermore, unlike standard front-compressed strings, keys can be decompressed in a memory-efficient manner.It performs range queries with no extra disk seeks; in contrast, B-trees incur disk seeks when skipping from leaf block to leaf block.It utilizes all levels of a memory hierarchy efficiently and makes good use of disk locality by using cache-oblivious layout strategies.

...read moreread less

86 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Uniform memory hierarchies

[...]

Bowen Alpern¹, Larry Carter¹, Ephraim Feig¹•Institutions (1)

IBM¹

22 Oct 1990

TL;DR: The authors introduce a model, called the uniform memory hierarchy (UMH), which reflects the hierarchical nature of computer memory more accurately than the RAM (random-access-machine) model, which assumes that any item in memory can be accessed with unit cost.

...read moreread less

Abstract: The authors introduce a model, called the uniform memory hierarchy (UMH) model, which reflects the hierarchical nature of computer memory more accurately than the RAM (random-access-machine) model, which assumes that any item in memory can be accessed with unit cost. In the model memory occurs as a sequence of increasingly large levels. Data are transferred between levels in fixed-size blocks (the size is level dependent). Within a level blocks are random access. The model is easily extended to handle parallelism. The UMH model is really a family of models parameterized by the rate at which the bandwidth decays as one travels up the hierarchy. A program is parsimonious on a UMH if the leading terms of the program's (time) complexity on the UMH and on a RAM are identical. If these terms differ by more than a constant factor, then the program is inefficient. The authors analyze two standard FFT programs with the same RAM complexity. One is efficient; the other is not. >

...read moreread less

65 citations

Journal Article•DOI•

Towards a theory of cache-efficient algorithms

[...]

Sandeep Sen¹, Siddhartha Chatterjee², Neeraj Dumir¹•Institutions (2)

Indian Institute of Technology Delhi¹, IBM²

01 Nov 2002-Journal of the ACM

TL;DR: In this article, the authors present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters.

...read moreread less

Abstract: We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-efficient algorithms in the single-level cache model for fundamental problems like sorting, FFT, and an important subclass of permutations. We also analyze the average-case cache behavior of mergesort, show that ignoring associativity concerns could lead to inferior performance, and present supporting experimental evidence.We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and for dealing with the hitherto unresolved problem of limited associativity.

...read moreread less

61 citations

DOI•

Cache-Oblivious Data Structures

[...]

Lars Arge, Gerth Stølting Brodal, Rolf Fagerberg

01 Jan 2004

TL;DR: University of Southern Denmark 38.1 The Cache-Oblivious Model: Fundamental Primitives, kd-Tree, k-Merger, and 2d Orthogonal Range Searching.

...read moreread less

Abstract: University of Southern Denmark 38.1 The Cache-Oblivious Model . . . . . . . . . . . . . . . . . . . . . . . . . 38-1 38.2 Fundamental Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-3 Van Emde Boas Layout • k-Merger 38.3 Dynamic B-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-8 Density Based • Exponential Tree Based 38.4 Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-12 Merge Based Priority Queue: Funnel Heap • Exponential Level Based Priority Queue 38.5 2d Orthogonal Range Searching . . . . . . . . . . . . . . . . . . . . . 38-21 Cache-Oblivious kd-Tree • Cache-Oblivious Range Tree

...read moreread less

45 citations

"Cache-Oblivious Algorithms" refers methods in this paper

...Excellent surveys on cache-oblivious algorithms and data structures include [Arge et al. 2005; Brodal 2004; Demaine 2002]....
[...]

Book Chapter•DOI•

A Characterization of Temporal Locality and Its Portability across Memory Hierarchies

[...]

Gianfranco Bilardi¹, Enoch Peserico²•Institutions (2)

University of Padua¹, Massachusetts Institute of Technology²

08 Jul 2001

TL;DR: This paper formulates and investigates the question of whether a given algorithm can be coded in a way efficiently portable across machines with different hierarchical memory systems, modeled as a(x)-HRAMs (Hierarchical RAMs), and proposes the decomposition-tree memory manager and the reoccurrence-width memory manager.

...read moreread less

Abstract: This paper formulates and investigates the question of whether a given algorithm can be coded in a way efficiently portable across machines with different hierarchical memory systems, modeled as a(x)-HRAMs (Hierarchical RAMs), where the time to access a location x is a(x). The width decomposition framework is proposed to provide a machine-independent characterization of temporal locality of a computation by a suitable set of space reuse parameters. Using this framework, it is shown that, when the schedule, i.e. the order by which operations are executed, is fixed, efficient portability is achievable. We propose (a) the decomposition-tree memory manager, which achieves time within a logarithmic factor of optimal on all HRAMs, and (b) the reoccurrence-width memory manager, which achieves time within a constant factor of optimal for the important class of uniform HRAMs. We also show that, when the schedule is considered as a degree of freedom of the implementation, there are computations whose optimal schedule does vary with the access function. In particular, we exhibit some computations for which any schedule is bound to be a polynomial factor slower than optimal on at least one of two sufficiently different machines. On the positive side, we show that relatively few schedules are sufficient to provide a near optimal solution on a wide class of HRAMs.

...read moreread less

42 citations

Journal Article•DOI•

Large-scale sorting in uniform memory hierarchies

[...]

Jeffrey Scott Vitter¹, Mark H. Nodine¹•Institutions (1)

Brown University¹

01 Jan 1993-Journal of Parallel and Distributed Computing

TL;DR: This work presents several efficient algorithms for sorting on the uniform memory hierarchy (UMH), introduced by Alpern, Carter, and Fei, and its parallelization P-UMH, and develops optimal sorting algorithms for all bandwidths for other versions of UMH.

...read moreread less

41 citations

"Cache-Oblivious Algorithms" refers background or methods in this paper

...Finally, we present simulation results proving that optimal cache-oblivious algorithms satisfying the regularity condition are also optimal (in expectation) in the previously studied SUMH [5, 28] and HMM [1] models....
[...]
...It can also be shown [Prokop 1999] that cache-oblivious algorithms satisfying (14) are also optimal (in expectation) in the previously studied SUMH [Alpern et al. 1990; Vitter and Nodine 1993] and HMM [Aggarwal et al. 1987a] models....
[...]
...Specifically, we prove (with only minor assumptions) that optimal cache-oblivious algorithms in the ideal-cache model are also optimal in the hierarchical memory model (HMM) [1] and in the serial uniform memory hierarchy (SUMH) model [5, 28]....
[...]
...Unlike previous cache-efficient distribution-sorting algorithms [1, 3, 21, 28, 30], which use sampling or other techniques to find the partitioning elements before the distribution step, our algorithm uses a “bucket splitting” technique to select pivots incrementally during the distribution....
[...]
...In the more restrictive SUMH model [28], however, only one bus is active at a time....
[...]