scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Cache-Oblivious Algorithms

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.
Abstract: This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults.We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.
Citations
More filters
Proceedings ArticleDOI
04 Jun 2011
TL;DR: The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm.
Abstract: A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.

364 citations

Proceedings ArticleDOI
22 Oct 2006
TL;DR: Several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor are examined, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure.
Abstract: Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.

150 citations

Journal ArticleDOI
TL;DR: This paper describes lower bounds on communication in linear algebra, and presents lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices.
Abstract: The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

138 citations

01 Jan 2003
TL;DR: A recent body of work has developed cache-oblivious algorithms and data structures that perform as well or nearly as well as standard external-memory structures which require knowledge of the cache/memory size and block transfer size.
Abstract: A recent direction in the design of cache-efficient and diskefficient algorithms and data structures is the notion of cache obliviousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in 1999. Cache-oblivious algorithms perform well on a multilevel memory hierarchy without knowing any parameters of the hierarchy, only knowing the existence of a hierarchy. Equivalently, a single cache-oblivious algorithm is efficient on all memory hierarchies simultaneously. While such results might seem impossible, a recent body of work has developed cache-oblivious algorithms and data structures that perform as well or nearly as well as standard external-memory structures which require knowledge of the cache/memory size and block transfer size. Here we describe several of these results with the intent of elucidating the techniques behind their design. Perhaps the most exciting of these results are the data structures, which form general building blocks immediately leading to several algorithmic results.

97 citations

Proceedings ArticleDOI
26 Jun 2006
TL;DR: This paper presents a cache-oblivious string B-tree (COSB-tree) data structure that is efficient in all these ways: searches asymptotically optimally and inserts and deletes nearly optimally, and maintains an index whose size is proportional to the front-compressed size of the dictionary.
Abstract: B-trees are the data structure of choice for maintaining searchable data on disk. However, B-trees perform suboptimally when keys are long or of variable length,when keys are compressed, even when using front compression, the standard B-tree compression scheme,for range queries, andwith respect to memory effects such as disk prefetching.This paper presents a cache-oblivious string B-tree (COSB-tree) data structure that is efficient in all these ways: The COSB-tree searches asymptotically optimally and inserts and deletes nearly optimally.It maintains an index whose size is proportional to the front-compressed size of the dictionary. Furthermore, unlike standard front-compressed strings, keys can be decompressed in a memory-efficient manner.It performs range queries with no extra disk seeks; in contrast, B-trees incur disk seeks when skipping from leaf block to leaf block.It utilizes all levels of a memory hierarchy efficiently and makes good use of disk locality by using cache-oblivious layout strategies.

86 citations

References
More filters
Proceedings ArticleDOI
Bowen Alpern1, Larry Carter1, Ephraim Feig1
22 Oct 1990
TL;DR: The authors introduce a model, called the uniform memory hierarchy (UMH), which reflects the hierarchical nature of computer memory more accurately than the RAM (random-access-machine) model, which assumes that any item in memory can be accessed with unit cost.
Abstract: The authors introduce a model, called the uniform memory hierarchy (UMH) model, which reflects the hierarchical nature of computer memory more accurately than the RAM (random-access-machine) model, which assumes that any item in memory can be accessed with unit cost. In the model memory occurs as a sequence of increasingly large levels. Data are transferred between levels in fixed-size blocks (the size is level dependent). Within a level blocks are random access. The model is easily extended to handle parallelism. The UMH model is really a family of models parameterized by the rate at which the bandwidth decays as one travels up the hierarchy. A program is parsimonious on a UMH if the leading terms of the program's (time) complexity on the UMH and on a RAM are identical. If these terms differ by more than a constant factor, then the program is inefficient. The authors analyze two standard FFT programs with the same RAM complexity. One is efficient; the other is not. >

65 citations

Journal ArticleDOI
TL;DR: In this article, the authors present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters.
Abstract: We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-efficient algorithms in the single-level cache model for fundamental problems like sorting, FFT, and an important subclass of permutations. We also analyze the average-case cache behavior of mergesort, show that ignoring associativity concerns could lead to inferior performance, and present supporting experimental evidence.We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and for dealing with the hitherto unresolved problem of limited associativity.

61 citations

DOI
01 Jan 2004
TL;DR: University of Southern Denmark 38.1 The Cache-Oblivious Model: Fundamental Primitives, kd-Tree, k-Merger, and 2d Orthogonal Range Searching.
Abstract: University of Southern Denmark 38.1 The Cache-Oblivious Model . . . . . . . . . . . . . . . . . . . . . . . . . 38-1 38.2 Fundamental Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-3 Van Emde Boas Layout • k-Merger 38.3 Dynamic B-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-8 Density Based • Exponential Tree Based 38.4 Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-12 Merge Based Priority Queue: Funnel Heap • Exponential Level Based Priority Queue 38.5 2d Orthogonal Range Searching . . . . . . . . . . . . . . . . . . . . . 38-21 Cache-Oblivious kd-Tree • Cache-Oblivious Range Tree

45 citations


"Cache-Oblivious Algorithms" refers methods in this paper

  • ...Excellent surveys on cache-oblivious algorithms and data structures include [Arge et al. 2005; Brodal 2004; Demaine 2002]....

    [...]

Book ChapterDOI
08 Jul 2001
TL;DR: This paper formulates and investigates the question of whether a given algorithm can be coded in a way efficiently portable across machines with different hierarchical memory systems, modeled as a(x)-HRAMs (Hierarchical RAMs), and proposes the decomposition-tree memory manager and the reoccurrence-width memory manager.
Abstract: This paper formulates and investigates the question of whether a given algorithm can be coded in a way efficiently portable across machines with different hierarchical memory systems, modeled as a(x)-HRAMs (Hierarchical RAMs), where the time to access a location x is a(x). The width decomposition framework is proposed to provide a machine-independent characterization of temporal locality of a computation by a suitable set of space reuse parameters. Using this framework, it is shown that, when the schedule, i.e. the order by which operations are executed, is fixed, efficient portability is achievable. We propose (a) the decomposition-tree memory manager, which achieves time within a logarithmic factor of optimal on all HRAMs, and (b) the reoccurrence-width memory manager, which achieves time within a constant factor of optimal for the important class of uniform HRAMs. We also show that, when the schedule is considered as a degree of freedom of the implementation, there are computations whose optimal schedule does vary with the access function. In particular, we exhibit some computations for which any schedule is bound to be a polynomial factor slower than optimal on at least one of two sufficiently different machines. On the positive side, we show that relatively few schedules are sufficient to provide a near optimal solution on a wide class of HRAMs.

42 citations

Journal ArticleDOI
TL;DR: This work presents several efficient algorithms for sorting on the uniform memory hierarchy (UMH), introduced by Alpern, Carter, and Fei, and its parallelization P-UMH, and develops optimal sorting algorithms for all bandwidths for other versions of UMH.

41 citations


"Cache-Oblivious Algorithms" refers background or methods in this paper

  • ...Finally, we present simulation results proving that optimal cache-oblivious algorithms satisfying the regularity condition are also optimal (in expectation) in the previously studied SUMH [5, 28] and HMM [1] models....

    [...]

  • ...It can also be shown [Prokop 1999] that cache-oblivious algorithms satisfying (14) are also optimal (in expectation) in the previously studied SUMH [Alpern et al. 1990; Vitter and Nodine 1993] and HMM [Aggarwal et al. 1987a] models....

    [...]

  • ...Specifically, we prove (with only minor assumptions) that optimal cache-oblivious algorithms in the ideal-cache model are also optimal in the hierarchical memory model (HMM) [1] and in the serial uniform memory hierarchy (SUMH) model [5, 28]....

    [...]

  • ...Unlike previous cache-efficient distribution-sorting algorithms [1, 3, 21, 28, 30], which use sampling or other techniques to find the partitioning elements before the distribution step, our algorithm uses a “bucket splitting” technique to select pivots incrementally during the distribution....

    [...]

  • ...In the more restrictive SUMH model [28], however, only one bus is active at a time....

    [...]