Journal ArticleDOI
Cache-Oblivious Algorithms
Reads0
Chats0
TLDR
It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.Abstract:
This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults.We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.read more
Citations
More filters
Proceedings ArticleDOI
The pochoir stencil compiler
TL;DR: The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm.
Proceedings ArticleDOI
Implicit and explicit optimizations for stencil computations
TL;DR: Several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor are examined, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure.
Journal ArticleDOI
Communication lower bounds and optimal algorithms for numerical linear algebra
TL;DR: This paper describes lower bounds on communication in linear algebra, and presents lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices.
Cache-Oblivious Algorithms and Data Structures
TL;DR: A recent body of work has developed cache-oblivious algorithms and data structures that perform as well or nearly as well as standard external-memory structures which require knowledge of the cache/memory size and block transfer size.
Proceedings ArticleDOI
Cache-oblivious string B-trees
TL;DR: This paper presents a cache-oblivious string B-tree (COSB-tree) data structure that is efficient in all these ways: searches asymptotically optimally and inserts and deletes nearly optimally, and maintains an index whose size is proportional to the front-compressed size of the dictionary.
References
More filters
Proceedings ArticleDOI
A fast Fourier transform compiler
TL;DR: The internals of this special-purpose compiler, called genfft, are described in some detail, and it is argued that a specialized compiler is a valuable tool.
Gaussian Elimination is not Optimal
TL;DR: In this paper, Cook et al. gave an algorithm which computes the coefficients of the product of two square matrices A and B of order n with less than 4. 7 n l°g 7 arithmetical operations (all logarithms in this paper are for base 2).
Journal ArticleDOI
Algorithms for parallel memory, I: Two-level memories
TL;DR: In this article, the authors provided the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for sorting, FFT, matrix transposition, standard matrix multiplication, and related problems.
Proceedings ArticleDOI
A model for hierarchical memory
TL;DR: An algorithm that uses LRU policy at the successive “levels” of the memory hierarchy is shown to be optimal for arbitrary memory access time.