Cache-Oblivious Algorithms

doi:10.1145/2071379.2071383

Journal ArticleDOI

Cache-Oblivious Algorithms

Matteo Frigo, +3 more

- 01 Jan 2012 -

ACM Transactions on Algorithms

- Vol. 8, Iss: 1, pp 4

Chats0

TLDR

It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.

Abstract:

This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults.We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.

Cache-Oblivious Algorithms

Citations

Algorithmic cache of sorted tables for feature selection: Speeding up methods based on consistency and information theory measures

Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilers

Cache-oblivious scheduling of streaming pipelines

Performance and Power Characteristics of Matrix Multiplication Algorithms on Multicore and Shared Memory Machines

NUMA-aware multicore Matrix Multiplication

References

Matrix computations

Introduction to Algorithms

Introduction to Algorithms

An algorithm for the machine calculation of complex Fourier series

Computer Architecture: A Quantitative Approach

Related Papers (5)

The input/output complexity of sorting and related problems

I/O complexity: The red-blue pebble game

Communication lower bounds for distributed-memory matrix multiplication

Gaussian elimination is not optimal

External memory algorithms and data structures: dealing with massive data