Cache-Oblivious Algorithms

doi:10.1145/2071379.2071383

Home
/
Papers
/
Cache-Oblivious Algorithms

Journal Article•DOI•

Cache-Oblivious Algorithms

Matteo Frigo¹, Charles E. Leiserson¹, Harald Prokop¹, Sridhar Ramachandran¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2012-ACM Transactions on Algorithms (ACM)-Vol. 8, Iss: 1, pp 4

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.

read less

Abstract: This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults.We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Cache-Adaptive Analysis

[...]

Michael A. Bender¹, Erik D. Demaine², Roozbeh Ebrahimi³, Jeremy T. Fineman⁴, Rob Johnson¹, Andrea Lincoln⁵, Jayson Lynch², Samuel McCauley¹ - Show less +4 more•Institutions (5)

Stony Brook University¹, Massachusetts Institute of Technology², Google³, Georgetown University⁴, Stanford University⁵

11 Jul 2016

TL;DR: These techniques make analyzing algorithms in the cache-adaptive model almost as easy as in the external memory, or DAM model, and give algorithm designers clear guidelines for creating optimally cache- Adaptive algorithms.

...read moreread less

Abstract: Memory efficiency and locality have substantial impact on the performance of programs, particularly when operating on large data sets. Thus, memory- or I/O-efficient algorithms have received significant attention both in theory and practice. The widespread deployment of multicore machines, however, brings new challenges. Specifically, since the memory (RAM) is shared across multiple processes, the effective memory-size allocated to each process fluctuates over time. This paper presents techniques for designing and analyzing algorithms in a cache-adaptive setting, where the RAM available to the algorithm changes over time. These techniques make analyzing algorithms in the cache-adaptive model almost as easy as in the external memory, or DAM model. Our techniques enable us to analyze a wide variety of algorithms --- Master-Method-style algorithms, Akra-Bazzi-style algorithms, collections of mutually recursive algorithms, and algorithms, such as FFT, that break problems of size N into subproblems of size Theta(Nc).We demonstrate the effectiveness of these techniques by deriving several results: 1. We give a simple recipe for determining whether common divide-and-conquer cache-oblivious algorithms are optimally cache adaptive. 2. We show how to bound an algorithm's non-optimality. We give a tight analysis showing that a class of cache-oblivious algorithms is a logarithmic factor worse than optimal. 3. We show the generality of our techniques by analyzing the cache-oblivious FFT algorithm, which is not covered by the above theorems. Nonetheless, the same general techniques can show that it is at most O(loglog N) away from optimal in the cache adaptive setting, and that this bound is tight.These general theorems give concrete results about several algorithms that could not be analyzed using earlier techniques. For example, our results apply to Fast Fourier Transform, matrix multiplication, Jacobi Multipass Filter, and cache-oblivious dynamic-programming algorithms, such as Longest Common Subsequence and Edit Distance.Our results also give algorithm designers clear guidelines for creating optimally cache-adaptive algorithms.

...read moreread less

8 citations

Posted Content•

The Cost of Address Translation

[...]

Tomasz Jurkiewicz¹, Kurt Mehlhorn¹•Institutions (1)

Max Planck Society¹

04 Dec 2012-arXiv: Data Structures and Algorithms

TL;DR: This paper proposes the VAT model (virtual address translation) to account for the cost of address translations and analyzes the algorithms mentioned above and others in the model, finding the predictions agree with the measurements.

...read moreread less

Abstract: Modern computers are not random access machines (RAMs). They have a memory hierarchy, multiple cores, and virtual memory. In this paper, we address the computational cost of address translation in virtual memory. Starting point for our work is the observation that the analysis of some simple algorithms (random scan of an array, binary search, heapsort) in either the RAM model or the EM model (external memory model) does not correctly predict growth rates of actual running times. We propose the VAT model (virtual address translation) to account for the cost of address translations and analyze the algorithms mentioned above and others in the model. The predictions agree with the measurements. We also analyze the VAT-cost of cache-oblivious algorithms.

...read moreread less

8 citations

Cites background or methods from "Cache-Oblivious Algorithms"

...he number of cache faults is bounded by 4dC(M=4;dB;n) provided that M> 4g(dB). Here M= M=a, B= P=aand a> 1 is the size (in addressable units) of the items handled by the algorithm. Funnel sort [Frigo et al. 2012] is an optimal cache-oblivious sorting algorithm. On an EM-machine with cache size Mfand block size Be, it sorts nitems with C(M;fB;ne ) = O n Be & logn=Mf logM=fBe ’! ACM Journal of Experimental...
[...]
...he-oblivious algorithms that match the performance of the best EM-algorithm for the problem are known for several fundamental algorithmic problems, e.g., sorting, FFT, matrix multiply, and searching [Frigo et al. 2012]. Do all these algorithms automatically have small VAT-complexity via Theorem 7.1? Unfortunately, the answer is no. Observe that the theorem refers to the cache misses in a machine with memory size a...
[...]

Book Chapter•DOI•

Smart Stores: A Scalable Foot Traffic Collection and Prediction System

[...]

Soheila Abrishami¹, Piyush Kumar¹, Wickus Nienaber•Institutions (1)

Florida State University¹

12 Jul 2017

TL;DR: A large scale data collection and prediction system for store foot traffic collected from wireless access points deployed at over 100 businesses across the United States for a period of more than one year is designed.

...read moreread less

Abstract: An accurate foot traffic prediction system can help retail businesses, physical stores, and restaurants optimize their labor schedule and costs, and reduce food wastage. In this paper, we design a large scale data collection and prediction system for store foot traffic. Our data has been collected from wireless access points deployed at over 100 businesses across the United States for a period of more than one year. This data is centrally processed and analyzed to predict the foot traffic for the next 168 hours (a week). Our current predictor is based on Support Vector Regression (SVR). There are a few other predictors that we have found that are similar in accuracy to SVR. For our collected data the average foot traffic per hour is 35 per store. Our prediction result is on average within 22% of the actual result for a 168 hour (a week) period.

...read moreread less

8 citations

Journal Article•DOI•

An Efficient External Memory Algorithm for Terrain Viewshed Computation

[...]

Chaulio R. Ferreira¹, Marcus V. A. Andrade¹, Salles V. G. Magalhães¹, W. Randolph Franklin²•Institutions (2)

Universidade Federal de Viçosa¹, Rensselaer Polytechnic Institute²

21 Jun 2016-ACM Transactions on Spatial Algorithms and Systems

TL;DR: This article presents TiledVS, a fast external algorithm and implementation for computing viewsheds that subdivides the terrain into tiles that are stored compressed on disk and then paged into memory with a custom cache data structure and least recently used algorithm.

...read moreread less

Abstract: This article presents TiledVS, a fast external algorithm and implementation for computing viewsheds. TiledVS is intended for terrains that are too large for internal memory, even more than 100,000×100,000 points. It subdivides the terrain into tiles that are stored compressed on disk and then paged into memory with a custom cache data structure and least recently used algorithm. If there is sufficient available memory to store a whole row of tiles, which is easy, then this specialized data management is faster than relying on the operating system’s virtual memory management. Applications of viewshed computation include siting radio transmitters, surveillance, and visual environmental impact measurement.TiledVS runs a rotating line of sight from the observer to points on the region boundary. For each boundary point, it computes the visibility of all terrain points close to the line of sight. The running time is linear in the number of points. No terrain tile is read more than twice. TiledVS is very fast, for instance, processing a 104,000×104,000 terrain on a modest computer with only 512MB of RAM took only 1½ hours. On large datasets, TiledVS was several times faster than competing algorithms, such as the ones included in GRASS. The source code of TiledVS is freely available for nonprofit researchers to study, use, and extend.A preliminary version of this algorithm appeared in a four-page ACM SIGSPATIAL GIS 2012 conference paper, “More Efficient Terrain Viewshed Computation on Massive Datasets Using External Memory.” This more detailed version adds the fast lossless compression stage that reduces the time by 30p to 40p, and many more experiments and comparisons.

...read moreread less

8 citations

Proceedings Article•DOI•

Computing betweenness centrality in external memory

[...]

Lars Arge¹, Michael T. Goodrich², Freek van Walderveen¹•Institutions (2)

Aarhus University¹, University of California, Irvine²

23 Dec 2013

TL;DR: This paper describes the first known external-memory and cache-oblivious algorithms for computing betweenness centrality, and describes general algorithms for networks with weighted and unweighted edges and a specialized algorithm with small diameters, as is common in social networks exhibiting the “small worlds” phenomenon.

...read moreread less

Abstract: Betweenness centrality is one of the most well-known measures of the importance of nodes in a social-network graph. In this paper we describe the first known external-memory and cache-oblivious algorithms for computing betweenness centrality. We present four different external-memory algorithms exhibiting various tradeoffs with respect to performance. Two of the algorithms are cache-oblivious. We describe general algorithms for networks with weighted and unweighted edges and a specialized algorithm for networks with small diameters, as is common in social networks exhibiting the “small worlds” phenomenon.

...read moreread less

8 citations

1
2
3
4
5
6
7
8
9
10
11
…
12
13
14
15
16
17
18
…
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Collapse

References

PDF

Open Access

More filters

Book•

Matrix computations

[...]

Gene H. Golub

01 Jan 1983

34,729 citations

Book•

Introduction to Algorithms

[...]

Thomas H. Cormen¹, Charles E. Leiserson¹, Ronald L. Rivest¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 1990

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.

...read moreread less

Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

...read moreread less

21,651 citations

Introduction to Algorithms

[...]

Adhi Harmoko S, M.Komp, Joseph Marie Jacquard, Konrad Zuse, Eniac - Show less +1 more

01 Jan 2005

19,250 citations

Journal Article•DOI•

An algorithm for the machine calculation of complex Fourier series

[...]

J.W. Cooley, John W. Tukey

01 Apr 1965-Mathematics of Computation

TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.

...read moreread less

Abstract: An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0

...read moreread less

11,795 citations

"Cache-Oblivious Algorithms" refers methods in this paper

...The basic algorithm is the well-known “six-step” variant [Bailey 1990; Vitter and Shriver 1994b] of the Cooley-Tukey FFT algorithm [Cooley and Tukey 1965]....
[...]

Book•

Computer Architecture: A Quantitative Approach

[...]

John L. Hennessy¹, David A. Patterson²•Institutions (2)

Stanford University¹, University of California, Berkeley²

01 Dec 1989

TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.

...read moreread less

Abstract: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today. In this edition, the authors bring their trademark method of quantitative analysis not only to high-performance desktop machine design, but also to the design of embedded and server systems. They have illustrated their principles with designs from all three of these domains, including examples from consumer electronics, multimedia and Web technologies, and high-performance computing.

...read moreread less

11,671 citations

"Cache-Oblivious Algorithms" refers background or methods in this paper

...We assume that the caches satisfy the inclusion property [Hennessy and Patterson 1996, p. 723], which says that the values stored in cache i are also stored in cache i + 1 (where cache 1 is the cache closest to the processor)....
[...]
...Moreover, the iterative algorithm behaves erratically, apparently due to so-called “conflict” misses [Hennessy and Patterson 1996, p. 390], where limited cache associativity interacts with the regular addressing of the matrix to cause systematic interference....
[...]
...Our strategy for the simulation is to use an LRU (least-recently used) replacement strategy [Hennessy and Patterson 1996, p. 378] in place of the optimal and omniscient replacement strategy....
[...]
...The ideal cache is fully associative [Hennessy and Patterson 1996, Ch. 5]: cache blocks can be stored anywhere in the cache....
[...]