scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Cache-Oblivious Algorithms

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.
Abstract: This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults.We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.
Citations
More filters
Journal ArticleDOI
TL;DR: This paper proposes an improved theoretic model suggesting the optimization of index-faced triangular mesh layout on the basis of a jointly consideration on both the vertex and face layout, which outperforms the CM and SFC methods not only at yielding better theoretical results but also at improving the runtime efficiency of large mesh simplification.
Abstract: Highly detailed polygonal meshes nowadays were created vastly from a great number of multimedia applications. Insufficient main memory space and inferior locality-of-reference nature of the input mesh sequence incurred frequent page swaps and extremely low runtime efficiency. To address this issue, techniques based on the models of graph layout, matrix bandwidth minimization, and space-filling-curves were reported. However, such models merely considered the optimization of vertex layout on the basis of the graph distance or matrix bandwidths while leave the optimization of face layout either unoptimized or resorted an additional sorting procedure. In this paper, we propose an improved theoretic model suggesting the optimization of index-faced triangular mesh layout on the basis of a jointly consideration on both the vertex and face layout. Unlike previous attempts, we do not need additional efforts in optimizing one layout after the other by sorting, the optimized layouts are generated on the fly with each other with respect to both the vertex or triangle bandwidths. According to the experimental results, our approach outperforms the CM and SFC methods not only at yielding better theoretical results but also at improving the runtime efficiency of large mesh simplification.

6 citations

Proceedings ArticleDOI
22 Sep 2009
TL;DR: This paper aims at minimizing the number of cache misses paid during the execution of the matrix product kernel on a multicore processor, and shows how to achieve the best possible tradeoff between shared and distributed caches.
Abstract: The multicore revolution is underway. Classical algorithms must be revisited in order to take the hierarchical memory layout into account. In this paper, we aim at minimizing the number of cache misses paid during the execution of the matrix product kernel on a multicore processor, and we show how to achieve the best possible tradeoff between shared and distributed caches. Comprehensive simulation results confirm the analytical performance predictions and fully establish the practical significance of our new algorithms.

6 citations

Book ChapterDOI
07 Jan 2013
TL;DR: In this paper, the authors propose the VAT model (virtual address translation) to account for the cost of address translations and analyze the algorithms mentioned above and others in the model, and also analyze the VAT-cost of cache-oblivious algorithms.
Abstract: Modern computers are not random access machines (RAMs). They have a memory hierarchy, multiple cores, and virtual memory. In this paper, we address the computational cost of address translation in virtual memory. Starting point for our work is the observation that the analysis of some simple algorithms (random scan of an array, binary search, heapsort) in either the RAM model or the EM model (external memory model) does not correctly predict growth rates of actual running times. We propose the VAT model (virtual address translation) to account for the cost of address translations and analyze the algorithms mentioned above and others in the model. The predictions agree with the measurements. We also analyze the VAT-cost of cache-oblivious algorithms.

6 citations

Journal ArticleDOI
20 Feb 2018
TL;DR: In this paper, a lower bound on the communication complexity of DAG computations is derived in terms of the switching potential of a DAG, i.e., the number of permutations that the DAG can realize when viewed as a switching network.
Abstract: Communication is a major factor determining the performance of algorithms on current computing systems; it is therefore valuable to provide tight lower bounds on the communication complexity of computations. This article presents a lower bound technique for the communication complexity in the bulk-synchronous parallel (BSP) model of a given class of DAG computations. The derived bound is expressed in terms of the switching potential of a DAG, that is, the number of permutations that the DAG can realize when viewed as a switching network. The proposed technique yields tight lower bounds for the fast Fourier transform (FFT), and for any sorting and permutation network. A stronger bound is also derived for the periodic balanced sorting network, by applying this technique to suitable subnetworks. Finally, we demonstrate that the switching potential captures communication requirements even in computational models different from BSP, such as the I/O model and the LPRAM.

5 citations

Book ChapterDOI
24 May 2017
TL;DR: This work shows how to compute the minimum cut of a graph cache-efficiently using a cache oblivious algorithm and a simpler one that incurs cache misses.
Abstract: We show how to compute the minimum cut of a graph cache-efficiently. Let B be the width of a cache line and M be the size of the cache. On a graph with V vertices and E edges, we give a cache oblivious algorithm that incurs \(O(\lceil \frac{E}{B} (\log ^4 E) \log _{M/B} E\rceil )\) cache misses and a simpler one that incurs \(O(\lceil \frac{V^2}{B} \log ^3 V\rceil )\) cache misses.

5 citations

References
More filters
Book
01 Jan 1983

34,729 citations

Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

01 Jan 2005

19,250 citations

Journal ArticleDOI
TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.
Abstract: An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0

11,795 citations


"Cache-Oblivious Algorithms" refers methods in this paper

  • ...The basic algorithm is the well-known “six-step” variant [Bailey 1990; Vitter and Shriver 1994b] of the Cooley-Tukey FFT algorithm [Cooley and Tukey 1965]....

    [...]

Book
01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Abstract: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today. In this edition, the authors bring their trademark method of quantitative analysis not only to high-performance desktop machine design, but also to the design of embedded and server systems. They have illustrated their principles with designs from all three of these domains, including examples from consumer electronics, multimedia and Web technologies, and high-performance computing.

11,671 citations


"Cache-Oblivious Algorithms" refers background or methods in this paper

  • ...We assume that the caches satisfy the inclusion property [Hennessy and Patterson 1996, p. 723], which says that the values stored in cache i are also stored in cache i + 1 (where cache 1 is the cache closest to the processor)....

    [...]

  • ...Moreover, the iterative algorithm behaves erratically, apparently due to so-called “conflict” misses [Hennessy and Patterson 1996, p. 390], where limited cache associativity interacts with the regular addressing of the matrix to cause systematic interference....

    [...]

  • ...Our strategy for the simulation is to use an LRU (least-recently used) replacement strategy [Hennessy and Patterson 1996, p. 378] in place of the optimal and omniscient replacement strategy....

    [...]

  • ...The ideal cache is fully associative [Hennessy and Patterson 1996, Ch. 5]: cache blocks can be stored anywhere in the cache....

    [...]