scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Cache-Oblivious Algorithms

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.
Abstract: This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults.We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.
Citations
More filters
Proceedings ArticleDOI
10 Feb 2018
TL;DR: Novel scalable parallel algorithms for finding global minimum cuts and connected components, which are important and fundamental problems in graph processing, and an approximate variant of the minimum cut algorithm, which approximates the exact solutions well while using a fractions of cores in a fraction of time are provided.
Abstract: We present novel scalable parallel algorithms for finding global minimum cuts and connected components, which are important and fundamental problems in graph processing To take advantage of future massively parallel architectures, our algorithms are communication-avoiding: they reduce the costs of communication across the network and the cache hierarchy The fundamental technique underlying our work is the randomized sparsification of a graph: removing a fraction of graph edges, deriving a solution for such a sparsified graph, and using the result to obtain a solution for the original input We design and implement sparsification with O(1) synchronization steps Our global minimum cut algorithm decreases communication costs and computation compared to the state-of-the-art, while our connected components algorithm incurs few cache misses and synchronization steps We validate our approach by evaluating MPI implementations of the algorithms on a petascale supercomputer We also provide an approximate variant of the minimum cut algorithm and show that it approximates the exact solutions well while using a fraction of cores in a fraction of time

40 citations

Proceedings ArticleDOI
TL;DR: In this paper, the authors consider a variant of the external memory (EM) model that charges $omega > 1$ for writing a block of size $B$ to the secondary memory, and present variants of three EM sorting algorithms (multi-way mergesort, sample sort, and heapsort using buffer trees) that asymptotically reduce the number of writes over the original algorithms, and perform roughly $\omega$ block reads for every block write.
Abstract: Emerging memory technologies have a significant gap between the cost, both in time and in energy, of writing to memory versus reading from memory. In this paper we present models and algorithms that account for this difference, with a focus on write-efficient sorting algorithms. First, we consider the PRAM model with asymmetric write cost, and show that sorting can be performed in $O\left(n\right)$ writes, $O\left(n \log n\right)$ reads, and logarithmic depth (parallel time). Next, we consider a variant of the External Memory (EM) model that charges $\omega > 1$ for writing a block of size $B$ to the secondary memory, and present variants of three EM sorting algorithms (multi-way mergesort, sample sort, and heapsort using buffer trees) that asymptotically reduce the number of writes over the original algorithms, and perform roughly $\omega$ block reads for every block write. Finally, we define a variant of the Ideal-Cache model with asymmetric write costs, and present write-efficient, cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication. Adapting prior bounds for work-stealing and parallel-depth-first schedulers to the asymmetric setting, these yield parallel cache complexity bounds for machines with private caches or with a shared cache, respectively.

34 citations

Book ChapterDOI
30 Jul 2003
TL;DR: An open issue on dictionaries dating back to the sixthies is closed, showing that bounds can be simultaneously achieved in the worst case for searching and updating by suitably maintaining a permutation of the n keys in the array.
Abstract: We close an open issue on dictionaries dating back to the sixthies. An array of n keys can be sorted so that searching takes O(log n) time. Alternatively, it can be organized as a heap so that inserting and deleting keys take O(log n) time. We show that these bounds can be simultaneously achieved in the worst case for searching and updating by suitably maintaining a permutation of the n keys in the array. The resulting data structure is called implicit as it uses just O(1) extra memory cells beside the n cells for the array. The data structure is also cache-oblivious, attaining O(logB n) block transfers in the worst case for any (unknown) value of the block size B, without wasting any single cell of memory at any level of the memory hierarchy.

33 citations

Journal ArticleDOI
TL;DR: This paper begins by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and shows limited performance improvements, then discusses the unidirectional space time representation (USTR), which can be used to reduce the amount of processor-memory traffic by a factor of O(&sqrt;C), where C is the cache size.
Abstract: The topic of cache performance has been well studied in recent years. Compiler optimizations exist and optimizations have been done for many problems. Much of this work has focused on dense linear algebra problems. At first glance, the Floyd--Warshall algorithm appears to fall into this category. In this paper, we begin by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and show limited performance improvements. We then discuss the unidirectional space time representation (USTR). We show analytically that the USTR can be used to reduce the amount of processor-memory traffic by a factor of O(√C), where C is the cache size, for a large class of algorithms. Since the USTR leads to a tiled implementation, we develop a tile size selection heuristic to intelligently narrow the search space for the tile size that minimizes total execution time. Using the USTR, we develop a cache-friendly implementation of the Floyd--Warshall algorithm. We show experimentally that this implementation minimizes the level-1 and level-2 cache misses and TLB misses and, therefore, exhibits the best overall performance. Using this implementation, we show a 2x improvement in performance over the best compiler optimized implementation on three different architectures. Finally, we show analytically that our implementation of the Floyd--Warshall algorithm is asymptotically optimal with respect to processor-memory traffic. We show experimental results for the Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We demonstrate improved cache performance using the Simplescalar simulator.

27 citations

Book ChapterDOI
29 Aug 2000
TL;DR: Comp compiler technology to translate iterative versions of a number of numerical kernels into block-recursive form is described and the cache behavior and performance of these compiler generated block-Recursive codes are studied.
Abstract: Block-recursive codes for dense numerical linear algebra computations appear to be well-suited for execution on machines with deep memory hierarchies because they are effectively blocked for all levels of the hierarchy. In this paper, we describe compiler technology to translate iterative versions of a number of numerical kernels into block-recursive form. We also study the cache behavior and performance of these compiler generated block-recursive codes.

27 citations

References
More filters
Book
01 Jan 1983

34,729 citations

Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

01 Jan 2005

19,250 citations

Journal ArticleDOI
TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.
Abstract: An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0

11,795 citations


"Cache-Oblivious Algorithms" refers methods in this paper

  • ...The basic algorithm is the well-known “six-step” variant [Bailey 1990; Vitter and Shriver 1994b] of the Cooley-Tukey FFT algorithm [Cooley and Tukey 1965]....

    [...]

Book
01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Abstract: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today. In this edition, the authors bring their trademark method of quantitative analysis not only to high-performance desktop machine design, but also to the design of embedded and server systems. They have illustrated their principles with designs from all three of these domains, including examples from consumer electronics, multimedia and Web technologies, and high-performance computing.

11,671 citations


"Cache-Oblivious Algorithms" refers background or methods in this paper

  • ...We assume that the caches satisfy the inclusion property [Hennessy and Patterson 1996, p. 723], which says that the values stored in cache i are also stored in cache i + 1 (where cache 1 is the cache closest to the processor)....

    [...]

  • ...Moreover, the iterative algorithm behaves erratically, apparently due to so-called “conflict” misses [Hennessy and Patterson 1996, p. 390], where limited cache associativity interacts with the regular addressing of the matrix to cause systematic interference....

    [...]

  • ...Our strategy for the simulation is to use an LRU (least-recently used) replacement strategy [Hennessy and Patterson 1996, p. 378] in place of the optimal and omniscient replacement strategy....

    [...]

  • ...The ideal cache is fully associative [Hennessy and Patterson 1996, Ch. 5]: cache blocks can be stored anywhere in the cache....

    [...]