scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Cache-Oblivious Algorithms

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.
Abstract: This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size M and cache-line length B where M = Ω(B2), the number of cache misses for an m × n matrix transpose is Θ(1 + mn/B). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1 + (n/B)(1 + logM n)). We also give a Θ(mnp)-work algorithm to multiply an m × n matrix by an n × p matrix that incurs Θ(1 + (mn + np + mp)/B + mnp/B√M) cache faults.We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We offer empirical evidence that cache-oblivious algorithms perform well in practice.
Citations
More filters
Proceedings ArticleDOI
26 Jul 2009
TL;DR: This paper presents a concept of active memory operations which is an on-chip network transaction that operates based on the microcode provided by the software designer that can replace multiple transactions of memory accesses over the on- chip network and related local processing element computation with a smaller number of high-level transactions and near-memory computation.
Abstract: Memory access latency and memory-related operations are often the performance bottleneck in parallel applications. In this paper, we present a concept of active memory operations which is an on-chip network transaction that operates based on the microcode provided by the software designer. Utilizing the active memory operation, we can replace multiple transactions of memory accesses over the on-chip network and related local processing element computation with a smaller number of high-level transactions and near-memory computation. We implemented a processor called active memory processor which is located near the memory and executes the active memory operations. In our case studies, we applied the concept to three real-world applications (parallelized JPEG, FFT, and text indexing for data mining) running on a 36-tile architecture with 32 cores and 4 memories and found that the programmable transaction approach can improve performance by 34.3% to 618% at the cost of additional design effort.

10 citations

Posted Content
TL;DR: In this paper, a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion was proposed, targeting multi-core processor architectures, and the complexity estimates and experimental comparisons demonstrate the advantages of this new approach.
Abstract: We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. Complexity estimates and experimental comparisons demonstrate the advantages of this new approach.

9 citations

Proceedings ArticleDOI
06 Jul 2020
TL;DR: This paper provides theoretical support for automatic HBM management by developing simple algorithms that can automatically control HBM and deliver good performance on multicore systems by providing a priority based approach that is simple, efficiently implementable and makespan-competitive for makespan when all multicore threads are independent.
Abstract: This paper develops an algorithmic foundation for automated management of the multilevel-memory systems common to new supercomputers. In particular, the High-Bandwidth Memory (HBM) of these systems has a similar latency to that of DRAM and a smaller capacity, but it has much larger bandwidth. Systems equipped with HBM do not fit in classic memory-hierarchy models due to HBM's atypical characteristics. Unlike caches, which are generally managed automatically by the hardware, programmers of some current HBM-equipped supercomputers can choose to explicitly manage HBM themselves. This process is problem specific and resource intensive. Vendors offer this option because there is no consensus on how to automatically manage HBM to guarantee good performance, or whether this is even possible. In this paper, we give theoretical support for automatic HBM management by developing simple algorithms that can automatically control HBM and deliver good performance on multicore systems. HBM management is starkly different from traditional caching both in terms of optimization objectives and algorithm development. Since DRAM and HBM have similar latencies, minimizing HBM misses (provably) turns out not to be the right memory-management objective. Instead, we directly focus on minimizing makespan. In addition, while cache-management algorithms must focus on what pages to keep in cache; HBM management requires answering two questions: (1) which pages to keep in HBM and (2) how to use the limited bandwidth from HBM to DRAM. It turns out that the natural approach of using LRU for the first question and FCFS (First-Come-First-Serve) for the second question is provably bad. Instead, we provide a priority based approach that is simple, efficiently implementable and $O(1)$-competitive for makespan when all multicore threads are independent.

9 citations


Cites background or methods from "Cache-Oblivious Algorithms"

  • ...[29, 30] and Prokop [30], which was based on Sleator and Tarjan’s [44]’s classic paging results; cache-adaptive analysis [16, 17, 37]; and parallel caching models based on work stealing [23]....

    [...]

  • ...HBM does not fit into a standard memory hierarchy model [12, 30], because in traditional hierarchies, both the latency and bandwidth improve as the levels get smaller....

    [...]

  • ...These include the seminal ideal cache-model of Frigo et al. [29, 30] and Prokop [30], which was based on Sleator and Tarjan’s [44]’s classic paging results; cache-adaptive analysis [16, 17, 37]; and parallel caching models based on work stealing [23]....

    [...]

  • ...Unlike the Ideal Cache model [29, 30], our HBM model has two resources to manage: the HBM itself and the far channel between...

    [...]

Proceedings ArticleDOI
13 Jun 2011
TL;DR: This work first enrich previous work in the area of compressed text-indexing providing an optimal data structure that requires ?
Abstract: We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet £ of size A is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases [15, 14, 5]. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing [8, 11, 6, 17] providing an optimal data structure that requires ?(|T|logA/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.

9 citations


Cites methods from "Cache-Oblivious Algorithms"

  • ...We conclude by mentioning a solution [13] that solves the weak prefix search problem efficiently in the Cache-Oblivious Model [14], and, thus, makes the above approach suitable for this model....

    [...]

Journal ArticleDOI
TL;DR: Experimental results suggest that each of the methods has its niche of excellence, which are roughly as follows: the theoretically optimal solutions based on a stack of back pointers perform in general best on sparse arrays, whose access frequency is less than 1% of the number of their entries.
Abstract: Initialization of an array, out of which only a small initially unknown portion will eventually be used, is a frequent need in programming. A folklore solution for initializing an array of n entries in constant time uses 2ni¾?log2ni¾? extra bits to realize a stack of back pointers to the actually used entries of the array. Navarro has given a succinct version of this technique, which requires only n + on bits of auxiliary storage. We describe, analyze, and experimentally compare these solutions and their space-efficient but theoretically suboptimal alternatives based on a simple bitmap for keeping track of the array entries which have been assigned a value. Experimental results suggest that each of the methods has its niche of excellence, which are roughly as follows: the theoretically optimal solutions based on a stack of back pointers perform in general best on sparse arrays, whose access frequency is less than 1% of the number of their entries. Brute-force initialization of the entire array seems generally to give the best overall performance for dense arrays whose access frequency is over 10% of their size. For the remaining cases of arrays with 1-10% access frequency, the methods which use a simple bitmap appear to give the best performance. The experiments show that the choice of a suitable implementation may yield substantial, up to hundreds of times speed-ups in the performance of initializable array operations. Copyright © 2015 John Wiley & Sons Ltd.

9 citations

References
More filters
Book
01 Jan 1983

34,729 citations

Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

01 Jan 2005

19,250 citations

Journal ArticleDOI
TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.
Abstract: An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0

11,795 citations


"Cache-Oblivious Algorithms" refers methods in this paper

  • ...The basic algorithm is the well-known “six-step” variant [Bailey 1990; Vitter and Shriver 1994b] of the Cooley-Tukey FFT algorithm [Cooley and Tukey 1965]....

    [...]

Book
01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Abstract: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today. In this edition, the authors bring their trademark method of quantitative analysis not only to high-performance desktop machine design, but also to the design of embedded and server systems. They have illustrated their principles with designs from all three of these domains, including examples from consumer electronics, multimedia and Web technologies, and high-performance computing.

11,671 citations


"Cache-Oblivious Algorithms" refers background or methods in this paper

  • ...We assume that the caches satisfy the inclusion property [Hennessy and Patterson 1996, p. 723], which says that the values stored in cache i are also stored in cache i + 1 (where cache 1 is the cache closest to the processor)....

    [...]

  • ...Moreover, the iterative algorithm behaves erratically, apparently due to so-called “conflict” misses [Hennessy and Patterson 1996, p. 390], where limited cache associativity interacts with the regular addressing of the matrix to cause systematic interference....

    [...]

  • ...Our strategy for the simulation is to use an LRU (least-recently used) replacement strategy [Hennessy and Patterson 1996, p. 378] in place of the optimal and omniscient replacement strategy....

    [...]

  • ...The ideal cache is fully associative [Hennessy and Patterson 1996, Ch. 5]: cache blocks can be stored anywhere in the cache....

    [...]