scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Low depth cache-oblivious algorithms

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Content maybe subject to copyright    Report

Citations
More filters
Dissertation
26 Apr 2013
TL;DR: Low-degree-parallelism in computation is explored, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case.
Abstract: Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits. Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem’s input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case. Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting, we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use. This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios. The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms. Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the wordRAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming.

2 citations


Cites background or methods from "Low depth cache-oblivious algorithm..."

  • ...Similarly, when scheduled on a PMSH using a PDF scheduler, the cache at each level i incurs fewer than Q(p(Mi−Bid), Bi), and the computation takes at most W ′/p+d′ steps, where d′ and W ′ are, respectively, the depth and work of the computation including the latencies of data misses [Blelloch et al., 2010]....

    [...]

  • ...show that for a nested-parallel computation of depth d scheduled with a work-stealing scheduler the cache complexity at each level of the hierarchy satisfies Qp(n,Mi, Bi) ≤ Q(n,Mi, Bi) +O(pMid/Bi) with probability 1− δ, where Mi and Bi are the cache and line sizes at level i, respectively, Q(n,Mi, Bi) is the sequential cache complexity of the computation at each level, and δ is an arbitrarily small positive constant [Blelloch et al., 2010]....

    [...]

  • ...which each processor has a private multi-level cache hierarchy, as well a Parallel Multi-level Shared Hierarchy (PMSH), where all processors share a multi-level cache hierarchy [Blelloch et al., 2010]....

    [...]

  • ...While there exist many sequential cache-oblivious algorithms with good cache complexity, several of these do not show natural parallelizations with low depth [Blelloch et al., 2010]....

    [...]

Dissertation
08 Jul 2013
TL;DR: Two new pattern definitions extending the reach of data analysis by pattern mining are contributed: gradual patterns and periodic patterns with unrestricted gaps, and ParaMiner, a pioneering algorithm for generic pattern mining, allowing practitioners to directly specify the patterns they are interested in.
Abstract: Pattern mining is the area of data mining concerned with finding regularities in data. This document presents my contributions to this domain along three axes: 1. The domain is young, and there are still some kinds of regularities that data analysts would like to discover in data but that are not handled. We contributed two new pattern definitions extending the reach of data analysis by pattern mining: gradual patterns and periodic patterns with unrestricted gaps. We also proposed ParaMiner, a pioneering algorithm for generic pattern mining, allowing practitioners to directly specify the patterns they are interested in. 2. Pattern mining is extremely demanding on computational resources. In order to reduce the mining time, we studied how to exploit the parallelism of multicore processors. Our results show that some well established techniques in pattern mining are ill-adapted for parallelism, and propose solutions. 3. Our ultimate goal is to make pattern mining easier to use by data analysts. There is a lot to do in this area, as currently they are presented with unusable lists of millions of patterns. We will present our first results in the context of mining execution traces of processors.

2 citations

Proceedings ArticleDOI
23 Jul 2017
TL;DR: In this article, a cache-oblivious adaptation of matrix multiplication to be incorporated in the parallel TU decomposition for rectangular matrices over finite fields, based on the Morton-hybrid space-filling curve representation, is presented.
Abstract: We present a cache-oblivious adaptation of matrix multiplication to be incorporated in the parallel TU decomposition for rectangular matrices over finite fields, based on the Morton-hybrid space-filling curve representation To realise this, we introduce the concepts of alignment and containment of sub-matrices under the Morton-hybrid layout We redesign the decompositions within the recursive matrix multiplication to force the base case to avoid all jumps in address space, at the expense of extra recursive matrix multiplication (MM) calls We show that the resulting cache oblivious adaptation has low span, and our experiments demonstrate that its sequential evaluation order demonstrates significant improvement in run-time, despite the recursion overhead We also observe orders of magnitude reductions in cache misses, which promises to yield a highly I/O efficient multithreaded deployment of this algorithm on parallel machines with private or shared caches

1 citations

Journal ArticleDOI
TL;DR: This paper examines several matrix layouts based on space-filling curves that allow for a cache-oblivious adaptation of parallel TU decomposition for rectangular matrices over finite fields and shows that the Morton-hybrid order incurs the least cost for index conversion routines that are required throughout the matrix decomposition.

1 citations

Posted Content
TL;DR: A cache-oblivious adaptation of matrix multiplication to be incorporated in the parallel TU decomposition for rectangular matrices over finite fields, based on the Morton-hybrid space-filling curve representation, which promises a highly I/O efficient multithreaded deployment of this algorithm on parallel machines with private or shared caches.
Abstract: We present a cache-oblivious adaptation of matrix multiplication to be incorporated in the parallel TU decomposition for rectangular matrices over finite fields, based on the Morton-hybrid space-filling curve representation. To realise this, we introduce the concepts of alignment and containment of sub-matrices under the Morton-hybrid layout. We redesign the decompositions within the recursive matrix multiplication to force the base case to avoid all jumps in address space, at the expense of extra recursive matrix multiplication (MM) calls. We show that the resulting cache oblivious adaptation has low span, and our experiments demonstrate that its sequential evaluation order demonstrates orders of magnitude improvement in run-time, despite the recursion overhead.

1 citations


Cites background from "Low depth cache-oblivious algorithm..."

  • ...Particularly, nested parallel algorithms for which the natural sequential execution has low cache complexity will also attain good cache complexity on parallel machines with private or shared caches [4]....

    [...]

  • ...The implications for parallel performance can be captured using the results from [4], which reveal that nested parallel algorithms for which the natural sequential execution has low cache complexity will also attain good cache complexity on parallel machines with private or shared caches....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

3,885 citations


Additional excerpts

  • ...7] and distributed memory machines [48, 33, 12]....

    [...]

Journal ArticleDOI
TL;DR: This article shows that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules, and analyzes the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule by a factor that depends on the size of fast memory.
Abstract: In this article we study the amortized efficiency of the “move-to-front” and similar rules for dynamically maintaining a linear list. Under the assumption that accessing the ith element from the front of the list takes t(i) time, we show that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules. Other natural heuristics, such as the transpose and frequency count rules, do not share this property. We generalize our results to show that move-to-front is within a constant factor of optimum as long as the access cost is a convex function. We also study paging, a setting in which the access cost is not convex. The paging rule corresponding to move-to-front is the “least recently used” (LRU) replacement rule. We analyze the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule (Belady's MIN algorithm) by a factor that depends on the size of fast memory. No on-line paging algorithm has better amortized performance.

2,378 citations


"Low depth cache-oblivious algorithm..." refers background in this paper

  • ...It follows from [47] that the number of cache misses at each level under the multi-level LRU policy is within a factor of two of the number of misses for a cache half the size running the optimal replacement policy....

    [...]

Journal ArticleDOI
TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.

1,688 citations


"Low depth cache-oblivious algorithm..." refers background in this paper

  • ...A common form of programming in this model is based on nested parallelism—consisting of nested parallel loops and/or fork-join constructs [13, 26, 20, 35, 44]....

    [...]

Book
01 Oct 1992
TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.
Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

1,577 citations


Additional excerpts

  • ...A basic strategy for list ranking [40] is the following: (i) shrink the list to size O(n/ log n), and (ii) apply pointer jumping on this shorter list....

    [...]

Proceedings ArticleDOI
01 Jul 1993
TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.
Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

1,515 citations