scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Low depth cache-oblivious algorithms

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
01 Jan 2017
TL;DR: Experimental results show that CAB can significantly improve the performance of memory-bound applications compared with the traditional random work-stealing policy.
Abstract: In this chapter, we discuss emerging dynamic task scheduling policies that can improve the performance of parallel applications on multi-socket architecture. In current real systems, multi-core computers often adopt a multi-socket multi-core architecture with shared caches in each socket. However, the traditional task scheduling policies (for example work-stealing) tend to pollute the shared cache and incur more cache misses. Due to the good performance of work-stealing policy, we use the traditional random work-stealing policy as the baseline in this chapter. To relieve this problem, in this chapter, we present a Cache-Aware Bi-tier work-stealing (CAB) policy. CAB improves the performance of memory-bound applications by reducing memory footprint and cache misses of tasks running inside the same CPU socket. CAB adaptively uses a task graph partitioner to divide an execution task graph into the inter-socket tier and the intra-socket tier. Tasks in the inter-socket tier are scheduled across sockets while tasks in the intra-socket tier are scheduled within the same socket. Experimental results show that CAB can significantly improve the performance of memory-bound applications compared with the traditional random work-stealing policy.

1 citations

Proceedings ArticleDOI
Yuan Tang1, Weiguo Gao1
09 Aug 2021
TL;DR: PACO as discussed by the authors is a cache-oblivious algorithm to parallelize Strassen's algorithm on a homogeneous shared-memory setting, and it can achieve perfect strong scaling on an arbitrary number, even a prime number of processors within a certain range.
Abstract: Frigo et al. proposed an ideal cache model and a recursive technique to design sequential cache-efficient algorithms in a cache-oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. Ballard et al. raised another open question on how to parallelize Strassen’s algorithm exactly and efficiently on an arbitrary number of processors. We propose a novel way of partitioning a cache-oblivious algorithm to achieve perfect strong scaling on an arbitrary number, even a prime number, of processors within a certain range in a shared-memory setting. Our approach is Processor-Aware but Cache-Oblivious (PACO). We apply the approach to classic rectangular matrix-matrix multiplication (MM) and Strassen’s algorithm. We provide an almost exact solution to the open problem on parallelizing Strassen. Though this paper focuses mainly on a homogeneous shared-memory setting, we also discuss the extensions of our approach to a distributed-memory and a heterogeneous settings. Our approach may provide a new perspective on extending the recursive cache-oblivious technique to an arbitrary architecture. Preliminary experiments show that our MM algorithm outperforms significantly Intel MKL’s dgemm. A full version of this paper is hosted on arXiv.

1 citations

Proceedings ArticleDOI
Yuan Tang1
06 Jul 2020
TL;DR: PACO as discussed by the authors is a cache-oblivious algorithm for parallelizing Strassen's algorithm on an arbitrary number of processors in a shared-memory setting, and it can be extended to a distributed-memory architecture or a heterogeneous computing system.
Abstract: Frigo et al. proposed an ideal cache model and a recursive cache-oblivious technique to design sequential cache-efficient algorithms in an oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. Ballard et al. raised another open question on how to parallelize Strassen's algorithm exactly and efficiently on an arbitrary number of processors. We propose a novel way of partitioning a cache-oblivious algorithm to achieve perfect strong scaling on an arbitrary number, even a prime number, of processors within a certain range in a shared-memory setting. Our approach is Processor-Aware but Cache-Oblivious (PACO). We demonstrate our approach on several important cache-oblivious algorithms, including longest common sub-sequence (LCS), classic rectangular matrix multiplication (MM), Strassen's algorithm, and comparison-based sorting. By our approach, we provide an almost exact solution to the open problem on parallelizing Strassen. We discuss how to extend our approach to a distributed-memory architecture, or even a heterogeneous computing system. Hence, our work may provide a new perspective on the fundamental open problem of extending the recursive cache-oblivious technique to an arbitrary architecture. By preliminary experiments, our algorithms outperform significantly state-of-the-art Processor-Oblivious (PO) and Processor-Aware (PA) counterparts.

1 citations

Dissertation
01 Jan 2014
TL;DR: A heterogeneous parallel computing architectures to accelerate the performance of proximity computation for various applications and an out-of-core proximity computation algorithm to handle a massive data that requires a larger memory space than the memory size of a computing resource in a heterogeneous computing system.
Abstract: Proximity computation is one of the most fundamental geometric operations for various applications including physically-based simulations, computer graphics, robotics, Etc. Also proximity computation is one of the most time consuming parts in various applications. There have been numerous attempts to accelerate the queries like adopting an acceleration hierarchy to cull redundant computations. Even though these methods are general and improve the performance of various proximity queries by several orders of magnitude, there are ever growing demands for further improving the performance of proximity queries, since the model complexities are also ever growing. Recently, the number of cores on a single chip has continued to increase in order to achieve a higher computing power. Also, various heterogeneous computing architectures consisting of different types of parallel computing resources have been introduced. However, prior acceleration techniques such as using acceleration hierarchies gave less consideration for utilizing such parallel architectures and heterogeneous computing environments. Since we are increasingly seeing more heterogeneous computing environments, it is getting more important to utilize them for proximity queries, in an efficient and robust manner. In this thesis, we employ heterogeneous parallel computing architectures to accelerate the performance of proximity computation for various applications. To efficiently utilize heterogeneous computing resources, we propose parallel computing systems and algorithms for proximity computation. We start with a specific proximity query and design a novel efficient parallel algorithm based on knowledge of the query and computing resources. Then we extend our method to various proximity queries and propose a general proximity computing framework. Also we improve the utilization efficiency of computing resources by designing optimization-based scheduling algorithm. With the proposed methods, an order of magnitude improvement is achieved on various quires by using up to two hexa-core CPUs and four different GPUs over using a single CPU core. In addition we propose an out-of-core proximity computation algorithm to handle a massive data that requires a larger memory space than the memory size of a computing resource in a heterogeneous computing system, especially for the particle-based fluid simulation. The proximity computing system using the out-of-core algorithm robustly works for a large-scale scene and achieves up to two order of magnitude performance improvement over a previous out-of-core approach. These results demonstrate the efficiency and robustness of approaches.

Cites background from "Low depth cache-oblivious algorithm..."

  • ...Theoretically, achieving the optimal performance in this context is non-trivial and thus has been studied only for particular problems such as sorting and FFTs [75] on the shared memory model with the same parallel cores....

    [...]

TL;DR: The phase-parallel framework as mentioned in this paper assigns a rank to each object and processes the objects based on the order of their ranks, so that all objects with the same rank can be processed in parallel.
Abstract: Some recent papers showed that many sequential iterative algorithms can be directly parallelized, by identifying the dependences between the input objects. This approach yields many simple and practical parallel algorithms, but there are still challenges in achieving work-efficiency and high-parallelism. Work-efficiency means that the number of operations is asymptotically the same as the best sequential solution. This can be hard for certain problems where the number of dependences between objects is asymptotically more than optimal sequential work, and we cannot even afford to generate them. To achieve high-parallelism, we want to process as many objects as possible in parallel. The goal is to achieve ˜ 𝑂 ( 𝐷 ) span for a problem with the deepest dependence length 𝐷 . We refer to this property as round-efficiency . This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and propose general approaches to do so. To efficiently parallelize many sequential iterative algorithms, we propose the phase-parallel framework . The framework assigns a rank to each object and processes the objects based on the order of their ranks. All objects with the same rank can be processed in parallel. To enable work-efficiency and high parallelism, we use two types of general techniques. Type 1 algorithms aim to use range queries to extract all objects with the same rank to avoid evaluating all the dependences. We discuss activity selection, and Dijkstra’s algorithm using Type 1 framework. Type 2 algorithms aim to wake up an object when the last object it depends on is finished. We discuss activity selection, longest increasing subsequence (LIS), greedy maximal independent set (MIS), and many other algorithms using Type 2 framework. All of our algorithms are (nearly) work-efficient and round-efficient. Many of them improve previous best bounds, and some of them (e.g., LIS) are the first to achieve work-efficiency with round-efficiency. We also implement many of them. On inputs with reasonable dependence depth, our algorithms are highly parallelized and significantly outperform their sequential counterparts
References
More filters
Journal ArticleDOI
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

3,885 citations


Additional excerpts

  • ...7] and distributed memory machines [48, 33, 12]....

    [...]

Journal ArticleDOI
TL;DR: This article shows that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules, and analyzes the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule by a factor that depends on the size of fast memory.
Abstract: In this article we study the amortized efficiency of the “move-to-front” and similar rules for dynamically maintaining a linear list. Under the assumption that accessing the ith element from the front of the list takes t(i) time, we show that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules. Other natural heuristics, such as the transpose and frequency count rules, do not share this property. We generalize our results to show that move-to-front is within a constant factor of optimum as long as the access cost is a convex function. We also study paging, a setting in which the access cost is not convex. The paging rule corresponding to move-to-front is the “least recently used” (LRU) replacement rule. We analyze the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule (Belady's MIN algorithm) by a factor that depends on the size of fast memory. No on-line paging algorithm has better amortized performance.

2,378 citations


"Low depth cache-oblivious algorithm..." refers background in this paper

  • ...It follows from [47] that the number of cache misses at each level under the multi-level LRU policy is within a factor of two of the number of misses for a cache half the size running the optimal replacement policy....

    [...]

Journal ArticleDOI
TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.

1,688 citations


"Low depth cache-oblivious algorithm..." refers background in this paper

  • ...A common form of programming in this model is based on nested parallelism—consisting of nested parallel loops and/or fork-join constructs [13, 26, 20, 35, 44]....

    [...]

Book
01 Oct 1992
TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.
Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

1,577 citations


Additional excerpts

  • ...A basic strategy for list ranking [40] is the following: (i) shrink the list to size O(n/ log n), and (ii) apply pointer jumping on this shorter list....

    [...]

Proceedings ArticleDOI
01 Jul 1993
TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.
Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

1,515 citations