Low depth cache-oblivious algorithms

doi:10.1145/1810479.1810519

Home
/
Papers
/
Low depth cache-oblivious algorithms

Proceedings Article•DOI•

Low depth cache-oblivious algorithms

Guy E. Blelloch¹, Phillip B. Gibbons², Harsha Vardhan Simhadri¹•Institutions (2)

Carnegie Mellon University¹, Intel²

13 Jun 2010-pp 189-199

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.

read less

Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book•

Performance Analysis and Tuning for General Purpose Graphics Processing Units

[...]

Hyesoon Kim¹, Richard Vuduc¹, Sara S. Baghsorkhi², Jee Choi¹, Wen-mei W. Hwu³, Wen-mei W. Hwu⁴ - Show less +2 more•Institutions (4)

Georgia Institute of Technology¹, Intel², National Center for Supercomputing Applications³, University of Illinois at Urbana–Champaign⁴

26 Nov 2012

TL;DR: This book provides a high-level overview of current GPGPU architectures and programming models, and reviews the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms.

...read moreread less

Abstract: General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models. We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms. We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques. Table of Contents: GPU Design, Programming, and Trends / Performance Principles / From Principles to Practice: Analysis and Tuning / Using Detailed Performance Analysis to Guide Optimization

...read moreread less

29 citations

Proceedings Article•DOI•

Optimal Parallel Algorithms in the Binary-Forking Model

[...]

Guy E. Blelloch¹, Jeremy T. Fineman², Yan Gu³, Yihan Sun³•Institutions (3)

Carnegie Mellon University¹, University of Washington², University of California, Riverside³

06 Jul 2020

TL;DR: In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously as mentioned in this paper, and the costs are measured in terms of work (total number of instructions), and span (longest dependence chain).

...read moreread less

Abstract: In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain). The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an Ω(log n) overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized.

...read moreread less

29 citations

Journal Article•DOI•

Resource Oblivious Sorting on Multicores

[...]

Richard Cole¹, Vijaya Ramachandran²•Institutions (2)

New York University¹, University of Texas at Austin²

06 Aug 2015-arXiv: Data Structures and Algorithms

TL;DR: In this paper, the authors present a deterministic sorting algorithm, SPMS (Sample, Partition, and Merge Sort), that interleaves the partitioning of a sample sort with merging.

...read moreread less

Abstract: We present a deterministic sorting algorithm, SPMS (Sample, Partition, and Merge Sort), that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts $n$ elements in $O(n \log n)$ time cache-obliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorithm is $O(\log n \cdot \log\log n)$, which improves on previous bounds for optimal cache oblivious sorting. The algorithm also has low false sharing costs. When scheduled by a work-stealing scheduler in a multicore computing environment with a global shared memory and $p$ cores, each having a cache of size $M$ organized in blocks of size $B$, the costs of the additional cache misses and false sharing misses due to this parallel execution are bounded by the cost of $O(S\cdot M/B)$ and $O(S \cdot B)$ cache misses respectively, where $S$ is the number of steals performed during the execution. Finally, SPMS is resource oblivious in Athat the dependence on machine parameters appear only in the analysis of its performance, and not within the algorithm itself.

...read moreread less

27 citations

Proceedings Article•DOI•

Can traditional programming bridge the Ninja performance gap for parallel computing applications

[...]

Satish, Kim, Chhugani, Saito, Krishnaiyer, Smelyanskiy, Girkar, Dubey - Show less +4 more

01 Jan 2012

27 citations

Proceedings Article•DOI•

Randomized Incremental Convex Hull is Highly Parallel

[...]

Guy E. Blelloch¹, Yan Gu², Julian Shun³, Yihan Sun²•Institutions (3)

Carnegie Mellon University¹, University of California, Riverside², Massachusetts Institute of Technology³

06 Jul 2020

TL;DR: A strong theoretical analysis is provided showing that for n points in any constant dimension, the standard incremental algorithm is inherently parallel, and it is shown that for problems where the size of the support set can be bounded by a constant, the depth of the configuration dependence graph is shallow.

...read moreread less

Abstract: The randomized incremental convex hull algorithm is one of the most practical and important geometric algorithms in the literature. Due to its simplicity, and the fact that many points or facets can be added independently, it is also widely used in parallel convex hull implementations. However, to date there have been no non-trivial theoretical bounds on the parallelism available in these implementations. In this paper, we provide a strong theoretical analysis showing that the standard incremental algorithm is inherently parallel. In particular, we show that for n points in any constant dimension, the algorithm has O(log n) dependence depth with high probability. This leads to a simple work-optimal parallel algorithm with polylogarithmic span with high probability. Our key technical contribution is a new definition and analysis of the configuration dependence graph extending the traditional configuration space, which allows for asynchrony in adding configurations. To capture the "true" dependence between configurations, we define the support set of configuration c to be the set of already added configurations that it depends on. We show that for problems where the size of the support set can be bounded by a constant, the depth of the configuration dependence graph is shallow (O(log n) with high probability for input size n). In addition to convex hull, our approach also extends to several related problems, including half-space intersection and finding the intersection of a set of unit circles. We believe that the configuration dependence graph and its analysis is a general idea that could potentially be applied to more problems.

...read moreread less

26 citations

Cites methods from "Low depth cache-oblivious algorithm..."

...tal model for parallelism and has been widely used in analyzing parallel algorithms [1, 2, 11, 12, 15, 16, 29, 30, 33], and are also supported by programming systems such as Cilk [37], the Java fork-join framework [46], X10 [25], Habanero [23], TBB [44], and TPL [57]....
[...]

1
2
…
3
4
5
6
7
8
9
…
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A bridging model for parallel computation

[...]

Leslie G. Valiant¹•Institutions (1)

Harvard University¹

01 Aug 1990-Communications of The ACM

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

3,885 citations

Additional excerpts

...7] and distributed memory machines [48, 33, 12]....
[...]

Journal Article•DOI•

Amortized efficiency of list update and paging rules

[...]

Daniel D. Sleator¹, Robert E. Tarjan¹•Institutions (1)

Bell Labs¹

01 Feb 1985-Communications of The ACM

TL;DR: This article shows that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules, and analyzes the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule by a factor that depends on the size of fast memory.

...read moreread less

Abstract: In this article we study the amortized efficiency of the “move-to-front” and similar rules for dynamically maintaining a linear list. Under the assumption that accessing the ith element from the front of the list takes t(i) time, we show that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules. Other natural heuristics, such as the transpose and frequency count rules, do not share this property. We generalize our results to show that move-to-front is within a constant factor of optimum as long as the access cost is a convex function. We also study paging, a setting in which the access cost is not convex. The paging rule corresponding to move-to-front is the “least recently used” (LRU) replacement rule. We analyze the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule (Belady's MIN algorithm) by a factor that depends on the size of fast memory. No on-line paging algorithm has better amortized performance.

...read moreread less

2,378 citations

"Low depth cache-oblivious algorithm..." refers background in this paper

...It follows from [47] that the number of cache misses at each level under the multi-level LRU policy is within a factor of two of the number of misses for a cache half the size running the optimal replacement policy....
[...]

Journal Article•DOI•

Cilk: An Efficient Multithreaded Runtime System

[...]

Robert D. Blumofe¹, Christopher F. Joerg¹, Bradley C. Kuszmaul¹, Charles E. Leiserson¹, Keith H. Randall¹, Yuli Zhou¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

25 Aug 1996-Journal of Parallel and Distributed Computing

TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.

...read moreread less

1,688 citations

"Low depth cache-oblivious algorithm..." refers background in this paper

...A common form of programming in this model is based on nested parallelism—consisting of nested parallel loops and/or fork-join constructs [13, 26, 20, 35, 44]....
[...]

Book•

An introduction to parallel algorithms

[...]

Joseph JaJa¹•Institutions (1)

University of Maryland, College Park¹

01 Oct 1992

TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.

...read moreread less

Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

...read moreread less

1,577 citations

Additional excerpts

...A basic strategy for list ranking [40] is the following: (i) shrink the list to size O(n/ log n), and (ii) apply pointer jumping on this shorter list....
[...]

Proceedings Article•DOI•

LogP: towards a realistic model of parallel computation

[...]

David E. Culler¹, Richard M. Karp¹, David A. Patterson¹, Abhijit Sahay¹, Klaus Erik Schauser¹, Eunice E. Santos¹, Ramesh Subramonian¹, Thorsten von Eicken¹ - Show less +4 more•Institutions (1)

University of California, Berkeley¹

01 Jul 1993

TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.

...read moreread less

Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

...read moreread less

1,515 citations