Low depth cache-oblivious algorithms

doi:10.1145/1810479.1810519

Home
/
Papers
/
Low depth cache-oblivious algorithms

Proceedings Article•DOI•

Low depth cache-oblivious algorithms

Guy E. Blelloch¹, Phillip B. Gibbons², Harsha Vardhan Simhadri¹•Institutions (2)

Carnegie Mellon University¹, Intel²

13 Jun 2010-pp 189-199

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.

read less

Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

I/O-Optimal Distribution Sweeping on Private-Cache Chip Multiprocessors

[...]

Deepak Ajwani¹, Nodari Sitchinava², Norbert Zeh³•Institutions (3)

University College Cork¹, Aarhus University², Dalhousie University³

16 May 2011

TL;DR: A new one-dimensional batched range counting algorithm on a sorted list of ranges and points that achieves an I/O complexity of $O((N + K)/PB)$, where $K$ is the sum of the counts of all the ranges.

...read moreread less

Abstract: The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for private-cache multi-core architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/O-efficient distribution sweeping framework was introduced recently, and a number of algorithms for problems on axis-aligned objects were obtained using this framework. The obtained algorithms were efficient but not optimal. In this paper, we improve the framework to obtain algorithms with the optimal I/O complexity of $O(sort {P}(N) + K/PB)$ for a number of problems on axis-aligned objects, $P$ denotes the number of cores/processors, $B$ denotes the number of elements that fit in a cache line, $N$ and $K$ denote the sizes of the input and output, respectively, and $sort {P}(N)$ denotes the I/O complexity of sorting $N$ items using $P$ processors in the PEM model. To obtain the above improvement, we present a new one-dimensional batched range counting algorithm on a sorted list of ranges and points that achieves an I/O complexity of $O((N + K)/PB)$, where $K$ is the sum of the counts of all the ranges. The key to achieving efficient load balancing among the processors in this algorithm is a new method to count the output without enumerating it, which might be of independent interest.

...read moreread less

6 citations

Cites background from "Low depth cache-oblivious algorithm..."

...It would be particularly interesting to see if an I/O-optimal low-depth cache-oblivious distribution sweeping paradigm can be designed, along the lines of [ 14 ]....
[...]
...In [9]–[ 14 ] several different multicore models were considered and cache- and processor-oblivious algorithms were presented for fundamental combinatorial, graph, and matrix-based problems....
[...]

Journal Article•DOI•

Computational geometry in the parallel external memory model

[...]

Nodari Sitchinava¹•Institutions (1)

Karlsruhe Institute of Technology¹

01 Jul 2012-Sigspatial Special

TL;DR: Continued advances in VLSI scaling combined with unsustainable power consumption of frequency scaling resulted in parallel processors having become mainstream as demonstrated by modern multicores, and prototypes boast up to 48 cores on a single chip.

...read moreread less

Abstract: Continued advances in VLSI scaling combined with unsustainable power consumption of frequency scaling resulted in parallel processors having become mainstream as demonstrated by modern multicores. Current off-the-shelf processors already contain up to 16 cores and the prototypes boast up to 48 cores on a single chip.

...read moreread less

6 citations

Cites background from "Low depth cache-oblivious algorithm..."

...see [9, 10, 12]), the PEM model offers the simplest way to study parallelism and cache-efficiency required for efficient computations on modern multicores....
[...]

Proceedings Article•DOI•

Learning with Analytical Models

[...]

Huda Ibeid¹, Siping Meng¹, Oliver Dobon¹, Luke N. Olson¹, William Gropp¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

20 May 2019

TL;DR: The proposed hybrid model aims to minimize prediction cost while providing reasonable prediction accuracy, and improves the prediction accuracy in comparison to pure machine learning techniques while using small training datasets, thus making it suitable for hardware and workload changes.

...read moreread less

Abstract: To understand and predict the performance of scientific applications, several analytical and machine learning approaches have been proposed, each having its advantages and disadvantages. In this paper, we propose and validate a hybrid approach for performance modeling and prediction, which combines analytical and machine learning models. The proposed hybrid model aims to minimize prediction cost while providing reasonable prediction accuracy. Our validation results show that the hybrid model is able to learn and correct the analytical models to better match the actual performance. Furthermore, the proposed hybrid model improves the prediction accuracy in comparison to pure machine learning techniques while using small training datasets, thus making it suitable for hardware and workload changes.

...read moreread less

5 citations

Cites methods from "Low depth cache-oblivious algorithm..."

...For a cache with size Z and cacheline length L in elements, a cache-oblivious algorithm [13] for multiplying a sparse H ×H matrix with h non-zeros by a vector establishes an upper bound on cache misses in the SpMV as O( L + H...
[...]

Posted Content•

Parallel In-Place Algorithms: Theory and Practice

[...]

Yan Gu, Omar Obeya, Julian Shun

01 Mar 2021-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this paper, the Decomposable Property is introduced to enable existing non-in-place but highly-optimized parallel algorithms to be converted into parallel in-place algorithms.

...read moreread less

Abstract: Many parallel algorithms use at least linear auxiliary space in the size of the input to enable computations to be done independently without conflicts. Unfortunately, this extra space can be prohibitive for memory-limited machines, preventing large inputs from being processed. Therefore, it is desirable to design parallel in-place algorithms that use sublinear (or even polylogarithmic) auxiliary space. In this paper, we bridge the gap between theory and practice for parallel in-place (PIP) algorithms. We first define two computational models based on fork-join parallelism, which reflect modern parallel programming environments. We then introduce a variety of new parallel in-place algorithms that are simple and efficient, both in theory and in practice. Our algorithmic highlight is the Decomposable Property introduced in this paper, which enables existing non-in-place but highly-optimized parallel algorithms to be converted into parallel in-place algorithms. Using this property, we obtain algorithms for random permutation, list contraction, tree contraction, and merging that take linear work, $O(n^{1-\epsilon})$ auxiliary space, and $O(n^\epsilon\cdot\text{polylog}(n))$ span for $0<\epsilon<1$. We also present new parallel in-place algorithms for scan, filter, merge, connectivity, biconnectivity, and minimum spanning forest using other techniques. In addition to theoretical results, we present experimental results for implementations of many of our parallel in-place algorithms. We show that on a 72-core machine with two-way hyper-threading, the parallel in-place algorithms usually outperform existing parallel algorithms for the same problems that use linear auxiliary space, indicating that the theory developed in this paper indeed leads to practical benefits in terms of both space usage and running time.

...read moreread less

5 citations

Book Chapter•DOI•

Cache Oblivious Minimum Cut

[...]

Barbara Geissmann¹, Lukas Gianinazzi¹•Institutions (1)

ETH Zurich¹

24 May 2017

TL;DR: This work shows how to compute the minimum cut of a graph cache-efficiently using a cache oblivious algorithm and a simpler one that incurs cache misses.

...read moreread less

Abstract: We show how to compute the minimum cut of a graph cache-efficiently. Let B be the width of a cache line and M be the size of the cache. On a graph with V vertices and E edges, we give a cache oblivious algorithm that incurs $O(\lceil \frac{E}{B} (\log ^4 E) \log _{M/B} E\rceil )$ cache misses and a simpler one that incurs $O(\lceil \frac{V^2}{B} \log ^3 V\rceil )$ cache misses.

...read moreread less

5 citations

1
2
3
4
5
6
7
8
9
10
11
…
12
13
14
15
16
17
18
…
19
20
21
22
23
24
25

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A bridging model for parallel computation

[...]

Leslie G. Valiant¹•Institutions (1)

Harvard University¹

01 Aug 1990-Communications of The ACM

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

3,885 citations

Additional excerpts

...7] and distributed memory machines [48, 33, 12]....
[...]

Journal Article•DOI•

Amortized efficiency of list update and paging rules

[...]

Daniel D. Sleator¹, Robert E. Tarjan¹•Institutions (1)

Bell Labs¹

01 Feb 1985-Communications of The ACM

TL;DR: This article shows that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules, and analyzes the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule by a factor that depends on the size of fast memory.

...read moreread less

Abstract: In this article we study the amortized efficiency of the “move-to-front” and similar rules for dynamically maintaining a linear list. Under the assumption that accessing the ith element from the front of the list takes t(i) time, we show that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules. Other natural heuristics, such as the transpose and frequency count rules, do not share this property. We generalize our results to show that move-to-front is within a constant factor of optimum as long as the access cost is a convex function. We also study paging, a setting in which the access cost is not convex. The paging rule corresponding to move-to-front is the “least recently used” (LRU) replacement rule. We analyze the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule (Belady's MIN algorithm) by a factor that depends on the size of fast memory. No on-line paging algorithm has better amortized performance.

...read moreread less

2,378 citations

"Low depth cache-oblivious algorithm..." refers background in this paper

...It follows from [47] that the number of cache misses at each level under the multi-level LRU policy is within a factor of two of the number of misses for a cache half the size running the optimal replacement policy....
[...]

Journal Article•DOI•

Cilk: An Efficient Multithreaded Runtime System

[...]

Robert D. Blumofe¹, Christopher F. Joerg¹, Bradley C. Kuszmaul¹, Charles E. Leiserson¹, Keith H. Randall¹, Yuli Zhou¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

25 Aug 1996-Journal of Parallel and Distributed Computing

TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.

...read moreread less

1,688 citations

"Low depth cache-oblivious algorithm..." refers background in this paper

...A common form of programming in this model is based on nested parallelism—consisting of nested parallel loops and/or fork-join constructs [13, 26, 20, 35, 44]....
[...]

Book•

An introduction to parallel algorithms

[...]

Joseph JaJa¹•Institutions (1)

University of Maryland, College Park¹

01 Oct 1992

TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.

...read moreread less

Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

...read moreread less

1,577 citations

Additional excerpts

...A basic strategy for list ranking [40] is the following: (i) shrink the list to size O(n/ log n), and (ii) apply pointer jumping on this shorter list....
[...]

Proceedings Article•DOI•

LogP: towards a realistic model of parallel computation

[...]

David E. Culler¹, Richard M. Karp¹, David A. Patterson¹, Abhijit Sahay¹, Klaus Erik Schauser¹, Eunice E. Santos¹, Ramesh Subramonian¹, Thorsten von Eicken¹ - Show less +4 more•Institutions (1)

University of California, Berkeley¹

01 Jul 1993

TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.

...read moreread less

Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

...read moreread less

1,515 citations