Low depth cache-oblivious algorithms

doi:10.1145/1810479.1810519

Home
/
Papers
/
Low depth cache-oblivious algorithms

Proceedings Article•DOI•

Low depth cache-oblivious algorithms

Guy E. Blelloch¹, Phillip B. Gibbons², Harsha Vardhan Simhadri¹•Institutions (2)

Carnegie Mellon University¹, Intel²

13 Jun 2010-pp 189-199

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.

read less

Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Ligra: a lightweight graph processing framework for shared memory

[...]

Julian Shun¹, Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

23 Feb 2013

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

816 citations

Proceedings Article•DOI•

Multicore triangle computations without tuning

[...]

Julian Shun¹, Kanat Tangwongsan²•Institutions (2)

Carnegie Mellon University¹, Mahidol University International College²

13 Apr 2015

TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.

...read moreread less

Abstract: Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs.

...read moreread less

143 citations

Proceedings Article•DOI•

Internally deterministic parallel algorithms can be fast

[...]

Guy E. Blelloch¹, Jeremy T. Fineman², Phillip B. Gibbons³, Julian Shun¹•Institutions (3)

Carnegie Mellon University¹, University of Washington², Intel³

25 Feb 2012

TL;DR: The main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code.

...read moreread less

Abstract: The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics---i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

...read moreread less

141 citations

Cites methods from "Low depth cache-oblivious algorithm..."

...Comparison Sort: We use a low-depth cache-efficient sample sort [9]....
[...]

Journal Article•DOI•

Can traditional programming bridge the Ninja performance gap for parallel computing applications

[...]

Nadathur Satish¹, Changkyu Kim¹, Jatin Chhugani¹, Hideki Saito¹, Rakesh Krishnaiyer¹, Mikhail Smelyanskiy¹, Milind B. Girkar¹, Pradeep Dubey¹ - Show less +4 more•Institutions (1)

Intel¹

09 Jun 2012

TL;DR: It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

...read moreread less

Abstract: Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

...read moreread less

87 citations

Cites methods from "Low depth cache-oblivious algorithm..."

...There have been various techniques proposed to address these algorithmic changes, either using compiler assisted optimization [27], using cache-oblivious algorithms [6] or specialized languages like Sequoia [21]....
[...]

Proceedings Article•DOI•

Scheduling irregular parallel computations on hierarchical caches

[...]

Guy E. Blelloch¹, Jeremy T. Fineman¹, Phillip B. Gibbons², Harsha Vardhan Simhadri¹•Institutions (2)

Carnegie Mellon University¹, Intel²

04 Jun 2011

TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.

...read moreread less

Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

...read moreread less

84 citations

Cites background from "Low depth cache-oblivious algorithm..."

...However, all future references .t into cache until reaching a supertask that does not .t in cache, at which point the Problem Span Cache Complexity Q * Scan (pre.x sums, etc.) O(log n) O(ln/Bl) Matrix Transpose (n × m matrix) [20] v v O(log(n + m))v O(lnm/Bl)v Matrix Multiplication ( n × n matrix) [20] v v O( n)v O(ln 1.5/Bl/ M + 1) v Matrix Inversion ( n × n matrix) O( n) O(ln 1.5/Bl/ M + 1) Quicksort [22] O(log2 n) O(ln/Bl(1 + logln/(M + 1)l)) Sample Sort [10] O(log2 n) O(ln/BlllogM+2 nl) Sparse-Matrix Vector Multiply [10] (m nonzeros, nE edge separators) O(log2 n) O(lm/B + n/(M + 1)1-El) Convex Hull (e.g., see [8]) O(log2 n) O(ln/BlllogM+2 nl) Barnes Hut tree (e.g., see [8]) O(log2 n) O(ln/Bl(1 + logln/(M + 1)l)) Table 1: Cache complexities of some algorithms analyzed in the PCO model....
[...]
...Unfortunately, current dynamic parallelism approaches have important limitations: they either apply to hierarchies of only private or only shared caches [1,9,10,16,21], require some strict balance criteria [7,15], or require a joint algorithm/scheduler analysis [7, 13–16]....
[...]
...Sample Sort [10] O(log(2) n) O(dn/BedlogM+2 ne) Sparse-Matrix Vector Multiply [10] O(log(2) n) O(dm/B + n/(M + 1)1− e) (m nonzeros, n edge separators)...
[...]
...A pair of common abstract measures for capturing parallel cache based locality are the number of misses given a sequential ordering of a parallel computation [1, 9, 10, 21], and the depth (span, critical path length) of the computation....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A bridging model for parallel computation

[...]

Leslie G. Valiant¹•Institutions (1)

Harvard University¹

01 Aug 1990-Communications of The ACM

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

3,885 citations

Additional excerpts

...7] and distributed memory machines [48, 33, 12]....
[...]

Journal Article•DOI•

Amortized efficiency of list update and paging rules

[...]

Daniel D. Sleator¹, Robert E. Tarjan¹•Institutions (1)

Bell Labs¹

01 Feb 1985-Communications of The ACM

TL;DR: This article shows that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules, and analyzes the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule by a factor that depends on the size of fast memory.

...read moreread less

Abstract: In this article we study the amortized efficiency of the “move-to-front” and similar rules for dynamically maintaining a linear list. Under the assumption that accessing the ith element from the front of the list takes t(i) time, we show that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules. Other natural heuristics, such as the transpose and frequency count rules, do not share this property. We generalize our results to show that move-to-front is within a constant factor of optimum as long as the access cost is a convex function. We also study paging, a setting in which the access cost is not convex. The paging rule corresponding to move-to-front is the “least recently used” (LRU) replacement rule. We analyze the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule (Belady's MIN algorithm) by a factor that depends on the size of fast memory. No on-line paging algorithm has better amortized performance.

...read moreread less

2,378 citations

"Low depth cache-oblivious algorithm..." refers background in this paper

...It follows from [47] that the number of cache misses at each level under the multi-level LRU policy is within a factor of two of the number of misses for a cache half the size running the optimal replacement policy....
[...]

Journal Article•DOI•

Cilk: An Efficient Multithreaded Runtime System

[...]

Robert D. Blumofe¹, Christopher F. Joerg¹, Bradley C. Kuszmaul¹, Charles E. Leiserson¹, Keith H. Randall¹, Yuli Zhou¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

25 Aug 1996-Journal of Parallel and Distributed Computing

TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.

...read moreread less

1,688 citations

"Low depth cache-oblivious algorithm..." refers background in this paper

...A common form of programming in this model is based on nested parallelism—consisting of nested parallel loops and/or fork-join constructs [13, 26, 20, 35, 44]....
[...]

Book•

An introduction to parallel algorithms

[...]

Joseph JaJa¹•Institutions (1)

University of Maryland, College Park¹

01 Oct 1992

TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.

...read moreread less

Abstract: Written by an authority in the field, this book provides an introduction to the design and analysis of parallel algorithms. The emphasis is on the application of the PRAM (parallel random access machine) model of parallel computation, with all its variants, to algorithm analysis. Special attention is given to the selection of relevant data structures and to algorithm design principles that have proved to be useful. Features *Uses PRAM (parallel random access machine) as the model for parallel computation. *Covers all essential classes of parallel algorithms. *Rich exercise sets. *Written by a highly respected author within the field. 0201548569B04062001

...read moreread less

1,577 citations

Additional excerpts

...A basic strategy for list ranking [40] is the following: (i) shrink the list to size O(n/ log n), and (ii) apply pointer jumping on this shorter list....
[...]

Proceedings Article•DOI•

LogP: towards a realistic model of parallel computation

[...]

David E. Culler¹, Richard M. Karp¹, David A. Patterson¹, Abhijit Sahay¹, Klaus Erik Schauser¹, Eunice E. Santos¹, Ramesh Subramonian¹, Thorsten von Eicken¹ - Show less +4 more•Institutions (1)

University of California, Berkeley¹

01 Jul 1993

TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.

...read moreread less

Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

...read moreread less

1,515 citations