Low depth cache-oblivious algorithms

doi:10.1145/1810479.1810519

Proceedings ArticleDOI

Low depth cache-oblivious algorithms

Guy E. Blelloch, +2 more

- pp 189-199

Chats0

TLDR

This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.

Abstract:

In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Ligra: a lightweight graph processing framework for shared memory

Julian Shun, +1 more

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

Proceedings ArticleDOI

Multicore triangle computations without tuning

Julian Shun, +1 more

TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.

...read moreread less

Proceedings ArticleDOI

Internally deterministic parallel algorithms can be fast

Guy E. Blelloch, +3 more

TL;DR: The main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code.

...read moreread less

Journal ArticleDOI

Can traditional programming bridge the Ninja performance gap for parallel computing applications

Nadathur Satish, +7 more

TL;DR: It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

...read moreread less

Proceedings ArticleDOI

Scheduling irregular parallel computations on hierarchical caches

Guy E. Blelloch, +3 more

TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

DAG-consistent distributed shared memory

Robert D. Blumofe, +4 more

TL;DR: This work introduces DAG (directed acyclic graph) consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming and provides empirical evidence of the flexibility and efficiency of DAG consistency for applications that include blocked matrix multiplication, Strassen's matrix multiplication algorithm and a Barnes-Hut code.

...read moreread less

Cache-Oblivious Algorithms and Data Structures

Erik D. Demaine

TL;DR: A recent body of work has developed cache-oblivious algorithms and data structures that perform as well or nearly as well as standard external-memory structures which require knowledge of the cache/memory size and block transfer size.

...read moreread less

Proceedings ArticleDOI

Effectively sharing a cache among threads

Guy E. Blelloch, +1 more

TL;DR: The perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache is given, and some theoretical justification for designing machines with shared caches is given.

...read moreread less

Book ChapterDOI

A Bridging Model for Multi-core Computing

Leslie G. Valiant

TL;DR: It is suggested that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for a bridging model aimed at capturing the most basic resource parameters of multi-core architectures.

...read moreread less

Proceedings ArticleDOI

Cache-efficient dynamic programming algorithms for multicores

Rezaul Chowdhury, +1 more

TL;DR: This work develops a generic CMP algorithm with an associated tiling sequence and provides a parallel schedule that results in a cache-efficient parallel execution up to the critical path length of the underlying dynamic programming algorithm.

...read moreread less