scispace - formally typeset
Proceedings ArticleDOI

Low depth cache-oblivious algorithms

Reads0
Chats0
TLDR
This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Ligra: a lightweight graph processing framework for shared memory

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Proceedings ArticleDOI

Multicore triangle computations without tuning

TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.
Proceedings ArticleDOI

Internally deterministic parallel algorithms can be fast

TL;DR: The main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code.
Journal ArticleDOI

Can traditional programming bridge the Ninja performance gap for parallel computing applications

TL;DR: It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
Proceedings ArticleDOI

Scheduling irregular parallel computations on hierarchical caches

TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
References
More filters
Proceedings ArticleDOI

DAG-consistent distributed shared memory

TL;DR: This work introduces DAG (directed acyclic graph) consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming and provides empirical evidence of the flexibility and efficiency of DAG consistency for applications that include blocked matrix multiplication, Strassen's matrix multiplication algorithm and a Barnes-Hut code.

Cache-Oblivious Algorithms and Data Structures

TL;DR: A recent body of work has developed cache-oblivious algorithms and data structures that perform as well or nearly as well as standard external-memory structures which require knowledge of the cache/memory size and block transfer size.
Proceedings ArticleDOI

Effectively sharing a cache among threads

TL;DR: The perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache is given, and some theoretical justification for designing machines with shared caches is given.
Book ChapterDOI

A Bridging Model for Multi-core Computing

TL;DR: It is suggested that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for a bridging model aimed at capturing the most basic resource parameters of multi-core architectures.
Proceedings ArticleDOI

Cache-efficient dynamic programming algorithms for multicores

TL;DR: This work develops a generic CMP algorithm with an associated tiling sequence and provides a parallel schedule that results in a cache-efficient parallel execution up to the critical path length of the underlying dynamic programming algorithm.
Related Papers (5)