Low depth cache-oblivious algorithms

doi:10.1145/1810479.1810519

Proceedings ArticleDOI

Low depth cache-oblivious algorithms

Guy E. Blelloch, +2 more

- pp 189-199

Chats0

TLDR

This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.

Abstract:

In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Ligra: a lightweight graph processing framework for shared memory

Julian Shun, +1 more

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

Proceedings ArticleDOI

Multicore triangle computations without tuning

Julian Shun, +1 more

TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.

...read moreread less

Proceedings ArticleDOI

Internally deterministic parallel algorithms can be fast

Guy E. Blelloch, +3 more

TL;DR: The main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code.

...read moreread less

Journal ArticleDOI

Can traditional programming bridge the Ninja performance gap for parallel computing applications

Nadathur Satish, +7 more

TL;DR: It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

...read moreread less

Proceedings ArticleDOI

Scheduling irregular parallel computations on hierarchical caches

Guy E. Blelloch, +3 more

TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Thread scheduling for multiprogrammed multiprocessors

Nimar S. Arora, +2 more

TL;DR: A user-level thread scheduler for shared-memory multiprocessors, which achieves linear speedup whenever P is small relative to the parallelism T1/T∈fty .

...read moreread less

External-Memory Graph Algorithms

Yi-Feng Chian, +5 more

Proceedings ArticleDOI

External-memory graph algorithms

Yi-Jen Chiang, +5 more

TL;DR: A collection of new techniques for designing and analyzing external-memory algorithms for graph problems and illustrating how these techniques can be applied to a wide variety of speci c problems are presented.

...read moreread less

Proceedings ArticleDOI

A model for hierarchical memory

Alok Aggarwal, +3 more

TL;DR: An algorithm that uses LRU policy at the successive “levels” of the memory hierarchy is shown to be optimal for arbitrary memory access time.

...read moreread less

Proceedings ArticleDOI

The data locality of work stealing

Umut A. Acar, +2 more

TL;DR: The initial experiments on iterative data-parallel applications show that the work-stealing scheduling algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads and a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor.

...read moreread less