scispace - formally typeset
Proceedings ArticleDOI

Low depth cache-oblivious algorithms

Reads0
Chats0
TLDR
This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Ligra: a lightweight graph processing framework for shared memory

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Proceedings ArticleDOI

Multicore triangle computations without tuning

TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.
Proceedings ArticleDOI

Internally deterministic parallel algorithms can be fast

TL;DR: The main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code.
Journal ArticleDOI

Can traditional programming bridge the Ninja performance gap for parallel computing applications

TL;DR: It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
Proceedings ArticleDOI

Scheduling irregular parallel computations on hierarchical caches

TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
References
More filters
Proceedings ArticleDOI

Thread scheduling for multiprogrammed multiprocessors

TL;DR: A user-level thread scheduler for shared-memory multiprocessors, which achieves linear speedup whenever P is small relative to the parallelism T1/T∈fty .
Proceedings ArticleDOI

External-memory graph algorithms

TL;DR: A collection of new techniques for designing and analyzing external-memory algorithms for graph problems and illustrating how these techniques can be applied to a wide variety of speci c problems are presented.
Proceedings ArticleDOI

A model for hierarchical memory

TL;DR: An algorithm that uses LRU policy at the successive “levels” of the memory hierarchy is shown to be optimal for arbitrary memory access time.
Proceedings ArticleDOI

The data locality of work stealing

TL;DR: The initial experiments on iterative data-parallel applications show that the work-stealing scheduling algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads and a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor.
Related Papers (5)