Proceedings ArticleDOI
Low depth cache-oblivious algorithms
Guy E. Blelloch,Phillip B. Gibbons,Harsha Vardhan Simhadri +2 more
- pp 189-199
Reads0
Chats0
TLDR
This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Abstract:
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.read more
Citations
More filters
Proceedings ArticleDOI
CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures
Quan Chen,Minyi Guo,Zhiyi Huang +2 more
TL;DR: A Cache Aware Task-Stealing (CATS) scheduler, which uses the shared cache efficiently with an online profiling method and schedules tasks with shared data to the same socket and adopts an online DAG partitioner based on the profiling information to ensure tasks with share data can efficiently utilize the shared Cache.
Proceedings ArticleDOI
Parallel triangle counting in massive streaming graphs
TL;DR: In this article, the authors present a fast parallel algorithm for estimating the number of triangles in a massive undirected graph whose edges arrive as a stream, designed for shared-memory multicore machines and making efficient use of parallelism and the memory hierarchy.
Proceedings ArticleDOI
A Top-Down Parallel Semisort
TL;DR: This work implements the parallel integer sorting algorithm of Rajasekaran and Reif, but instead of processing bits of a integers in a reduced range in a bottom-up fashion, it process the hashed values of keys directly top-down.
Proceedings ArticleDOI
Parallel and I/O efficient set covering algorithms
TL;DR: This algorithm is the first efficient external-memory or cache-oblivious algorithm for when neither the sets nor the elements fit in memory, leading to I/O cost equivalent to sorting in the Cache Oblivious or Parallel cache Oblivious models.
Proceedings ArticleDOI
Sorting with Asymmetric Read and Write Costs
TL;DR: This paper considers the PRAM model with asymmetric write cost, and presents write-efficient, cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication, which yield provably good bounds for parallel machines with private caches or with a shared cache.
References
More filters
Journal ArticleDOI
A bridging model for parallel computation
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Journal ArticleDOI
Amortized efficiency of list update and paging rules
TL;DR: This article shows that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules, and analyzes the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule by a factor that depends on the size of fast memory.
Journal ArticleDOI
Cilk: An Efficient Multithreaded Runtime System
Robert D. Blumofe,Christopher F. Joerg,Bradley C. Kuszmaul,Charles E. Leiserson,Keith H. Randall,Yuli Zhou +5 more
TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.
Book
An introduction to parallel algorithms
TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.
Proceedings ArticleDOI
LogP: towards a realistic model of parallel computation
David E. Culler,Richard M. Karp,David A. Patterson,Abhijit Sahay,Klaus Erik Schauser,Eunice E. Santos,Ramesh Subramonian,Thorsten von Eicken +7 more
TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.