scispace - formally typeset
Proceedings ArticleDOI

Low depth cache-oblivious algorithms

Reads0
Chats0
TLDR
This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

A Class of In-Place Linear Transformations Possessing the Cache-Oblivious Property

TL;DR: A family of in-place linear transformations based on block lower/upper (LU) decompositions is presented, of which the known transformation via LU decomposition is a special case, and it is shown that the proposed family includes a class of transformations that possesses the cache-oblivious property.
Posted Content

Low-Depth Parallel Algorithms for the Binary-Forking Model without Atomics.

TL;DR: This paper designs efficient parallel algorithms in the binary-forking model without atomics for three fundamental problems: Strassen's matrix multiplication (MM), comparison-based sorting, and the Fast Fourier Transform (FFT).

AutoMatch: Automated Matching of Compute Kernels to Heterogeneous HPC Architectures

TL;DR: The empirical evaluation shows that AutoMatch is highly accurate across five different heterogeneous architectures, identifying the best architecture for each workload in 96% of the test cases, and its workload distribution scheme has a comparable performance to a profiling-driven oracle.
DissertationDOI

Cache Based Optimization of Stencil Computations an Algorithmic Approach

TL;DR: This thesis presents comprehensive cache aware and cache oblivious algorithms to optimize stencil computations on structured rectangular 2D and 3D grids and tailor their frameworks to meet the new performance challenge on these architectures.
Proceedings ArticleDOI

Survey: Computational Models for Asymmetric Read and Write Costs

TL;DR: This survey reviews the existing computational models that measure the cost of operations and memory accesses to the NVMs, and scheduler of parallel algorithms on the new hardware, and lists some existing results on lower and upper bounds on the most common problems like sorting, searching, graph traversal based on these models.
References
More filters
Journal ArticleDOI

A bridging model for parallel computation

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Journal ArticleDOI

Amortized efficiency of list update and paging rules

TL;DR: This article shows that move-to-front is within a constant factor of optimum among a wide class of list maintenance rules, and analyzes the amortized complexity of LRU, showing that its efficiency differs from that of the off-line paging rule by a factor that depends on the size of fast memory.
Journal ArticleDOI

Cilk: An Efficient Multithreaded Runtime System

TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.
Book

An introduction to parallel algorithms

TL;DR: This book provides an introduction to the design and analysis of parallel algorithms, with the emphasis on the application of the PRAM model of parallel computation, with all its variants, to algorithm analysis.
Proceedings ArticleDOI

LogP: towards a realistic model of parallel computation

TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.
Related Papers (5)