Resource oblivious sorting on multicores
Richard Cole,Vijaya Ramachandran +1 more
- pp 226-237
TLDR
A new deterministic sorting algorithm that interleaves the partitioning of a sample sort with merging with an optimal number of cache misses is presented, which improves on previous bounds for deterministic sample sort.Abstract:
We present a new deterministic sorting algorithm that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts n elements in O(n log n) time cache-obliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorithm is O(log n log log n), which improves on previous bounds for deterministic sample sort. Given a multicore computing environment with a global shared memory and p cores, each having a cache of size M organized in blocks of size B, our algorithm can be scheduled effectively on these p cores in a cache-oblivious manner.
We improve on the above cache-oblivious processor-aware parallel implementation by using the Priority Work Stealing Scheduler (PWS) that we presented recently in a companion paper [12]. The PWS scheduler is both processor- and cache-oblivious (i.e., resource oblivious), and it tolerates asynchrony among the cores. Using PWS, we obtain a resource oblivious scheduling of our sorting algorithm that matches the performance of the processor-aware version. Our analysis includes the delay incurred by false-sharing. We also establish good bounds for our algorithm with the randomized work stealing scheduler.read more
Citations
More filters
Proceedings ArticleDOI
Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication
James Demmel,David Eliahu,Armando Fox,Shoaib Kamil,Benjamin Lipshitz,Oded Schwartz,Omer Spillinger +6 more
TL;DR: This work obtains the first communication-optimal algorithm for all dimensions of rectangular matrices by combining the dimension-splitting technique with the recursive BFS/DFS approach, and shows significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.
Proceedings ArticleDOI
Scheduling irregular parallel computations on hierarchical caches
TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
Journal ArticleDOI
Oblivious algorithms for multicores and networks of processors
TL;DR: This work introduces a multicore-oblivious (MO) approach to algorithms and schedulers for HM, and presents efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components.
Proceedings ArticleDOI
Oblivious algorithms for multicores and network of processors
TL;DR: This work introduces a multicore-oblivious (MO) approach to algorithms and schedulers for HM, and presents efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components.
Proceedings ArticleDOI
Cache-adaptive algorithms
Michael A. Bender,Roozbeh Ebrahimi,Jeremy T. Fineman,Golnaz Ghasemiesfeh,Rob Johnson,Samuel McCauley +5 more
TL;DR: The cache-adaptive model, which generalizes the external-memory model to apply to environments in which the amount of memory available to an algorithm can fluctuate, is introduced, and it is established that if a cache-oblivious algorithm is optimal on "square" memory profiles then, given resource augmentation it is ideal on all memory profiles.
References
More filters
Journal ArticleDOI
The input/output complexity of sorting and related problems
Alok Aggarwal,S. Vitter Jeffrey +1 more
TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Journal ArticleDOI
Scheduling multithreaded computations by work stealing
TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time to execute a fully strict computation on P processors using this scheduler is 1:1.
Journal ArticleDOI
Parallel merge sort
TL;DR: A parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small.
Proceedings ArticleDOI
Cache-oblivious algorithms
TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.
Proceedings ArticleDOI
An 0(n log n) sorting network
TL;DR: A sorting network of size 0(n log n) and depth 0(log n) is described, and a derived procedure (&egr;-nearsort) are described below, and the sorting network will be centered around these elementary steps.