Resource oblivious sorting on multicores

doi:10.1007/978-3-642-14165-2_20

Open AccessBook ChapterDOI

Resource oblivious sorting on multicores

- pp 226-237

TLDR

A new deterministic sorting algorithm that interleaves the partitioning of a sample sort with merging with an optimal number of cache misses is presented, which improves on previous bounds for deterministic sample sort.

Abstract:

We present a new deterministic sorting algorithm that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts n elements in O(n log n) time cache-obliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorithm is O(log n log log n), which improves on previous bounds for deterministic sample sort. Given a multicore computing environment with a global shared memory and p cores, each having a cache of size M organized in blocks of size B, our algorithm can be scheduled effectively on these p cores in a cache-oblivious manner. We improve on the above cache-oblivious processor-aware parallel implementation by using the Priority Work Stealing Scheduler (PWS) that we presented recently in a companion paper [12]. The PWS scheduler is both processor- and cache-oblivious (i.e., resource oblivious), and it tolerates asynchrony among the cores. Using PWS, we obtain a resource oblivious scheduling of our sorting algorithm that matches the performance of the processor-aware version. Our analysis includes the delay incurred by false-sharing. We also establish good bounds for our algorithm with the randomized work stealing scheduler.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

James Demmel, +6 more

TL;DR: This work obtains the first communication-optimal algorithm for all dimensions of rectangular matrices by combining the dimension-splitting technique with the recursive BFS/DFS approach, and shows significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

...read moreread less

Proceedings ArticleDOI

Scheduling irregular parallel computations on hierarchical caches

Guy E. Blelloch, +3 more

TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.

...read moreread less

Journal ArticleDOI

Oblivious algorithms for multicores and networks of processors

Rezaul Chowdhury, +3 more

- 01 Jul 2013 -

Journal of Parallel and Distributed Comp...

TL;DR: This work introduces a multicore-oblivious (MO) approach to algorithms and schedulers for HM, and presents efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components.

...read moreread less

Proceedings ArticleDOI

Oblivious algorithms for multicores and network of processors

Rezaul Chowdhury, +3 more

TL;DR: This work introduces a multicore-oblivious (MO) approach to algorithms and schedulers for HM, and presents efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components.

...read moreread less

Proceedings ArticleDOI

Cache-adaptive algorithms

Michael A. Bender, +5 more

TL;DR: The cache-adaptive model, which generalizes the external-memory model to apply to environments in which the amount of memory available to an algorithm can fluctuate, is introduced, and it is established that if a cache-oblivious algorithm is optimal on "square" memory profiles then, given resource augmentation it is ideal on all memory profiles.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

The input/output complexity of sorting and related problems

Alok Aggarwal, +1 more

- 01 Sep 1988 -

Communications of The ACM

TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.

...read moreread less

Journal ArticleDOI

Scheduling multithreaded computations by work stealing

Robert D. Blumofe, +1 more

- 01 Sep 1999 -

Journal of the ACM

TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time to execute a fully strict computation on P processors using this scheduler is 1:1.

...read moreread less

Journal ArticleDOI

Parallel merge sort

Richard Cole

- 01 Aug 1988 -

SIAM Journal on Computing

TL;DR: A parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small.

...read moreread less

Proceedings ArticleDOI

Cache-oblivious algorithms

Matteo Frigo, +3 more

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.

...read moreread less

Proceedings ArticleDOI

An 0(n log n) sorting network

Miklós Ajtai, +2 more

TL;DR: A sorting network of size 0(n log n) and depth 0(log n) is described, and a derived procedure (&egr;-nearsort) are described below, and the sorting network will be centered around these elementary steps.

...read moreread less