Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures

doi:10.1109/IPDPSW.2013.129

Proceedings ArticleDOI

Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures

Dip Sankar Banerjee, +2 more

- pp 1060-1069

Chats0

TLDR

This work presents a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting and shows that such performance gains can be obtained on other hybrid CPU+GPU platforms.

Abstract:

Sorting has been a topic of immense research value since the inception of Computer Science. Hybrid computing on multicore architectures involves computing simultaneously on a tightly coupled heterogeneous collection of devices. In this work, we consider a multicore CPU along with a many core GPU as our experimental hybrid platform. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sub lists. Sorting the independent sub lists results in sorting the entire original list. On a CPU+GPU platform consisting of an Intel i7 980 and an Nvidia GTX 580, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by Davidson et. al. [In Par 2012]. On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by Leischner et. al. [IPDPS 2010]. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid CPU+GPU platforms.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Can GPUs sort strings efficiently

Aditya Deshpande, +1 more

TL;DR: This paper presents a fast and efficient string sort on the GPU that is built on the available radix sort, and achieves speed of up to 10 over current GPU methods, especially on large datasets.

...read moreread less

Journal ArticleDOI

Kepler GPU accelerated recursive sorting using dynamic parallelism

B. Neelima, +4 more

- 25 Feb 2017 -

Concurrency and Computation: Practice an...

TL;DR: This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort and the GPU implementation of string sorting algorithm using singleton elements in the literature.

...read moreread less

Proceedings ArticleDOI

String sorting on multi and many-threaded architectures: A comparative study

B. Neelima, +2 more

TL;DR: A comparative study on the most popular and efficient string sorting algorithms that have been implemented on CPU and GPU machines and an efficient parallel multi-key quicksort implementation which uses ternary search tree in order to increase the speed up and efficiency of sorting large set of string data are produced.

...read moreread less

Proceedings ArticleDOI

Architecture- and workload- aware heterogeneous algorithms for sparse matrix vector multiplication

Sivaramakrishna Bharadwaj Indarapu, +2 more

TL;DR: This paper considers a class of sparse matrices that exhibit a scale-free nature and identifies a scheme that works well for such matrices and uses simple and effective mechanisms to determine the appropriate amount of work to be alloted to the CPU and the GPU.

...read moreread less

Proceedings ArticleDOI

Applications of Ear Decomposition to Efficient Heterogeneous Algorithms for Shortest Path/Cycle Problems

Debarshi Dutta, +3 more

TL;DR: The applicability of an ear decomposition of graphs to problems such as all-pairs-shortestpaths and minimum cost cycle basis is studied and it is shown that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations.

...read moreread less

References

PDF

Open Access

More filters

Book ChapterDOI

Fast in-place sorting with CUDA based on bitonic sort

Hagen Peters, +2 more

TL;DR: A high-performance in-place implementation of Batcher's bitonic sorting networks for CUDA-enabled GPUs is presented, adapted bitonic sort for arbitrary input length and assigned compare/exchange-operations to threads in a way that decreases low-performance global-memory access and thereby greatly increases the performance of the implementation.

...read moreread less

Proceedings ArticleDOI

An improved supercomputer sorting benchmark

Thearling, +1 more

TL;DR: The authors have investigated the use of entropy as a measure of data distribution and propose that it, along with larger datasets, be added to existing sorting benchmarks (such as NAS) and were able to sort 1 billion 32-b keys in less than 17 s on a 1024 processor CM-5.

...read moreread less

Proceedings ArticleDOI

Fast and scalable list ranking on the GPU

M. Suhail Rehman, +2 more

TL;DR: This paper describes two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on massively multi-threaded hardware, and presents a GPU-optimized, Recursive Helman-JaJa (RHJ) algorithm.

...read moreread less

Journal ArticleDOI

Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort

Hagen Peters, +2 more

- 01 May 2011 -

Concurrency and Computation: Practice an...

TL;DR: This work assigned compare/exchange operations to threads in a way that decreases low‐performance global‐memory access and makes efficient use of high‐performance shared memory, which greatly increases the performance of this in‐place, comparison‐based sorting algorithm.

...read moreread less

Journal ArticleDOI

Optimization of linked list prefix computations on multithreaded gpus using cuda

Zheng Wei, +1 more

- 27 Dec 2012 -

Parallel Processing Letters

TL;DR: An optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations is introduced and the processing cost per element is shown to be close to the best possible.

...read moreread less