Proceedings ArticleDOI
Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures
Dip Sankar Banerjee,Parikshit Sakurikar,Kishore Kothapalli +2 more
- pp 1060-1069
Reads0
Chats0
TLDR
This work presents a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting and shows that such performance gains can be obtained on other hybrid CPU+GPU platforms.Abstract:
Sorting has been a topic of immense research value since the inception of Computer Science. Hybrid computing on multicore architectures involves computing simultaneously on a tightly coupled heterogeneous collection of devices. In this work, we consider a multicore CPU along with a many core GPU as our experimental hybrid platform. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sub lists. Sorting the independent sub lists results in sorting the entire original list. On a CPU+GPU platform consisting of an Intel i7 980 and an Nvidia GTX 580, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by Davidson et. al. [In Par 2012]. On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by Leischner et. al. [IPDPS 2010]. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid CPU+GPU platforms.read more
Citations
More filters
Proceedings ArticleDOI
Can GPUs sort strings efficiently
Aditya Deshpande,P. J. Narayanan +1 more
TL;DR: This paper presents a fast and efficient string sort on the GPU that is built on the available radix sort, and achieves speed of up to 10 over current GPU methods, especially on large datasets.
Journal ArticleDOI
Kepler GPU accelerated recursive sorting using dynamic parallelism
TL;DR: This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort and the GPU implementation of string sorting algorithm using singleton elements in the literature.
Proceedings ArticleDOI
String sorting on multi and many-threaded architectures: A comparative study
TL;DR: A comparative study on the most popular and efficient string sorting algorithms that have been implemented on CPU and GPU machines and an efficient parallel multi-key quicksort implementation which uses ternary search tree in order to increase the speed up and efficiency of sorting large set of string data are produced.
Proceedings ArticleDOI
Architecture- and workload- aware heterogeneous algorithms for sparse matrix vector multiplication
TL;DR: This paper considers a class of sparse matrices that exhibit a scale-free nature and identifies a scheme that works well for such matrices and uses simple and effective mechanisms to determine the appropriate amount of work to be alloted to the CPU and the GPU.
Proceedings ArticleDOI
Applications of Ear Decomposition to Efficient Heterogeneous Algorithms for Shortest Path/Cycle Problems
TL;DR: The applicability of an ear decomposition of graphs to problems such as all-pairs-shortestpaths and minimum cost cycle basis is studied and it is shown that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations.
References
More filters
Book ChapterDOI
Fast in-place sorting with CUDA based on bitonic sort
TL;DR: A high-performance in-place implementation of Batcher's bitonic sorting networks for CUDA-enabled GPUs is presented, adapted bitonic sort for arbitrary input length and assigned compare/exchange-operations to threads in a way that decreases low-performance global-memory access and thereby greatly increases the performance of the implementation.
Proceedings ArticleDOI
An improved supercomputer sorting benchmark
Thearling,Smith +1 more
TL;DR: The authors have investigated the use of entropy as a measure of data distribution and propose that it, along with larger datasets, be added to existing sorting benchmarks (such as NAS) and were able to sort 1 billion 32-b keys in less than 17 s on a 1024 processor CM-5.
Proceedings ArticleDOI
Fast and scalable list ranking on the GPU
TL;DR: This paper describes two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on massively multi-threaded hardware, and presents a GPU-optimized, Recursive Helman-JaJa (RHJ) algorithm.
Journal ArticleDOI
Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort
TL;DR: This work assigned compare/exchange operations to threads in a way that decreases low‐performance global‐memory access and makes efficient use of high‐performance shared memory, which greatly increases the performance of this in‐place, comparison‐based sorting algorithm.
Journal ArticleDOI
Optimization of linked list prefix computations on multithreaded gpus using cuda
Zheng Wei,Joseph JaJa +1 more
TL;DR: An optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations is introduced and the processing cost per element is shown to be close to the best possible.