scispace - formally typeset
Proceedings ArticleDOI

Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures

Reads0
Chats0
TLDR
This work presents a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting and shows that such performance gains can be obtained on other hybrid CPU+GPU platforms.
Abstract
Sorting has been a topic of immense research value since the inception of Computer Science. Hybrid computing on multicore architectures involves computing simultaneously on a tightly coupled heterogeneous collection of devices. In this work, we consider a multicore CPU along with a many core GPU as our experimental hybrid platform. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sub lists. Sorting the independent sub lists results in sorting the entire original list. On a CPU+GPU platform consisting of an Intel i7 980 and an Nvidia GTX 580, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by Davidson et. al. [In Par 2012]. On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by Leischner et. al. [IPDPS 2010]. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid CPU+GPU platforms.

read more

Citations
More filters
Proceedings ArticleDOI

Can GPUs sort strings efficiently

TL;DR: This paper presents a fast and efficient string sort on the GPU that is built on the available radix sort, and achieves speed of up to 10 over current GPU methods, especially on large datasets.
Journal ArticleDOI

Kepler GPU accelerated recursive sorting using dynamic parallelism

TL;DR: This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort and the GPU implementation of string sorting algorithm using singleton elements in the literature.
Proceedings ArticleDOI

String sorting on multi and many-threaded architectures: A comparative study

TL;DR: A comparative study on the most popular and efficient string sorting algorithms that have been implemented on CPU and GPU machines and an efficient parallel multi-key quicksort implementation which uses ternary search tree in order to increase the speed up and efficiency of sorting large set of string data are produced.
Proceedings ArticleDOI

Architecture- and workload- aware heterogeneous algorithms for sparse matrix vector multiplication

TL;DR: This paper considers a class of sparse matrices that exhibit a scale-free nature and identifies a scheme that works well for such matrices and uses simple and effective mechanisms to determine the appropriate amount of work to be alloted to the CPU and the GPU.
Proceedings ArticleDOI

Applications of Ear Decomposition to Efficient Heterogeneous Algorithms for Shortest Path/Cycle Problems

TL;DR: The applicability of an ear decomposition of graphs to problems such as all-pairs-shortestpaths and minimum cost cycle basis is studied and it is shown that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations.
References
More filters
Book ChapterDOI

Fast in-place sorting with CUDA based on bitonic sort

TL;DR: A high-performance in-place implementation of Batcher's bitonic sorting networks for CUDA-enabled GPUs is presented, adapted bitonic sort for arbitrary input length and assigned compare/exchange-operations to threads in a way that decreases low-performance global-memory access and thereby greatly increases the performance of the implementation.
Proceedings ArticleDOI

An improved supercomputer sorting benchmark

Thearling, +1 more
TL;DR: The authors have investigated the use of entropy as a measure of data distribution and propose that it, along with larger datasets, be added to existing sorting benchmarks (such as NAS) and were able to sort 1 billion 32-b keys in less than 17 s on a 1024 processor CM-5.
Proceedings ArticleDOI

Fast and scalable list ranking on the GPU

TL;DR: This paper describes two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on massively multi-threaded hardware, and presents a GPU-optimized, Recursive Helman-JaJa (RHJ) algorithm.
Journal ArticleDOI

Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort

TL;DR: This work assigned compare/exchange operations to threads in a way that decreases low‐performance global‐memory access and makes efficient use of high‐performance shared memory, which greatly increases the performance of this in‐place, comparison‐based sorting algorithm.
Journal ArticleDOI

Optimization of linked list prefix computations on multithreaded gpus using cuda

TL;DR: An optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations is introduced and the processing cost per element is shown to be close to the best possible.
Related Papers (5)