Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures

doi:10.1109/IPDPSW.2013.129

Home
/
Papers
/
Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures

Proceedings Article•DOI•

Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures

Dip Sankar Banerjee, Parikshit Sakurikar, Kishore Kothapalli

20 May 2013-pp 1060-1069

TL;DR: This work presents a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting and shows that such performance gains can be obtained on other hybrid CPU+GPU platforms.

read less

Abstract: Sorting has been a topic of immense research value since the inception of Computer Science. Hybrid computing on multicore architectures involves computing simultaneously on a tightly coupled heterogeneous collection of devices. In this work, we consider a multicore CPU along with a many core GPU as our experimental hybrid platform. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sub lists. Sorting the independent sub lists results in sorting the entire original list. On a CPU+GPU platform consisting of an Intel i7 980 and an Nvidia GTX 580, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by Davidson et. al. [In Par 2012]. On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by Leischner et. al. [IPDPS 2010]. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid CPU+GPU platforms.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Can GPUs sort strings efficiently

[...]

Aditya Deshpande, P. J. Narayanan

01 Dec 2013

TL;DR: This paper presents a fast and efficient string sort on the GPU that is built on the available radix sort, and achieves speed of up to 10 over current GPU methods, especially on large datasets.

...read moreread less

Abstract: String sorting or variable-length key sorting has lagged in performance on the GPU even as the fixed-length key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available radix sort. Our method sorts strings from left to right in steps, moving only indexes and small prefixes for efficiency. We reduce the number of sort steps by adaptively consuming maximum string bytes based on the number of segments in each step. Performance is improved by using Thrust primitives for most steps and by removing singleton segments from consideration. Over 70% of the string sort time is spent on Thrust primitives. This provides high performance along with high adaptability to future GPUs. We achieve speed of up to 10 over current GPU methods, especially on large datasets. We also scale to much larger input sizes. We present results on easy and difficult strings defined using their after-sort tie lengths.

...read moreread less

19 citations

Journal Article•DOI•

Kepler GPU accelerated recursive sorting using dynamic parallelism

[...]

B. Neelima¹, Bharath Shamsundar¹, Anjjan Narayan², Rithesh G Prabhu³, Crystal Gomes⁴ - Show less +1 more•Institutions (4)

N.M.A.M. Institute of Technology¹, Georgia Institute of Technology², University of Southern California³, T. A. Pai Management Institute⁴

25 Feb 2017-Concurrency and Computation: Practice and Experience

TL;DR: This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort and the GPU implementation of string sorting algorithm using singleton elements in the literature.

...read moreread less

Abstract: This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort. Because multi‐key quicksort is a recursive‐based algorithm, many of the researchers have found it tedious to parallelize the algorithm on the multi and many core architectures. A survey of the state‐of‐the‐art string sorting algorithms and a robust insight of the Kepler GPU architecture gave rise to an intriguing research idea of matching the template of multi‐key quicksort with the dynamic parallelism feature offered by the Kepler‐based GPU's. The CPU parallel implementation has an improvement of 33 to 50% and 62 to 75 improvement when compared with 8‐bit and 16‐bit parallel most significant digit radix sort, respectively. The GPU implementation of multi‐key quicksort gives 6× to 18× speed up compared with the CPU parallel implementation of parallel multi‐key quicksort. The GPU implementation of multi‐key quicksort achieves 1.5× to 3× speed up when compared with the GPU implementation of string sorting algorithm using singleton elements in the literature. Copyright © 2016 John Wiley & Sons, Ltd.

...read moreread less

10 citations

Cites methods from "Fast, Scalable Parallel Comparison ..."

...[21] have made use of a hybrid CPU+GPU platform and implemented a faster merge sort algorithm than Davidson et al....
[...]

Proceedings Article•DOI•

String sorting on multi and many-threaded architectures: A comparative study

[...]

B. Neelima¹, Anjjan S Narayan¹, Rithesh G Prabhu¹•Institutions (1)

N.M.A.M. Institute of Technology¹

01 Dec 2014

TL;DR: A comparative study on the most popular and efficient string sorting algorithms that have been implemented on CPU and GPU machines and an efficient parallel multi-key quicksort implementation which uses ternary search tree in order to increase the speed up and efficiency of sorting large set of string data are produced.

...read moreread less

Abstract: The increase in the amount of data is evident in recent times. The amount of data stored and retrieved is increasing at a fast rate. Processing text data consumes large amount of memory in terms of storage and extraction. Sorting the stored data is one of the most favorable methods that can be used in order to increase the efficiency of extracting stored data. Graphic Processing Units (GPUs) have evolved from being used as dedicated graphic rendering modules to being used to exploit fast parallelism for large computational purposes. The use of GPUs for sorting strings large in size has produced effective and fast results when compared to using CPUs. This paper produces a comparative study on the most popular and efficient string sorting algorithms that have been implemented on CPU and GPU machines. This paper also proposes an efficient parallel multi-key quicksort implementation which uses ternary search tree in order to increase the speed up and efficiency of sorting large set of string data.

...read moreread less

8 citations

Proceedings Article•DOI•

Architecture- and workload- aware heterogeneous algorithms for sparse matrix vector multiplication

[...]

Sivaramakrishna Bharadwaj Indarapu¹, Manoj Maramreddy¹, Kishore Kothapalli¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

09 Oct 2014

TL;DR: This paper considers a class of sparse matrices that exhibit a scale-free nature and identifies a scheme that works well for such matrices and uses simple and effective mechanisms to determine the appropriate amount of work to be alloted to the CPU and the GPU.

...read moreread less

Abstract: Multiplying a sparse matrix with a vector, denoted spmv, is a fundamental operation in linear algebra with several applications. Hence, efficient and scalable implementation of spmv has been a topic of immense research. Recent efforts are aimed at implementations on GPUs, multicore architectures, and such emerging computational platforms. Owing to the highly irregular nature of spmv, it is observed that GPUs and CPUs can offer comparable performance.In this paper, we propose three heterogeneous algorithms for spmv that simultaneously utilize both the CPU and the GPU. This is shown to lead to better resource utilization apart from performance gains. Our experiments of the work division schemes on standard datasets indicate that it is not in general possible to choose the most appropriate scheme given a matrix. We therefore consider a class of sparse matrices that exhibit a scale-free nature and identify a scheme that works well for such matrices. Finally, we use simple and effective mechanisms to determine the appropriate amount of work to be alloted to the CPU and the GPU.

...read moreread less

8 citations

Cites background from "Fast, Scalable Parallel Comparison ..."

...Several challenge problems from parallel computing such as sorting [20], graph traversals [21], and the like are already known to have highly ef.cient implementations on a variety of accelerators....
[...]

Proceedings Article•DOI•

Applications of Ear Decomposition to Efficient Heterogeneous Algorithms for Shortest Path/Cycle Problems

[...]

Debarshi Dutta¹, Meher Chaitanya¹, Kishore Kothapalli¹, Debajyoti Bera²•Institutions (2)

International Institute of Information Technology, Hyderabad¹, Indraprastha Institute of Information Technology²

01 May 2017

TL;DR: The applicability of an ear decomposition of graphs to problems such as all-pairs-shortestpaths and minimum cost cycle basis is studied and it is shown that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations.

...read moreread less

Abstract: Graph algorithms play an important role in several fields of sciences and engineering. Prominent among them are the All-Pairs-Shortest-Paths (APSP) and related problems. Indeed there are several efficient implementations for such problems on a variety of modern multi- and many-core architectures. It can be noticed that for several graph problems, parallelism offers only a limited success as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, some of these graphs exhibit clear structural properties due to their sparsity. This calls for particular solution strategies aimed at scalable processing of large, sparse graphs on modern parallel architectures. In this paper, we study the applicability of an ear decomposition of graphs to problems such as all-pairs-shortestpaths and minimum cost cycle basis. Through experimentation, we show that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations. We believe that our techniques have the potential to be relevant for designing scalable solutions for other computations on large sparse graphs.

...read moreread less

3 citations

Cites background from "Fast, Scalable Parallel Comparison ..."

...Over the past few years several carefully handcrafted heterogeneous algorithms on various heterogeneous platforms are designed for a variety of fundamental problems from parallel computing such as sorting [7], sparse matrix operations [37], [27], dense linear algebra routines [36], graph algorithm [20], [5] and the like....
[...]

References

PDF

Open Access

More filters

Journal Article•DOI•

Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

[...]

Makoto Matsumoto¹, Takuji Nishimura¹•Institutions (1)

Keio University¹

01 Jan 1998-ACM Transactions on Modeling and Computer Simulation

TL;DR: A new algorithm called Mersenne Twister (MT) is proposed for generating uniform pseudorandom numbers, which provides a super astronomical period of 2 and 623-dimensional equidistribution up to 32-bit accuracy, while using a working area of only 624 words.

...read moreread less

Abstract: A new algorithm called Mersenne Twister (MT) is proposed for generating uniform pseudorandom numbers. For a particular choice of parameters, the algorithm provides a super astronomical period of 219937 −1 and 623-dimensional equidistribution up to 32-bit accuracy, while using a working area of only 624 words. This is a new variant of the previously proposed generators, TGFSR, modified so as to admit a Mersenne-prime period. The characteristic polynomial has many terms. The distribution up to v bits accuracy for 1 ≤ v ≤ 32 is also shown to be good. An algorithm is also given that checks the primitivity of the characteristic polynomial of MT with computational complexity O(p2) where p is the degree of the polynomial.We implemented this generator in portable C-code. It passed several stringent statistical tests, including diehard. Its speed is comparable to other modern generators. Its merits are due to the efficient algorithms that are unique to polynomial calculations over the two-element field.

...read moreread less

5,819 citations

Proceedings Article•DOI•

Scalable parallel programming with CUDA

[...]

John R. Nickolls¹, Ian Buck¹, Michael Garland¹, Kevin Skadron²•Institutions (2)

Nvidia¹, University of Virginia²

11 Aug 2008

TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.

...read moreread less

Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

...read moreread less

2,216 citations

Book•

Parallel Programming in OpenMP

[...]

Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror E. Maydan¹, Jeff McDonald, Ramesh Menon - Show less +2 more•Institutions (1)

Tensilica¹

11 Oct 2000

TL;DR: Aimed at the working researcher or scientific C/C++ or Fortran programmer, this text introduces the competent research programmer to a new vocabulary of idioms and techniques for parallelizing software using OpenMP.

...read moreread less

Abstract: Aimed at the working researcher or scientific C/C++ or Fortran programmer, this text introduces the competent research programmer to a new vocabulary of idioms and techniques for parallelizing software using OpenMP.

...read moreread less

1,253 citations

Journal Article•DOI•

Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?

[...]

John R. Nickolls¹, Ian Buck¹, Michael Garland¹, Kevin Skadron²•Institutions (2)

Nvidia¹, University of Virginia²

01 Mar 2008-ACM Queue

TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.

...read moreread less

Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore’s law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

...read moreread less

1,148 citations

Proceedings Article•DOI•

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

[...]

Victor W. Lee¹, Changkyu Kim¹, Jatin Chhugani¹, Michael E. Deisher¹, Daehyun Kim¹, Anthony D. Nguyen¹, Nadathur Satish¹, Mikhail Smelyanskiy¹, Srinivas Chennupaty¹, Per Hammarlund¹, Ronak Singhal¹, Pradeep Dubey¹ - Show less +8 more•Institutions (1)

Intel¹

19 Jun 2010

TL;DR: This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

...read moreread less

Abstract: Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

...read moreread less

810 citations