scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures

TL;DR: This work presents a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting and shows that such performance gains can be obtained on other hybrid CPU+GPU platforms.
Abstract: Sorting has been a topic of immense research value since the inception of Computer Science. Hybrid computing on multicore architectures involves computing simultaneously on a tightly coupled heterogeneous collection of devices. In this work, we consider a multicore CPU along with a many core GPU as our experimental hybrid platform. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sub lists. Sorting the independent sub lists results in sorting the entire original list. On a CPU+GPU platform consisting of an Intel i7 980 and an Nvidia GTX 580, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by Davidson et. al. [In Par 2012]. On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by Leischner et. al. [IPDPS 2010]. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid CPU+GPU platforms.
Citations
More filters
Proceedings ArticleDOI

[...]

01 Dec 2013
TL;DR: This paper presents a fast and efficient string sort on the GPU that is built on the available radix sort, and achieves speed of up to 10 over current GPU methods, especially on large datasets.
Abstract: String sorting or variable-length key sorting has lagged in performance on the GPU even as the fixed-length key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available radix sort. Our method sorts strings from left to right in steps, moving only indexes and small prefixes for efficiency. We reduce the number of sort steps by adaptively consuming maximum string bytes based on the number of segments in each step. Performance is improved by using Thrust primitives for most steps and by removing singleton segments from consideration. Over 70% of the string sort time is spent on Thrust primitives. This provides high performance along with high adaptability to future GPUs. We achieve speed of up to 10 over current GPU methods, especially on large datasets. We also scale to much larger input sizes. We present results on easy and difficult strings defined using their after-sort tie lengths.

18 citations

Proceedings ArticleDOI

[...]

01 Dec 2014
TL;DR: A comparative study on the most popular and efficient string sorting algorithms that have been implemented on CPU and GPU machines and an efficient parallel multi-key quicksort implementation which uses ternary search tree in order to increase the speed up and efficiency of sorting large set of string data are produced.
Abstract: The increase in the amount of data is evident in recent times. The amount of data stored and retrieved is increasing at a fast rate. Processing text data consumes large amount of memory in terms of storage and extraction. Sorting the stored data is one of the most favorable methods that can be used in order to increase the efficiency of extracting stored data. Graphic Processing Units (GPUs) have evolved from being used as dedicated graphic rendering modules to being used to exploit fast parallelism for large computational purposes. The use of GPUs for sorting strings large in size has produced effective and fast results when compared to using CPUs. This paper produces a comparative study on the most popular and efficient string sorting algorithms that have been implemented on CPU and GPU machines. This paper also proposes an efficient parallel multi-key quicksort implementation which uses ternary search tree in order to increase the speed up and efficiency of sorting large set of string data.

8 citations

Proceedings ArticleDOI

[...]

09 Oct 2014
TL;DR: This paper considers a class of sparse matrices that exhibit a scale-free nature and identifies a scheme that works well for such matrices and uses simple and effective mechanisms to determine the appropriate amount of work to be alloted to the CPU and the GPU.
Abstract: Multiplying a sparse matrix with a vector, denoted spmv, is a fundamental operation in linear algebra with several applications. Hence, efficient and scalable implementation of spmv has been a topic of immense research. Recent efforts are aimed at implementations on GPUs, multicore architectures, and such emerging computational platforms. Owing to the highly irregular nature of spmv, it is observed that GPUs and CPUs can offer comparable performance.In this paper, we propose three heterogeneous algorithms for spmv that simultaneously utilize both the CPU and the GPU. This is shown to lead to better resource utilization apart from performance gains. Our experiments of the work division schemes on standard datasets indicate that it is not in general possible to choose the most appropriate scheme given a matrix. We therefore consider a class of sparse matrices that exhibit a scale-free nature and identify a scheme that works well for such matrices. Finally, we use simple and effective mechanisms to determine the appropriate amount of work to be alloted to the CPU and the GPU.

7 citations


Cites background from "Fast, Scalable Parallel Comparison ..."

  • [...]

Journal ArticleDOI

[...]

TL;DR: This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort and the GPU implementation of string sorting algorithm using singleton elements in the literature.
Abstract: This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort. Because multi‐key quicksort is a recursive‐based algorithm, many of the researchers have found it tedious to parallelize the algorithm on the multi and many core architectures. A survey of the state‐of‐the‐art string sorting algorithms and a robust insight of the Kepler GPU architecture gave rise to an intriguing research idea of matching the template of multi‐key quicksort with the dynamic parallelism feature offered by the Kepler‐based GPU's. The CPU parallel implementation has an improvement of 33 to 50% and 62 to 75 improvement when compared with 8‐bit and 16‐bit parallel most significant digit radix sort, respectively. The GPU implementation of multi‐key quicksort gives 6× to 18× speed up compared with the CPU parallel implementation of parallel multi‐key quicksort. The GPU implementation of multi‐key quicksort achieves 1.5× to 3× speed up when compared with the GPU implementation of string sorting algorithm using singleton elements in the literature. Copyright © 2016 John Wiley & Sons, Ltd.

7 citations


Cites methods from "Fast, Scalable Parallel Comparison ..."

  • [...]

Proceedings ArticleDOI

[...]

01 May 2017
TL;DR: The applicability of an ear decomposition of graphs to problems such as all-pairs-shortestpaths and minimum cost cycle basis is studied and it is shown that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations.
Abstract: Graph algorithms play an important role in several fields of sciences and engineering. Prominent among them are the All-Pairs-Shortest-Paths (APSP) and related problems. Indeed there are several efficient implementations for such problems on a variety of modern multi- and many-core architectures. It can be noticed that for several graph problems, parallelism offers only a limited success as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, some of these graphs exhibit clear structural properties due to their sparsity. This calls for particular solution strategies aimed at scalable processing of large, sparse graphs on modern parallel architectures. In this paper, we study the applicability of an ear decomposition of graphs to problems such as all-pairs-shortestpaths and minimum cost cycle basis. Through experimentation, we show that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations. We believe that our techniques have the potential to be relevant for designing scalable solutions for other computations on large sparse graphs.

3 citations


Cites background from "Fast, Scalable Parallel Comparison ..."

  • [...]

References
More filters
Journal ArticleDOI

[...]

TL;DR: A new algorithm called Mersenne Twister (MT) is proposed for generating uniform pseudorandom numbers, which provides a super astronomical period of 2 and 623-dimensional equidistribution up to 32-bit accuracy, while using a working area of only 624 words.
Abstract: A new algorithm called Mersenne Twister (MT) is proposed for generating uniform pseudorandom numbers. For a particular choice of parameters, the algorithm provides a super astronomical period of 219937 −1 and 623-dimensional equidistribution up to 32-bit accuracy, while using a working area of only 624 words. This is a new variant of the previously proposed generators, TGFSR, modified so as to admit a Mersenne-prime period. The characteristic polynomial has many terms. The distribution up to v bits accuracy for 1 ≤ v ≤ 32 is also shown to be good. An algorithm is also given that checks the primitivity of the characteristic polynomial of MT with computational complexity O(p2) where p is the degree of the polynomial.We implemented this generator in portable C-code. It passed several stringent statistical tests, including diehard. Its speed is comparable to other modern generators. Its merits are due to the efficient algorithms that are unique to polynomial calculations over the two-element field.

5,418 citations

Proceedings ArticleDOI

[...]

11 Aug 2008
TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

2,143 citations

Book

[...]

11 Oct 2000
TL;DR: Aimed at the working researcher or scientific C/C++ or Fortran programmer, this text introduces the competent research programmer to a new vocabulary of idioms and techniques for parallelizing software using OpenMP.
Abstract: Aimed at the working researcher or scientific C/C++ or Fortran programmer, this text introduces the competent research programmer to a new vocabulary of idioms and techniques for parallelizing software using OpenMP.

1,253 citations

Journal ArticleDOI

[...]

TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.
Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore’s law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

1,065 citations

Proceedings ArticleDOI

[...]

19 Jun 2010
TL;DR: This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.
Abstract: Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

790 citations


"Fast, Scalable Parallel Comparison ..." refers background in this paper

  • [...]