GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory Capacity

doi:10.1109/TBDATA.2015.2511001

Journal ArticleDOI

GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory Capacity

Hideyuki Shamoto, +4 more

- 01 Mar 2016 -

IEEE Transactions on Big Data

- Vol. 2, Iss: 1, pp 57-69

Chats0

TLDR

This work investigates applicability of using GPU devices to the splitter-based algorithms and extends HykSort, an existing splitter, by offloading costly computation phases to GPUs, and finds that the performance is mostly bottlenecked by the CPU-GPU host-to-device bandwidth.

Abstract:

Splitter-based parallel sorting algorithms are known to be highly efficient for distributed sorting due to their low communication complexity. Although using GPU accelerators could help to reduce the computation cost in general, their effectiveness in distributed sorting algorithms remains unclear. We investigate applicability of using GPU devices to the splitter-based algorithms and extend HykSort, an existing splitter-based algorithm by offloading costly computation phases to GPUs. To cope with the volumes of data exceeding the GPU memory capacity, out-of-core local sort is used with small overhead about 7.5 percent when the data size is tripled. We evaluate the performance of our implementation on the TSUBAME2.5 supercomputer that comprises over 4,000 NVIDIA K20x GPUs. Weak scaling analysis shows 389 times speedup with 0.25 TB/s throughput when sorting 4 TB of 64 bit integer values on 1,024 nodes compared to running on one node; this is 1.40 times faster than the reference CPU implementation. Detailed analysis however reveals that the performance is mostly bottlenecked by the CPU-GPU host-to-device bandwidth. With orders of magnitude improvements announced for next generation GPUs, the performance boost will be tremendous in accordance with other successful GPU accelerations.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

An Efficient and Unique TF/IDF Algorithmic Model-Based Data Analysis for Handling Applications with Big Data Streaming

Celestine Iwendi, +4 more

- 11 Nov 2019 -

Electronics

TL;DR: A novel TF/IDF algorithm with the temporal Louvain approach to solve the problem of categorization of documents into hierarchical structures showing the relationship between variables, which is a boon to analysts making essential decisions is proposed.

...read moreread less

Proceedings ArticleDOI

Bonsai: high-performance adaptive merge tree sorting

Nikola Samardzic, +4 more

TL;DR: It is shown that merge trees can be implemented on FPGAs to offer state-of-the-art performance over many problem sizes and memory hierarchies and Bonsai, an adaptive sorting solution that takes into consideration the off-chip memory bandwidth and the amount of on-chip resources to optimize sorting time is developed.

...read moreread less

Book ChapterDOI

The Impact of Big Data Analytics and Challenges to Cyber Security

Anandakumar Haldorai, +1 more

TL;DR: This chapter deals with the analysis of the real-time uses of big data to both individual persons and the society too, while concentrating on seven important areas of key usage: big data for business optimization and customer analytics, big data and healthcare,big data and science, bigData and finance, big Data as enablers of openness and efficiency in government, bigdata and the emerging energy distribution systems, and big data security.

...read moreread less

Journal ArticleDOI

Accelerating the similarity self-join using the GPU

Michael Gowanlock, +2 more

- 28 Jun 2019 -

Journal of Parallel and Distributed Comp...

TL;DR: This work proposes several techniques to optimize the self-join using the GPU that include a GPU-efficient index that employs a bounded search, a batching scheme to accommodate large result sets, and duplicate search removal with low overhead.

...read moreread less

Journal ArticleDOI

A hybrid CPU/GPU approach for optimizing sorting throughput

Michael Gowanlock, +2 more

TL;DR: It is demonstrated that, while out-of-place GPU sorting achieves the best performance, an in-place sort has the potential to further reduce some host-side bottlenecks, which encourages several future research priorities.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Sorting networks and their applications

Kenneth E. Batcher

TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.

...read moreread less

Proceedings ArticleDOI

GPUTeraSort: high performance graphics co-processor sorting for large database management

Naga K. Govindaraju, +3 more

TL;DR: Overall, the results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.

...read moreread less

Book ChapterDOI

Thrust: A Productivity-Oriented Library for CUDA

Nathan Bell, +1 more

TL;DR: Thrust as mentioned in this paper is a parallel template library for CUDA C/C++ applications with minimal programming effort that allows developers to make fine-grained decisions about how computations are decomposed into parallel threads and executed on the device.

...read moreread less

Proceedings ArticleDOI

A comparison of sorting algorithms for the connection machine CM-2

Guy E. Blelloch, +5 more

TL;DR: A fast sorting algorithm for the Connection Machine Supercomputer model CM-2 is developed and it is shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U( lg n) time in the bounded-degree fixed interconnection network domain.

...read moreread less

Journal ArticleDOI

Parallel sorting by regular sampling

Hanmao Shi, +1 more

- 01 Apr 1992 -

Journal of Parallel and Distributed Comp...

TL;DR: The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection and is shown to be asymptotically optimal.

...read moreread less

Collapse

Computer Graphics Forum

BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications

Reza Bayat Mokhtari, +1 more

Relational Query Co-Processing on Graphics Processors 1

Mian Lu, +4 more

GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory Capacity

Citations

An Efficient and Unique TF/IDF Algorithmic Model-Based Data Analysis for Handling Applications with Big Data Streaming

Bonsai: high-performance adaptive merge tree sorting

The Impact of Big Data Analytics and Challenges to Cyber Security

Accelerating the similarity self-join using the GPU

A hybrid CPU/GPU approach for optimizing sorting throughput

References

Sorting networks and their applications

GPUTeraSort: high performance graphics co-processor sorting for large database management

Thrust: A Productivity-Oriented Library for CUDA

A comparison of sorting algorithms for the connection machine CM-2

Parallel sorting by regular sampling

Related Papers (5)

Large-scale distributed sorting for GPU-based heterogeneous supercomputers

Comparison based sorting for systems with multiple GPUs

Fast Four‐Way Parallel Radix Sorting on GPUs

BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications

Relational Query Co-Processing on Graphics Processors 1