scispace - formally typeset
Journal ArticleDOI

GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory Capacity

Reads0
Chats0
TLDR
This work investigates applicability of using GPU devices to the splitter-based algorithms and extends HykSort, an existing splitter, by offloading costly computation phases to GPUs, and finds that the performance is mostly bottlenecked by the CPU-GPU host-to-device bandwidth.
Abstract
Splitter-based parallel sorting algorithms are known to be highly efficient for distributed sorting due to their low communication complexity. Although using GPU accelerators could help to reduce the computation cost in general, their effectiveness in distributed sorting algorithms remains unclear. We investigate applicability of using GPU devices to the splitter-based algorithms and extend HykSort, an existing splitter-based algorithm by offloading costly computation phases to GPUs. To cope with the volumes of data exceeding the GPU memory capacity, out-of-core local sort is used with small overhead about 7.5 percent when the data size is tripled. We evaluate the performance of our implementation on the TSUBAME2.5 supercomputer that comprises over 4,000 NVIDIA K20x GPUs. Weak scaling analysis shows 389 times speedup with 0.25 TB/s throughput when sorting 4 TB of 64 bit integer values on 1,024 nodes compared to running on one node; this is 1.40 times faster than the reference CPU implementation. Detailed analysis however reveals that the performance is mostly bottlenecked by the CPU-GPU host-to-device bandwidth. With orders of magnitude improvements announced for next generation GPUs, the performance boost will be tremendous in accordance with other successful GPU accelerations.

read more

Citations
More filters
Journal ArticleDOI

An Efficient and Unique TF/IDF Algorithmic Model-Based Data Analysis for Handling Applications with Big Data Streaming

TL;DR: A novel TF/IDF algorithm with the temporal Louvain approach to solve the problem of categorization of documents into hierarchical structures showing the relationship between variables, which is a boon to analysts making essential decisions is proposed.
Proceedings ArticleDOI

Bonsai: high-performance adaptive merge tree sorting

TL;DR: It is shown that merge trees can be implemented on FPGAs to offer state-of-the-art performance over many problem sizes and memory hierarchies and Bonsai, an adaptive sorting solution that takes into consideration the off-chip memory bandwidth and the amount of on-chip resources to optimize sorting time is developed.
Book ChapterDOI

The Impact of Big Data Analytics and Challenges to Cyber Security

TL;DR: This chapter deals with the analysis of the real-time uses of big data to both individual persons and the society too, while concentrating on seven important areas of key usage: big data for business optimization and customer analytics, big data and healthcare,big data and science, bigData and finance, big Data as enablers of openness and efficiency in government, bigdata and the emerging energy distribution systems, and big data security.
Journal ArticleDOI

Accelerating the similarity self-join using the GPU

TL;DR: This work proposes several techniques to optimize the self-join using the GPU that include a GPU-efficient index that employs a bounded search, a batching scheme to accommodate large result sets, and duplicate search removal with low overhead.
Journal ArticleDOI

A hybrid CPU/GPU approach for optimizing sorting throughput

TL;DR: It is demonstrated that, while out-of-place GPU sorting achieves the best performance, an in-place sort has the potential to further reduce some host-side bottlenecks, which encourages several future research priorities.
References
More filters
Proceedings ArticleDOI

Sorting networks and their applications

TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Proceedings ArticleDOI

GPUTeraSort: high performance graphics co-processor sorting for large database management

TL;DR: Overall, the results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.
Book ChapterDOI

Thrust: A Productivity-Oriented Library for CUDA

TL;DR: Thrust as mentioned in this paper is a parallel template library for CUDA C/C++ applications with minimal programming effort that allows developers to make fine-grained decisions about how computations are decomposed into parallel threads and executed on the device.
Proceedings ArticleDOI

A comparison of sorting algorithms for the connection machine CM-2

TL;DR: A fast sorting algorithm for the Connection Machine Supercomputer model CM-2 is developed and it is shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U( lg n) time in the bounded-degree fixed interconnection network domain.
Journal ArticleDOI

Parallel sorting by regular sampling

TL;DR: The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection and is shown to be asymptotically optimal.
Related Papers (5)