This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort.
Abstract:
While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lock-free synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.
TL;DR: An efficient algorithm that performs well on a variety of different graphs and can be well adapted to the dynamic setting and outperforms another competing dynamic graph framework on the authors' test set is presented.
TL;DR: This thesis intends to show the importance of memory optimizations for GPU systems, and addresses problems of data transfer and global atomic memory contention, and provides a theoretical model which can be used to correctly predict the comparative performance of memory movement techniques for a given data-intensive application and system.
TL;DR: In this article, a new technique GEDF-VD, which integrates Global Earliest Deadline First (GEDF) with Virtual Deadline (VD), is proposed to schedule parallel tasks of gang models.
TL;DR: This thesis introduces the umbrella concept of “anti-parallel patterns” parts of parallel programs that cause its performance to be less than expected and presents a number of benchmark programs written using NVIDIA’s CUDA technology in order to model the behaviour of the pattern.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
TL;DR: The Radix-2 Frameworks, a collection of general and high performance FFTs designed to solve the multi-Dimensional FFT problem of Prime Factor and Convolution, are presented.
TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.
Q1. What are the contributions mentioned in the paper "Inter-block gpu communication via fast barrier synchronization" ?
The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, their GPU lock-free synchronization further improves performance by 10 %, 26 %, and 40 %, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.
Q2. What are the future works in "Inter-block gpu communication via fast barrier synchronization" ?
As for future work, the authors will further investigate the reasons for the irregularity of the FFT ’ s synchronization time versus the number of blocks in the kernel. Second, the authors will propose a general model to characterize algorithms ’ parallelism properties, based on which, better performance can be obtained for their parallelization on multi- and many-core architectures.
Q3. What is the way to improve the performance of a bitonic sort?
For bitonic sort, Greβ et al. [7] improve the algorithmic complexity of GPU-ABisort to O (n log n) with an adaptive data structure that enables merges to be done in linear time.
Q4. How do the authors allocate shared memory on an SM to each block?
In addition, the authors allocate all available shared memory on an SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint.
Q5. What are the three well-known algorithms that the authors integrate into their synchronization approach?
In addition, the authors integrate each of their GPU synchronization approaches in a micro-benchmark and three well-known algorithms: FFT, dynamic programming, and bitonic sort.
Q6. How many threads can be sorted in a block?
Another parallel implementation of the bitonic sort is in the CUDA SDK [21], but there is only one block in the kernel to use the available barrier function __syncthreads(), thus restricting the maximum number of items that can be sorted to 512 — the maximum number of threads in a block.
Q7. What is the reason why the barrier function can not guarantee that inter-block communication is correct?
As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.
Q8. How many threads are used to check the elements of Arrayin in parallel?
It is worth noting that in the step 2) above, rather than having one thread to check all elements of Arrayin in serial as in [29], the authors use N threads to check the elements of Arrayin in parallel.
Q9. How does the research on mapping dynamic programming work?
Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU uses graphics primitives [14], [15] in a task parallel fashion.
We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization.