This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort.
Abstract:
While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lock-free synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.
TL;DR: A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.
TL;DR: A new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy that guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.
TL;DR: Through micro-kernel benchmarks, it is shown the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases.
TL;DR: This paper describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol, called TC-Weak.
TL;DR: A model of Nvidia GPU hardware is proposed, which correctly models every behaviour witnessed in the authors' experiments, and is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy.
TL;DR: This paper takes Smith-Waterman as a case study to explore the architectural features of Graphics Processing Units (GPUs) and evaluate the challenges the hardware architecture poses, as well as the software modifications needed to map the program architecture on to the GPU.
TL;DR: This paper presents two protocols that address factors hindering the performance of Herlihy's standard non-blocking protocol, and introduces a second, optimistic protocol that avoids copying, except in the case when a thread of control is delayed during its attempted update.
TL;DR: A fine-grained parallelization of a single instance of the dynamic programming (DP) algorithm that is mapped to the GPU is proposed, which incorporates a set of techniques aimed to substantially improve GPU performance: matrix re-alignment, coalesced memory access, tiling, and GPU (rather than CPU) synchronization.
TL;DR: Anadaptive software combining tree is used to implement a scalable barrier withO(1) recognition latency and it is shown how adaptive combining trees can be used to implemented the fuzzy barrier.
Q1. What are the contributions mentioned in the paper "Inter-block gpu communication via fast barrier synchronization" ?
The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, their GPU lock-free synchronization further improves performance by 10 %, 26 %, and 40 %, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.
Q2. What are the future works in "Inter-block gpu communication via fast barrier synchronization" ?
As for future work, the authors will further investigate the reasons for the irregularity of the FFT ’ s synchronization time versus the number of blocks in the kernel. Second, the authors will propose a general model to characterize algorithms ’ parallelism properties, based on which, better performance can be obtained for their parallelization on multi- and many-core architectures.
Q3. What is the way to improve the performance of a bitonic sort?
For bitonic sort, Greβ et al. [7] improve the algorithmic complexity of GPU-ABisort to O (n log n) with an adaptive data structure that enables merges to be done in linear time.
Q4. How do the authors allocate shared memory on an SM to each block?
In addition, the authors allocate all available shared memory on an SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint.
Q5. What are the three well-known algorithms that the authors integrate into their synchronization approach?
In addition, the authors integrate each of their GPU synchronization approaches in a micro-benchmark and three well-known algorithms: FFT, dynamic programming, and bitonic sort.
Q6. How many threads can be sorted in a block?
Another parallel implementation of the bitonic sort is in the CUDA SDK [21], but there is only one block in the kernel to use the available barrier function __syncthreads(), thus restricting the maximum number of items that can be sorted to 512 — the maximum number of threads in a block.
Q7. What is the reason why the barrier function can not guarantee that inter-block communication is correct?
As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.
Q8. How many threads are used to check the elements of Arrayin in parallel?
It is worth noting that in the step 2) above, rather than having one thread to check all elements of Arrayin in serial as in [29], the authors use N threads to check the elements of Arrayin in parallel.
Q9. How does the research on mapping dynamic programming work?
Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU uses graphics primitives [14], [15] in a task parallel fashion.
We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization.