Inter-block GPU communication via fast barrier synchronization
Summary (5 min read)
Introduction
- The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization.
- To quantify the execution time of each phase, the authors propose a general performance model that partitions the kernel execution time into the three aforementioned phases.
- First, the authors propose two GPU synchronization strategies for interblock synchronization.
- From their experiment results, though their proposed barrier synchronization is effi- cient, the low efficacy of __threadfence causes a lot of overhead, especially when the number of blocks in a kernel is large.
II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280
- The NVIDIA GeForce GTX 280 GPU card consists of 240 streaming processors (SPs), each clocked at 1296 MHz.
- These 240 SPs are grouped into 30 streaming multiprocessors (SMs), each of which contains 8 streaming processors.
- Global memory can be accessed by all threads and blocks on the GPU, and thus, is often used to communicate data across different blocks via a CPU barrier synchronization, as explained later.
- In general, only the compute-intensive and data-parallel parts of a program are parallelized with CUDA and are implemented as kernels that are compiled to the device instruction set.
- Currently, this type of data communication occurs via global memory, followed by a barrier synchronization via the CPU.
IV. A MODEL FOR KERNEL EXECUTION TIME AND SPEEDUP
- The kernel launch time depends on the data transfer rate from the host to the device as well as the size of kernel code and parameters.
- As the authors can see, in the CPU explicit synchronization, the kernel function __kernel_func is followed by the function cudaThreadSynchronize, which will not return until all prior operations on the device are completed.
- Since kernel launch is an asynchronous operation, if there are multiple kernel launches, kernel launch time can be overlapped by previous kernels’ computation time and synchronization time.
- First, kernel launch time can be combined with the synchronization time in the CPU explicit synchronization; Second, it can be overlapped in CPU implicit synchronization; According to Equation (5), corresponding to these ρ values, if only the computation is accelerated, maximum speedup of the three aforementioned algorithms are shown in Table II.
V. PROPOSED GPU SYNCHRONIZATION
- Since in CUDA programming model, the execution of a thread block is non-preemptive, care must be taken to avoid deadlocks in GPU synchronization design.
- Consider a scenario where multiple thread blocks are mapped to one SM and the active block is waiting for the completion of a global barrier.
- The authors solution to this problem is to have a one-to-one mapping between thread blocks and SMs.
- In other words, for a GPU with ‘Y’ SMs, the authors ensure that at most ‘Y’ blocks are used in the kernel.
- In the following discussion, the authors will present two alternative GPU synchronization designs: GPU lock-based synchronization and GPU lock-free synchronization.
A. GPU Lock-Based Synchronization
- The basic idea of GPU lock-based synchronization [31] is to use a global mutex variable to count the number of thread blocks that reach the synchronization point.
- The leading thread will then repeatedly compare g_mutex to a target value goalVal.
- The value of goalVal is then incremented by N each time when the barrier function is successively called.
- In the GPU lock-based synchronization, the execution time of the barrier function __gpu_sync consists of three parts — atomic addition, checking of g_mutex, and synchronization of threads within a block via __syncthreads.
- The atomic addition can only be executed sequentially by different blocks, while the g_mutex checking and intra-block synchronization can be executed in parallel.
B. GPU Lock-Free Synchronization
- In the GPU lock-based synchronization, the mutex variable g_mutex is added with the atomic function atomicAdd.
- As shown in Figure 6, their lock-free synchronization approach uses two arrays Arrayin and Arrayout to coordinate the synchronization requests from various blocks.
- The algorithm is outlined into three steps as follows: 3) A block will continue its execution once its leading thread sees the corresponding element in Arrayout is set to goalVal.
C. Synchronization Time Verification via a Micro-benchmark
- To verify the execution time of the synchronization function __gpu_sync for each synchronization method, a microbenchmark to compute the mean of two floats for 10,000 times is used.
- Here, each result is the average of three runs.
- From Figure 8, the computation time is only about 5ms, while the time needed by the CPU implicit synchronization is about 60ms, which is 12 times the computation time.
- 3) For the GPU lock-based synchronization, the synchronization time is linear to the number of blocks in a kernel, and more synchronization time is needed for a kernel with a larger 3Since there are at most 30 blocks that can be set on a GTX 280, threads that check Arrayin are in the same warp, which are executed in parallel.
- The reason is that, as the authors analyzed in Section V-A, more blocks means more atomic add operations should be executed for the synchronization.
VI. ALGORITHMS USED FOR PERFORMANCE EVALUATION
- Inter-block synchronization can be used in many algorithms.
- The authors choose three of them that can benefit from their proposed GPU synchronization methods.
- The three algorithms are Fast Fourier Transformation [16], SmithWaterman [25], and bitonic sort [4].
A. Fast Fourier Transformation
- A Discrete Fourier Transformation (DFT) transforms a sequence of values into its frequency components or, inversely, converts the frequency components back to the original data sequence.
- Fast Fourier Transformation (FFT) is a fast way of DFT computation.
- The computation difference can be substantial for long data sequence, especially when the sequence has thousands or millions of points.
- Within each iteration, computation of different points is independent, which can be done in parallel, because they depend on points only from its previous iteration.
- The barrier used here can be multiple kernel launches (CPU synchronization) or the GPU synchronization approaches proposed in this paper.
C. Bitonic Sort
- Bitonic sort is one of the fastest sorting networks [13], which is a special type of sorting algorithm devised by Ken Batcher [4].
- The main idea behind bitonic sort is using a divide-andconquer strategy.
- In the divide step, the input sequence is divided into two subsequences and each sequence is sorted with bitonic sort itself, where one is in the ascending order and the other is in the descending order.
- In the conquer step, with the two sorted subsequences as the input, the bitonic merge is used to combine them to get the whole sorted sequence [13].
- In each iteration, the numbers to be sorted are divided into pairs and a compareand-swap operation is applied on it, which can be executed in parallel for different pairs.
A. Overview
- To evaluate the performance of their proposed GPU synchronization approaches, the authors implement them in the three algorithms described in Section VI.
- 3) Corresponding to the best performance of each algorithm with each synchronization approach, the percentages of computation time and synchronization time are demonstrated and analyzed;.
- The authors experiments are performed on a GeForce GTX 280 GPU card, which has 30 SMs and 240 processing cores with the clock speed 1296MHz.
- For the host machine, The processor is an Intel Core 2 Duo CPU with 2MB of L2 cache and its clock speed is 2.2GHz.
- Similar as that in the microbenchmark, each result is the average of three runs.
B. Kernel Execution Time
- Figure 10 shows the kernel execution time decrease with their proposed GPU synchronization approaches and its variation versus the number of blocks in the kernel.
- This is due to, when the number of blocks in the kernel is larger than 30 or less than 9, kernel execution times are more than that with block number between 9 and 30.
- Figure 10(a) shows the performance of FFT, Figure 10(b) is for SWat, and Figure 10(c) displays the kernel execution time of bitonic sort.
- Third, kernel execution time difference between the CPU implicit synchronization and the proposed GPU synchronization of FFT is much less than that of SWat and bitonic sort.
- The more blocks are set in the kernel, the more performance improvement can be obtained if compared to the GPU lock-based synchronization approach.
C. Synchronization Time
- The authors show the synchronization time variation versus the number of blocks in the kernel.
- Here, the synchronization time is the difference between the total kernel execution time and the computation time, which is obtained by running an implementation of each algorithm with the GPU synchronization approach, but with the synchronization function __gpu_sync removed.
- With 9 blocks in the kernel, time needed for the GPU lock-based synchronization is close to that of the GPU lock-free synchronization;.
D. Percentages of the Computation Time and the Synchronization Time
- Figure 12 shows the performance breakdown in percentage of the three algorithms when different synchronization approaches are used.
- As a result, synchronization time changes have a less impact on the total kernel execution time compared to SWat and bitonic sort.
- This indicates that inter-block communication time occupies a large part of the total execution time in some algorithms.
- Thus, decreasing the synchronization time can improve the performance greatly in some algorithms;.
E. Costs of Guaranteeing Inter-Block Communication Correctness
- As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.
- Consider the GPU lock-free synchronization, from Figure 13(a), for FFT, when the number of blocks in the kernel is larger than 14, more time is needed to execute the kernel with the GPU lock-free synchronization.
- The threshold is 18 and 12 for SWat and bitonic sort, respectively.
- From these results, though the barrier can be implemented in software efficiently, the cost of guaranteeing correctness with the function __threadfence is very high, which means guaranteeing writes to shared memory or global memory to be read correctly via __threadfence is not an efficient way.
- This is not expected on the next generation of NVIDIA GPU “Fermi”, on which, with a more efficient implementation of __threadfence and a different architecture, it is needed for correct inter-block data communication.
VIII. CONCLUSION
- In the current GPU architecture, inter-block communication on GPUs requires a barrier synchronization to exist.
- Second, the authors propose two synchronization approaches: GPU lock-based synchronization and GPU lock-free synchronization.
- For each of these methods, the authors quantify its efficacy with the aforementioned performance model.
- In addition, based on the kernel execution time model, the authors partition the kernel execution time into the computation time and the synchronization time for the three algorithms.
- As a result, better approaches such as efficient hardware barrier implementation or memory flush functions are needed to support efficient and correct inter-block communication on a GPU.
Did you find this useful? Give us your feedback
Citations
520 citations
235 citations
202 citations
166 citations
110 citations
Additional excerpts
...[15, 26, 45]) simply omit memory model considerations....
[...]
References
10,262 citations
"Inter-block GPU communication via f..." refers background or methods in this paper
...Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU used graphics primitives [15, 14] in a task parallel fashion....
[...]
..., Smith-Waterman [25]), and bitonic sort [4] — and evaluate their effectiveness....
[...]
...This is due to three facts: 1) There have been a lot of works [19, 25, 15, 6, 10] to decrease the computation time....
[...]
...The three algorithms are Fast Fourier Transformation [16], Smith-Waterman [25], and bitonic sort [4]....
[...]
...We then introduce these GPU synchronization strategies into three different algorithms — Fast Fourier Transform (FFT) [16], dynamic programming (e.g., Smith-Waterman [25]), and bitonic sort [4] — and evaluate their effectiveness....
[...]
2,553 citations
"Inter-block GPU communication via f..." refers background or methods in this paper
...More detailed information about bitonic sort is in [4]....
[...]
...The three algorithms are Fast Fourier Transformation [16], Smith-Waterman [25], and bitonic sort [4]....
[...]
...Bitonic sort is one of the fastest sorting networks [13], which is a special type of sorting algorithm devised by Ken Batcher [4]....
[...]
..., Smith-Waterman [25]), and bitonic sort [4] — and evaluate their effectiveness....
[...]
1,222 citations
"Inter-block GPU communication via f..." refers background or methods in this paper
...A detailed description of the FFT algorithm can be found in [16]....
[...]
...We then introduce these GPU synchronization strategies into three different algorithms — Fast Fourier Transform (FFT) [16], dynamic programming (e....
[...]
...The three algorithms are Fast Fourier Transformation [16], Smith-Waterman [25], and bitonic sort [4]....
[...]
...To accelerate FFT [16], Govindaraju et al....
[...]
993 citations
"Inter-block GPU communication via f..." refers background or methods in this paper
...However, GPUs typically map well only to data or task parallel applications whose execution requires minimal or even no interblock communication [9, 24, 26, 30]....
[...]
...With these programming models, more and more applications have been mapped to GPUs and accelerated [6, 7, 10, 12, 18, 19, 23, 24, 26, 30]....
[...]
787 citations
Related Papers (5)
Frequently Asked Questions (9)
Q2. What are the future works in "Inter-block gpu communication via fast barrier synchronization" ?
As for future work, the authors will further investigate the reasons for the irregularity of the FFT ’ s synchronization time versus the number of blocks in the kernel. Second, the authors will propose a general model to characterize algorithms ’ parallelism properties, based on which, better performance can be obtained for their parallelization on multi- and many-core architectures.
Q3. What is the way to improve the performance of a bitonic sort?
For bitonic sort, Greβ et al. [7] improve the algorithmic complexity of GPU-ABisort to O (n log n) with an adaptive data structure that enables merges to be done in linear time.
Q4. How do the authors allocate shared memory on an SM to each block?
In addition, the authors allocate all available shared memory on an SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint.
Q5. What are the three well-known algorithms that the authors integrate into their synchronization approach?
In addition, the authors integrate each of their GPU synchronization approaches in a micro-benchmark and three well-known algorithms: FFT, dynamic programming, and bitonic sort.
Q6. How many threads can be sorted in a block?
Another parallel implementation of the bitonic sort is in the CUDA SDK [21], but there is only one block in the kernel to use the available barrier function __syncthreads(), thus restricting the maximum number of items that can be sorted to 512 — the maximum number of threads in a block.
Q7. What is the reason why the barrier function can not guarantee that inter-block communication is correct?
As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.
Q8. How many threads are used to check the elements of Arrayin in parallel?
It is worth noting that in the step 2) above, rather than having one thread to check all elements of Arrayin in serial as in [29], the authors use N threads to check the elements of Arrayin in parallel.
Q9. How does the research on mapping dynamic programming work?
Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU uses graphics primitives [14], [15] in a task parallel fashion.