Proceedings Article•DOI•

Inter-block GPU communication via fast barrier synchronization

Shucai Xiao¹, Wu-chun Feng¹•Institutions (1)

19 Apr 2010-pp 1-12

TL;DR: This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort.

read less

Abstract: While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lock-free synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.

...read moreread less

Summary (5 min read)

Jump to: [Introduction] – [II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280] – [III. RELATED WORK] – [IV. A MODEL FOR KERNEL EXECUTION TIME AND SPEEDUP] – [V. PROPOSED GPU SYNCHRONIZATION] – [A. GPU Lock-Based Synchronization] – [B. GPU Lock-Free Synchronization] – [C. Synchronization Time Verification via a Micro-benchmark] – [VI. ALGORITHMS USED FOR PERFORMANCE EVALUATION] – [A. Fast Fourier Transformation] – [C. Bitonic Sort] – [A. Overview] – [B. Kernel Execution Time] – [C. Synchronization Time] – [D. Percentages of the Computation Time and the Synchronization Time] – [E. Costs of Guaranteeing Inter-Block Communication Correctness] and [VIII. CONCLUSION]

Introduction

The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization.
To quantify the execution time of each phase, the authors propose a general performance model that partitions the kernel execution time into the three aforementioned phases.
First, the authors propose two GPU synchronization strategies for interblock synchronization.
From their experiment results, though their proposed barrier synchronization is effi- cient, the low efficacy of __threadfence causes a lot of overhead, especially when the number of blocks in a kernel is large.

II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280

The NVIDIA GeForce GTX 280 GPU card consists of 240 streaming processors (SPs), each clocked at 1296 MHz.
These 240 SPs are grouped into 30 streaming multiprocessors (SMs), each of which contains 8 streaming processors.
Global memory can be accessed by all threads and blocks on the GPU, and thus, is often used to communicate data across different blocks via a CPU barrier synchronization, as explained later.
In general, only the compute-intensive and data-parallel parts of a program are parallelized with CUDA and are implemented as kernels that are compiled to the device instruction set.
Currently, this type of data communication occurs via global memory, followed by a barrier synchronization via the CPU.

IV. A MODEL FOR KERNEL EXECUTION TIME AND SPEEDUP

The kernel launch time depends on the data transfer rate from the host to the device as well as the size of kernel code and parameters.
As the authors can see, in the CPU explicit synchronization, the kernel function __kernel_func is followed by the function cudaThreadSynchronize, which will not return until all prior operations on the device are completed.
Since kernel launch is an asynchronous operation, if there are multiple kernel launches, kernel launch time can be overlapped by previous kernels’ computation time and synchronization time.
First, kernel launch time can be combined with the synchronization time in the CPU explicit synchronization; Second, it can be overlapped in CPU implicit synchronization; According to Equation (5), corresponding to these ρ values, if only the computation is accelerated, maximum speedup of the three aforementioned algorithms are shown in Table II.

V. PROPOSED GPU SYNCHRONIZATION

Since in CUDA programming model, the execution of a thread block is non-preemptive, care must be taken to avoid deadlocks in GPU synchronization design.
Consider a scenario where multiple thread blocks are mapped to one SM and the active block is waiting for the completion of a global barrier.
The authors solution to this problem is to have a one-to-one mapping between thread blocks and SMs.
In other words, for a GPU with ‘Y’ SMs, the authors ensure that at most ‘Y’ blocks are used in the kernel.
In the following discussion, the authors will present two alternative GPU synchronization designs: GPU lock-based synchronization and GPU lock-free synchronization.

A. GPU Lock-Based Synchronization

The basic idea of GPU lock-based synchronization [31] is to use a global mutex variable to count the number of thread blocks that reach the synchronization point.
The leading thread will then repeatedly compare g_mutex to a target value goalVal.
The value of goalVal is then incremented by N each time when the barrier function is successively called.
In the GPU lock-based synchronization, the execution time of the barrier function __gpu_sync consists of three parts — atomic addition, checking of g_mutex, and synchronization of threads within a block via __syncthreads.
The atomic addition can only be executed sequentially by different blocks, while the g_mutex checking and intra-block synchronization can be executed in parallel.

B. GPU Lock-Free Synchronization

In the GPU lock-based synchronization, the mutex variable g_mutex is added with the atomic function atomicAdd.
As shown in Figure 6, their lock-free synchronization approach uses two arrays Arrayin and Arrayout to coordinate the synchronization requests from various blocks.
The algorithm is outlined into three steps as follows: 3) A block will continue its execution once its leading thread sees the corresponding element in Arrayout is set to goalVal.

C. Synchronization Time Verification via a Micro-benchmark

To verify the execution time of the synchronization function __gpu_sync for each synchronization method, a microbenchmark to compute the mean of two floats for 10,000 times is used.
Here, each result is the average of three runs.
From Figure 8, the computation time is only about 5ms, while the time needed by the CPU implicit synchronization is about 60ms, which is 12 times the computation time.
3) For the GPU lock-based synchronization, the synchronization time is linear to the number of blocks in a kernel, and more synchronization time is needed for a kernel with a larger 3Since there are at most 30 blocks that can be set on a GTX 280, threads that check Arrayin are in the same warp, which are executed in parallel.
The reason is that, as the authors analyzed in Section V-A, more blocks means more atomic add operations should be executed for the synchronization.

VI. ALGORITHMS USED FOR PERFORMANCE EVALUATION

Inter-block synchronization can be used in many algorithms.
The authors choose three of them that can benefit from their proposed GPU synchronization methods.
The three algorithms are Fast Fourier Transformation [16], SmithWaterman [25], and bitonic sort [4].

A. Fast Fourier Transformation

A Discrete Fourier Transformation (DFT) transforms a sequence of values into its frequency components or, inversely, converts the frequency components back to the original data sequence.
Fast Fourier Transformation (FFT) is a fast way of DFT computation.
The computation difference can be substantial for long data sequence, especially when the sequence has thousands or millions of points.
Within each iteration, computation of different points is independent, which can be done in parallel, because they depend on points only from its previous iteration.
The barrier used here can be multiple kernel launches (CPU synchronization) or the GPU synchronization approaches proposed in this paper.

C. Bitonic Sort

Bitonic sort is one of the fastest sorting networks [13], which is a special type of sorting algorithm devised by Ken Batcher [4].
The main idea behind bitonic sort is using a divide-andconquer strategy.
In the divide step, the input sequence is divided into two subsequences and each sequence is sorted with bitonic sort itself, where one is in the ascending order and the other is in the descending order.
In the conquer step, with the two sorted subsequences as the input, the bitonic merge is used to combine them to get the whole sorted sequence [13].
In each iteration, the numbers to be sorted are divided into pairs and a compareand-swap operation is applied on it, which can be executed in parallel for different pairs.

A. Overview

To evaluate the performance of their proposed GPU synchronization approaches, the authors implement them in the three algorithms described in Section VI.
3) Corresponding to the best performance of each algorithm with each synchronization approach, the percentages of computation time and synchronization time are demonstrated and analyzed;.
The authors experiments are performed on a GeForce GTX 280 GPU card, which has 30 SMs and 240 processing cores with the clock speed 1296MHz.
For the host machine, The processor is an Intel Core 2 Duo CPU with 2MB of L2 cache and its clock speed is 2.2GHz.
Similar as that in the microbenchmark, each result is the average of three runs.

B. Kernel Execution Time

Figure 10 shows the kernel execution time decrease with their proposed GPU synchronization approaches and its variation versus the number of blocks in the kernel.
This is due to, when the number of blocks in the kernel is larger than 30 or less than 9, kernel execution times are more than that with block number between 9 and 30.
Figure 10(a) shows the performance of FFT, Figure 10(b) is for SWat, and Figure 10(c) displays the kernel execution time of bitonic sort.
Third, kernel execution time difference between the CPU implicit synchronization and the proposed GPU synchronization of FFT is much less than that of SWat and bitonic sort.
The more blocks are set in the kernel, the more performance improvement can be obtained if compared to the GPU lock-based synchronization approach.

C. Synchronization Time

The authors show the synchronization time variation versus the number of blocks in the kernel.
Here, the synchronization time is the difference between the total kernel execution time and the computation time, which is obtained by running an implementation of each algorithm with the GPU synchronization approach, but with the synchronization function __gpu_sync removed.
With 9 blocks in the kernel, time needed for the GPU lock-based synchronization is close to that of the GPU lock-free synchronization;.

D. Percentages of the Computation Time and the Synchronization Time

Figure 12 shows the performance breakdown in percentage of the three algorithms when different synchronization approaches are used.
As a result, synchronization time changes have a less impact on the total kernel execution time compared to SWat and bitonic sort.
This indicates that inter-block communication time occupies a large part of the total execution time in some algorithms.
Thus, decreasing the synchronization time can improve the performance greatly in some algorithms;.

E. Costs of Guaranteeing Inter-Block Communication Correctness

As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.
Consider the GPU lock-free synchronization, from Figure 13(a), for FFT, when the number of blocks in the kernel is larger than 14, more time is needed to execute the kernel with the GPU lock-free synchronization.
The threshold is 18 and 12 for SWat and bitonic sort, respectively.
From these results, though the barrier can be implemented in software efficiently, the cost of guaranteeing correctness with the function __threadfence is very high, which means guaranteeing writes to shared memory or global memory to be read correctly via __threadfence is not an efficient way.
This is not expected on the next generation of NVIDIA GPU “Fermi”, on which, with a more efficient implementation of __threadfence and a different architecture, it is needed for correct inter-block data communication.

VIII. CONCLUSION

In the current GPU architecture, inter-block communication on GPUs requires a barrier synchronization to exist.
Second, the authors propose two synchronization approaches: GPU lock-based synchronization and GPU lock-free synchronization.
For each of these methods, the authors quantify its efficacy with the aforementioned performance model.
In addition, based on the kernel execution time model, the authors partition the kernel execution time into the computation time and the synchronization time for the three algorithms.
As a result, better approaches such as efficient hardware barrier implementation or memory flush functions are needed to support efficient and correct inter-block communication on a GPU.

Did you find this useful? Give us your feedback

Figures (16)

Fig. 8. Execution Time of the Micro-benchmark.

Fig. 1. Total Kernel Execution Time Composition

TABLE I PERCENT OF TIME SPENT ON INTER-BLOCK COMMUNICATION

Fig. 5. Time Composition of GPU Lock-Based Synchronization

Fig. 6. Code snapshot of the GPU Lock-Free Synchronization

Fig. 12. Percentages of Computation Time and Synchronization Time

Fig. 13. Kernel Execution Time versus Number of Blocks in the Kernel with __threadfence() Called

Fig. 7. Time Composition of GPU Lock-Free Synchronization

Fig. 11. Synchronization Time versus Number of Blocks in the Kernel

Fig. 9. Wavefront Pattern and Dependency in the Matrix Filling Process.

Fig. 3. GPU Synchronization Function Call

Fig. 2. CPU Explicit/Implicit Synchronization Function Call

Fig. 10. Kernel Execution Time versus Number of Blocks in the Kernel

TABLE III ADDITIONAL SPEEDUP OBTAINED BY BETTER SYNCHRONIZATION APPROACHES

TABLE II POSSIBLE MAXIMUM SPEEDUP WITH ONLY COMPUTATION ACCELERATED

Fig. 4. Code snapshot of the GPU Lock-Based Synchronization

Content maybe subject to copyright Report

Inter-Block GPU Communication via Fast

Barrier Synchronization

Shucai Xiao

∗

and Wu-chun Feng

∗†

∗

Department of Electrical and Computer Engineering

†

Department of Computer Science

Virginia Tech

Blacksburg, Virginia 24061

Email: {shucai, wfeng}@vt.edu

Abstract—While GPGPU stands for general-purpose compu-

tation on graphics processing units, the lack of explicit support

for inter-block communication on the GPU arguably hampers its

broader adoption as a general-purpose computing device. Inter-

block communication on the GPU occurs via global memory

and then requires barrier synchronization across the blocks,

i.e., inter-block GPU communication via barrier synchronization.

Currently, such synchronization is only available via the CPU,

which in turn, can incur signiﬁcant overhead.

We propose two approaches for inter-block GPU communi-

cation via barrier synchronization: GPU lock-based synchro-

nization and GPU lock-free synchronization. We then evaluate

the efﬁcacy of each approach via a micro-benchmark as well

as three well-known algorithms — Fast Fourier Transform

(FFT), dynamic programming, and bitonic sort. For the micro-

benchmark, the experimental results show that our GPU lock-

free synchronization performs 8.4 times faster than CPU explicit

synchronization and 4.0 times faster than CPU implicit synchro-

nization. When integrated with the FFT, dynamic programming,

and bitonic sort algorithms, our GPU lock-free synchronization

further improves performance by 10%, 26%, and 40%, respec-

tively, and ultimately delivers an overall speed-up of 70x, 13x,

and 24x, respectively.

I. INTRODUCTION

Today, improving the computational capability of a proces-

sor comes from increasing its number of processing cores

rather than increasing its clock speed. This is reﬂected in

both traditional multi-core processors and many-core graphics

processing units (GPUs).

Originally, GPUs were designed for graphics-based ap-

plications. With the elimination of key architecture limita-

tions, GPUs have evolved to become more widely used for

general-purpose computation, i.e., general-purpose computa-

tion on the GPU (GPGPU). Programming models such as

NVIDIA’s Compute Uniﬁed Device Architecture (CUDA) [22]

and AMD/ATI’s Brook+ [2] enable applications to be more

easily mapped onto the GPU. With these programming models,

more and more applications have been mapped to GPUs and

accelerated [6], [7], [10], [12], [18], [19], [23], [24], [26], [30].

However, GPUs typically map well only to data or task

parallel applications whose execution requires minimal or even

no inter-block communication [9], [24], [26], [30]. Why?

There exists no explicit support for inter-block communication

on the GPU. Currently, such inter-block communication occurs

via global memory and requires a barrier synchronization to

complete the communication, which is (inefﬁciently) imple-

mented via the host CPU. Hereafter, we refer to such CPU-

based barrier synchronization as CPU synchronization.

In general, when a program (i.e., kernel) executes on the

GPU, its execution time consists of three phases: (1) kernel

launch to the GPU, (2) computation on the GPU, and (3)

inter-block GPU communication via barrier synchronization.

With different approaches for synchronization, the percentage

of time that each of these three phases takes will differ.

Furthermore, some of the phases may overlap in time. To

quantify the execution time of each phase, we propose a

general performance model that partitions the kernel execution

time into the three aforementioned phases. Based on our model

and code proﬁling while using the current state of the art

in barrier synchronization, i.e., CPU implicit synchronization

(see Section IV), inter-block communication via barrier syn-

chronization can consume more than 50% of the total kernel

execution time, as shown in Table I.

TABLE I

PERCENT OF TIME SPENT ON INTER-BLOCK COMMUNICATION

Algorithms FFT SWat Bitonic sort

% of time spent on inter-

17.8% 49.2% 59.6%

block communication

(SWat: Smith-Waterman)

Hence, in contrast to previous work that mainly focuses on

optimizing the GPU computation, we focus on reducing the

inter-block communication time via barrier synchronization.

To achieve this, we propose a set of GPU synchronization

strategies, which can synchronize the execution of different

blocks without the involvement of the host CPU, thus avoiding

the costly operation of a kernel launch from the CPU to

GPU. To the best of our knowledge, this work is the ﬁrst that

systematically addresses how to better support more general-

purpose computation by signiﬁcantly reducing the inter-block

Because inter-block GPU communication time is dominated by the inter-

block synchronization time, we will use inter-block synchronization time

instead of inter-block GPU communication time hereafter.

communication time (rather than the computation time) on a

GPU.

We propose two types of GPU synchronization, one with

locks and the other without. For the former, we use one

mutual-exclusive (mutex) variable and an atomic add operation

to implement GPU lock-based synchronization. With respect to

the latter, which we refer to as GPU lock-free synchronization,

we use two arrays, instead of mutex variables, and eliminate

the need for atomic operations. With this approach, each thread

within a single block controls the execution of a different

block, and the intra-block synchronization is achieved by

synchronizing the threads within the block with the existing

barrier function __syncthreads().

We then introduce these GPU synchronization strate-

gies into three different algorithms — Fast Fourier Trans-

form (FFT) [16], dynamic programming (e.g., Smith-

Waterman [25]), and bitonic sort [4] — and evaluate their

effectiveness. Speciﬁcally, based on our performance model,

we analyze the percentage of time spent computing versus

synchronizing for each of the algorithms.

Finally, according to the work of Volkov et al. [29],

correctness of inter-block communication via GPU synchro-

nization cannot be guaranteed unless a memory consistency

model is assumed. To solve this problem, a new function

__threadfence() is introduced in CUDA 2.2. This func-

tion will block the calling thread until prior writes to global

memory or shared memory visible to other threads [22]. It is

expected that additional overhead will be caused by integrating

__threadfence() into our barrier functions. From our

experiment results, when the number of blocks is more than

18 in the kernel, performance of all three algorithms are

worse than that with the CPU implicit synchronization. As

a result, though barriers can be implemented efﬁciently in

software, guaranteeing the inter-block communication correct-

ness with __threadfence() causes a lot of overhead, then

implementing efﬁcient barrier synchronization via hardware or

improving the memory ﬂush efﬁciency become necessary for

efﬁcient and correct inter-block communication on GPUs. It is

worth noting that even without __threadfence() called

in our barrier functions, all results are correct in our thousands

of runs.

Overall, the contributions of this paper are four-fold. First,

we propose two GPU synchronization strategies for inter-

block synchronization. These strategies do not involve the host

CPU, and in turn, reduce the synchronization time between

blocks. Second, we propose a performance model for kernel

execution time and speedup that characterizes the efﬁcacy of

different synchronization approaches. Third, we integrate our

proposed GPU synchronization strategies into three widely

used algorithms — Fast Fourier Transform (FFT), dynamic

programming, and bitonic sort — and obtain performance

improvements of 9.08%, 25.47%, and 40.39%, respectively,

over the traditional CPU synchronization approach. Fourth,

we show the cost of guaranteeing inter-block communication

correctness via __threadfence(). From our experiment

results, though our proposed barrier synchronization is efﬁ-

cient, the low efﬁcacy of __threadfence() causes a lot

of overhead, especially when the number of blocks in a kernel

is large.

The rest of the paper is organized as follows. Section II

provides an overview of the NVIDIA GTX 280 architecture

and CUDA programming model. The related work is described

in Section III. Section IV presents the time partition model

for kernel execution time. Section V describes our GPU

synchronization approaches. In Section VI, we give a brief

description of the algorithms that we use to evaluate our

proposed GPU synchronization strategies, and Section VII

presents and analyzes the experimental results. Section VIII

concludes the paper.

II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280

The NVIDIA GeForce GTX 280 GPU card consists of 240

streaming processors (SPs), each clocked at 1296 MHz. These

240 SPs are grouped into 30 streaming multiprocessors (SMs),

each of which contains 8 streaming processors. The on-chip

memory for each SM contains 16,384 registers and 16 KB

of shared memory, which can only be accessed by threads

executing on that SM; this grouping of threads on an SM is

denoted as a block. The off-chip memory (or device memory)

contains 1 GB of GDDR3 global memory and supports a

memory bandwidth of 141.7 gigabytes per second (GB/s).

Global memory can be accessed by all threads and blocks

on the GPU, and thus, is often used to communicate data

across different blocks via a CPU barrier synchronization, as

explained later.

NVIDIA provides the CUDA programming model and

software environment [22]. It is an extension to the C program-

ming language. In general, only the compute-intensive and

data-parallel parts of a program are parallelized with CUDA

and are implemented as kernels that are compiled to the device

instruction set. A kernel must be launched to the device before

it can be executed.

In CUDA, threads within a block can communicate via

shared memory or global memory. The barrier function

__syncthreads() ensures proper communication. We re-

fer to this as intra-block communication.

However, there is no explicit support for data communica-

tion across different blocks, i.e., inter-block communication.

Currently, this type of data communication occurs via global

memory, followed by a barrier synchronization via the CPU.

That is, the barrier is implemented by terminating the current

kernel’s execution and re-launching the kernel, which is an

expensive operation.

III. RELATED WORK

Our work is most closely related to two areas of research:

(1) algorithmic mapping of data parallel algorithms onto the

GPU, speciﬁcally for FFT, dynamic programming, and bitonic

sort and (2) synchronization protocols in multi- and many-core

environments.

To the best of our knowledge, all known algorithmic map-

pings of FFT, dynamic programming, and bitonic sort take

the same general approach. The algorithm is mapped onto the

GPU in as much of a “data parallel” or “task parallel” fashion

as possible in order to minimize or even eliminate inter-

block communication because such communication requires

an expensive barrier synchronization. For example, running a

single (constrained) problem instance per SM, i.e., 30 separate

problem instances on the NVIDIA GTX 280, obviates the need

for inter-block communication altogether.

To accelerate FFT [16], Govindaraju et al. [6] use efﬁcient

memory access to optimize FFT performance. Speciﬁcally,

when the number of points in a sequence is small, shared

memory is used; if there are too many points in a sequence

to store in shared memory, then techniques for coalesced

global memory access are used. In addition, Govindaraju

et al. propose a hierarchical implementation to compute a

large sequence’s FFT by combining the FFTs of smaller

subsequences that can be calculated on shared memory. In

all of these FFT implementations, the necessary barrier syn-

chronization is done by the CPU via kernel launches. Another

work is that of Volkov et al. [30], which tries to accelerate

the FFT by designing a hierarchical communication scheme

to minimize inter-block communication. Finally, Nukada et al.

[20] accelerate the 3-D FFT through shared memory usage and

optimizing the number of threads and registers via appropriate

localization. Note that all of the aforementioned approaches

focus on optimizing the GPU computation and minimizing

or eliminating the inter-block communication rather than by

optimizing the performance of inter-block communication.

Past research on mapping dynamic programming, e.g., the

Smith-Waterman (SWat) algorithm, onto the GPU uses graph-

ics primitives [14], [15] in a task parallel fashion. More recent

work uses CUDA, but again, largely in a task parallel man-

ner [18], [19], [26] or in a ﬁne-grain parallel approach [31].

In the task parallel approach, no inter-block communication

is needed, but the problem size it supports is limited to 1K

characters. While the ﬁne-grain parallel approach can support

sequences of up to 7K characters, inter-block communication

time consumes about 50% of the total matrix ﬁlling time.

So if a better inter-block synchronization method is used,

performance improvements can be obtained.

For bitonic sort, Greβ et al. [7] improve the algorithmic

complexity of GPU-ABisort to O (n log n) with an adaptive

data structure that enables merges to be done in linear time.

Another parallel implementation of the bitonic sort is in the

CUDA SDK [21], but there is only one block in the kernel to

use the available barrier function __syncthreads(), thus

restricting the maximum number of items that can be sorted

to 512 — the maximum number of threads in a block. If our

proposed inter-block GPU synchronization is used, multiple

blocks can be set in the kernel, which in turn, will signiﬁcantly

increase the maximum number of items that can be sorted.

Many types of software barriers have been designed for

shared-memory environments [1], [3], [8], [11], [17], but none

of them can be directly applied to GPU environments. This

is because multiple CUDA thread blocks can be scheduled

to be executed on a single SM and the CUDA blocks do

not yield to the execution. That is, blocks run to completion

once spawned by the CUDA thread scheduler. This may result

in deadlocks, and thus, cannot be resolved in the same way

as in traditional CPU processing environments, where one

can yield the waiting process to execute other processes.

One way of addressing this is our GPU lock-based barrier

synchronization [31]. This approach leverages a traditional

shared mutex barrier and avoid deadlock by ensuring a one-

to-one mapping between the SMs and the thread blocks.

Cederman et al. [5] implement a dynamic load-balancing

method on the GPU that is based on the lock-free synchro-

nization method found on traditional multi-core processors.

However, this scheme controls task assignment instead of

addressing inter-block communication. In addition, we note

that lock-free synchronization generally performs worse than

lock-based methods on traditional multi-core processors, but

its performance is better than that of the lock-based method

on the GPU in our work.

The work of Stuart et al. [27] focuses on data communica-

tion between multiple GPUs, i.e., inter-GPU communication.

Though their approach can be used for inter-block communi-

cation across different SMs on the same GPU, the performance

is projected to be quite poor because data needs to be moved

to the CPU host memory ﬁrst and then transferred back to the

device memory, which is unnecessary for data communication

on a single GPU card.

The most closely related work to ours is that of Volkov et al.

[29]. Volkov et al. propose a global software synchronization

method that does not use atomic operations to accelerate

dense linear-algebra constructs. However, as [29] notes, their

synchronization method has not been implemented into any

real application to test the performance improvement. Further-

more, their proposed synchronization cannot guarantee that

previous accesses to all levels of the memory hierarchy have

completed. Finally, Volkov et al. used only one thread to check

all arrival variables, hence serializing this portion of inter-

block synchronization and adversely affecting its performance.

In contrast, our proposed GPU synchronization approaches

guarantee the completion of memory accesses with the existing

memory access model in CUDA. This is because a new

function __threadfence() is added in CUDA 2.2, which

can guarantee all writes to global memory visible to other

threads, so correctness of reads after the barrier function

can be guaranteed. In addition, we integrate each of our

GPU synchronization approaches in a micro-benchmark and

three well-known algorithms: FFT, dynamic programming, and

bitonic sort. Finally, we use multiple threads in a block to

check all the arrival variables, which can be executed in

parallel, thus achieving a good performance.

IV. A MODEL FOR KERNEL EXECUTION TIME AND

SPEEDUP

In general, a kernel’s execution time on GPUs consists of

three components — kernel launch time, computation time,

Fig. 1. Total Kernel Execution Time Composition

(a) CPU explicit synchronization

(b) CPU implicit synchronization

Fig. 2. CPU Explicit/Implicit Synchronization Function Call

and synchronization time, which can be represented as

T =

i=1



(i)

+ t

(i)

+ t

(i)



(1)

where M is the number of kernel launches, t

(i)

is the kernel

launch time, t

(i)

is the computation time, and t

(i)

is the

synchronization time for the i

kernel launch as shown in

Figure 1. Each of the three time components is impacted by a

few factors. For instance, the kernel launch time depends on

the data transfer rate from the host to the device as well as the

size of kernel code and parameters. For the computation time,

it is affected by memory access methods, thread organization

(number of threads per block and number of blocks per grid)

in the kernel, etc. Similarly, the synchronization time will be

different with different synchronization approaches used.

Figure 2 shows the pseudo-code of implementing bar-

rier synchronization via kernel launches, where Figure 2(a)

is the function call of CPU Explicit Synchronization and

Figure 2(b) is for CPU Implicit Synchronization. As we

can see, in the CPU explicit synchronization, the kernel

function __kernel_func() is followed by the function

cudaThreadSynchronize(), which will not return un-

til all prior operations on the device are completed. As a

result, the three operations — kernel launch, computation,

and synchronization are executed sequentially in the CPU

explicit synchronization. In contrast, in the CPU implicit syn-

chronization, cudaThreadSynchronize() is not called.

Since kernel launch is an asynchronous operation, if there are

multiple kernel launches, kernel launch time can be overlapped

by previous kernels’ computation time and synchronization

time. So, in the CPU implicit synchronization approach, except

for the ﬁrst kernel launch, subsequent ones are pipelined

with computation and synchronization of previous kernel’s

execution, and the execution time of multiple kernel launches

Fig. 3. GPU Synchronization Function Call

can be represented as

T = t

(1)

i=1



(i)

+ t

(i)

CIS



(2)

where, M is the number of kernel launches, t

(1)

is the time

for the ﬁrst kernel launch, t

(i)

and t

(i)

CIS

are the computation

time and synchronization time for the i

kernel launch,

respectively.

With respect to the GPU Synchronization, Figure 3 shows

the pseudo-code of how functions are called. In this approach,

a kernel is launched only once. When barrier synchroniza-

tion is needed, a barrier function __gpu_sync() is called

instead of re-launching the kernel. In Figure 3, the function

__device_func() implements the same functionality as

the kernel function __kernel_func() in Figure 2, but it

is a device function instead of a global one, so it is called on

the device rather than on the host. In the GPU synchronization,

kernel execution time can be expressed as

T = t

i=1



(i)

+ t

(i)



(3)

where, M is the number of barriers needed for the kernel’s

execution, t

is the kernel launch time, t

(i)

and t

(i)

are the

computation time and synchronization time for the i

loop,

respectively.

From Equations (1), (2), and (3), an algorithm can be

accelerated by decreasing any of the three time components.

With the properties of kernel launch time considered

, we

ignore the kernel launch time in the following discussion. If

the synchronization time is reduced, according to the Amdahl’s

Law, the maximum kernel execution speedup is constrained by

+ (T − t

) /S







1 −



ρ + (1 − ρ) /S

(4)

where S

is the kernel execution speedup gained with reduc-

ing the synchronization time, ρ =

is the percentage of

the computation time t

in the total kernel execution time T ,

= T − t

is the synchronization time of the CPU implicit

Three properties are considered. First, kernel launch time can be combined

with the synchronization time in the CPU explicit synchronization; Second, it

can be overlapped in CPU implicit synchronization; Third, kernel is launched

only once in the GPU synchronization.

synchronization, which is our baseline as mentioned later. S

is the synchronization speedup. Similarly, if only computation

is accelerated, the maximum overall speedup is constrained by

ρ/S

+ (1 − ρ)

(5)

where S

is the computation speedup.

In Equation (4), the smaller the ρ is, the more speedup can

be gained with a ﬁxed S

; while in Equation (5), the larger

the ρ is, the more speedup can be obtained with a ﬁxed S

In practice, different algorithms have different ρ values. For

example, for the three algorithms used in this paper, FFT has

a ρ value larger than 0.8, while SWat and bitonic sort have

a ρ of about 0.5 and 0.4, respectively. According to Equation

(5), corresponding to these ρ values, if only the computation

is accelerated, maximum speedup of the three aforementioned

algorithms are shown in Table II. As can be observed, very

low speedup can be obtained in these three algorithms if only

the computation is accelerated. Since most of the previous

work focuses on optimizing the computation, i.e., decreases

the computation time t

, the more optimization is performed

on an algorithm, the smaller ρ will become. At this time,

decreasing the computation time will not help much for the

overall performance. On the other side, if we decrease the

synchronization time, large kernel execution speedup can be

obtained.

TABLE II

POSSIBLE MAXIMUM SPEEDUP WITH ONLY COMPUTATION

ACCELERATED

Algorithms FFT SWat Bitonic sort

ρ 0.82 0.51 0.40

Possible maximum speedup 5.61 2.03 1.68

In this paper, we will focus on decreasing the synchroniza-

tion time. This is due to three facts:

1) There has been a lot of work [6], [10], [15], [19], [25]

proposed to decrease the computation time. Techniques

such as shared memory usage and divergent branch

removing have been widely used.

2) No work has been done to decrease the synchronization

time for algorithms to be executed on a GPU;

3) In some algorithms, the synchronization time consumes

a large part of the kernel execution time (e.g., SWat and

bitonic sort in Figure 12), which results in a small ρ

value.

With the above model for speedup brought by synchro-

nization time reduction, we propose two GPU synchronization

approaches in the next section, and time consumption of each

of them is modeled and analyzed quantitatively.

V. PROPOSED GPU SYNCHRONIZATION

Since in CUDA programming model, the execution of a

thread block is non-preemptive, care must be taken to avoid

deadlocks in GPU synchronization design. Consider a scenario

where multiple thread blocks are mapped to one SM and the

active block is waiting for the completion of a global barrier.

A deadlock will occur in this case because unscheduled thread

blocks will not be able to reach the barrier without preemption.

Our solution to this problem is to have a one-to-one mapping

between thread blocks and SMs. In other words, for a GPU

with ‘Y’ SMs, we ensure that at most ‘Y’ blocks are used in

the kernel. In addition, we allocate all available shared memory

on an SM to each block so that no two blocks can be scheduled

to the same SM because of the memory constraint.

In the following discussion, we will present two alternative

GPU synchronization designs: GPU lock-based synchroniza-

tion and GPU lock-free synchronization. The ﬁrst one uses

a mutex variable and CUDA atomic operations; while the

second method uses a lock-free algorithm that avoids the use

of expensive CUDA atomic operations.

A. GPU Lock-Based Synchronization

The basic idea of GPU lock-based synchronization [31]

is to use a global mutex variable to count the number of

thread blocks that reach the synchronization point. As shown

in Figure 4, in the barrier function __gpu_sync(), after a

block completes its computation, one of its threads (we call it

the leading thread.) will atomically add 1 to g_mutex. The

leading thread will then repeatedly compare g_mutex to a

target value goalVal. If g_mutex is equal to goalVal,

the synchronization is completed and each thread block can

proceed with its next stage of computation. In our design,

goalVal is set to the number of blocks N in the kernel when

the barrier function is ﬁrst called. The value of goalVal is

then incremented by N each time when the barrier function is

successively called. This design is more efﬁcient than keeping

goalVal constant and resetting g_mutex after each barrier

because the former saves the number of instructions and avoids

conditional branching.

1 //the mutex variable

2 __device__ volatile int g_mutex;

4 //GPU lock-based synchronization function

5 __device__ void __gpu_sync(int goalVal)

6 {

7 //thread ID in a block

8 int tid_in_block = threadIdx.x

blockDim.y

9 + threadIdx.y;

11 // only thread 0 is used for synchronization

12 if (tid_in_block == 0) {

13 atomicAdd((int

)&g_mutex, 1);

15 //only when all blocks add 1 to g_mutex

16 //will g_mutex equal to goalVal

17 while(g_mutex != goalVal) {

18 //Do nothing here

19 }

20 }

21 __syncthreads();

22 }

Fig. 4. Code snapshot of the GPU Lock-Based Synchronization

In the GPU lock-based synchronization, the execution time

of the barrier function __gpu_sync() consists of three

HTML Viewer

Frequently Asked Questions (9)

Q1. What are the contributions mentioned in the paper "Inter-block gpu communication via fast barrier synchronization" ?

The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, their GPU lock-free synchronization further improves performance by 10 %, 26 %, and 40 %, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.

Q2. What are the future works in "Inter-block gpu communication via fast barrier synchronization" ?

As for future work, the authors will further investigate the reasons for the irregularity of the FFT ’ s synchronization time versus the number of blocks in the kernel. Second, the authors will propose a general model to characterize algorithms ’ parallelism properties, based on which, better performance can be obtained for their parallelization on multi- and many-core architectures.

Q3. What is the way to improve the performance of a bitonic sort?

For bitonic sort, Greβ et al. [7] improve the algorithmic complexity of GPU-ABisort to O (n log n) with an adaptive data structure that enables merges to be done in linear time.

Q4. How do the authors allocate shared memory on an SM to each block?

In addition, the authors allocate all available shared memory on an SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint.

Q5. What are the three well-known algorithms that the authors integrate into their synchronization approach?

In addition, the authors integrate each of their GPU synchronization approaches in a micro-benchmark and three well-known algorithms: FFT, dynamic programming, and bitonic sort.

Q6. How many threads can be sorted in a block?

Another parallel implementation of the bitonic sort is in the CUDA SDK [21], but there is only one block in the kernel to use the available barrier function __syncthreads(), thus restricting the maximum number of items that can be sorted to 512 — the maximum number of threads in a block.

Q7. What is the reason why the barrier function can not guarantee that inter-block communication is correct?

As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.

Q8. How many threads are used to check the elements of Arrayin in parallel?

It is worth noting that in the step 2) above, rather than having one thread to check all elements of Arrayin in serial as in [29], the authors use N threads to check the elements of Arrayin in parallel.

Q9. How does the research on mapping dynamic programming work?

Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU uses graphics primitives [14], [15] in a task parallel fashion.

Inter-block GPU communication via fast barrier synchronization

Summary (5 min read)

Introduction

II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280

IV. A MODEL FOR KERNEL EXECUTION TIME AND SPEEDUP

V. PROPOSED GPU SYNCHRONIZATION

A. GPU Lock-Based Synchronization

B. GPU Lock-Free Synchronization

C. Synchronization Time Verification via a Micro-benchmark

VI. ALGORITHMS USED FOR PERFORMANCE EVALUATION

A. Fast Fourier Transformation

C. Bitonic Sort

A. Overview

B. Kernel Execution Time

C. Synchronization Time

D. Percentages of the Computation Time and the Synchronization Time

E. Costs of Guaranteeing Inter-Block Communication Correctness

VIII. CONCLUSION

Figures (16)

Citations

Additional excerpts

References

"Inter-block GPU communication via f..." refers background or methods in this paper

"Inter-block GPU communication via f..." refers background or methods in this paper

"Inter-block GPU communication via f..." refers background or methods in this paper

"Inter-block GPU communication via f..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (9)

Q1. What are the contributions mentioned in the paper "Inter-block gpu communication via fast barrier synchronization" ?

Q2. What are the future works in "Inter-block gpu communication via fast barrier synchronization" ?

Q3. What is the way to improve the performance of a bitonic sort?

Q4. How do the authors allocate shared memory on an SM to each block?

Q5. What are the three well-known algorithms that the authors integrate into their synchronization approach?

Q6. How many threads can be sorted in a block?

Q7. What is the reason why the barrier function can not guarantee that inter-block communication is correct?

Q8. How many threads are used to check the elements of Arrayin in parallel?

Q9. How does the research on mapping dynamic programming work?

Trending Questions (1)