scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Inter-block GPU communication via fast barrier synchronization

Shucai Xiao1, Wu-chun Feng1
19 Apr 2010-pp 1-12
TL;DR: This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort.
Abstract: While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lock-free synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.

Summary (5 min read)

Introduction

  • The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization.
  • To quantify the execution time of each phase, the authors propose a general performance model that partitions the kernel execution time into the three aforementioned phases.
  • First, the authors propose two GPU synchronization strategies for interblock synchronization.
  • From their experiment results, though their proposed barrier synchronization is effi- cient, the low efficacy of __threadfence causes a lot of overhead, especially when the number of blocks in a kernel is large.

II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280

  • The NVIDIA GeForce GTX 280 GPU card consists of 240 streaming processors (SPs), each clocked at 1296 MHz.
  • These 240 SPs are grouped into 30 streaming multiprocessors (SMs), each of which contains 8 streaming processors.
  • Global memory can be accessed by all threads and blocks on the GPU, and thus, is often used to communicate data across different blocks via a CPU barrier synchronization, as explained later.
  • In general, only the compute-intensive and data-parallel parts of a program are parallelized with CUDA and are implemented as kernels that are compiled to the device instruction set.
  • Currently, this type of data communication occurs via global memory, followed by a barrier synchronization via the CPU.

IV. A MODEL FOR KERNEL EXECUTION TIME AND SPEEDUP

  • The kernel launch time depends on the data transfer rate from the host to the device as well as the size of kernel code and parameters.
  • As the authors can see, in the CPU explicit synchronization, the kernel function __kernel_func is followed by the function cudaThreadSynchronize, which will not return until all prior operations on the device are completed.
  • Since kernel launch is an asynchronous operation, if there are multiple kernel launches, kernel launch time can be overlapped by previous kernels’ computation time and synchronization time.
  • First, kernel launch time can be combined with the synchronization time in the CPU explicit synchronization; Second, it can be overlapped in CPU implicit synchronization; According to Equation (5), corresponding to these ρ values, if only the computation is accelerated, maximum speedup of the three aforementioned algorithms are shown in Table II.

V. PROPOSED GPU SYNCHRONIZATION

  • Since in CUDA programming model, the execution of a thread block is non-preemptive, care must be taken to avoid deadlocks in GPU synchronization design.
  • Consider a scenario where multiple thread blocks are mapped to one SM and the active block is waiting for the completion of a global barrier.
  • The authors solution to this problem is to have a one-to-one mapping between thread blocks and SMs.
  • In other words, for a GPU with ‘Y’ SMs, the authors ensure that at most ‘Y’ blocks are used in the kernel.
  • In the following discussion, the authors will present two alternative GPU synchronization designs: GPU lock-based synchronization and GPU lock-free synchronization.

A. GPU Lock-Based Synchronization

  • The basic idea of GPU lock-based synchronization [31] is to use a global mutex variable to count the number of thread blocks that reach the synchronization point.
  • The leading thread will then repeatedly compare g_mutex to a target value goalVal.
  • The value of goalVal is then incremented by N each time when the barrier function is successively called.
  • In the GPU lock-based synchronization, the execution time of the barrier function __gpu_sync consists of three parts — atomic addition, checking of g_mutex, and synchronization of threads within a block via __syncthreads.
  • The atomic addition can only be executed sequentially by different blocks, while the g_mutex checking and intra-block synchronization can be executed in parallel.

B. GPU Lock-Free Synchronization

  • In the GPU lock-based synchronization, the mutex variable g_mutex is added with the atomic function atomicAdd.
  • As shown in Figure 6, their lock-free synchronization approach uses two arrays Arrayin and Arrayout to coordinate the synchronization requests from various blocks.
  • The algorithm is outlined into three steps as follows: 3) A block will continue its execution once its leading thread sees the corresponding element in Arrayout is set to goalVal.

C. Synchronization Time Verification via a Micro-benchmark

  • To verify the execution time of the synchronization function __gpu_sync for each synchronization method, a microbenchmark to compute the mean of two floats for 10,000 times is used.
  • Here, each result is the average of three runs.
  • From Figure 8, the computation time is only about 5ms, while the time needed by the CPU implicit synchronization is about 60ms, which is 12 times the computation time.
  • 3) For the GPU lock-based synchronization, the synchronization time is linear to the number of blocks in a kernel, and more synchronization time is needed for a kernel with a larger 3Since there are at most 30 blocks that can be set on a GTX 280, threads that check Arrayin are in the same warp, which are executed in parallel.
  • The reason is that, as the authors analyzed in Section V-A, more blocks means more atomic add operations should be executed for the synchronization.

VI. ALGORITHMS USED FOR PERFORMANCE EVALUATION

  • Inter-block synchronization can be used in many algorithms.
  • The authors choose three of them that can benefit from their proposed GPU synchronization methods.
  • The three algorithms are Fast Fourier Transformation [16], SmithWaterman [25], and bitonic sort [4].

A. Fast Fourier Transformation

  • A Discrete Fourier Transformation (DFT) transforms a sequence of values into its frequency components or, inversely, converts the frequency components back to the original data sequence.
  • Fast Fourier Transformation (FFT) is a fast way of DFT computation.
  • The computation difference can be substantial for long data sequence, especially when the sequence has thousands or millions of points.
  • Within each iteration, computation of different points is independent, which can be done in parallel, because they depend on points only from its previous iteration.
  • The barrier used here can be multiple kernel launches (CPU synchronization) or the GPU synchronization approaches proposed in this paper.

C. Bitonic Sort

  • Bitonic sort is one of the fastest sorting networks [13], which is a special type of sorting algorithm devised by Ken Batcher [4].
  • The main idea behind bitonic sort is using a divide-andconquer strategy.
  • In the divide step, the input sequence is divided into two subsequences and each sequence is sorted with bitonic sort itself, where one is in the ascending order and the other is in the descending order.
  • In the conquer step, with the two sorted subsequences as the input, the bitonic merge is used to combine them to get the whole sorted sequence [13].
  • In each iteration, the numbers to be sorted are divided into pairs and a compareand-swap operation is applied on it, which can be executed in parallel for different pairs.

A. Overview

  • To evaluate the performance of their proposed GPU synchronization approaches, the authors implement them in the three algorithms described in Section VI.
  • 3) Corresponding to the best performance of each algorithm with each synchronization approach, the percentages of computation time and synchronization time are demonstrated and analyzed;.
  • The authors experiments are performed on a GeForce GTX 280 GPU card, which has 30 SMs and 240 processing cores with the clock speed 1296MHz.
  • For the host machine, The processor is an Intel Core 2 Duo CPU with 2MB of L2 cache and its clock speed is 2.2GHz.
  • Similar as that in the microbenchmark, each result is the average of three runs.

B. Kernel Execution Time

  • Figure 10 shows the kernel execution time decrease with their proposed GPU synchronization approaches and its variation versus the number of blocks in the kernel.
  • This is due to, when the number of blocks in the kernel is larger than 30 or less than 9, kernel execution times are more than that with block number between 9 and 30.
  • Figure 10(a) shows the performance of FFT, Figure 10(b) is for SWat, and Figure 10(c) displays the kernel execution time of bitonic sort.
  • Third, kernel execution time difference between the CPU implicit synchronization and the proposed GPU synchronization of FFT is much less than that of SWat and bitonic sort.
  • The more blocks are set in the kernel, the more performance improvement can be obtained if compared to the GPU lock-based synchronization approach.

C. Synchronization Time

  • The authors show the synchronization time variation versus the number of blocks in the kernel.
  • Here, the synchronization time is the difference between the total kernel execution time and the computation time, which is obtained by running an implementation of each algorithm with the GPU synchronization approach, but with the synchronization function __gpu_sync removed.
  • With 9 blocks in the kernel, time needed for the GPU lock-based synchronization is close to that of the GPU lock-free synchronization;.

D. Percentages of the Computation Time and the Synchronization Time

  • Figure 12 shows the performance breakdown in percentage of the three algorithms when different synchronization approaches are used.
  • As a result, synchronization time changes have a less impact on the total kernel execution time compared to SWat and bitonic sort.
  • This indicates that inter-block communication time occupies a large part of the total execution time in some algorithms.
  • Thus, decreasing the synchronization time can improve the performance greatly in some algorithms;.

E. Costs of Guaranteeing Inter-Block Communication Correctness

  • As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.
  • Consider the GPU lock-free synchronization, from Figure 13(a), for FFT, when the number of blocks in the kernel is larger than 14, more time is needed to execute the kernel with the GPU lock-free synchronization.
  • The threshold is 18 and 12 for SWat and bitonic sort, respectively.
  • From these results, though the barrier can be implemented in software efficiently, the cost of guaranteeing correctness with the function __threadfence is very high, which means guaranteeing writes to shared memory or global memory to be read correctly via __threadfence is not an efficient way.
  • This is not expected on the next generation of NVIDIA GPU “Fermi”, on which, with a more efficient implementation of __threadfence and a different architecture, it is needed for correct inter-block data communication.

VIII. CONCLUSION

  • In the current GPU architecture, inter-block communication on GPUs requires a barrier synchronization to exist.
  • Second, the authors propose two synchronization approaches: GPU lock-based synchronization and GPU lock-free synchronization.
  • For each of these methods, the authors quantify its efficacy with the aforementioned performance model.
  • In addition, based on the kernel execution time model, the authors partition the kernel execution time into the computation time and the synchronization time for the three algorithms.
  • As a result, better approaches such as efficient hardware barrier implementation or memory flush functions are needed to support efficient and correct inter-block communication on a GPU.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Inter-Block GPU Communication via Fast
Barrier Synchronization
Shucai Xiao
and Wu-chun Feng
∗†
Department of Electrical and Computer Engineering
Department of Computer Science
Virginia Tech
Blacksburg, Virginia 24061
Email: {shucai, wfeng}@vt.edu
Abstract—While GPGPU stands for general-purpose compu-
tation on graphics processing units, the lack of explicit support
for inter-block communication on the GPU arguably hampers its
broader adoption as a general-purpose computing device. Inter-
block communication on the GPU occurs via global memory
and then requires barrier synchronization across the blocks,
i.e., inter-block GPU communication via barrier synchronization.
Currently, such synchronization is only available via the CPU,
which in turn, can incur significant overhead.
We propose two approaches for inter-block GPU communi-
cation via barrier synchronization: GPU lock-based synchro-
nization and GPU lock-free synchronization. We then evaluate
the efficacy of each approach via a micro-benchmark as well
as three well-known algorithms Fast Fourier Transform
(FFT), dynamic programming, and bitonic sort. For the micro-
benchmark, the experimental results show that our GPU lock-
free synchronization performs 8.4 times faster than CPU explicit
synchronization and 4.0 times faster than CPU implicit synchro-
nization. When integrated with the FFT, dynamic programming,
and bitonic sort algorithms, our GPU lock-free synchronization
further improves performance by 10%, 26%, and 40%, respec-
tively, and ultimately delivers an overall speed-up of 70x, 13x,
and 24x, respectively.
I. INTRODUCTION
Today, improving the computational capability of a proces-
sor comes from increasing its number of processing cores
rather than increasing its clock speed. This is reflected in
both traditional multi-core processors and many-core graphics
processing units (GPUs).
Originally, GPUs were designed for graphics-based ap-
plications. With the elimination of key architecture limita-
tions, GPUs have evolved to become more widely used for
general-purpose computation, i.e., general-purpose computa-
tion on the GPU (GPGPU). Programming models such as
NVIDIAs Compute Unified Device Architecture (CUDA) [22]
and AMD/ATI’s Brook+ [2] enable applications to be more
easily mapped onto the GPU. With these programming models,
more and more applications have been mapped to GPUs and
accelerated [6], [7], [10], [12], [18], [19], [23], [24], [26], [30].
However, GPUs typically map well only to data or task
parallel applications whose execution requires minimal or even
no inter-block communication [9], [24], [26], [30]. Why?
There exists no explicit support for inter-block communication
on the GPU. Currently, such inter-block communication occurs
via global memory and requires a barrier synchronization to
complete the communication, which is (inefficiently) imple-
mented via the host CPU. Hereafter, we refer to such CPU-
based barrier synchronization as CPU synchronization.
In general, when a program (i.e., kernel) executes on the
GPU, its execution time consists of three phases: (1) kernel
launch to the GPU, (2) computation on the GPU, and (3)
inter-block GPU communication via barrier synchronization.
1
With different approaches for synchronization, the percentage
of time that each of these three phases takes will differ.
Furthermore, some of the phases may overlap in time. To
quantify the execution time of each phase, we propose a
general performance model that partitions the kernel execution
time into the three aforementioned phases. Based on our model
and code profiling while using the current state of the art
in barrier synchronization, i.e., CPU implicit synchronization
(see Section IV), inter-block communication via barrier syn-
chronization can consume more than 50% of the total kernel
execution time, as shown in Table I.
TABLE I
PERCENT OF TIME SPENT ON INTER-BLOCK COMMUNICATION
Algorithms FFT SWat Bitonic sort
% of time spent on inter-
17.8% 49.2% 59.6%
block communication
(SWat: Smith-Waterman)
Hence, in contrast to previous work that mainly focuses on
optimizing the GPU computation, we focus on reducing the
inter-block communication time via barrier synchronization.
To achieve this, we propose a set of GPU synchronization
strategies, which can synchronize the execution of different
blocks without the involvement of the host CPU, thus avoiding
the costly operation of a kernel launch from the CPU to
GPU. To the best of our knowledge, this work is the first that
systematically addresses how to better support more general-
purpose computation by significantly reducing the inter-block
1
Because inter-block GPU communication time is dominated by the inter-
block synchronization time, we will use inter-block synchronization time
instead of inter-block GPU communication time hereafter.

communication time (rather than the computation time) on a
GPU.
We propose two types of GPU synchronization, one with
locks and the other without. For the former, we use one
mutual-exclusive (mutex) variable and an atomic add operation
to implement GPU lock-based synchronization. With respect to
the latter, which we refer to as GPU lock-free synchronization,
we use two arrays, instead of mutex variables, and eliminate
the need for atomic operations. With this approach, each thread
within a single block controls the execution of a different
block, and the intra-block synchronization is achieved by
synchronizing the threads within the block with the existing
barrier function __syncthreads().
We then introduce these GPU synchronization strate-
gies into three different algorithms Fast Fourier Trans-
form (FFT) [16], dynamic programming (e.g., Smith-
Waterman [25]), and bitonic sort [4] and evaluate their
effectiveness. Specifically, based on our performance model,
we analyze the percentage of time spent computing versus
synchronizing for each of the algorithms.
Finally, according to the work of Volkov et al. [29],
correctness of inter-block communication via GPU synchro-
nization cannot be guaranteed unless a memory consistency
model is assumed. To solve this problem, a new function
__threadfence() is introduced in CUDA 2.2. This func-
tion will block the calling thread until prior writes to global
memory or shared memory visible to other threads [22]. It is
expected that additional overhead will be caused by integrating
__threadfence() into our barrier functions. From our
experiment results, when the number of blocks is more than
18 in the kernel, performance of all three algorithms are
worse than that with the CPU implicit synchronization. As
a result, though barriers can be implemented efficiently in
software, guaranteeing the inter-block communication correct-
ness with __threadfence() causes a lot of overhead, then
implementing efficient barrier synchronization via hardware or
improving the memory flush efficiency become necessary for
efficient and correct inter-block communication on GPUs. It is
worth noting that even without __threadfence() called
in our barrier functions, all results are correct in our thousands
of runs.
Overall, the contributions of this paper are four-fold. First,
we propose two GPU synchronization strategies for inter-
block synchronization. These strategies do not involve the host
CPU, and in turn, reduce the synchronization time between
blocks. Second, we propose a performance model for kernel
execution time and speedup that characterizes the efficacy of
different synchronization approaches. Third, we integrate our
proposed GPU synchronization strategies into three widely
used algorithms Fast Fourier Transform (FFT), dynamic
programming, and bitonic sort and obtain performance
improvements of 9.08%, 25.47%, and 40.39%, respectively,
over the traditional CPU synchronization approach. Fourth,
we show the cost of guaranteeing inter-block communication
correctness via __threadfence(). From our experiment
results, though our proposed barrier synchronization is effi-
cient, the low efficacy of __threadfence() causes a lot
of overhead, especially when the number of blocks in a kernel
is large.
The rest of the paper is organized as follows. Section II
provides an overview of the NVIDIA GTX 280 architecture
and CUDA programming model. The related work is described
in Section III. Section IV presents the time partition model
for kernel execution time. Section V describes our GPU
synchronization approaches. In Section VI, we give a brief
description of the algorithms that we use to evaluate our
proposed GPU synchronization strategies, and Section VII
presents and analyzes the experimental results. Section VIII
concludes the paper.
II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280
The NVIDIA GeForce GTX 280 GPU card consists of 240
streaming processors (SPs), each clocked at 1296 MHz. These
240 SPs are grouped into 30 streaming multiprocessors (SMs),
each of which contains 8 streaming processors. The on-chip
memory for each SM contains 16,384 registers and 16 KB
of shared memory, which can only be accessed by threads
executing on that SM; this grouping of threads on an SM is
denoted as a block. The off-chip memory (or device memory)
contains 1 GB of GDDR3 global memory and supports a
memory bandwidth of 141.7 gigabytes per second (GB/s).
Global memory can be accessed by all threads and blocks
on the GPU, and thus, is often used to communicate data
across different blocks via a CPU barrier synchronization, as
explained later.
NVIDIA provides the CUDA programming model and
software environment [22]. It is an extension to the C program-
ming language. In general, only the compute-intensive and
data-parallel parts of a program are parallelized with CUDA
and are implemented as kernels that are compiled to the device
instruction set. A kernel must be launched to the device before
it can be executed.
In CUDA, threads within a block can communicate via
shared memory or global memory. The barrier function
__syncthreads() ensures proper communication. We re-
fer to this as intra-block communication.
However, there is no explicit support for data communica-
tion across different blocks, i.e., inter-block communication.
Currently, this type of data communication occurs via global
memory, followed by a barrier synchronization via the CPU.
That is, the barrier is implemented by terminating the current
kernel’s execution and re-launching the kernel, which is an
expensive operation.
III. RELATED WORK
Our work is most closely related to two areas of research:
(1) algorithmic mapping of data parallel algorithms onto the
GPU, specifically for FFT, dynamic programming, and bitonic
sort and (2) synchronization protocols in multi- and many-core
environments.
To the best of our knowledge, all known algorithmic map-
pings of FFT, dynamic programming, and bitonic sort take

the same general approach. The algorithm is mapped onto the
GPU in as much of a “data parallel” or “task parallel” fashion
as possible in order to minimize or even eliminate inter-
block communication because such communication requires
an expensive barrier synchronization. For example, running a
single (constrained) problem instance per SM, i.e., 30 separate
problem instances on the NVIDIA GTX 280, obviates the need
for inter-block communication altogether.
To accelerate FFT [16], Govindaraju et al. [6] use efficient
memory access to optimize FFT performance. Specifically,
when the number of points in a sequence is small, shared
memory is used; if there are too many points in a sequence
to store in shared memory, then techniques for coalesced
global memory access are used. In addition, Govindaraju
et al. propose a hierarchical implementation to compute a
large sequence’s FFT by combining the FFTs of smaller
subsequences that can be calculated on shared memory. In
all of these FFT implementations, the necessary barrier syn-
chronization is done by the CPU via kernel launches. Another
work is that of Volkov et al. [30], which tries to accelerate
the FFT by designing a hierarchical communication scheme
to minimize inter-block communication. Finally, Nukada et al.
[20] accelerate the 3-D FFT through shared memory usage and
optimizing the number of threads and registers via appropriate
localization. Note that all of the aforementioned approaches
focus on optimizing the GPU computation and minimizing
or eliminating the inter-block communication rather than by
optimizing the performance of inter-block communication.
Past research on mapping dynamic programming, e.g., the
Smith-Waterman (SWat) algorithm, onto the GPU uses graph-
ics primitives [14], [15] in a task parallel fashion. More recent
work uses CUDA, but again, largely in a task parallel man-
ner [18], [19], [26] or in a fine-grain parallel approach [31].
In the task parallel approach, no inter-block communication
is needed, but the problem size it supports is limited to 1K
characters. While the fine-grain parallel approach can support
sequences of up to 7K characters, inter-block communication
time consumes about 50% of the total matrix filling time.
So if a better inter-block synchronization method is used,
performance improvements can be obtained.
For bitonic sort, Greβ et al. [7] improve the algorithmic
complexity of GPU-ABisort to O (n log n) with an adaptive
data structure that enables merges to be done in linear time.
Another parallel implementation of the bitonic sort is in the
CUDA SDK [21], but there is only one block in the kernel to
use the available barrier function __syncthreads(), thus
restricting the maximum number of items that can be sorted
to 512 the maximum number of threads in a block. If our
proposed inter-block GPU synchronization is used, multiple
blocks can be set in the kernel, which in turn, will significantly
increase the maximum number of items that can be sorted.
Many types of software barriers have been designed for
shared-memory environments [1], [3], [8], [11], [17], but none
of them can be directly applied to GPU environments. This
is because multiple CUDA thread blocks can be scheduled
to be executed on a single SM and the CUDA blocks do
not yield to the execution. That is, blocks run to completion
once spawned by the CUDA thread scheduler. This may result
in deadlocks, and thus, cannot be resolved in the same way
as in traditional CPU processing environments, where one
can yield the waiting process to execute other processes.
One way of addressing this is our GPU lock-based barrier
synchronization [31]. This approach leverages a traditional
shared mutex barrier and avoid deadlock by ensuring a one-
to-one mapping between the SMs and the thread blocks.
Cederman et al. [5] implement a dynamic load-balancing
method on the GPU that is based on the lock-free synchro-
nization method found on traditional multi-core processors.
However, this scheme controls task assignment instead of
addressing inter-block communication. In addition, we note
that lock-free synchronization generally performs worse than
lock-based methods on traditional multi-core processors, but
its performance is better than that of the lock-based method
on the GPU in our work.
The work of Stuart et al. [27] focuses on data communica-
tion between multiple GPUs, i.e., inter-GPU communication.
Though their approach can be used for inter-block communi-
cation across different SMs on the same GPU, the performance
is projected to be quite poor because data needs to be moved
to the CPU host memory first and then transferred back to the
device memory, which is unnecessary for data communication
on a single GPU card.
The most closely related work to ours is that of Volkov et al.
[29]. Volkov et al. propose a global software synchronization
method that does not use atomic operations to accelerate
dense linear-algebra constructs. However, as [29] notes, their
synchronization method has not been implemented into any
real application to test the performance improvement. Further-
more, their proposed synchronization cannot guarantee that
previous accesses to all levels of the memory hierarchy have
completed. Finally, Volkov et al. used only one thread to check
all arrival variables, hence serializing this portion of inter-
block synchronization and adversely affecting its performance.
In contrast, our proposed GPU synchronization approaches
guarantee the completion of memory accesses with the existing
memory access model in CUDA. This is because a new
function __threadfence() is added in CUDA 2.2, which
can guarantee all writes to global memory visible to other
threads, so correctness of reads after the barrier function
can be guaranteed. In addition, we integrate each of our
GPU synchronization approaches in a micro-benchmark and
three well-known algorithms: FFT, dynamic programming, and
bitonic sort. Finally, we use multiple threads in a block to
check all the arrival variables, which can be executed in
parallel, thus achieving a good performance.
IV. A MODEL FOR KERNEL EXECUTION TIME AND
SPEEDUP
In general, a kernel’s execution time on GPUs consists of
three components kernel launch time, computation time,

Fig. 1. Total Kernel Execution Time Composition
(a) CPU explicit synchronization
(b) CPU implicit synchronization
Fig. 2. CPU Explicit/Implicit Synchronization Function Call
and synchronization time, which can be represented as
T =
M
X
i=1
t
(i)
O
+ t
(i)
C
+ t
(i)
S
(1)
where M is the number of kernel launches, t
(i)
O
is the kernel
launch time, t
(i)
C
is the computation time, and t
(i)
S
is the
synchronization time for the i
th
kernel launch as shown in
Figure 1. Each of the three time components is impacted by a
few factors. For instance, the kernel launch time depends on
the data transfer rate from the host to the device as well as the
size of kernel code and parameters. For the computation time,
it is affected by memory access methods, thread organization
(number of threads per block and number of blocks per grid)
in the kernel, etc. Similarly, the synchronization time will be
different with different synchronization approaches used.
Figure 2 shows the pseudo-code of implementing bar-
rier synchronization via kernel launches, where Figure 2(a)
is the function call of CPU Explicit Synchronization and
Figure 2(b) is for CPU Implicit Synchronization. As we
can see, in the CPU explicit synchronization, the kernel
function __kernel_func() is followed by the function
cudaThreadSynchronize(), which will not return un-
til all prior operations on the device are completed. As a
result, the three operations kernel launch, computation,
and synchronization are executed sequentially in the CPU
explicit synchronization. In contrast, in the CPU implicit syn-
chronization, cudaThreadSynchronize() is not called.
Since kernel launch is an asynchronous operation, if there are
multiple kernel launches, kernel launch time can be overlapped
by previous kernels’ computation time and synchronization
time. So, in the CPU implicit synchronization approach, except
for the first kernel launch, subsequent ones are pipelined
with computation and synchronization of previous kernel’s
execution, and the execution time of multiple kernel launches
Fig. 3. GPU Synchronization Function Call
can be represented as
T = t
(1)
O
+
M
X
i=1
t
(i)
C
+ t
(i)
CIS
(2)
where, M is the number of kernel launches, t
(1)
O
is the time
for the first kernel launch, t
(i)
C
and t
(i)
CIS
are the computation
time and synchronization time for the i
th
kernel launch,
respectively.
With respect to the GPU Synchronization, Figure 3 shows
the pseudo-code of how functions are called. In this approach,
a kernel is launched only once. When barrier synchroniza-
tion is needed, a barrier function __gpu_sync() is called
instead of re-launching the kernel. In Figure 3, the function
__device_func() implements the same functionality as
the kernel function __kernel_func() in Figure 2, but it
is a device function instead of a global one, so it is called on
the device rather than on the host. In the GPU synchronization,
kernel execution time can be expressed as
T = t
O
+
M
X
i=1
t
(i)
C
+ t
(i)
GS
(3)
where, M is the number of barriers needed for the kernel’s
execution, t
O
is the kernel launch time, t
(i)
C
and t
(i)
GS
are the
computation time and synchronization time for the i
th
loop,
respectively.
From Equations (1), (2), and (3), an algorithm can be
accelerated by decreasing any of the three time components.
With the properties of kernel launch time considered
2
, we
ignore the kernel launch time in the following discussion. If
the synchronization time is reduced, according to the Amdahl’s
Law, the maximum kernel execution speedup is constrained by
S
T
=
T
t
C
+ (T t
C
) /S
S
=
1
t
C
T
+
1
t
C
T
/S
S
=
1
ρ + (1 ρ) /S
S
(4)
where S
T
is the kernel execution speedup gained with reduc-
ing the synchronization time, ρ =
t
C
T
is the percentage of
the computation time t
C
in the total kernel execution time T ,
t
S
= T t
C
is the synchronization time of the CPU implicit
2
Three properties are considered. First, kernel launch time can be combined
with the synchronization time in the CPU explicit synchronization; Second, it
can be overlapped in CPU implicit synchronization; Third, kernel is launched
only once in the GPU synchronization.

synchronization, which is our baseline as mentioned later. S
S
is the synchronization speedup. Similarly, if only computation
is accelerated, the maximum overall speedup is constrained by
S
T
=
1
ρ/S
C
+ (1 ρ)
(5)
where S
C
is the computation speedup.
In Equation (4), the smaller the ρ is, the more speedup can
be gained with a fixed S
S
; while in Equation (5), the larger
the ρ is, the more speedup can be obtained with a fixed S
C
.
In practice, different algorithms have different ρ values. For
example, for the three algorithms used in this paper, FFT has
a ρ value larger than 0.8, while SWat and bitonic sort have
a ρ of about 0.5 and 0.4, respectively. According to Equation
(5), corresponding to these ρ values, if only the computation
is accelerated, maximum speedup of the three aforementioned
algorithms are shown in Table II. As can be observed, very
low speedup can be obtained in these three algorithms if only
the computation is accelerated. Since most of the previous
work focuses on optimizing the computation, i.e., decreases
the computation time t
C
, the more optimization is performed
on an algorithm, the smaller ρ will become. At this time,
decreasing the computation time will not help much for the
overall performance. On the other side, if we decrease the
synchronization time, large kernel execution speedup can be
obtained.
TABLE II
POSSIBLE MAXIMUM SPEEDUP WITH ONLY COMPUTATION
ACCELERATED
Algorithms FFT SWat Bitonic sort
ρ 0.82 0.51 0.40
Possible maximum speedup 5.61 2.03 1.68
In this paper, we will focus on decreasing the synchroniza-
tion time. This is due to three facts:
1) There has been a lot of work [6], [10], [15], [19], [25]
proposed to decrease the computation time. Techniques
such as shared memory usage and divergent branch
removing have been widely used.
2) No work has been done to decrease the synchronization
time for algorithms to be executed on a GPU;
3) In some algorithms, the synchronization time consumes
a large part of the kernel execution time (e.g., SWat and
bitonic sort in Figure 12), which results in a small ρ
value.
With the above model for speedup brought by synchro-
nization time reduction, we propose two GPU synchronization
approaches in the next section, and time consumption of each
of them is modeled and analyzed quantitatively.
V. PROPOSED GPU SYNCHRONIZATION
Since in CUDA programming model, the execution of a
thread block is non-preemptive, care must be taken to avoid
deadlocks in GPU synchronization design. Consider a scenario
where multiple thread blocks are mapped to one SM and the
active block is waiting for the completion of a global barrier.
A deadlock will occur in this case because unscheduled thread
blocks will not be able to reach the barrier without preemption.
Our solution to this problem is to have a one-to-one mapping
between thread blocks and SMs. In other words, for a GPU
with ‘Y’ SMs, we ensure that at most ‘Y’ blocks are used in
the kernel. In addition, we allocate all available shared memory
on an SM to each block so that no two blocks can be scheduled
to the same SM because of the memory constraint.
In the following discussion, we will present two alternative
GPU synchronization designs: GPU lock-based synchroniza-
tion and GPU lock-free synchronization. The first one uses
a mutex variable and CUDA atomic operations; while the
second method uses a lock-free algorithm that avoids the use
of expensive CUDA atomic operations.
A. GPU Lock-Based Synchronization
The basic idea of GPU lock-based synchronization [31]
is to use a global mutex variable to count the number of
thread blocks that reach the synchronization point. As shown
in Figure 4, in the barrier function __gpu_sync(), after a
block completes its computation, one of its threads (we call it
the leading thread.) will atomically add 1 to g_mutex. The
leading thread will then repeatedly compare g_mutex to a
target value goalVal. If g_mutex is equal to goalVal,
the synchronization is completed and each thread block can
proceed with its next stage of computation. In our design,
goalVal is set to the number of blocks N in the kernel when
the barrier function is first called. The value of goalVal is
then incremented by N each time when the barrier function is
successively called. This design is more efficient than keeping
goalVal constant and resetting g_mutex after each barrier
because the former saves the number of instructions and avoids
conditional branching.
1 //the mutex variable
2 __device__ volatile int g_mutex;
3
4 //GPU lock-based synchronization function
5 __device__ void __gpu_sync(int goalVal)
6 {
7 //thread ID in a block
8 int tid_in_block = threadIdx.x
*
blockDim.y
9 + threadIdx.y;
10
11 // only thread 0 is used for synchronization
12 if (tid_in_block == 0) {
13 atomicAdd((int
*
)&g_mutex, 1);
14
15 //only when all blocks add 1 to g_mutex
16 //will g_mutex equal to goalVal
17 while(g_mutex != goalVal) {
18 //Do nothing here
19 }
20 }
21 __syncthreads();
22 }
Fig. 4. Code snapshot of the GPU Lock-Based Synchronization
In the GPU lock-based synchronization, the execution time
of the barrier function __gpu_sync() consists of three

Citations
More filters
Posted Content
TL;DR: A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.
Abstract: Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency.

520 citations

Proceedings ArticleDOI
13 Jun 2010
TL;DR: A new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy that guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.
Abstract: Breadth-first search (BFS) has wide applications in electronic design automation (EDA) as well as in other fields. Researchers have tried to accelerate BFS on the GPU, but the two published works are both asymptotically slower than the fastest CPU implementation. In this paper, we present a new GPU implementation of BFS that uses a hierarchical queue management technique and a three-layer kernel arrangement strategy. It guarantees the same computational complexity as the fastest sequential version and can achieve up to 10 times speedup.

235 citations

Proceedings ArticleDOI
13 May 2012
TL;DR: Through micro-kernel benchmarks, it is shown the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases.
Abstract: In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing—CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.

202 citations

Proceedings ArticleDOI
23 Feb 2013
TL;DR: This paper describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol, called TC-Weak.
Abstract: While scalable coherence has been extensively studied in the context of general purpose chip multiprocessors (CMPs), GPU architectures present a new set of challenges. Introducing conventional directory protocols adds unnecessary coherence traffic overhead to existing GPU applications. Moreover, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC) [34, 54], explored the use of time-based approaches in CMP coherence protocols. This paper describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present an implementation of TC, called TC-Weak, which eliminates LCC's trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications with inter-workgroup communication by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also find that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.

166 citations

Proceedings ArticleDOI
14 Mar 2015
TL;DR: A model of Nvidia GPU hardware is proposed, which correctly models every behaviour witnessed in the authors' experiments, and is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy.
Abstract: Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software. To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent programs), we questioned the assumptions in programming guides and vendor documentation about the guarantees provided by hardware. We developed a tool to generate thousands of litmus tests and run them under stressful workloads. We observed a litany of previously elusive weak behaviours, and exposed folklore beliefs about GPU programming---often supported by official tutorials---as false. As a way forward, we propose a model of Nvidia GPU hardware, which correctly models every behaviour witnessed in our experiments. The model is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy.

110 citations


Additional excerpts

  • ...[15, 26, 45]) simply omit memory model considerations....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"Inter-block GPU communication via f..." refers background or methods in this paper

  • ...Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU used graphics primitives [15, 14] in a task parallel fashion....

    [...]

  • ..., Smith-Waterman [25]), and bitonic sort [4] — and evaluate their effectiveness....

    [...]

  • ...This is due to three facts: 1) There have been a lot of works [19, 25, 15, 6, 10] to decrease the computation time....

    [...]

  • ...The three algorithms are Fast Fourier Transformation [16], Smith-Waterman [25], and bitonic sort [4]....

    [...]

  • ...We then introduce these GPU synchronization strategies into three different algorithms — Fast Fourier Transform (FFT) [16], dynamic programming (e.g., Smith-Waterman [25]), and bitonic sort [4] — and evaluate their effectiveness....

    [...]

Proceedings ArticleDOI
30 Apr 1968
TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Abstract: To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently. A major problem in the design of such a computing system is the connecting together of the various parts of the system (the I/O devices, memories, processing units, etc.) in such a way that all the required data transfers can be accommodated. One common scheme is a high-speed bus which is time-shared by the various parts; speed of available hardware limits this scheme. Another scheme is a cross-bar switch or matrix; limiting factors here are the amount of hardware (an m × n matrix requires m × n cross-points) and the fan-in and fan-out of the hardware.

2,553 citations


"Inter-block GPU communication via f..." refers background or methods in this paper

  • ...More detailed information about bitonic sort is in [4]....

    [...]

  • ...The three algorithms are Fast Fourier Transformation [16], Smith-Waterman [25], and bitonic sort [4]....

    [...]

  • ...Bitonic sort is one of the fastest sorting networks [13], which is a special type of sorting algorithm devised by Ken Batcher [4]....

    [...]

  • ..., Smith-Waterman [25]), and bitonic sort [4] — and evaluate their effectiveness....

    [...]

Book
01 Jan 1992
TL;DR: The Radix-2 Frameworks, a collection of general and high performance FFTs designed to solve the multi-Dimensional FFT problem of Prime Factor and Convolution, are presented.
Abstract: 1. The Radix-2 Frameworks. Matrix Notation and Algorithms The FFT Idea The Cooley-Tukey Factorization Weight and Butterfly Computations Bit Reversal and Transposition The Cooley-Tukey Framework The Stockham Autosort Frameworks The Pease Framework Decimation in Frequency and Inverse FFTs 2. General Radix Frameworks. The General Radix Ideas Index Reversal and Transposition Mixed-Radix Factorizations Radix-4 and Radix-8 Frameworks The Split-Radix Frameworks 3. High Performance Frameworks. The Multiple DFT Problem Matrix Transposition The Large Single-Vector FFT Problem Multi-Dimensional FFT Problem Distributed Memory FFTs Shared Memory FFTs 4. Selected Topics. Prime Factor FFTs Convolution FFTs of Real Data Cosine and Sine Transforms Fast Poisson Solvers Bibliography Index.

1,222 citations


"Inter-block GPU communication via f..." refers background or methods in this paper

  • ...A detailed description of the FFT algorithm can be found in [16]....

    [...]

  • ...We then introduce these GPU synchronization strategies into three different algorithms — Fast Fourier Transform (FFT) [16], dynamic programming (e....

    [...]

  • ...The three algorithms are Fast Fourier Transformation [16], Smith-Waterman [25], and bitonic sort [4]....

    [...]

  • ...To accelerate FFT [16], Govindaraju et al....

    [...]

Proceedings ArticleDOI
20 Feb 2008
TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
Abstract: GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread's resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.

993 citations


"Inter-block GPU communication via f..." refers background or methods in this paper

  • ...However, GPUs typically map well only to data or task parallel applications whose execution requires minimal or even no interblock communication [9, 24, 26, 30]....

    [...]

  • ...With these programming models, more and more applications have been mapped to GPUs and accelerated [6, 7, 10, 12, 18, 19, 23, 24, 26, 30]....

    [...]

Proceedings ArticleDOI
15 Nov 2008
TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.
Abstract: We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80--90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.

787 citations

Frequently Asked Questions (9)
Q1. What are the contributions mentioned in the paper "Inter-block gpu communication via fast barrier synchronization" ?

The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, their GPU lock-free synchronization further improves performance by 10 %, 26 %, and 40 %, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively. 

As for future work, the authors will further investigate the reasons for the irregularity of the FFT ’ s synchronization time versus the number of blocks in the kernel. Second, the authors will propose a general model to characterize algorithms ’ parallelism properties, based on which, better performance can be obtained for their parallelization on multi- and many-core architectures. 

For bitonic sort, Greβ et al. [7] improve the algorithmic complexity of GPU-ABisort to O (n log n) with an adaptive data structure that enables merges to be done in linear time. 

In addition, the authors allocate all available shared memory on an SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint. 

In addition, the authors integrate each of their GPU synchronization approaches in a micro-benchmark and three well-known algorithms: FFT, dynamic programming, and bitonic sort. 

Another parallel implementation of the bitonic sort is in the CUDA SDK [21], but there is only one block in the kernel to use the available barrier function __syncthreads(), thus restricting the maximum number of items that can be sorted to 512 — the maximum number of threads in a block. 

As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed. 

It is worth noting that in the step 2) above, rather than having one thread to check all elements of Arrayin in serial as in [29], the authors use N threads to check the elements of Arrayin in parallel. 

Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU uses graphics primitives [14], [15] in a task parallel fashion. 

Trending Questions (1)
How do I change the voltage of my GPU?

We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization.