scispace - formally typeset
Open AccessProceedings ArticleDOI

Inter-block GPU communication via fast barrier synchronization

Shucai Xiao, +1 more
- pp 1-12
Reads0
Chats0
TLDR
This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort.
Abstract
While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lock-free synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.

read more

Content maybe subject to copyright    Report

Inter-Block GPU Communication via Fast
Barrier Synchronization
Shucai Xiao
and Wu-chun Feng
∗†
Department of Electrical and Computer Engineering
Department of Computer Science
Virginia Tech
Blacksburg, Virginia 24061
Email: {shucai, wfeng}@vt.edu
Abstract—While GPGPU stands for general-purpose compu-
tation on graphics processing units, the lack of explicit support
for inter-block communication on the GPU arguably hampers its
broader adoption as a general-purpose computing device. Inter-
block communication on the GPU occurs via global memory
and then requires barrier synchronization across the blocks,
i.e., inter-block GPU communication via barrier synchronization.
Currently, such synchronization is only available via the CPU,
which in turn, can incur significant overhead.
We propose two approaches for inter-block GPU communi-
cation via barrier synchronization: GPU lock-based synchro-
nization and GPU lock-free synchronization. We then evaluate
the efficacy of each approach via a micro-benchmark as well
as three well-known algorithms Fast Fourier Transform
(FFT), dynamic programming, and bitonic sort. For the micro-
benchmark, the experimental results show that our GPU lock-
free synchronization performs 8.4 times faster than CPU explicit
synchronization and 4.0 times faster than CPU implicit synchro-
nization. When integrated with the FFT, dynamic programming,
and bitonic sort algorithms, our GPU lock-free synchronization
further improves performance by 10%, 26%, and 40%, respec-
tively, and ultimately delivers an overall speed-up of 70x, 13x,
and 24x, respectively.
I. INTRODUCTION
Today, improving the computational capability of a proces-
sor comes from increasing its number of processing cores
rather than increasing its clock speed. This is reflected in
both traditional multi-core processors and many-core graphics
processing units (GPUs).
Originally, GPUs were designed for graphics-based ap-
plications. With the elimination of key architecture limita-
tions, GPUs have evolved to become more widely used for
general-purpose computation, i.e., general-purpose computa-
tion on the GPU (GPGPU). Programming models such as
NVIDIAs Compute Unified Device Architecture (CUDA) [22]
and AMD/ATI’s Brook+ [2] enable applications to be more
easily mapped onto the GPU. With these programming models,
more and more applications have been mapped to GPUs and
accelerated [6], [7], [10], [12], [18], [19], [23], [24], [26], [30].
However, GPUs typically map well only to data or task
parallel applications whose execution requires minimal or even
no inter-block communication [9], [24], [26], [30]. Why?
There exists no explicit support for inter-block communication
on the GPU. Currently, such inter-block communication occurs
via global memory and requires a barrier synchronization to
complete the communication, which is (inefficiently) imple-
mented via the host CPU. Hereafter, we refer to such CPU-
based barrier synchronization as CPU synchronization.
In general, when a program (i.e., kernel) executes on the
GPU, its execution time consists of three phases: (1) kernel
launch to the GPU, (2) computation on the GPU, and (3)
inter-block GPU communication via barrier synchronization.
1
With different approaches for synchronization, the percentage
of time that each of these three phases takes will differ.
Furthermore, some of the phases may overlap in time. To
quantify the execution time of each phase, we propose a
general performance model that partitions the kernel execution
time into the three aforementioned phases. Based on our model
and code profiling while using the current state of the art
in barrier synchronization, i.e., CPU implicit synchronization
(see Section IV), inter-block communication via barrier syn-
chronization can consume more than 50% of the total kernel
execution time, as shown in Table I.
TABLE I
PERCENT OF TIME SPENT ON INTER-BLOCK COMMUNICATION
Algorithms FFT SWat Bitonic sort
% of time spent on inter-
17.8% 49.2% 59.6%
block communication
(SWat: Smith-Waterman)
Hence, in contrast to previous work that mainly focuses on
optimizing the GPU computation, we focus on reducing the
inter-block communication time via barrier synchronization.
To achieve this, we propose a set of GPU synchronization
strategies, which can synchronize the execution of different
blocks without the involvement of the host CPU, thus avoiding
the costly operation of a kernel launch from the CPU to
GPU. To the best of our knowledge, this work is the first that
systematically addresses how to better support more general-
purpose computation by significantly reducing the inter-block
1
Because inter-block GPU communication time is dominated by the inter-
block synchronization time, we will use inter-block synchronization time
instead of inter-block GPU communication time hereafter.

communication time (rather than the computation time) on a
GPU.
We propose two types of GPU synchronization, one with
locks and the other without. For the former, we use one
mutual-exclusive (mutex) variable and an atomic add operation
to implement GPU lock-based synchronization. With respect to
the latter, which we refer to as GPU lock-free synchronization,
we use two arrays, instead of mutex variables, and eliminate
the need for atomic operations. With this approach, each thread
within a single block controls the execution of a different
block, and the intra-block synchronization is achieved by
synchronizing the threads within the block with the existing
barrier function __syncthreads().
We then introduce these GPU synchronization strate-
gies into three different algorithms Fast Fourier Trans-
form (FFT) [16], dynamic programming (e.g., Smith-
Waterman [25]), and bitonic sort [4] and evaluate their
effectiveness. Specifically, based on our performance model,
we analyze the percentage of time spent computing versus
synchronizing for each of the algorithms.
Finally, according to the work of Volkov et al. [29],
correctness of inter-block communication via GPU synchro-
nization cannot be guaranteed unless a memory consistency
model is assumed. To solve this problem, a new function
__threadfence() is introduced in CUDA 2.2. This func-
tion will block the calling thread until prior writes to global
memory or shared memory visible to other threads [22]. It is
expected that additional overhead will be caused by integrating
__threadfence() into our barrier functions. From our
experiment results, when the number of blocks is more than
18 in the kernel, performance of all three algorithms are
worse than that with the CPU implicit synchronization. As
a result, though barriers can be implemented efficiently in
software, guaranteeing the inter-block communication correct-
ness with __threadfence() causes a lot of overhead, then
implementing efficient barrier synchronization via hardware or
improving the memory flush efficiency become necessary for
efficient and correct inter-block communication on GPUs. It is
worth noting that even without __threadfence() called
in our barrier functions, all results are correct in our thousands
of runs.
Overall, the contributions of this paper are four-fold. First,
we propose two GPU synchronization strategies for inter-
block synchronization. These strategies do not involve the host
CPU, and in turn, reduce the synchronization time between
blocks. Second, we propose a performance model for kernel
execution time and speedup that characterizes the efficacy of
different synchronization approaches. Third, we integrate our
proposed GPU synchronization strategies into three widely
used algorithms Fast Fourier Transform (FFT), dynamic
programming, and bitonic sort and obtain performance
improvements of 9.08%, 25.47%, and 40.39%, respectively,
over the traditional CPU synchronization approach. Fourth,
we show the cost of guaranteeing inter-block communication
correctness via __threadfence(). From our experiment
results, though our proposed barrier synchronization is effi-
cient, the low efficacy of __threadfence() causes a lot
of overhead, especially when the number of blocks in a kernel
is large.
The rest of the paper is organized as follows. Section II
provides an overview of the NVIDIA GTX 280 architecture
and CUDA programming model. The related work is described
in Section III. Section IV presents the time partition model
for kernel execution time. Section V describes our GPU
synchronization approaches. In Section VI, we give a brief
description of the algorithms that we use to evaluate our
proposed GPU synchronization strategies, and Section VII
presents and analyzes the experimental results. Section VIII
concludes the paper.
II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280
The NVIDIA GeForce GTX 280 GPU card consists of 240
streaming processors (SPs), each clocked at 1296 MHz. These
240 SPs are grouped into 30 streaming multiprocessors (SMs),
each of which contains 8 streaming processors. The on-chip
memory for each SM contains 16,384 registers and 16 KB
of shared memory, which can only be accessed by threads
executing on that SM; this grouping of threads on an SM is
denoted as a block. The off-chip memory (or device memory)
contains 1 GB of GDDR3 global memory and supports a
memory bandwidth of 141.7 gigabytes per second (GB/s).
Global memory can be accessed by all threads and blocks
on the GPU, and thus, is often used to communicate data
across different blocks via a CPU barrier synchronization, as
explained later.
NVIDIA provides the CUDA programming model and
software environment [22]. It is an extension to the C program-
ming language. In general, only the compute-intensive and
data-parallel parts of a program are parallelized with CUDA
and are implemented as kernels that are compiled to the device
instruction set. A kernel must be launched to the device before
it can be executed.
In CUDA, threads within a block can communicate via
shared memory or global memory. The barrier function
__syncthreads() ensures proper communication. We re-
fer to this as intra-block communication.
However, there is no explicit support for data communica-
tion across different blocks, i.e., inter-block communication.
Currently, this type of data communication occurs via global
memory, followed by a barrier synchronization via the CPU.
That is, the barrier is implemented by terminating the current
kernel’s execution and re-launching the kernel, which is an
expensive operation.
III. RELATED WORK
Our work is most closely related to two areas of research:
(1) algorithmic mapping of data parallel algorithms onto the
GPU, specifically for FFT, dynamic programming, and bitonic
sort and (2) synchronization protocols in multi- and many-core
environments.
To the best of our knowledge, all known algorithmic map-
pings of FFT, dynamic programming, and bitonic sort take

the same general approach. The algorithm is mapped onto the
GPU in as much of a “data parallel” or “task parallel” fashion
as possible in order to minimize or even eliminate inter-
block communication because such communication requires
an expensive barrier synchronization. For example, running a
single (constrained) problem instance per SM, i.e., 30 separate
problem instances on the NVIDIA GTX 280, obviates the need
for inter-block communication altogether.
To accelerate FFT [16], Govindaraju et al. [6] use efficient
memory access to optimize FFT performance. Specifically,
when the number of points in a sequence is small, shared
memory is used; if there are too many points in a sequence
to store in shared memory, then techniques for coalesced
global memory access are used. In addition, Govindaraju
et al. propose a hierarchical implementation to compute a
large sequence’s FFT by combining the FFTs of smaller
subsequences that can be calculated on shared memory. In
all of these FFT implementations, the necessary barrier syn-
chronization is done by the CPU via kernel launches. Another
work is that of Volkov et al. [30], which tries to accelerate
the FFT by designing a hierarchical communication scheme
to minimize inter-block communication. Finally, Nukada et al.
[20] accelerate the 3-D FFT through shared memory usage and
optimizing the number of threads and registers via appropriate
localization. Note that all of the aforementioned approaches
focus on optimizing the GPU computation and minimizing
or eliminating the inter-block communication rather than by
optimizing the performance of inter-block communication.
Past research on mapping dynamic programming, e.g., the
Smith-Waterman (SWat) algorithm, onto the GPU uses graph-
ics primitives [14], [15] in a task parallel fashion. More recent
work uses CUDA, but again, largely in a task parallel man-
ner [18], [19], [26] or in a fine-grain parallel approach [31].
In the task parallel approach, no inter-block communication
is needed, but the problem size it supports is limited to 1K
characters. While the fine-grain parallel approach can support
sequences of up to 7K characters, inter-block communication
time consumes about 50% of the total matrix filling time.
So if a better inter-block synchronization method is used,
performance improvements can be obtained.
For bitonic sort, Greβ et al. [7] improve the algorithmic
complexity of GPU-ABisort to O (n log n) with an adaptive
data structure that enables merges to be done in linear time.
Another parallel implementation of the bitonic sort is in the
CUDA SDK [21], but there is only one block in the kernel to
use the available barrier function __syncthreads(), thus
restricting the maximum number of items that can be sorted
to 512 the maximum number of threads in a block. If our
proposed inter-block GPU synchronization is used, multiple
blocks can be set in the kernel, which in turn, will significantly
increase the maximum number of items that can be sorted.
Many types of software barriers have been designed for
shared-memory environments [1], [3], [8], [11], [17], but none
of them can be directly applied to GPU environments. This
is because multiple CUDA thread blocks can be scheduled
to be executed on a single SM and the CUDA blocks do
not yield to the execution. That is, blocks run to completion
once spawned by the CUDA thread scheduler. This may result
in deadlocks, and thus, cannot be resolved in the same way
as in traditional CPU processing environments, where one
can yield the waiting process to execute other processes.
One way of addressing this is our GPU lock-based barrier
synchronization [31]. This approach leverages a traditional
shared mutex barrier and avoid deadlock by ensuring a one-
to-one mapping between the SMs and the thread blocks.
Cederman et al. [5] implement a dynamic load-balancing
method on the GPU that is based on the lock-free synchro-
nization method found on traditional multi-core processors.
However, this scheme controls task assignment instead of
addressing inter-block communication. In addition, we note
that lock-free synchronization generally performs worse than
lock-based methods on traditional multi-core processors, but
its performance is better than that of the lock-based method
on the GPU in our work.
The work of Stuart et al. [27] focuses on data communica-
tion between multiple GPUs, i.e., inter-GPU communication.
Though their approach can be used for inter-block communi-
cation across different SMs on the same GPU, the performance
is projected to be quite poor because data needs to be moved
to the CPU host memory first and then transferred back to the
device memory, which is unnecessary for data communication
on a single GPU card.
The most closely related work to ours is that of Volkov et al.
[29]. Volkov et al. propose a global software synchronization
method that does not use atomic operations to accelerate
dense linear-algebra constructs. However, as [29] notes, their
synchronization method has not been implemented into any
real application to test the performance improvement. Further-
more, their proposed synchronization cannot guarantee that
previous accesses to all levels of the memory hierarchy have
completed. Finally, Volkov et al. used only one thread to check
all arrival variables, hence serializing this portion of inter-
block synchronization and adversely affecting its performance.
In contrast, our proposed GPU synchronization approaches
guarantee the completion of memory accesses with the existing
memory access model in CUDA. This is because a new
function __threadfence() is added in CUDA 2.2, which
can guarantee all writes to global memory visible to other
threads, so correctness of reads after the barrier function
can be guaranteed. In addition, we integrate each of our
GPU synchronization approaches in a micro-benchmark and
three well-known algorithms: FFT, dynamic programming, and
bitonic sort. Finally, we use multiple threads in a block to
check all the arrival variables, which can be executed in
parallel, thus achieving a good performance.
IV. A MODEL FOR KERNEL EXECUTION TIME AND
SPEEDUP
In general, a kernel’s execution time on GPUs consists of
three components kernel launch time, computation time,

Fig. 1. Total Kernel Execution Time Composition
(a) CPU explicit synchronization
(b) CPU implicit synchronization
Fig. 2. CPU Explicit/Implicit Synchronization Function Call
and synchronization time, which can be represented as
T =
M
X
i=1
t
(i)
O
+ t
(i)
C
+ t
(i)
S
(1)
where M is the number of kernel launches, t
(i)
O
is the kernel
launch time, t
(i)
C
is the computation time, and t
(i)
S
is the
synchronization time for the i
th
kernel launch as shown in
Figure 1. Each of the three time components is impacted by a
few factors. For instance, the kernel launch time depends on
the data transfer rate from the host to the device as well as the
size of kernel code and parameters. For the computation time,
it is affected by memory access methods, thread organization
(number of threads per block and number of blocks per grid)
in the kernel, etc. Similarly, the synchronization time will be
different with different synchronization approaches used.
Figure 2 shows the pseudo-code of implementing bar-
rier synchronization via kernel launches, where Figure 2(a)
is the function call of CPU Explicit Synchronization and
Figure 2(b) is for CPU Implicit Synchronization. As we
can see, in the CPU explicit synchronization, the kernel
function __kernel_func() is followed by the function
cudaThreadSynchronize(), which will not return un-
til all prior operations on the device are completed. As a
result, the three operations kernel launch, computation,
and synchronization are executed sequentially in the CPU
explicit synchronization. In contrast, in the CPU implicit syn-
chronization, cudaThreadSynchronize() is not called.
Since kernel launch is an asynchronous operation, if there are
multiple kernel launches, kernel launch time can be overlapped
by previous kernels’ computation time and synchronization
time. So, in the CPU implicit synchronization approach, except
for the first kernel launch, subsequent ones are pipelined
with computation and synchronization of previous kernel’s
execution, and the execution time of multiple kernel launches
Fig. 3. GPU Synchronization Function Call
can be represented as
T = t
(1)
O
+
M
X
i=1
t
(i)
C
+ t
(i)
CIS
(2)
where, M is the number of kernel launches, t
(1)
O
is the time
for the first kernel launch, t
(i)
C
and t
(i)
CIS
are the computation
time and synchronization time for the i
th
kernel launch,
respectively.
With respect to the GPU Synchronization, Figure 3 shows
the pseudo-code of how functions are called. In this approach,
a kernel is launched only once. When barrier synchroniza-
tion is needed, a barrier function __gpu_sync() is called
instead of re-launching the kernel. In Figure 3, the function
__device_func() implements the same functionality as
the kernel function __kernel_func() in Figure 2, but it
is a device function instead of a global one, so it is called on
the device rather than on the host. In the GPU synchronization,
kernel execution time can be expressed as
T = t
O
+
M
X
i=1
t
(i)
C
+ t
(i)
GS
(3)
where, M is the number of barriers needed for the kernel’s
execution, t
O
is the kernel launch time, t
(i)
C
and t
(i)
GS
are the
computation time and synchronization time for the i
th
loop,
respectively.
From Equations (1), (2), and (3), an algorithm can be
accelerated by decreasing any of the three time components.
With the properties of kernel launch time considered
2
, we
ignore the kernel launch time in the following discussion. If
the synchronization time is reduced, according to the Amdahl’s
Law, the maximum kernel execution speedup is constrained by
S
T
=
T
t
C
+ (T t
C
) /S
S
=
1
t
C
T
+
1
t
C
T
/S
S
=
1
ρ + (1 ρ) /S
S
(4)
where S
T
is the kernel execution speedup gained with reduc-
ing the synchronization time, ρ =
t
C
T
is the percentage of
the computation time t
C
in the total kernel execution time T ,
t
S
= T t
C
is the synchronization time of the CPU implicit
2
Three properties are considered. First, kernel launch time can be combined
with the synchronization time in the CPU explicit synchronization; Second, it
can be overlapped in CPU implicit synchronization; Third, kernel is launched
only once in the GPU synchronization.

synchronization, which is our baseline as mentioned later. S
S
is the synchronization speedup. Similarly, if only computation
is accelerated, the maximum overall speedup is constrained by
S
T
=
1
ρ/S
C
+ (1 ρ)
(5)
where S
C
is the computation speedup.
In Equation (4), the smaller the ρ is, the more speedup can
be gained with a fixed S
S
; while in Equation (5), the larger
the ρ is, the more speedup can be obtained with a fixed S
C
.
In practice, different algorithms have different ρ values. For
example, for the three algorithms used in this paper, FFT has
a ρ value larger than 0.8, while SWat and bitonic sort have
a ρ of about 0.5 and 0.4, respectively. According to Equation
(5), corresponding to these ρ values, if only the computation
is accelerated, maximum speedup of the three aforementioned
algorithms are shown in Table II. As can be observed, very
low speedup can be obtained in these three algorithms if only
the computation is accelerated. Since most of the previous
work focuses on optimizing the computation, i.e., decreases
the computation time t
C
, the more optimization is performed
on an algorithm, the smaller ρ will become. At this time,
decreasing the computation time will not help much for the
overall performance. On the other side, if we decrease the
synchronization time, large kernel execution speedup can be
obtained.
TABLE II
POSSIBLE MAXIMUM SPEEDUP WITH ONLY COMPUTATION
ACCELERATED
Algorithms FFT SWat Bitonic sort
ρ 0.82 0.51 0.40
Possible maximum speedup 5.61 2.03 1.68
In this paper, we will focus on decreasing the synchroniza-
tion time. This is due to three facts:
1) There has been a lot of work [6], [10], [15], [19], [25]
proposed to decrease the computation time. Techniques
such as shared memory usage and divergent branch
removing have been widely used.
2) No work has been done to decrease the synchronization
time for algorithms to be executed on a GPU;
3) In some algorithms, the synchronization time consumes
a large part of the kernel execution time (e.g., SWat and
bitonic sort in Figure 12), which results in a small ρ
value.
With the above model for speedup brought by synchro-
nization time reduction, we propose two GPU synchronization
approaches in the next section, and time consumption of each
of them is modeled and analyzed quantitatively.
V. PROPOSED GPU SYNCHRONIZATION
Since in CUDA programming model, the execution of a
thread block is non-preemptive, care must be taken to avoid
deadlocks in GPU synchronization design. Consider a scenario
where multiple thread blocks are mapped to one SM and the
active block is waiting for the completion of a global barrier.
A deadlock will occur in this case because unscheduled thread
blocks will not be able to reach the barrier without preemption.
Our solution to this problem is to have a one-to-one mapping
between thread blocks and SMs. In other words, for a GPU
with ‘Y’ SMs, we ensure that at most ‘Y’ blocks are used in
the kernel. In addition, we allocate all available shared memory
on an SM to each block so that no two blocks can be scheduled
to the same SM because of the memory constraint.
In the following discussion, we will present two alternative
GPU synchronization designs: GPU lock-based synchroniza-
tion and GPU lock-free synchronization. The first one uses
a mutex variable and CUDA atomic operations; while the
second method uses a lock-free algorithm that avoids the use
of expensive CUDA atomic operations.
A. GPU Lock-Based Synchronization
The basic idea of GPU lock-based synchronization [31]
is to use a global mutex variable to count the number of
thread blocks that reach the synchronization point. As shown
in Figure 4, in the barrier function __gpu_sync(), after a
block completes its computation, one of its threads (we call it
the leading thread.) will atomically add 1 to g_mutex. The
leading thread will then repeatedly compare g_mutex to a
target value goalVal. If g_mutex is equal to goalVal,
the synchronization is completed and each thread block can
proceed with its next stage of computation. In our design,
goalVal is set to the number of blocks N in the kernel when
the barrier function is first called. The value of goalVal is
then incremented by N each time when the barrier function is
successively called. This design is more efficient than keeping
goalVal constant and resetting g_mutex after each barrier
because the former saves the number of instructions and avoids
conditional branching.
1 //the mutex variable
2 __device__ volatile int g_mutex;
3
4 //GPU lock-based synchronization function
5 __device__ void __gpu_sync(int goalVal)
6 {
7 //thread ID in a block
8 int tid_in_block = threadIdx.x
*
blockDim.y
9 + threadIdx.y;
10
11 // only thread 0 is used for synchronization
12 if (tid_in_block == 0) {
13 atomicAdd((int
*
)&g_mutex, 1);
14
15 //only when all blocks add 1 to g_mutex
16 //will g_mutex equal to goalVal
17 while(g_mutex != goalVal) {
18 //Do nothing here
19 }
20 }
21 __syncthreads();
22 }
Fig. 4. Code snapshot of the GPU Lock-Based Synchronization
In the GPU lock-based synchronization, the execution time
of the barrier function __gpu_sync() consists of three

Citations
More filters

Communication Architectures for Scalable GPU-centric Computing Systems

TL;DR: This work addresses the GPU’s inability to orchestrate communication and presents solutions to allow GPUs to source and sink network traffic independently, and addresses communication management on GPUs, which is required to provide high-level communication abstractions.
Dissertation

Designing Efficient Barriers and Semaphores for Graphics Processing Units

TL;DR: The results show that proposed designs significantly improve performance compared to state-of-the-art solutions like CUDA Cooperative Groups, and scale to an order of magnitude more threads – avoiding livelock as the algorithms scale compared to prior open source algorithms.
Book ChapterDOI

M&C: A Software Solution to Reduce Errors Caused by Incoherent Caches on GPUs in Unstructured Graphic Algorithm

TL;DR: This paper proposes a set of rules for programming on current GPUs to avoid the errors caused by incoherent caches when applying barrier synchronization and leverages these rules into an unstructured graphic algorithm — constrained Delaunay triangulation.
Proceedings ArticleDOI

Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues

David Defour
TL;DR: In this paper, a simple solution base on an index interposer and an index scrambler is proposed to amplify the possible combination of execution order to discover numerical reproducibility.
References
More filters
Journal ArticleDOI

Identification of common molecular subsequences.

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Proceedings ArticleDOI

Sorting networks and their applications

TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Book

Computational Frameworks for the Fast Fourier Transform

TL;DR: The Radix-2 Frameworks, a collection of general and high performance FFTs designed to solve the multi-Dimensional FFT problem of Prime Factor and Convolution, are presented.
Proceedings ArticleDOI

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
Proceedings ArticleDOI

Benchmarking GPUs to tune dense linear algebra

TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions mentioned in the paper "Inter-block gpu communication via fast barrier synchronization" ?

The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, their GPU lock-free synchronization further improves performance by 10 %, 26 %, and 40 %, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively. 

As for future work, the authors will further investigate the reasons for the irregularity of the FFT ’ s synchronization time versus the number of blocks in the kernel. Second, the authors will propose a general model to characterize algorithms ’ parallelism properties, based on which, better performance can be obtained for their parallelization on multi- and many-core architectures. 

For bitonic sort, Greβ et al. [7] improve the algorithmic complexity of GPU-ABisort to O (n log n) with an adaptive data structure that enables merges to be done in linear time. 

In addition, the authors allocate all available shared memory on an SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint. 

In addition, the authors integrate each of their GPU synchronization approaches in a micro-benchmark and three well-known algorithms: FFT, dynamic programming, and bitonic sort. 

Another parallel implementation of the bitonic sort is in the CUDA SDK [21], but there is only one block in the kernel to use the available barrier function __syncthreads(), thus restricting the maximum number of items that can be sorted to 512 — the maximum number of threads in a block. 

As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed. 

It is worth noting that in the step 2) above, rather than having one thread to check all elements of Arrayin in serial as in [29], the authors use N threads to check the elements of Arrayin in parallel. 

Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU uses graphics primitives [14], [15] in a task parallel fashion. 

Trending Questions (1)
How do I change the voltage of my GPU?

We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization.