What are the future works in "Inter-block gpu communication via fast barrier synchronization" ?

As for future work, the authors will further investigate the reasons for the irregularity of the FFT ’ s synchronization time versus the number of blocks in the kernel. Second, the authors will propose a general model to characterize algorithms ’ parallelism properties, based on which, better performance can be obtained for their parallelization on multi- and many-core architectures.

What is the reason why the barrier function can not guarantee that inter-block communication is correct?

As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.

How many threads are used to check the elements of Arrayin in parallel?

It is worth noting that in the step 2) above, rather than having one thread to check all elements of Arrayin in serial as in [29], the authors use N threads to check the elements of Arrayin in parallel.

(Open Access) Inter-block GPU communication via fast barrier synchronization (2010) | Shucai Xiao

Q: What are the contributions mentioned in the paper "Inter-block gpu communication via fast barrier synchronization" ?

The authors propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, their GPU lock-free synchronization further improves performance by 10 %, 26 %, and 40 %, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.

Q: How do the authors allocate shared memory on an SM to each block?

In addition, the authors allocate all available shared memory on an SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint.

Q: What are the three well-known algorithms that the authors integrate into their synchronization approach?

In addition, the authors integrate each of their GPU synchronization approaches in a micro-benchmark and three well-known algorithms: FFT, dynamic programming, and bitonic sort.

Inter-Block GPU Communication via Fast

Barrier Synchronization

Shucai Xiao

∗

and Wu-chun Feng

∗†

∗

Department of Electrical and Computer Engineering

†

Department of Computer Science

Virginia Tech

Blacksburg, Virginia 24061

Email: {shucai, wfeng}@vt.edu

Abstract—While GPGPU stands for general-purpose compu-

tation on graphics processing units, the lack of explicit support

for inter-block communication on the GPU arguably hampers its

broader adoption as a general-purpose computing device. Inter-

block communication on the GPU occurs via global memory

and then requires barrier synchronization across the blocks,

i.e., inter-block GPU communication via barrier synchronization.

Currently, such synchronization is only available via the CPU,

which in turn, can incur signiﬁcant overhead.

We propose two approaches for inter-block GPU communi-

cation via barrier synchronization: GPU lock-based synchro-

nization and GPU lock-free synchronization. We then evaluate

the efﬁcacy of each approach via a micro-benchmark as well

as three well-known algorithms — Fast Fourier Transform

(FFT), dynamic programming, and bitonic sort. For the micro-

benchmark, the experimental results show that our GPU lock-

free synchronization performs 8.4 times faster than CPU explicit

synchronization and 4.0 times faster than CPU implicit synchro-

nization. When integrated with the FFT, dynamic programming,

and bitonic sort algorithms, our GPU lock-free synchronization

further improves performance by 10%, 26%, and 40%, respec-

tively, and ultimately delivers an overall speed-up of 70x, 13x,

and 24x, respectively.

I. INTRODUCTION

Today, improving the computational capability of a proces-

sor comes from increasing its number of processing cores

rather than increasing its clock speed. This is reﬂected in

both traditional multi-core processors and many-core graphics

processing units (GPUs).

Originally, GPUs were designed for graphics-based ap-

plications. With the elimination of key architecture limita-

tions, GPUs have evolved to become more widely used for

general-purpose computation, i.e., general-purpose computa-

tion on the GPU (GPGPU). Programming models such as

NVIDIA’s Compute Uniﬁed Device Architecture (CUDA) [22]

and AMD/ATI’s Brook+ [2] enable applications to be more

easily mapped onto the GPU. With these programming models,

more and more applications have been mapped to GPUs and

accelerated [6], [7], [10], [12], [18], [19], [23], [24], [26], [30].

However, GPUs typically map well only to data or task

parallel applications whose execution requires minimal or even

no inter-block communication [9], [24], [26], [30]. Why?

There exists no explicit support for inter-block communication

on the GPU. Currently, such inter-block communication occurs

via global memory and requires a barrier synchronization to

complete the communication, which is (inefﬁciently) imple-

mented via the host CPU. Hereafter, we refer to such CPU-

based barrier synchronization as CPU synchronization.

In general, when a program (i.e., kernel) executes on the

GPU, its execution time consists of three phases: (1) kernel

launch to the GPU, (2) computation on the GPU, and (3)

inter-block GPU communication via barrier synchronization.

With different approaches for synchronization, the percentage

of time that each of these three phases takes will differ.

Furthermore, some of the phases may overlap in time. To

quantify the execution time of each phase, we propose a

general performance model that partitions the kernel execution

time into the three aforementioned phases. Based on our model

and code proﬁling while using the current state of the art

in barrier synchronization, i.e., CPU implicit synchronization

(see Section IV), inter-block communication via barrier syn-

chronization can consume more than 50% of the total kernel

execution time, as shown in Table I.

TABLE I

PERCENT OF TIME SPENT ON INTER-BLOCK COMMUNICATION

Algorithms FFT SWat Bitonic sort

% of time spent on inter-

17.8% 49.2% 59.6%

block communication

(SWat: Smith-Waterman)

Hence, in contrast to previous work that mainly focuses on

optimizing the GPU computation, we focus on reducing the

inter-block communication time via barrier synchronization.

To achieve this, we propose a set of GPU synchronization

strategies, which can synchronize the execution of different

blocks without the involvement of the host CPU, thus avoiding

the costly operation of a kernel launch from the CPU to

GPU. To the best of our knowledge, this work is the ﬁrst that

systematically addresses how to better support more general-

purpose computation by signiﬁcantly reducing the inter-block

Because inter-block GPU communication time is dominated by the inter-

block synchronization time, we will use inter-block synchronization time

instead of inter-block GPU communication time hereafter.

communication time (rather than the computation time) on a

GPU.

We propose two types of GPU synchronization, one with

locks and the other without. For the former, we use one

mutual-exclusive (mutex) variable and an atomic add operation

to implement GPU lock-based synchronization. With respect to

the latter, which we refer to as GPU lock-free synchronization,

we use two arrays, instead of mutex variables, and eliminate

the need for atomic operations. With this approach, each thread

within a single block controls the execution of a different

block, and the intra-block synchronization is achieved by

synchronizing the threads within the block with the existing

barrier function __syncthreads().

We then introduce these GPU synchronization strate-

gies into three different algorithms — Fast Fourier Trans-

form (FFT) [16], dynamic programming (e.g., Smith-

Waterman [25]), and bitonic sort [4] — and evaluate their

effectiveness. Speciﬁcally, based on our performance model,

we analyze the percentage of time spent computing versus

synchronizing for each of the algorithms.

Finally, according to the work of Volkov et al. [29],

correctness of inter-block communication via GPU synchro-

nization cannot be guaranteed unless a memory consistency

model is assumed. To solve this problem, a new function

__threadfence() is introduced in CUDA 2.2. This func-

tion will block the calling thread until prior writes to global

memory or shared memory visible to other threads [22]. It is

expected that additional overhead will be caused by integrating

__threadfence() into our barrier functions. From our

experiment results, when the number of blocks is more than

18 in the kernel, performance of all three algorithms are

worse than that with the CPU implicit synchronization. As

a result, though barriers can be implemented efﬁciently in

software, guaranteeing the inter-block communication correct-

ness with __threadfence() causes a lot of overhead, then

implementing efﬁcient barrier synchronization via hardware or

improving the memory ﬂush efﬁciency become necessary for

efﬁcient and correct inter-block communication on GPUs. It is

worth noting that even without __threadfence() called

in our barrier functions, all results are correct in our thousands

of runs.

Overall, the contributions of this paper are four-fold. First,

we propose two GPU synchronization strategies for inter-

block synchronization. These strategies do not involve the host

CPU, and in turn, reduce the synchronization time between

blocks. Second, we propose a performance model for kernel

execution time and speedup that characterizes the efﬁcacy of

different synchronization approaches. Third, we integrate our

proposed GPU synchronization strategies into three widely

used algorithms — Fast Fourier Transform (FFT), dynamic

programming, and bitonic sort — and obtain performance

improvements of 9.08%, 25.47%, and 40.39%, respectively,

over the traditional CPU synchronization approach. Fourth,

we show the cost of guaranteeing inter-block communication

correctness via __threadfence(). From our experiment

results, though our proposed barrier synchronization is efﬁ-

cient, the low efﬁcacy of __threadfence() causes a lot

of overhead, especially when the number of blocks in a kernel

is large.

The rest of the paper is organized as follows. Section II

provides an overview of the NVIDIA GTX 280 architecture

and CUDA programming model. The related work is described

in Section III. Section IV presents the time partition model

for kernel execution time. Section V describes our GPU

synchronization approaches. In Section VI, we give a brief

description of the algorithms that we use to evaluate our

proposed GPU synchronization strategies, and Section VII

presents and analyzes the experimental results. Section VIII

concludes the paper.

II. OVERVIEW OF CUDA ON THE NVIDIA GTX 280

The NVIDIA GeForce GTX 280 GPU card consists of 240

streaming processors (SPs), each clocked at 1296 MHz. These

240 SPs are grouped into 30 streaming multiprocessors (SMs),

each of which contains 8 streaming processors. The on-chip

memory for each SM contains 16,384 registers and 16 KB

of shared memory, which can only be accessed by threads

executing on that SM; this grouping of threads on an SM is

denoted as a block. The off-chip memory (or device memory)

contains 1 GB of GDDR3 global memory and supports a

memory bandwidth of 141.7 gigabytes per second (GB/s).

Global memory can be accessed by all threads and blocks

on the GPU, and thus, is often used to communicate data

across different blocks via a CPU barrier synchronization, as

explained later.

NVIDIA provides the CUDA programming model and

software environment [22]. It is an extension to the C program-

ming language. In general, only the compute-intensive and

data-parallel parts of a program are parallelized with CUDA

and are implemented as kernels that are compiled to the device

instruction set. A kernel must be launched to the device before

it can be executed.

In CUDA, threads within a block can communicate via

shared memory or global memory. The barrier function

__syncthreads() ensures proper communication. We re-

fer to this as intra-block communication.

However, there is no explicit support for data communica-

tion across different blocks, i.e., inter-block communication.

Currently, this type of data communication occurs via global

memory, followed by a barrier synchronization via the CPU.

That is, the barrier is implemented by terminating the current

kernel’s execution and re-launching the kernel, which is an

expensive operation.

III. RELATED WORK

Our work is most closely related to two areas of research:

(1) algorithmic mapping of data parallel algorithms onto the

GPU, speciﬁcally for FFT, dynamic programming, and bitonic

sort and (2) synchronization protocols in multi- and many-core

environments.

To the best of our knowledge, all known algorithmic map-

pings of FFT, dynamic programming, and bitonic sort take

the same general approach. The algorithm is mapped onto the

GPU in as much of a “data parallel” or “task parallel” fashion

as possible in order to minimize or even eliminate inter-

block communication because such communication requires

an expensive barrier synchronization. For example, running a

single (constrained) problem instance per SM, i.e., 30 separate

problem instances on the NVIDIA GTX 280, obviates the need

for inter-block communication altogether.

To accelerate FFT [16], Govindaraju et al. [6] use efﬁcient

memory access to optimize FFT performance. Speciﬁcally,

when the number of points in a sequence is small, shared

memory is used; if there are too many points in a sequence

to store in shared memory, then techniques for coalesced

global memory access are used. In addition, Govindaraju

et al. propose a hierarchical implementation to compute a

large sequence’s FFT by combining the FFTs of smaller

subsequences that can be calculated on shared memory. In

all of these FFT implementations, the necessary barrier syn-

chronization is done by the CPU via kernel launches. Another

work is that of Volkov et al. [30], which tries to accelerate

the FFT by designing a hierarchical communication scheme

to minimize inter-block communication. Finally, Nukada et al.

[20] accelerate the 3-D FFT through shared memory usage and

optimizing the number of threads and registers via appropriate

localization. Note that all of the aforementioned approaches

focus on optimizing the GPU computation and minimizing

or eliminating the inter-block communication rather than by

optimizing the performance of inter-block communication.

Past research on mapping dynamic programming, e.g., the

Smith-Waterman (SWat) algorithm, onto the GPU uses graph-

ics primitives [14], [15] in a task parallel fashion. More recent

work uses CUDA, but again, largely in a task parallel man-

ner [18], [19], [26] or in a ﬁne-grain parallel approach [31].

In the task parallel approach, no inter-block communication

is needed, but the problem size it supports is limited to 1K

characters. While the ﬁne-grain parallel approach can support

sequences of up to 7K characters, inter-block communication

time consumes about 50% of the total matrix ﬁlling time.

So if a better inter-block synchronization method is used,

performance improvements can be obtained.

For bitonic sort, Greβ et al. [7] improve the algorithmic

complexity of GPU-ABisort to O (n log n) with an adaptive

data structure that enables merges to be done in linear time.

Another parallel implementation of the bitonic sort is in the

CUDA SDK [21], but there is only one block in the kernel to

use the available barrier function __syncthreads(), thus

restricting the maximum number of items that can be sorted

to 512 — the maximum number of threads in a block. If our

proposed inter-block GPU synchronization is used, multiple

blocks can be set in the kernel, which in turn, will signiﬁcantly

increase the maximum number of items that can be sorted.

Many types of software barriers have been designed for

shared-memory environments [1], [3], [8], [11], [17], but none

of them can be directly applied to GPU environments. This

is because multiple CUDA thread blocks can be scheduled

to be executed on a single SM and the CUDA blocks do

not yield to the execution. That is, blocks run to completion

once spawned by the CUDA thread scheduler. This may result

in deadlocks, and thus, cannot be resolved in the same way

as in traditional CPU processing environments, where one

can yield the waiting process to execute other processes.

One way of addressing this is our GPU lock-based barrier

synchronization [31]. This approach leverages a traditional

shared mutex barrier and avoid deadlock by ensuring a one-

to-one mapping between the SMs and the thread blocks.

Cederman et al. [5] implement a dynamic load-balancing

method on the GPU that is based on the lock-free synchro-

nization method found on traditional multi-core processors.

However, this scheme controls task assignment instead of

addressing inter-block communication. In addition, we note

that lock-free synchronization generally performs worse than

lock-based methods on traditional multi-core processors, but

its performance is better than that of the lock-based method

on the GPU in our work.

The work of Stuart et al. [27] focuses on data communica-

tion between multiple GPUs, i.e., inter-GPU communication.

Though their approach can be used for inter-block communi-

cation across different SMs on the same GPU, the performance

is projected to be quite poor because data needs to be moved

to the CPU host memory ﬁrst and then transferred back to the

device memory, which is unnecessary for data communication

on a single GPU card.

The most closely related work to ours is that of Volkov et al.

[29]. Volkov et al. propose a global software synchronization

method that does not use atomic operations to accelerate

dense linear-algebra constructs. However, as [29] notes, their

synchronization method has not been implemented into any

real application to test the performance improvement. Further-

more, their proposed synchronization cannot guarantee that

previous accesses to all levels of the memory hierarchy have

completed. Finally, Volkov et al. used only one thread to check

all arrival variables, hence serializing this portion of inter-

block synchronization and adversely affecting its performance.

In contrast, our proposed GPU synchronization approaches

guarantee the completion of memory accesses with the existing

memory access model in CUDA. This is because a new

function __threadfence() is added in CUDA 2.2, which

can guarantee all writes to global memory visible to other

threads, so correctness of reads after the barrier function

can be guaranteed. In addition, we integrate each of our

GPU synchronization approaches in a micro-benchmark and

three well-known algorithms: FFT, dynamic programming, and

bitonic sort. Finally, we use multiple threads in a block to

check all the arrival variables, which can be executed in

parallel, thus achieving a good performance.

IV. A MODEL FOR KERNEL EXECUTION TIME AND

SPEEDUP

In general, a kernel’s execution time on GPUs consists of

three components — kernel launch time, computation time,

Fig. 1. Total Kernel Execution Time Composition

(a) CPU explicit synchronization

(b) CPU implicit synchronization

Fig. 2. CPU Explicit/Implicit Synchronization Function Call

and synchronization time, which can be represented as

T =

i=1



(i)

+ t

(i)

+ t

(i)



(1)

where M is the number of kernel launches, t

(i)

is the kernel

launch time, t

(i)

is the computation time, and t

(i)

is the

synchronization time for the i

kernel launch as shown in

Figure 1. Each of the three time components is impacted by a

few factors. For instance, the kernel launch time depends on

the data transfer rate from the host to the device as well as the

size of kernel code and parameters. For the computation time,

it is affected by memory access methods, thread organization

(number of threads per block and number of blocks per grid)

in the kernel, etc. Similarly, the synchronization time will be

different with different synchronization approaches used.

Figure 2 shows the pseudo-code of implementing bar-

rier synchronization via kernel launches, where Figure 2(a)

is the function call of CPU Explicit Synchronization and

Figure 2(b) is for CPU Implicit Synchronization. As we

can see, in the CPU explicit synchronization, the kernel

function __kernel_func() is followed by the function

cudaThreadSynchronize(), which will not return un-

til all prior operations on the device are completed. As a

result, the three operations — kernel launch, computation,

and synchronization are executed sequentially in the CPU

explicit synchronization. In contrast, in the CPU implicit syn-

chronization, cudaThreadSynchronize() is not called.

Since kernel launch is an asynchronous operation, if there are

multiple kernel launches, kernel launch time can be overlapped

by previous kernels’ computation time and synchronization

time. So, in the CPU implicit synchronization approach, except

for the ﬁrst kernel launch, subsequent ones are pipelined

with computation and synchronization of previous kernel’s

execution, and the execution time of multiple kernel launches

Fig. 3. GPU Synchronization Function Call

can be represented as

T = t

(1)

i=1



(i)

+ t

(i)

CIS



(2)

where, M is the number of kernel launches, t

(1)

is the time

for the ﬁrst kernel launch, t

(i)

and t

(i)

CIS

are the computation

time and synchronization time for the i

kernel launch,

respectively.

With respect to the GPU Synchronization, Figure 3 shows

the pseudo-code of how functions are called. In this approach,

a kernel is launched only once. When barrier synchroniza-

tion is needed, a barrier function __gpu_sync() is called

instead of re-launching the kernel. In Figure 3, the function

__device_func() implements the same functionality as

the kernel function __kernel_func() in Figure 2, but it

is a device function instead of a global one, so it is called on

the device rather than on the host. In the GPU synchronization,

kernel execution time can be expressed as

T = t

i=1



(i)

+ t

(i)



(3)

where, M is the number of barriers needed for the kernel’s

execution, t

is the kernel launch time, t

(i)

and t

(i)

are the

computation time and synchronization time for the i

loop,

respectively.

From Equations (1), (2), and (3), an algorithm can be

accelerated by decreasing any of the three time components.

With the properties of kernel launch time considered

, we

ignore the kernel launch time in the following discussion. If

the synchronization time is reduced, according to the Amdahl’s

Law, the maximum kernel execution speedup is constrained by

+ (T − t

) /S







1 −



ρ + (1 − ρ) /S

(4)

where S

is the kernel execution speedup gained with reduc-

ing the synchronization time, ρ =

is the percentage of

the computation time t

in the total kernel execution time T ,

= T − t

is the synchronization time of the CPU implicit

Three properties are considered. First, kernel launch time can be combined

with the synchronization time in the CPU explicit synchronization; Second, it

can be overlapped in CPU implicit synchronization; Third, kernel is launched

only once in the GPU synchronization.

synchronization, which is our baseline as mentioned later. S

is the synchronization speedup. Similarly, if only computation

is accelerated, the maximum overall speedup is constrained by

ρ/S

+ (1 − ρ)

(5)

where S

is the computation speedup.

In Equation (4), the smaller the ρ is, the more speedup can

be gained with a ﬁxed S

; while in Equation (5), the larger

the ρ is, the more speedup can be obtained with a ﬁxed S

In practice, different algorithms have different ρ values. For

example, for the three algorithms used in this paper, FFT has

a ρ value larger than 0.8, while SWat and bitonic sort have

a ρ of about 0.5 and 0.4, respectively. According to Equation

(5), corresponding to these ρ values, if only the computation

is accelerated, maximum speedup of the three aforementioned

algorithms are shown in Table II. As can be observed, very

low speedup can be obtained in these three algorithms if only

the computation is accelerated. Since most of the previous

work focuses on optimizing the computation, i.e., decreases

the computation time t

, the more optimization is performed

on an algorithm, the smaller ρ will become. At this time,

decreasing the computation time will not help much for the

overall performance. On the other side, if we decrease the

synchronization time, large kernel execution speedup can be

obtained.

TABLE II

POSSIBLE MAXIMUM SPEEDUP WITH ONLY COMPUTATION

ACCELERATED

Algorithms FFT SWat Bitonic sort

ρ 0.82 0.51 0.40

Possible maximum speedup 5.61 2.03 1.68

In this paper, we will focus on decreasing the synchroniza-

tion time. This is due to three facts:

1) There has been a lot of work [6], [10], [15], [19], [25]

proposed to decrease the computation time. Techniques

such as shared memory usage and divergent branch

removing have been widely used.

2) No work has been done to decrease the synchronization

time for algorithms to be executed on a GPU;

3) In some algorithms, the synchronization time consumes

a large part of the kernel execution time (e.g., SWat and

bitonic sort in Figure 12), which results in a small ρ

value.

With the above model for speedup brought by synchro-

nization time reduction, we propose two GPU synchronization

approaches in the next section, and time consumption of each

of them is modeled and analyzed quantitatively.

V. PROPOSED GPU SYNCHRONIZATION

Since in CUDA programming model, the execution of a

thread block is non-preemptive, care must be taken to avoid

deadlocks in GPU synchronization design. Consider a scenario

where multiple thread blocks are mapped to one SM and the

active block is waiting for the completion of a global barrier.

A deadlock will occur in this case because unscheduled thread

blocks will not be able to reach the barrier without preemption.

Our solution to this problem is to have a one-to-one mapping

between thread blocks and SMs. In other words, for a GPU

with ‘Y’ SMs, we ensure that at most ‘Y’ blocks are used in

the kernel. In addition, we allocate all available shared memory

on an SM to each block so that no two blocks can be scheduled

to the same SM because of the memory constraint.

In the following discussion, we will present two alternative

GPU synchronization designs: GPU lock-based synchroniza-

tion and GPU lock-free synchronization. The ﬁrst one uses

a mutex variable and CUDA atomic operations; while the

second method uses a lock-free algorithm that avoids the use

of expensive CUDA atomic operations.

A. GPU Lock-Based Synchronization

The basic idea of GPU lock-based synchronization [31]

is to use a global mutex variable to count the number of

thread blocks that reach the synchronization point. As shown

in Figure 4, in the barrier function __gpu_sync(), after a

block completes its computation, one of its threads (we call it

the leading thread.) will atomically add 1 to g_mutex. The

leading thread will then repeatedly compare g_mutex to a

target value goalVal. If g_mutex is equal to goalVal,

the synchronization is completed and each thread block can

proceed with its next stage of computation. In our design,

goalVal is set to the number of blocks N in the kernel when

the barrier function is ﬁrst called. The value of goalVal is

then incremented by N each time when the barrier function is

successively called. This design is more efﬁcient than keeping

goalVal constant and resetting g_mutex after each barrier

because the former saves the number of instructions and avoids

conditional branching.

1 //the mutex variable

2 __device__ volatile int g_mutex;

4 //GPU lock-based synchronization function

5 __device__ void __gpu_sync(int goalVal)

6 {

7 //thread ID in a block

8 int tid_in_block = threadIdx.x

blockDim.y

9 + threadIdx.y;

11 // only thread 0 is used for synchronization

12 if (tid_in_block == 0) {

13 atomicAdd((int

)&g_mutex, 1);

15 //only when all blocks add 1 to g_mutex

16 //will g_mutex equal to goalVal

17 while(g_mutex != goalVal) {

18 //Do nothing here

19 }

20 }

21 __syncthreads();

22 }

Fig. 4. Code snapshot of the GPU Lock-Based Synchronization

In the GPU lock-based synchronization, the execution time

of the barrier function __gpu_sync() consists of three

Inter-block GPU communication via fast barrier synchronization

Figures

Citations

Efficient Neural Audio Synthesis

An effective GPU implementation of breadth-first search

A study of Persistent Threads style GPU programming for GPGPU workloads

Cache coherence for GPU architectures

GPU Concurrency: Weak Behaviours and Programming Assumptions

References

Sequence alignment with GPU: Performance and design challenges

Performance issues in non-blocking synchronization on shared-memory multiprocessors

On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit

A scalable implementation of barrier synchronization using an adaptive combining tree

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Related Papers (5)

Accelerating large graph algorithms on the GPU using CUDA

Rodinia: A benchmark suite for heterogeneous computing

Analyzing CUDA workloads using a detailed GPU simulator

Scalable GPU graph traversal

Accelerating CUDA graph algorithms at maximum warp

Frequently Asked Questions (9)

Q1. What are the contributions mentioned in the paper "Inter-block gpu communication via fast barrier synchronization" ?

Q2. What are the future works in "Inter-block gpu communication via fast barrier synchronization" ?

Q3. What is the way to improve the performance of a bitonic sort?

Q4. How do the authors allocate shared memory on an SM to each block?

Q5. What are the three well-known algorithms that the authors integrate into their synchronization approach?

Q6. How many threads can be sorted in a block?

Q7. What is the reason why the barrier function can not guarantee that inter-block communication is correct?

Q8. How many threads are used to check the elements of Arrayin in parallel?

Q9. How does the research on mapping dynamic programming work?

Trending Questions (1)