Inter-block GPU communication via fast barrier synchronization
read more
Citations
Communication Architectures for Scalable GPU-centric Computing Systems
Designing Efficient Barriers and Semaphores for Graphics Processing Units
M&C: A Software Solution to Reduce Errors Caused by Incoherent Caches on GPUs in Unstructured Graphic Algorithm
Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues
References
Identification of common molecular subsequences.
Sorting networks and their applications
Computational Frameworks for the Fast Fourier Transform
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Benchmarking GPUs to tune dense linear algebra
Related Papers (5)
Frequently Asked Questions (9)
Q2. What are the future works in "Inter-block gpu communication via fast barrier synchronization" ?
As for future work, the authors will further investigate the reasons for the irregularity of the FFT ’ s synchronization time versus the number of blocks in the kernel. Second, the authors will propose a general model to characterize algorithms ’ parallelism properties, based on which, better performance can be obtained for their parallelization on multi- and many-core architectures.
Q3. What is the way to improve the performance of a bitonic sort?
For bitonic sort, Greβ et al. [7] improve the algorithmic complexity of GPU-ABisort to O (n log n) with an adaptive data structure that enables merges to be done in linear time.
Q4. How do the authors allocate shared memory on an SM to each block?
In addition, the authors allocate all available shared memory on an SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint.
Q5. What are the three well-known algorithms that the authors integrate into their synchronization approach?
In addition, the authors integrate each of their GPU synchronization approaches in a micro-benchmark and three well-known algorithms: FFT, dynamic programming, and bitonic sort.
Q6. How many threads can be sorted in a block?
Another parallel implementation of the bitonic sort is in the CUDA SDK [21], but there is only one block in the kernel to use the available barrier function __syncthreads(), thus restricting the maximum number of items that can be sorted to 512 — the maximum number of threads in a block.
Q7. What is the reason why the barrier function can not guarantee that inter-block communication is correct?
As described in [29], the barrier function cannot guarantee that inter-block communication is correct unless a memory consistency model is assumed.
Q8. How many threads are used to check the elements of Arrayin in parallel?
It is worth noting that in the step 2) above, rather than having one thread to check all elements of Arrayin in serial as in [29], the authors use N threads to check the elements of Arrayin in parallel.
Q9. How does the research on mapping dynamic programming work?
Past research on mapping dynamic programming, e.g., the Smith-Waterman (SWat) algorithm, onto the GPU uses graphics primitives [14], [15] in a task parallel fashion.