Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids
Summary (3 min read)
1. Introduction
- General-Purpose computing on Graphics Processing Units has recently emerged as a powerful computing paradigm because of the massive parallelism provided by several hundreds of processing cores [4, 15].
- This will eventually provide additional computing power for the kernel execution while utilizing the idle CPU cores.
- The paper proposes Cooperative Heterogeneous Computing (CHC), a new computing paradigm for explicitly processing CUDA applications in parallel on sets of heterogeneous processors including x86 based general-purpose multi-core processors and graphics processing units.
- The authors present a theoretical analysis of the expected performance to demonstrate the maximum feasible improvement of their proposed system.
- In addition, the performance evaluation on a real system has been performed and the results show that speedups as high as 3.08 have been achieved.
3. Motivation
- One of the major roles of the host CPU for the CUDA kernel is limited to controlling and accessing the graphics devices, while the GPU device provides a massive amount of data parallelism.
- As soon as the host calls the kernel function, the device starts to execute the kernel with a large number of hardware threads on the GPU device.
- The authors CHC system is to use the idle computing resource with concurrent execution of the CUDA kernel on both CPU and GPU (as described in Fig. 1(b)).
- This would be quite helpful to program a single chip heterogeneous multi-core processor including CPU and GPU as well.
- The key difference on which the authors focus is in the use of idle computing resources with concurrent execution of the same CUDA kernel on both CPU and GPU, thereby easing the GPU burden.
4. Design
- An overview of their proposed CHC system is shown in Fig. 2. The first includes the Workload Distribution Module (WDM), designed to apply the distribution ratio to the kernel configuration information.
- As seen in Fig. 2, this procedure extracts the PTX code from the CUDA binary to prepare the LLVM code for cooperative computing.
- The CUDA kernel execution typically needs some startup time to initialize the GPU device.
- In the CHC framework, the GPU start-up process and the PTX-to-LLVM translation are simultaneously performed to hide the PTXto-LLVM translation overhead.
4.1. Workload distribution module and method
- The input of WDM is the kernel configuration information and the output specifies two different portions of the kernel, each for CPU cores and the GPU device.
- In order to divide the CUDA kernel, the workload distribution module determines the amount of the thread blocks to be detached from the grid considering the dimension of the grid and the workload distribution ratio as depicted in Fig.
- WDM then delivers the generated execution configurations (i.e., the output of the WDM) to the CPU and GPU loaders.
- Therefore, the first identifier of the CPU’s subkernel will be (dGrid.y×GPURatio) +.
- The proposed work distribution can split the kernel according to the granularity of thread block.
4.2. Memory consolidation for transparent
- Memory space A programmerwriting CUDA applications should assign memory spaces in the device memory of the graphics hardware.
- For this purpose, the host system should preserve pointer variables pointing to the location in the device memory.
- The abstraction layer uses double pointers data structures (similar to [19]) for pointer variables to map one pointer variable onto two memory addresses: for the main memory and the device memory.
- Whenever a pointer variable is referenced, the abstraction layer translates the pointer to the memory addresses, for both CPU and GPU.
- The addresses of these memory spaces are stored in a TMS data structure (e.g., TMS1), and the framework maps the pointer variable on the TMS data structure.
4.3. Global scheduling queue for thread scheduling
- GPU is a throughput-oriented architecture which shows outstanding performance with applications having a large amount of data parallelism [7].
- In their scheduling scheme, the authors allow a thread block to be assigned dynamically to any available core.
- For this pur- pose, the authors have implemented a work sharing scheme using a Global Scheduling Queue (GSQ) [3].
- This scheduling algorithm enqueues a task (i.e., a thread block) into a global queue so that any worker thread on an available core can consume the task.
- In addition, any core which finishes the assigned thread block so early would handle another thread block without being idle.
4.4. Limitations on global memory consistency
- CHC emulates the global memory on the CPU-side as well.
- Thread blocks in the CPU can access the emulated global memory and perform the atomic operations.
- Their system does not allow the global memory atomic operations between the thread blocks on the CPU and the thread blocks on the GPU to avoid severe performance degradation.
- In fact, discrete GPUs have their own memory and communicate with the main memory through the PCI express, which causes long latency problems.
- This architectural limit suggests that the CHC prototype need not provide global memory atomic operations between CPU and GPU.
5. Results
- The proposed CHC framework has been fully implemented on a desktop system with two Intel XeonTM X5550 2.66 GHz quad-core processors and an NVIDIA GeForceTM 9400 GT device.
- This configuration is also applicable to a single-chip heterogeneous multi-core processor that has an integrated GPU, which is generally slower than discrete GPUs.
- The authors adapt 14 CUDA applications which do not have the global memory synchronization across CPU and GPU at runtime; twelve from the NVIDIA CUDA Software Development Kit (SDK) [16], SpMV [2], and MD5 hashing [9].
- From left to right the columns represent the application name, the number of computation kernels, the number of thread blocks in the kernels, a description of the kernel, and work distribution ratio used in the CHC framework.
- The authors measured the execution time of kernel launches and compared CHC framework against the GPU-only computing.
5.1. Initial analysis
- For the initial analysis, the authors have measured the execution delay using only the GPU device and the delay using only the host CPU (through the LLVM JIT compilation technique [5, 6, 11]).
- In addition, the workload has been configured either as executing only one thread block or as executing the complete set of thread blocks.
- The maximum performance improvement achievable based on the initial execution delays can be experimentally deduced, as depicted in Table 2. Fig. 5 shows the way to find it; the x-axis represents the workload ratio in terms of thread blocks assigned to the CPU cores against thread blocks on the GPU device.
- Therefore, the execution delay for GPU is proportionally reduced along the x-axis.
- Compressed Sparse Row (CSR) format is used.
5.2. Performance improvements of CHC framework
- Table 2 shows the maximum performance and the optimal distribution ratio obtained from the initial analysis.
- In addition, the actual execution time and the actual work distribution ratio using CHC are also presented.
- In fact, the optimal distribution ratio is used to determine the work distribution ratio on CHC.
- More in detail, the applications with exponential, trigonometric, or power arithmetic operations (BINO, BLKS, MERT, MONT, and CONV) show little performance improvement.
6. Conclusions
- The paper has introduced three key features for the efficient exploitation of the thread level parallelism provided by CUDA on the CPU multi-cores in addition to the GPU device.
- The experiments demonstrate that the proposed framework successfully achieves efficient parallel execution and that the performance results obtained are close to the values deduced from the theoretical analysis.
- The authors also plan to design an efficient thread block distribution technique considering data access patterns and thread divergence.
Did you find this useful? Give us your feedback
Citations
414 citations
Cites background from "Cooperative heterogeneous computing..."
...Li et al. [2011] propose an HCT for Cryo-EM 3D reconstruction, where tasks (in this case, individual images) are assigned to a CPU and a GPU based on their relative performance....
[...]
52 citations
18 citations
Cites background from "Cooperative heterogeneous computing..."
...introduced cooperative computing framework [9] exploiting TLP (thread-level...
[...]
13 citations
Cites background from "Cooperative heterogeneous computing..."
...[5] have proposed a cooperative heterogeneous computing framework....
[...]
9 citations
Cites methods from "Cooperative heterogeneous computing..."
...0 1 Out[0] = In[0] 1 2 2 Out[1] = (In[0] * 3 + In[1] + 2) / 4 3 3 Out[2] = (In[1] * 3 + In[0] + 1) / 4 4 4 Out[3] = (In[1] * 3 + In[2] + 2) / 4 5 5 Out[4] = (In[2] * 3 + In[1] + 1) / 4 6 7 6 Out[5] = (In[2] * 3 + In[3] + 2) / 4 7 Out[6] = (In[3] * 3 + In[2] + 1) / 4 8 Out[7] = (In[3] * 3 + In[4] + 2) / 4 9 Out[8] = (In[4] * 3 + In[3] + 1) / 4 0 10 Out[9] = (In[4] * 3 + In[5] + 2) / 4 1 11 Out[10] = (In[5] * 3 + In[4] + 1) / 4 2 12 Out[11] = (In[5] * 3 + In[6] + 2) / 4 3 13 Out[12] = (In[6] * 3 + In[5] + 1) / 4 4 14 Out[13] = (In[6] * 3 + In[7] + 2) / 4 5 6 15 Out[14] = (In[7] * 3 + In[6] + 1) / 4 7 (a) Block-based pattern 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 16 Out[15] = In[7] 4.2 Upsampling The chrominance components in 4:2:2 subsampling are downsampled to half of the luminance size during JPEG encoding....
[...]
...Input : Pixel information in YCbCr color space Output: Pixel information in RGB color space 1 R = Y + 1.402 (Cr -128) 2 G = Y -0.34414 (Cb -128) -0.71414 (Cr -128) 3 B = Y + 1.772 (Cb -128) vector 0 vector 1 vector 2 vector 3 vector 4 vector 5 R G B R G B R G B R G B R G B R G B R G B R G B Figure 4: Vectorization of interleaving RGB performed by one work-item in order to reduce global memory writes....
[...]
...1 Out[0] = In[0] 2 Out[1] = (In[0] * 3 + In[1] + 2) / 4 3 Out[2] = (In[1] * 3 + In[0] + 1) / 4 4 Out[3] = (In[1] * 3 + In[2] + 2) / 4 5 Out[4] = (In[2] * 3 + In[1] + 1) / 4 6 Out[5] = (In[2] * 3 + In[3] + 2) / 4 7 Out[6] = (In[3] * 3 + In[2] + 1) / 4 8 Out[7] = (In[3] * 3 + In[4] + 2) / 4 9 Out[8] = (In[4] * 3 + In[3] + 1) / 4 10 Out[9] = (In[4] * 3 + In[5] + 2) / 4 11 Out[10] = (In[5] * 3 + In[4] + 1) / 4 12 Out[11] = (In[5] * 3 + In[6] + 2) / 4 13 Out[12] = (In[6] * 3 + In[5] + 1) / 4 14 Out[13] = (In[6] * 3 + In[7] + 2) / 4 15 Out[14] = (In[7] * 3 + In[6] + 1) / 4 16 Out[15] = In[7]...
[...]
...The Qilin framework [16] and CHC framework [13] showed possibilities of a cooperative CPU-GPU computation of a CUDA application....
[...]
...The work-item with the even ID reads In[0] to In[4] to produce an eight-pixel row from Out[0] to Out[7], and the workitem with the odd ID reads In[4] to In[7] to produce the successive eight-pixel row Out[8] to Out[15]....
[...]
References
4,841 citations
1,558 citations
1,202 citations
962 citations
"Cooperative heterogeneous computing..." refers background in this paper
...General-Purpose computing on Graphics Processing Units (GPGPU) has recently emerged as a powerful computing paradigm because of the massive parallelism provided by several hundreds of processing cores [4, 15]....
[...]
565 citations
"Cooperative heterogeneous computing..." refers background in this paper
...In addition, any core which finishes the assigned thread block so early would handle another thread block without being idle....
[...]
Related Papers (5)
Frequently Asked Questions (14)
Q2. What have the authors stated for future works in "Cooperative heterogeneous computing for parallel processing on cpu/gpu hybrids" ?
The authors believe the cooperative heterogeneous computing can be utilized in the future heterogeneous multi-core processors which are expected to include even more GPU cores as well as CPU cores. As future work, the authors will first develop a dynamic control scheme on deciding the workload distribution ratio. The authors also plan to design an efficient thread block distribution technique considering data access patterns and thread divergence. In fact, the future CHC framework needs to address the performance trade-offs considering the CUDA application configurations on various GPU and CPU models.
Q3. What is the main purpose of the GPU?
GPU is a throughput-oriented architecture which shows outstanding performance with applications having a large amount of data parallelism [7].
Q4. What is the main role of the host CPU for the CUDA kernel?
One of the major roles of the host CPU for the CUDA kernel is limited to controlling and accessing the graphics devices, while the GPU device provides a massive amount of data parallelism.
Q5. What is the main purpose of the Merge framework?
The Merge framework has extended EXOCHI for the parallel execution on CPU and GMA; however, it still requires APIs and the additional porting time [13].
Q6. What is the purpose of the framework?
It dynamically distributes the workload, but the framework targets only for generalized reduction applications, while their system targets to map general CUDA applications.
Q7. What is the main purpose of EXOCHI?
In addition, EXOCHI provides a programming environ-ment that enhances computing performance for media kernels on multicore CPUs with Intel R© Graphics Media Accelerator (GMA) [20].
Q8. What is the purpose of the experiments?
The experiments demonstrate that the proposed framework successfully achieves efficient parallel execution and that the performance results obtained are close to the values deducedfrom the theoretical analysis.
Q9. What is the way to schedule thread blocks?
Ocelot uses a locality-aware static partitioning scheme in their proposed thread scheduler, which assigns each thread block considering load balancing between neighboring worker thread [6].
Q10. What is the purpose of the CHC system?
Their CHC system is to use the idle computing resource with concurrent execution of the CUDA kernel on both CPU and GPU (as described in Fig. 1(b)).
Q11. What is the input of the WDM?
The input of WDM is the kernel configuration information and the output specifies two different portions of the kernel, each for CPU cores and the GPU device.
Q12. What is the purpose of the scheduling algorithm?
This scheduling algorithm enqueues a task (i.e., a thread block) into a global queue so that any worker thread on an available core can consume the task.
Q13. What is the main difference between a CUDA and a GPU?
Considering that the future computer systems are expected to incorporate more cores in both general purpose processors and graphics devices, parallel processing on CPU and GPU would become a great computing paradigm for high-performance applications.
Q14. What is the way to predict the performance of a CUDA program?
it is quite hard to predict characteristics of a CUDA program since the runtime behavior strongly relies on dynamic characteristics of the kernel [1, 10].