scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids

TL;DR: A cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU, without any source recompilation is presented.
Abstract: This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups as high as 3.08 compared to the baseline GPU-only processing.

Summary (3 min read)

1. Introduction

  • General-Purpose computing on Graphics Processing Units has recently emerged as a powerful computing paradigm because of the massive parallelism provided by several hundreds of processing cores [4, 15].
  • This will eventually provide additional computing power for the kernel execution while utilizing the idle CPU cores.
  • The paper proposes Cooperative Heterogeneous Computing (CHC), a new computing paradigm for explicitly processing CUDA applications in parallel on sets of heterogeneous processors including x86 based general-purpose multi-core processors and graphics processing units.
  • The authors present a theoretical analysis of the expected performance to demonstrate the maximum feasible improvement of their proposed system.
  • In addition, the performance evaluation on a real system has been performed and the results show that speedups as high as 3.08 have been achieved.

3. Motivation

  • One of the major roles of the host CPU for the CUDA kernel is limited to controlling and accessing the graphics devices, while the GPU device provides a massive amount of data parallelism.
  • As soon as the host calls the kernel function, the device starts to execute the kernel with a large number of hardware threads on the GPU device.
  • The authors CHC system is to use the idle computing resource with concurrent execution of the CUDA kernel on both CPU and GPU (as described in Fig. 1(b)).
  • This would be quite helpful to program a single chip heterogeneous multi-core processor including CPU and GPU as well.
  • The key difference on which the authors focus is in the use of idle computing resources with concurrent execution of the same CUDA kernel on both CPU and GPU, thereby easing the GPU burden.

4. Design

  • An overview of their proposed CHC system is shown in Fig. 2. The first includes the Workload Distribution Module (WDM), designed to apply the distribution ratio to the kernel configuration information.
  • As seen in Fig. 2, this procedure extracts the PTX code from the CUDA binary to prepare the LLVM code for cooperative computing.
  • The CUDA kernel execution typically needs some startup time to initialize the GPU device.
  • In the CHC framework, the GPU start-up process and the PTX-to-LLVM translation are simultaneously performed to hide the PTXto-LLVM translation overhead.

4.1. Workload distribution module and method

  • The input of WDM is the kernel configuration information and the output specifies two different portions of the kernel, each for CPU cores and the GPU device.
  • In order to divide the CUDA kernel, the workload distribution module determines the amount of the thread blocks to be detached from the grid considering the dimension of the grid and the workload distribution ratio as depicted in Fig.
  • WDM then delivers the generated execution configurations (i.e., the output of the WDM) to the CPU and GPU loaders.
  • Therefore, the first identifier of the CPU’s subkernel will be (dGrid.y×GPURatio) +.
  • The proposed work distribution can split the kernel according to the granularity of thread block.

4.2. Memory consolidation for transparent

  • Memory space A programmerwriting CUDA applications should assign memory spaces in the device memory of the graphics hardware.
  • For this purpose, the host system should preserve pointer variables pointing to the location in the device memory.
  • The abstraction layer uses double pointers data structures (similar to [19]) for pointer variables to map one pointer variable onto two memory addresses: for the main memory and the device memory.
  • Whenever a pointer variable is referenced, the abstraction layer translates the pointer to the memory addresses, for both CPU and GPU.
  • The addresses of these memory spaces are stored in a TMS data structure (e.g., TMS1), and the framework maps the pointer variable on the TMS data structure.

4.3. Global scheduling queue for thread scheduling

  • GPU is a throughput-oriented architecture which shows outstanding performance with applications having a large amount of data parallelism [7].
  • In their scheduling scheme, the authors allow a thread block to be assigned dynamically to any available core.
  • For this pur- pose, the authors have implemented a work sharing scheme using a Global Scheduling Queue (GSQ) [3].
  • This scheduling algorithm enqueues a task (i.e., a thread block) into a global queue so that any worker thread on an available core can consume the task.
  • In addition, any core which finishes the assigned thread block so early would handle another thread block without being idle.

4.4. Limitations on global memory consistency

  • CHC emulates the global memory on the CPU-side as well.
  • Thread blocks in the CPU can access the emulated global memory and perform the atomic operations.
  • Their system does not allow the global memory atomic operations between the thread blocks on the CPU and the thread blocks on the GPU to avoid severe performance degradation.
  • In fact, discrete GPUs have their own memory and communicate with the main memory through the PCI express, which causes long latency problems.
  • This architectural limit suggests that the CHC prototype need not provide global memory atomic operations between CPU and GPU.

5. Results

  • The proposed CHC framework has been fully implemented on a desktop system with two Intel XeonTM X5550 2.66 GHz quad-core processors and an NVIDIA GeForceTM 9400 GT device.
  • This configuration is also applicable to a single-chip heterogeneous multi-core processor that has an integrated GPU, which is generally slower than discrete GPUs.
  • The authors adapt 14 CUDA applications which do not have the global memory synchronization across CPU and GPU at runtime; twelve from the NVIDIA CUDA Software Development Kit (SDK) [16], SpMV [2], and MD5 hashing [9].
  • From left to right the columns represent the application name, the number of computation kernels, the number of thread blocks in the kernels, a description of the kernel, and work distribution ratio used in the CHC framework.
  • The authors measured the execution time of kernel launches and compared CHC framework against the GPU-only computing.

5.1. Initial analysis

  • For the initial analysis, the authors have measured the execution delay using only the GPU device and the delay using only the host CPU (through the LLVM JIT compilation technique [5, 6, 11]).
  • In addition, the workload has been configured either as executing only one thread block or as executing the complete set of thread blocks.
  • The maximum performance improvement achievable based on the initial execution delays can be experimentally deduced, as depicted in Table 2. Fig. 5 shows the way to find it; the x-axis represents the workload ratio in terms of thread blocks assigned to the CPU cores against thread blocks on the GPU device.
  • Therefore, the execution delay for GPU is proportionally reduced along the x-axis.
  • Compressed Sparse Row (CSR) format is used.

5.2. Performance improvements of CHC framework

  • Table 2 shows the maximum performance and the optimal distribution ratio obtained from the initial analysis.
  • In addition, the actual execution time and the actual work distribution ratio using CHC are also presented.
  • In fact, the optimal distribution ratio is used to determine the work distribution ratio on CHC.
  • More in detail, the applications with exponential, trigonometric, or power arithmetic operations (BINO, BLKS, MERT, MONT, and CONV) show little performance improvement.

6. Conclusions

  • The paper has introduced three key features for the efficient exploitation of the thread level parallelism provided by CUDA on the CPU multi-cores in addition to the GPU device.
  • The experiments demonstrate that the proposed framework successfully achieves efficient parallel execution and that the performance results obtained are close to the values deduced from the theoretical analysis.
  • The authors also plan to design an efficient thread block distribution technique considering data access patterns and thread divergence.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Cooperative Heterogeneous Computing
for Parallel Processing on CPU/GPU Hybrids
Changmin Lee and Won W. Ro
School of Electrical and Electronic Engineering
Yonsei University
Seoul 120-749, Republic of Korea
{exahz, wro}@yonsei.ac.kr
Jean-Luc Gaudiot
Department of Electrical Engineering and
Computer Science, University of California
Irvine, CA 92697-2625
gaudiot@uci.edu
Abstract
This paper presents a cooperative heterogeneous com-
puting framework which enables the efcient utilization of
available computing resources of host CPU cores for CUDA
kernels, which are designed to run only on GPU. The pro-
posed system exploits at runtime the coarse-grain thread-
level parallelism across CPU and GPU, without any source
recompilation. To this end, three features including a work
distribution module, a transparent memory space, and a
global scheduling queue are described in this paper. With
a completely automatic runtime workload distribution, the
proposed framework achieves speedups as high as 3.08
compared to the baseline GPU-only processing.
1. Introduction
General-Purpose computing on Graphics Processing
Units (GPGPU) has recently emerged as a powerful com-
puting paradigm because of the massive parallelism pro-
vided by several hundreds of processing cores [4, 15]. Un-
der the GPGPU concept, NVIDIA
R
has developed a C-
based programming model, Compute Unied Device Ar-
chitecture (CUDA), which provides greater programmabil-
ity for high-performance graphics devices. As a matter o f
fact, general-purpose computing on graphics devices with
CUDA helps improve the performance of many applications
under the concept of a Single Instruction Multiple Thread
model (SIMT).
Although the GPGPU paradigm successfully provides
signicant computation throughput, its performance could
still be improved if we could utilize the idle CPU resource.
Indeed, in general, the host CPU is being held while the
CUDA kernel executes on the GPU devices; the CPU is not
allowed to resume execution until the GPU ha s completed
the kernel code and has provided the computation results.
The main motivation of our research is to exploit parallelism
across the host CPU cores in addition to the GPU cores.
This will eventually provide additional computing power
for the kernel execution while utilizing the idle CPU cores.
Our ultimate goal is thus to provide a technique which even-
tually exploits sufcient parallelism across h eterogeneous
processors.
The paper proposes Cooperative Heterogeneous Com-
puting (CHC), a new computing paradigm for explicitly
processing CUDA applications in parallel on sets of hetero-
geneous processors including x86 based general-purpose
multi-core processors and g raphics processing units. Ther e
have been several previous research projects which have
aimed at exploiting parallelism on CPU and GPU. How-
ever, those previous approaches require either additional
programming language support or API development. As
opposed to those previous efforts, our CHC is a software
framework that provides a virtual layer for transparent exe-
cution over host CPU cores. This enables the direct execu-
tion of CUDA code, while simultaneously providing suf-
cient portability and back ward compatibility.
To achieve an efcient cooperative execution model, we
have developed three important techniques:
A workload distribution module (WDM) for CUDA
kernel to map each kernel onto CPU and GPU
A memory model that supports a transparent memory
space (TMS) to manage the main memory with GPU
memory
A global scheduling queue (GSQ) that supports bal-
anced thread scheduling and distribution on each of the
CPU cores
We present a theoretical analysis of the expected perfor-
mance to demonstrate the maximum feasible improvement
of our proposed system. In addition, the p erformance eval-
uation on a real system has been performed and the results

Kernel_func<<<>>>();
Serial code
Serial code
CPU GPU
waiting
(a)
Kernel_func<<<>>>();
Serial code
Serial code
CPU GPU
Execution time reduction
(b)
Figure 1. Execution ow of CUDA p rogram: limitation where CPU stalls (a) and cooperative execution
in parallel (b).
show that speedups as high as 3.08 have been achieved. On
average, the complete CHC system shows a performance
improvement of 1.42 over GPU-only computation with 14
CUDA applications.
The rest of the paper is organized as follows. Section 2
reviews related work and Section 3 introduce the existing
CUDA programming model and describe motivation of this
work. In Section 4, we presents the design and limitations
of the CHC framework. Section 5 gives preliminary results.
Finally, we conclude this work in Section 6.
2. Related work
There h ave been several prior research projects which
aim at mapping an explicitly parallel program for gr a phics
devices onto multi-core CPUs or heterogeneous architec-
tures. MCUDA [18] automatically translates CUDA codes
for general purpose multi-core processors, applying source-
to-source translation. This implies that th e MCUDA tech-
nique translates the kernel source co de into a code written in
a general purpose high-level language, which requires one
additional step o f source recompilation.
Twin Peaks [8] maps an OpenCL-compatible program
targeted for GPUs onto multi-core CPUs by u sing the
LLVM (Low Level Virtual Mach ine) inter mediate repre-
sentation for various instruction sets. Ocelot [6], wh ich
inspired our runtime system, uses a dynamic translation
techniqu e to map a CUDA program onto multi-core CPUs.
Ocelot converts at runtime PTX code into an LLVM code
without recompilation and optimizes PTX and LLVM code
for execution by the CPU. The proposed framework in this
paper is largely different from these translation techniques
(MCUDA, Twin Peaks, and Ocelot) in that we support co-
operative execution for parallel processing over both CPU
cores and GPU cores.
In addition, EXOCHI provides a programming environ-
ment that enhances computing performance for media ker-
nels on multicore CPUs with Intel
R
Graphics Media Ac-
celerator (GMA) [20]. However, this programming model
uses the CPU cores only for serial execution. The Merge
framework has extended EXOCHI for the parallel execu-
tion on CPU and GMA; however, it still requires APIs and
the additional porting time [13]. Lee et al. have presented
a framework which aims at porting an OpenCL program
on the Cell BE processor [12]. They have implemented a
runtime system that manages software-managed caches and
coherence protocols.
Ravi et al. [17] have proposed a compiler and a run-
time framewo rk that generate a hybrid code running on both
CPU and GPU. It dynamically distributes the workload, but
the framework targets only for generalized reduction appli-
cations, while our system targets to map general CUDA ap-
plications. Qilin [14], in the most relevant study to our pro-
posed framework, has shown an adaptive kernel mapping
using a dynamic work d istribution. The Qilin system trains
a program to maintain databases for the adaptive mapping
scheme. In fact, Qilin requires and strongly relies o n its own
programming interface. This implies that the system cannot
directly port the existing CUDA codes, but rather p rogram-
mers should modify the source code to t their interfaces.
As an alternative, CHC is designed for seamless porting of
the existing CUDA code on CPU cores and GPU cores. In
other words, we focus on p roviding backward compatibility
of CUDA runtime APIs.
3. Motivation
One of the major roles of the host CPU for the CUDA
kernel is limited to controlling and accessin g the graphics
devices, while the GPU device provides a massive amount
of data parallelism. Fig. 1(a) shows an example where the
host contr ols the execution ow of the program only, while

the d evice is responsible for executing the kernel. Once a
CUDA program is started, the host processor executes the
program sequentially until the kernel code is encountered.
As soon as the host calls the kernel function, the d evice
starts to execute the kernel with a large number of har d-
ware threads on the GPU device. In fact, the host processor
is held in the idle state until the device reaches the end of
the kernel execution.
As a result, the id le time causes an ine fcient utilization
of the CPU hardware resource of the host machine. Our
CHC system is to use the idle computing resource with con-
current execution of the CUDA kernel on both CPU and
GPU (as described in Fig. 1(b)). Considering that the fu-
ture computer systems are expected to incorporate more
cores in both general purpose processors and graphics de-
vices, parallel processing on CPU and GPU would become
a g reat computing paradigm for high-performance appli-
cations. This would be quite helpful to program a sin-
gle chip heterogeneo us multi-core processor including CPU
and GPU as well. Note that Intel
R
and AMD
R
have al-
ready shipped commercial heterogeneou s multi-core pro-
cessors.
In fact, CUDA is capable of enabling asynchronous con-
current execution between host and device. The concurrent
execution returns a control to the host before the device has
completed a requested task (i.e., non-blocking). However,
the CPU that has the control can only perform a function
such as memory copy, setting other input data, or kernel
launches using streams. The key difference on which we
focus is in the use of idle computing resources with concur-
rent execution of the same CUDA kernel on both CPU and
GPU, thereby easing the GPU burden.
4. Design
An overview of our proposed CHC system is shown in
Fig. 2. It contains two runtime procedures for each kernel
launched. Each kernel execution undergoes those proce-
dures. The rst includes the Workload Distribution Mod-
ule (WDM), designed to apply the distribution ratio to the
kernel conguration information. Then, the modied con-
guration information is delivered to both the CPU loader
and the GPU loader. Two sub-kernels (Kernel
CP U
and
Kernel
GP U
) are loaded and executed, based on the modi-
ed kernel congurations produced by the WDM.
The second procedure is designed to translate the PTX
code into the LLVM intermediate representation (LLVM
IR). As seen in Fig. 2, this procedure extracts the PTX code
from the CUDA binary to prepare the LLVM code for co-
operative computing. On the GPU device, our runtime sys-
tem passes the PTX code through the CUDA device driver,
which means that the GPU executes the kernel in the orig-
inal manner using the PTX-JIT compilation. On the CPU
Kernel Configuration
Information
CUDA Executable Binary
PTX Assembly Code
Workload
Distribution Module
PTX-to-LLVMLLVM PTX
CHC
Framework
GPU
CUDA Driver (PTX-JIT)
Main Memory
Kernel
CPU
CoreCore
CPU
LLVM-JIT
Global Scheduling Queue
---
CPU Loader GPU Loader
Kernel
GPU
SM SM
Global Memory
---
Abstraction Layer for TMS
Figure 2. An overview of the CHC runtime
system.
core side, CHC uses the PTX translator provided in Ocelot
in order to convert PTX instructions into LLVM IR [6]. This
LLVM IR is used for a kernel context of all thread blocks
running on CPU cores, and LLVM-JIT is utilized to execute
the kernel context [11].
The CUDA kernel execution typically needs some start-
up time to initialize the GPU device. In the CHC frame-
work, the GPU start-up process and the PTX-to-LLVM
translation are simultaneously performed to hide the PTX-
to-LLVM translation overhead.
4.1. Workload distribution module and
method
The input of WDM is the kernel conguration informa-
tion and the output species two different portions of the
kernel, each for CPU cores and the GPU device. The kernel
conguration information contains the execution congura-
tion which provides the dimension of a grid and that of a
block. The dimension of a grid can be efciently used for
our workload distribution module.
In order to divide the CUDA kernel, the workload distri-
bution module determines the amount of the thread blocks
to be detached from the grid considering the dimension of
the grid and the workload distribution ratio as depicted in
Fig. 3. As a result, WDM generates two additional exe-
cution congurations, one for CPU and the other for GPU.
WDM then delivers the generated execution congurations
(i.e., the output of the WDM) to the CPU and GPU load-
ers. With th ese execution congurations, each loader now
can make a sub-kernel by using the kernel context such as

kernel_func<<<dGrid, dBlock>>>();
dGrid { x, y }
Work Distribution Flow
sub1_dGrid.x := dGrid.x
sub1_dGrid.y := dGrid.y u CPU
Ratio
sub2_dGrid.x := dGrid.x
sub2_dGrid.y := dGrid.y u GPU
Ratio
CPU Loader GPU Loader
sub1_dGrid sub2_dGrid
LLVM PTX
Abstraction Layer
Sub-Kernel 2
Sub-Kernel 1
Kernel Grid
Core0 Core1
Queue for Work Sharing
Figure 3. Work distribution ow and kernel
mapping to CPU and GPU.
LLVM and PTX.
Typically, WDM assigns the front portion of thread
blocks to the GPU-side, while the rest is assigned to the
CPU-side. Therefore, the rst identier of the CPU’s sub-
kernel will be (dGrid.y × GPU
Ratio
) + 1. Then, each thread
block can identify the assigned data with the identier since
both sides have an identical memory space.
In order to nd the optim al workload distribution ratio,
we can probably predict the runtime behavior such as the
execution d elay on CPU cores. However, it is quite hard to
predict characteristics of a CUDA program since the run-
time behavior strongly relies on dynamic characteristics o f
the kernel [1, 10]. For this reason, Qilin used an empirical
approach to achieve their proposed adaptive mapping [14].
In fact, our proposed CHC also adopts a heuristic approach
to determine the workload distribution ratio. Then, the CHC
framework performs the dynamic work distribution at run-
time based on this ratio. The proposed work distribution can
split the kernel according to the granularity of thread block.
4.2. Memory consolidation for transparent
memory space
A programmer writing CUDA applications should assign
memory spaces in the device memory of the graphics hard-
ware. These memory locations (or, addresses) are u sed for
the input and output data. In the CUDA model, data can be
copied between the host memory and the d edicated memory
on the device. For this purpose, the host system should pre-
serve pointer variables pointing to the location in the device
memory.
As opposed to the original CUDA model, two differ-
ent memory addresses exist for one pointer variable in our
proposed CHC framework. The key design problem is
caused by the fact that the comp utation results of the CPU
side are stored into the main memory that is different from
the device memory. To address this problem, we propose
and design an abstraction layer, Transparent Memory Space
(TMS), to preserve two different memory addresses in a
pointer variable at a time.
Accessing memory addresses. The abstraction layer
uses double pointers data structures (similar to [19]) for
pointer variables to map one pointer variable onto two mem-
ory addresses: for the main memory and the device mem-
ory. As seen in Fig. 4, we have declared the abstrac-
tion layer that manages a list of the TMS data structures.
Whenever a pointer variable is referenced, the abstraction
layer translates the pointer to the memory addresses, for
both CPU and GPU. For example, when a pointer vari-
able (e.g., d
out) is used to allocate device memory using
cudaMalloc(), the framework assigns memory spaces
both on the device memory and the host memory. The ad-
dresses of these memory spaces are stored in a TMS d ata
structure (e.g., TMS1), and the framework maps the pointer
variable on the TMS data structure. Thus, the runtime
framework can perform the address translation for a pointer
variable.
Launching a kernel. For launching a kernel, pointer
variables dened in advance may be used as arguments of
the kernel function. At that time, the CPU and G PU load-
ers obtain each translated address from the mapping table
so that each sub-kernel could retain actual addresses o n its
memory domain.
Merging separated data. After nishing the kernel com-
putation, the computation results are copied to the host
memory (cudaMemcpy()) to perform further operations.
Therefore, merging the data of two separate memory do-
mains is required. To reduce memory copy overhead, the
framework traces memory addresses which are modied by
the CPU-side computation.
4.3. Global scheduling queue for thread
scheduling
GPU is a throughput-oriented architecture which shows
outstanding performance with applications having a large
amount of data parallelism [7]. However, to achieve mean-
ingful performance from the CPU side, scheduling thread
blocks with an efcient policy is important.
Ocelot uses a locality-aware static partitioning scheme
in their proposed thread scheduler, which assigns each
thread block considering load balancing between neighbor-
ing worker thread [6]. However, this static partitioning
method probably causes some cores to nish their execution
early. In our scheduling scheme, we allow a thread block to
be assigned dynamically to any available core. For this pur-

float *h_in;
float *h_out;
...
cudaMalloc(d_in, size);
cudaMalloc(d_out, size);
...
cudaMemcpy(d_in, h_in, size, ...);
kernel_func<<<...>>>(d_in, d_out);
cudaMemcpy(h_out, d_out, size, ...);
...
ŘŘŘ
d_out
0xC
0xD
TMS1
ŘŘŘ
GPU Dedicated Memory
ŘŘŘ
Main Memory
CPU Loader GPU Loader
Kernel
CPU
Kernel
GPU
(0xB, 0xD)(0xA, 0xC)
d_in
0xA
0xB
TMS0
pointers
d_in
d_out
values
TMS0
TMS1
Mapping Table
Abstraction Layer
Output by GPU computation
Output by CPU computation
0xD
0xC
Abstraction layer
d_out points to address of TMS1
Figure 4. Anatomy of transparent memory space.
pose, we have implemented a work sharing scheme using a
Global Scheduling Queue (GSQ) [3]. This scheduling al-
gorithm enqueues a task (i.e., a thread block) into a global
queue so that any worker thread on an available core can
consume the task. Thus, this scheduling scheme allows a
worker thread in each core to pick up only one thread block
and achieve load balancing. In addition, any core which
nishes the assigned thread block so early would handle an-
other thread block without being idle.
4.4. Limitations on global memory consis-
tency
CHC emulates the global memory on the CPU-side as
well. Thread blocks in the CPU can access the emulated
global memory and perform the atomic operations. How-
ever, our system does not allow the global memory atomic
operations between the thread blocks on the CPU and the
thread blocks on the GPU to avoid severe performance
degradation. In fact, discrete GPUs h ave their own memory
and communicate with the main memory through the PCI
express, which causes long latency p roblems. This archi-
tectural limit suggests that the CHC prototype need not pro-
vide global memory atomic operations between CPU and
GPU.
5. Results
The proposed CHC framework has been fully imple-
mented on a desktop system with two Intel Xeon
TM
X5550
2.66 GHz quad-core processors and an NVIDIA GeForce
TM
9400 GT device. The aim of the CHC framework is to
demonstrate the feasibility of th e par allel kernel execution
on CPU and GPU to improve CUDA execution on low-end
GPUs. This conguration is also ap plicable to a single-chip
heterogeneous multi-core processor that has an integrated
GPU, which is generally slower than discrete GPUs.
We adapt 14 CUDA applications which do not have the
global memory synchronization across CPU and GPU at
runtime; twelve from the NVIDIA CUDA Software Devel-
opment Kit (SDK) [16], SpMV [2], and MD5 hashing [9].
Table 1 summarizes these applications and kernels.
From left to right the columns represent the application
name, the number of computation kernels, the number of
thread blocks in the kernels, a description of the kernel, and
work distribution ratio used in the CHC framework. We
measured the execution time of kernel launches and com-
pared CHC framework against the GPU-only computing.
The validity of the CHC resu lts was compared to a compu-
tation result that has been executed on a CPU only.
5.1. Initial analysis
For the initial analysis, we have m easured the execution
delay using only the GPU device and the delay using only
the host CPU (through the LLVM JIT compilation tech-
nique [5, 6, 11]). In addition, the workload has been cong-
ured either as executing only one thread block or as execut-
ing the complete set of thread blocks.
The maximum performance improvement achievable
based on the initial execution d elays can be experimentally
deduced, as depicted in Table 2. Fig. 5 shows the way
to nd it; the x-axis represents the workload ratio in terms
of thread blocks assigned to the CPU cores against thread
blocks on the GPU device. With having more thread blocks
on the CPU cores, fewer thread blocks would be assigned
to the GPU device. Therefore, the execution delay for GPU
is proportionally reduced along the x-axis.
From the above observation, the maximum value be-
tween the CPU execution delay and the GPU execution de-
lay at a given workload ratio can be considered as the total

Citations
More filters
Journal ArticleDOI
TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Abstract: As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

414 citations


Cites background from "Cooperative heterogeneous computing..."

  • ...Li et al. [2011] propose an HCT for Cryo-EM 3D reconstruction, where tasks (in this case, individual images) are assigned to a CPU and a GPU based on their relative performance....

    [...]

Journal ArticleDOI
TL;DR: A heterogeneous computation and resource allocation framework based on a heterogeneous mobile architecture to achieve effective implementation of FL is proposed and results show that the proposed scheme converges quite fast and better enhance the energy efficiency of the wireless powered FL system compared with the baseline schemes.
Abstract: Federated learning (FL) is a popular edge learning approach that utilizes local data and computing resources of network edge devices to train machine learning (ML) models while preserving users’ privacy. Nevertheless, performing efficient learning tasks on the devices and achieving longer battery life are primary challenges faced by federated learning. In this paper, we are the first to study the application of heterogeneous computing (HC) and wireless power transfer (WPT) to federated learning to address these challenges. Especially, we propose a heterogeneous computation and resource allocation framework based on a heterogeneous mobile architecture to achieve effective implementation of FL. To minimize the energy consumption of smart devices and maximize their harvesting energy simultaneously, we formulate an optimization problem featuring multidimensional control, which jointly considers time splitting for WPT, dataset size allocation, transmit power allocation and subcarrier assignment during communications, and processor frequency of processing units (central processing unit (CPU) and graphics processing unit (GPU)). However, the major obstacle is how to design a proper algorithm to solve this optimization problem efficiently. For this purpose, we decouple the optimization variables so as to achieve high efficiency in deriving its solution. Particularly, we first compute the optimal processor frequency and dataset size allocation via employing the Lagrangian dual method, followed by finding the closed-form solution to the optimal time splitting allocation, and finally attain the optimal subcarrier assignment as well as transmit power for transmissions through an iteration algorithm. To evaluate the performance of our proposed scheme, we set up four baseline schemes as comparison, and simulation results show that the proposed scheme converges quite fast and better enhance the energy efficiency of the wireless powered FL system compared with the baseline schemes.

52 citations

Proceedings ArticleDOI
06 Aug 2015
TL;DR: By running tasks in a pipelined parallel way on multicore CPUs and offloading a data-parallel task to a GPU, this paper could successfully achieve 29 fps for Full HD inputs on Tegra K1 platform where quad-core Cortex-A15 CPU and CUDA supported 192-core GPU are integrated.
Abstract: CPU-GPU heterogeneous systems have become a mainstream platform in both server and embedded domains with ever increasing demand for powerful accelerator. In this paper, we present parallelization techniques that exploit both data and task parallelism of LBP based face detection algorithm on an embedded heterogeneous platform. By running tasks in a pipelined parallel way on multicore CPUs and by offloading a data-parallel task to a GPU, we could successfully achieve 29 fps for Full HD inputs on Tegra K1 platform where quad-core Cortex-A15 CPU and CUDA supported 192-core GPU are integrated. This corresponds to 5.54x speedup over a sequential version and 1.69x speedup compared to the GPU-only implementations.

18 citations


Cites background from "Cooperative heterogeneous computing..."

  • ...introduced cooperative computing framework [9] exploiting TLP (thread-level...

    [...]

Proceedings ArticleDOI
01 Mar 2017
TL;DR: This paper explores the effects of overlapping data transfer and the kernel execution on overall execution time of some CUDA applications and shows that the usage of the different levels of concurrency supported by the streams enhances the performance of theCUDA applications.
Abstract: In a CPU-GPU based heterogeneous computing system, the input data to be processed by the kernel resides in the host memory. The host and the device memory address spaces are different. Therefore, the device can not directly access the host memory. In CUDA programming model, the data is moved between the host memory and the device memory. This data transfer is a time consuming task. The communication overhead can be hidden by overlapping the data transfer and the kernel execution. CUDA streams provide a means for overlapping data transfer and the kernel execution. In this paper we explore the effects of overlapping data transfer and the kernel execution on overall execution time of some CUDA applications. The results show that the usage of the different levels of concurrency supported by the streams enhances the performance of the CUDA applications.

13 citations


Cites background from "Cooperative heterogeneous computing..."

  • ...[5] have proposed a cooperative heterogeneous computing framework....

    [...]

Posted Content
TL;DR: A novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and an OpenCL-programmable GPU that achieves up to 95% of the theoretically attainable maximal speedup, with an average of 88%.
Abstract: With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and GPU for JPEG decoding. In this paper we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and an OpenCL-programmable GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses (1) the CPU and GPU performance characteristics, (2) the image entropy and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our run-time partitioning and scheduling scheme exploits task, data and pipeline parallelism by scheduling the non-parallelizable entropy decoding task on the CPU, whereas inverse cosine transformations (IDCTs), color conversions and upsampling are conducted on both the CPU and the GPU. Our kernels have been optimized for GPU memory hierarchies. We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 constitute a competitive yardstick for the comparison to the proposed approach. Retro-fitting our method with libjpeg-turbo provided insights on the software-engineering aspects of re-engineering legacy code for heterogeneous multicores.

9 citations


Cites methods from "Cooperative heterogeneous computing..."

  • ...0 1 Out[0] = In[0] 1 2 2 Out[1] = (In[0] * 3 + In[1] + 2) / 4 3 3 Out[2] = (In[1] * 3 + In[0] + 1) / 4 4 4 Out[3] = (In[1] * 3 + In[2] + 2) / 4 5 5 Out[4] = (In[2] * 3 + In[1] + 1) / 4 6 7 6 Out[5] = (In[2] * 3 + In[3] + 2) / 4 7 Out[6] = (In[3] * 3 + In[2] + 1) / 4 8 Out[7] = (In[3] * 3 + In[4] + 2) / 4 9 Out[8] = (In[4] * 3 + In[3] + 1) / 4 0 10 Out[9] = (In[4] * 3 + In[5] + 2) / 4 1 11 Out[10] = (In[5] * 3 + In[4] + 1) / 4 2 12 Out[11] = (In[5] * 3 + In[6] + 2) / 4 3 13 Out[12] = (In[6] * 3 + In[5] + 1) / 4 4 14 Out[13] = (In[6] * 3 + In[7] + 2) / 4 5 6 15 Out[14] = (In[7] * 3 + In[6] + 1) / 4 7 (a) Block-based pattern 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 16 Out[15] = In[7] 4.2 Upsampling The chrominance components in 4:2:2 subsampling are downsampled to half of the luminance size during JPEG encoding....

    [...]

  • ...Input : Pixel information in YCbCr color space Output: Pixel information in RGB color space 1 R = Y + 1.402 (Cr -128) 2 G = Y -0.34414 (Cb -128) -0.71414 (Cr -128) 3 B = Y + 1.772 (Cb -128) vector 0 vector 1 vector 2 vector 3 vector 4 vector 5 R G B R G B R G B R G B R G B R G B R G B R G B Figure 4: Vectorization of interleaving RGB per­formed by one work-item in order to reduce global memory writes....

    [...]

  • ...1 Out[0] = In[0] 2 Out[1] = (In[0] * 3 + In[1] + 2) / 4 3 Out[2] = (In[1] * 3 + In[0] + 1) / 4 4 Out[3] = (In[1] * 3 + In[2] + 2) / 4 5 Out[4] = (In[2] * 3 + In[1] + 1) / 4 6 Out[5] = (In[2] * 3 + In[3] + 2) / 4 7 Out[6] = (In[3] * 3 + In[2] + 1) / 4 8 Out[7] = (In[3] * 3 + In[4] + 2) / 4 9 Out[8] = (In[4] * 3 + In[3] + 1) / 4 10 Out[9] = (In[4] * 3 + In[5] + 2) / 4 11 Out[10] = (In[5] * 3 + In[4] + 1) / 4 12 Out[11] = (In[5] * 3 + In[6] + 2) / 4 13 Out[12] = (In[6] * 3 + In[5] + 1) / 4 14 Out[13] = (In[6] * 3 + In[7] + 2) / 4 15 Out[14] = (In[7] * 3 + In[6] + 1) / 4 16 Out[15] = In[7]...

    [...]

  • ...The Qilin framework [16] and CHC framework [13] showed possibilities of a cooperative CPU-GPU computation of a CUDA application....

    [...]

  • ...The work-item with the even ID reads In[0] to In[4] to pro­duce an eight-pixel row from Out[0] to Out[7], and the work­item with the odd ID reads In[4] to In[7] to produce the successive eight-pixel row Out[8] to Out[15]....

    [...]

References
More filters
Proceedings ArticleDOI
20 Mar 2004
TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Abstract: We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in static single assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems.

4,841 citations

Proceedings ArticleDOI
26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Abstract: Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

1,558 citations

Journal ArticleDOI
TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time to execute a fully strict computation on P processors using this scheduler is 1:1.
Abstract: This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies.Specifically, our analysis shows that the expected time to execute a fully strict computation on P processors using our work-stealing scheduler is T1/P + O(T ∞ , where T1 is the minimum serial execution time of the multithreaded computation and (T ∞ is the minimum execution time with an infinite number of processors. Moreover, the space required by the execution is at most S1P, where S1 is the minimum serial space requirement. We also show that the expected total communication of the algorithm is at most O(PT ∞( 1 + nd)Smax), where Smax is the size of the largest activation record of any thread and nd is the maximum number of times that any thread synchronizes with its parent. This communication bound justifies the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.

1,202 citations

Journal ArticleDOI
TL;DR: The rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications are described.
Abstract: GPU computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing. This article describes the rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications.

962 citations


"Cooperative heterogeneous computing..." refers background in this paper

  • ...General-Purpose computing on Graphics Processing Units (GPGPU) has recently emerged as a powerful computing paradigm because of the massive parallelism provided by several hundreds of processing cores [4, 15]....

    [...]

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Adaptive mapping is proposed, a fully automatic technique to map computations to processing elements on a CPU+GPU machine and it is shown that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduced in energy consumption than static mappings on average for a set of important computation benchmarks.
Abstract: Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.

565 citations


"Cooperative heterogeneous computing..." refers background in this paper

  • ...In addition, any core which finishes the assigned thread block so early would handle another thread block without being idle....

    [...]

Frequently Asked Questions (14)
Q1. What contributions have the authors mentioned in the paper "Cooperative heterogeneous computing for parallel processing on cpu/gpu hybrids" ?

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. 

The authors believe the cooperative heterogeneous computing can be utilized in the future heterogeneous multi-core processors which are expected to include even more GPU cores as well as CPU cores. As future work, the authors will first develop a dynamic control scheme on deciding the workload distribution ratio. The authors also plan to design an efficient thread block distribution technique considering data access patterns and thread divergence. In fact, the future CHC framework needs to address the performance trade-offs considering the CUDA application configurations on various GPU and CPU models. 

GPU is a throughput-oriented architecture which shows outstanding performance with applications having a large amount of data parallelism [7]. 

One of the major roles of the host CPU for the CUDA kernel is limited to controlling and accessing the graphics devices, while the GPU device provides a massive amount of data parallelism. 

The Merge framework has extended EXOCHI for the parallel execution on CPU and GMA; however, it still requires APIs and the additional porting time [13]. 

It dynamically distributes the workload, but the framework targets only for generalized reduction applications, while their system targets to map general CUDA applications. 

In addition, EXOCHI provides a programming environ-ment that enhances computing performance for media kernels on multicore CPUs with Intel R© Graphics Media Accelerator (GMA) [20]. 

The experiments demonstrate that the proposed framework successfully achieves efficient parallel execution and that the performance results obtained are close to the values deducedfrom the theoretical analysis. 

Ocelot uses a locality-aware static partitioning scheme in their proposed thread scheduler, which assigns each thread block considering load balancing between neighboring worker thread [6]. 

Their CHC system is to use the idle computing resource with concurrent execution of the CUDA kernel on both CPU and GPU (as described in Fig. 1(b)). 

The input of WDM is the kernel configuration information and the output specifies two different portions of the kernel, each for CPU cores and the GPU device. 

This scheduling algorithm enqueues a task (i.e., a thread block) into a global queue so that any worker thread on an available core can consume the task. 

Considering that the future computer systems are expected to incorporate more cores in both general purpose processors and graphics devices, parallel processing on CPU and GPU would become a great computing paradigm for high-performance applications. 

it is quite hard to predict characteristics of a CUDA program since the runtime behavior strongly relies on dynamic characteristics of the kernel [1, 10].