Proceedings Article•DOI•

Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids

Changmin Lee¹, Won Woo Ro¹, Jean-Luc Gaudiot²•Institutions (2)

Yonsei University¹, University of California, Irvine²

25 Feb 2012-pp 33-40

TL;DR: A cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU, without any source recompilation is presented.

read less

Abstract: This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups as high as 3.08 compared to the baseline GPU-only processing.

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2. Related work] – [3. Motivation] – [4. Design] – [4.1. Workload distribution module and method] – [4.2. Memory consolidation for transparent] – [4.3. Global scheduling queue for thread scheduling] – [4.4. Limitations on global memory consistency] – [5. Results] – [5.1. Initial analysis] – [5.2. Performance improvements of CHC framework] and [6. Conclusions]

1. Introduction

General-Purpose computing on Graphics Processing Units has recently emerged as a powerful computing paradigm because of the massive parallelism provided by several hundreds of processing cores [4, 15].
This will eventually provide additional computing power for the kernel execution while utilizing the idle CPU cores.
The paper proposes Cooperative Heterogeneous Computing (CHC), a new computing paradigm for explicitly processing CUDA applications in parallel on sets of heterogeneous processors including x86 based general-purpose multi-core processors and graphics processing units.
The authors present a theoretical analysis of the expected performance to demonstrate the maximum feasible improvement of their proposed system.
In addition, the performance evaluation on a real system has been performed and the results show that speedups as high as 3.08 have been achieved.

3. Motivation

One of the major roles of the host CPU for the CUDA kernel is limited to controlling and accessing the graphics devices, while the GPU device provides a massive amount of data parallelism.
As soon as the host calls the kernel function, the device starts to execute the kernel with a large number of hardware threads on the GPU device.
The authors CHC system is to use the idle computing resource with concurrent execution of the CUDA kernel on both CPU and GPU (as described in Fig. 1(b)).
This would be quite helpful to program a single chip heterogeneous multi-core processor including CPU and GPU as well.
The key difference on which the authors focus is in the use of idle computing resources with concurrent execution of the same CUDA kernel on both CPU and GPU, thereby easing the GPU burden.

4. Design

An overview of their proposed CHC system is shown in Fig. 2. The first includes the Workload Distribution Module (WDM), designed to apply the distribution ratio to the kernel configuration information.
As seen in Fig. 2, this procedure extracts the PTX code from the CUDA binary to prepare the LLVM code for cooperative computing.
The CUDA kernel execution typically needs some startup time to initialize the GPU device.
In the CHC framework, the GPU start-up process and the PTX-to-LLVM translation are simultaneously performed to hide the PTXto-LLVM translation overhead.

4.1. Workload distribution module and method

The input of WDM is the kernel configuration information and the output specifies two different portions of the kernel, each for CPU cores and the GPU device.
In order to divide the CUDA kernel, the workload distribution module determines the amount of the thread blocks to be detached from the grid considering the dimension of the grid and the workload distribution ratio as depicted in Fig.
WDM then delivers the generated execution configurations (i.e., the output of the WDM) to the CPU and GPU loaders.
Therefore, the first identifier of the CPU’s subkernel will be (dGrid.y×GPURatio) +.
The proposed work distribution can split the kernel according to the granularity of thread block.

4.2. Memory consolidation for transparent

Memory space A programmerwriting CUDA applications should assign memory spaces in the device memory of the graphics hardware.
For this purpose, the host system should preserve pointer variables pointing to the location in the device memory.
The abstraction layer uses double pointers data structures (similar to [19]) for pointer variables to map one pointer variable onto two memory addresses: for the main memory and the device memory.
Whenever a pointer variable is referenced, the abstraction layer translates the pointer to the memory addresses, for both CPU and GPU.
The addresses of these memory spaces are stored in a TMS data structure (e.g., TMS1), and the framework maps the pointer variable on the TMS data structure.

4.3. Global scheduling queue for thread scheduling

GPU is a throughput-oriented architecture which shows outstanding performance with applications having a large amount of data parallelism [7].
In their scheduling scheme, the authors allow a thread block to be assigned dynamically to any available core.
For this pur- pose, the authors have implemented a work sharing scheme using a Global Scheduling Queue (GSQ) [3].
This scheduling algorithm enqueues a task (i.e., a thread block) into a global queue so that any worker thread on an available core can consume the task.
In addition, any core which finishes the assigned thread block so early would handle another thread block without being idle.

4.4. Limitations on global memory consistency

CHC emulates the global memory on the CPU-side as well.
Thread blocks in the CPU can access the emulated global memory and perform the atomic operations.
Their system does not allow the global memory atomic operations between the thread blocks on the CPU and the thread blocks on the GPU to avoid severe performance degradation.
In fact, discrete GPUs have their own memory and communicate with the main memory through the PCI express, which causes long latency problems.
This architectural limit suggests that the CHC prototype need not provide global memory atomic operations between CPU and GPU.

5. Results

The proposed CHC framework has been fully implemented on a desktop system with two Intel XeonTM X5550 2.66 GHz quad-core processors and an NVIDIA GeForceTM 9400 GT device.
This configuration is also applicable to a single-chip heterogeneous multi-core processor that has an integrated GPU, which is generally slower than discrete GPUs.
The authors adapt 14 CUDA applications which do not have the global memory synchronization across CPU and GPU at runtime; twelve from the NVIDIA CUDA Software Development Kit (SDK) [16], SpMV [2], and MD5 hashing [9].
From left to right the columns represent the application name, the number of computation kernels, the number of thread blocks in the kernels, a description of the kernel, and work distribution ratio used in the CHC framework.
The authors measured the execution time of kernel launches and compared CHC framework against the GPU-only computing.

5.1. Initial analysis

For the initial analysis, the authors have measured the execution delay using only the GPU device and the delay using only the host CPU (through the LLVM JIT compilation technique [5, 6, 11]).
In addition, the workload has been configured either as executing only one thread block or as executing the complete set of thread blocks.
The maximum performance improvement achievable based on the initial execution delays can be experimentally deduced, as depicted in Table 2. Fig. 5 shows the way to find it; the x-axis represents the workload ratio in terms of thread blocks assigned to the CPU cores against thread blocks on the GPU device.
Therefore, the execution delay for GPU is proportionally reduced along the x-axis.
Compressed Sparse Row (CSR) format is used.

5.2. Performance improvements of CHC framework

Table 2 shows the maximum performance and the optimal distribution ratio obtained from the initial analysis.
In addition, the actual execution time and the actual work distribution ratio using CHC are also presented.
In fact, the optimal distribution ratio is used to determine the work distribution ratio on CHC.
More in detail, the applications with exponential, trigonometric, or power arithmetic operations (BINO, BLKS, MERT, MONT, and CONV) show little performance improvement.

6. Conclusions

The paper has introduced three key features for the efficient exploitation of the thread level parallelism provided by CUDA on the CPU multi-cores in addition to the GPU device.
The experiments demonstrate that the proposed framework successfully achieves efficient parallel execution and that the performance results obtained are close to the values deduced from the theoretical analysis.
The authors also plan to design an efficient thread block distribution technique considering data access patterns and thread divergence.

Did you find this useful? Give us your feedback

Figures (8)

Figure 4. Anatomy of transparent memory space.

Table 2. Initial analysis and CHC results

Figure 5. Prediction of performance improvement with initial analysis.

Figure 1. Execution flow of CUDA program: limitation where CPU stalls (a) and cooperative execution in parallel (b).

Figure 6. Normalized performance speedup of CHC over the GPU-only processing.

Figure 2. An overview of the CHC runtime system.

Figure 3. Work distribution flow and kernel mapping to CPU and GPU.

Content maybe subject to copyright Report

Cooperative Heterogeneous Computing

for Parallel Processing on CPU/GPU Hybrids

Changmin Lee and Won W. Ro

School of Electrical and Electronic Engineering

Yonsei University

Seoul 120-749, Republic of Korea

{exahz, wro}@yonsei.ac.kr

Jean-Luc Gaudiot

Department of Electrical Engineering and

Computer Science, University of California

Irvine, CA 92697-2625

gaudiot@uci.edu

Abstract

This paper presents a cooperative heterogeneous com-

puting framework which enables the efﬁcient utilization of

available computing resources of host CPU cores for CUDA

kernels, which are designed to run only on GPU. The pro-

posed system exploits at runtime the coarse-grain thread-

level parallelism across CPU and GPU, without any source

recompilation. To this end, three features including a work

distribution module, a transparent memory space, and a

global scheduling queue are described in this paper. With

a completely automatic runtime workload distribution, the

proposed framework achieves speedups as high as 3.08

compared to the baseline GPU-only processing.

1. Introduction

General-Purpose computing on Graphics Processing

Units (GPGPU) has recently emerged as a powerful com-

puting paradigm because of the massive parallelism pro-

vided by several hundreds of processing cores [4, 15]. Un-

der the GPGPU concept, NVIDIA



has developed a C-

based programming model, Compute Uniﬁed Device Ar-

chitecture (CUDA), which provides greater programmabil-

ity for high-performance graphics devices. As a matter o f

fact, general-purpose computing on graphics devices with

CUDA helps improve the performance of many applications

under the concept of a Single Instruction Multiple Thread

model (SIMT).

Although the GPGPU paradigm successfully provides

signiﬁcant computation throughput, its performance could

still be improved if we could utilize the idle CPU resource.

Indeed, in general, the host CPU is being held while the

CUDA kernel executes on the GPU devices; the CPU is not

allowed to resume execution until the GPU ha s completed

the kernel code and has provided the computation results.

The main motivation of our research is to exploit parallelism

across the host CPU cores in addition to the GPU cores.

This will eventually provide additional computing power

for the kernel execution while utilizing the idle CPU cores.

Our ultimate goal is thus to provide a technique which even-

tually exploits sufﬁcient parallelism across h eterogeneous

processors.

The paper proposes Cooperative Heterogeneous Com-

puting (CHC), a new computing paradigm for explicitly

processing CUDA applications in parallel on sets of hetero-

geneous processors including x86 based general-purpose

multi-core processors and g raphics processing units. Ther e

have been several previous research projects which have

aimed at exploiting parallelism on CPU and GPU. How-

ever, those previous approaches require either additional

programming language support or API development. As

opposed to those previous efforts, our CHC is a software

framework that provides a virtual layer for transparent exe-

cution over host CPU cores. This enables the direct execu-

tion of CUDA code, while simultaneously providing sufﬁ-

cient portability and back ward compatibility.

To achieve an efﬁcient cooperative execution model, we

have developed three important techniques:

• A workload distribution module (WDM) for CUDA

kernel to map each kernel onto CPU and GPU

• A memory model that supports a transparent memory

space (TMS) to manage the main memory with GPU

memory

• A global scheduling queue (GSQ) that supports bal-

anced thread scheduling and distribution on each of the

CPU cores

We present a theoretical analysis of the expected perfor-

mance to demonstrate the maximum feasible improvement

of our proposed system. In addition, the p erformance eval-

uation on a real system has been performed and the results

Kernel_func<<<>>>();

Serial code

CPU GPU

waiting

(a)

Kernel_func<<<>>>();

Serial code

CPU GPU

Execution time reduction

(b)

Figure 1. Execution ﬂow of CUDA p rogram: limitation where CPU stalls (a) and cooperative execution

in parallel (b).

show that speedups as high as 3.08 have been achieved. On

average, the complete CHC system shows a performance

improvement of 1.42 over GPU-only computation with 14

CUDA applications.

The rest of the paper is organized as follows. Section 2

reviews related work and Section 3 introduce the existing

CUDA programming model and describe motivation of this

work. In Section 4, we presents the design and limitations

of the CHC framework. Section 5 gives preliminary results.

Finally, we conclude this work in Section 6.

2. Related work

There h ave been several prior research projects which

aim at mapping an explicitly parallel program for gr a phics

devices onto multi-core CPUs or heterogeneous architec-

tures. MCUDA [18] automatically translates CUDA codes

for general purpose multi-core processors, applying source-

to-source translation. This implies that th e MCUDA tech-

nique translates the kernel source co de into a code written in

a general purpose high-level language, which requires one

additional step o f source recompilation.

Twin Peaks [8] maps an OpenCL-compatible program

targeted for GPUs onto multi-core CPUs by u sing the

LLVM (Low Level Virtual Mach ine) inter mediate repre-

sentation for various instruction sets. Ocelot [6], wh ich

inspired our runtime system, uses a dynamic translation

techniqu e to map a CUDA program onto multi-core CPUs.

Ocelot converts at runtime PTX code into an LLVM code

without recompilation and optimizes PTX and LLVM code

for execution by the CPU. The proposed framework in this

paper is largely different from these translation techniques

(MCUDA, Twin Peaks, and Ocelot) in that we support co-

operative execution for parallel processing over both CPU

cores and GPU cores.

In addition, EXOCHI provides a programming environ-

ment that enhances computing performance for media ker-

nels on multicore CPUs with Intel



Graphics Media Ac-

celerator (GMA) [20]. However, this programming model

uses the CPU cores only for serial execution. The Merge

framework has extended EXOCHI for the parallel execu-

tion on CPU and GMA; however, it still requires APIs and

the additional porting time [13]. Lee et al. have presented

a framework which aims at porting an OpenCL program

on the Cell BE processor [12]. They have implemented a

runtime system that manages software-managed caches and

coherence protocols.

Ravi et al. [17] have proposed a compiler and a run-

time framewo rk that generate a hybrid code running on both

CPU and GPU. It dynamically distributes the workload, but

the framework targets only for generalized reduction appli-

cations, while our system targets to map general CUDA ap-

plications. Qilin [14], in the most relevant study to our pro-

posed framework, has shown an adaptive kernel mapping

using a dynamic work d istribution. The Qilin system trains

a program to maintain databases for the adaptive mapping

scheme. In fact, Qilin requires and strongly relies o n its own

programming interface. This implies that the system cannot

directly port the existing CUDA codes, but rather p rogram-

mers should modify the source code to ﬁt their interfaces.

As an alternative, CHC is designed for seamless porting of

the existing CUDA code on CPU cores and GPU cores. In

other words, we focus on p roviding backward compatibility

of CUDA runtime APIs.

3. Motivation

One of the major roles of the host CPU for the CUDA

kernel is limited to controlling and accessin g the graphics

devices, while the GPU device provides a massive amount

of data parallelism. Fig. 1(a) shows an example where the

host contr ols the execution ﬂow of the program only, while

the d evice is responsible for executing the kernel. Once a

CUDA program is started, the host processor executes the

program sequentially until the kernel code is encountered.

As soon as the host calls the kernel function, the d evice

starts to execute the kernel with a large number of har d-

ware threads on the GPU device. In fact, the host processor

is held in the idle state until the device reaches the end of

the kernel execution.

As a result, the id le time causes an ine fﬁcient utilization

of the CPU hardware resource of the host machine. Our

CHC system is to use the idle computing resource with con-

current execution of the CUDA kernel on both CPU and

GPU (as described in Fig. 1(b)). Considering that the fu-

ture computer systems are expected to incorporate more

cores in both general purpose processors and graphics de-

vices, parallel processing on CPU and GPU would become

a g reat computing paradigm for high-performance appli-

cations. This would be quite helpful to program a sin-

gle chip heterogeneo us multi-core processor including CPU

and GPU as well. Note that Intel



and AMD



have al-

ready shipped commercial heterogeneou s multi-core pro-

cessors.

In fact, CUDA is capable of enabling asynchronous con-

current execution between host and device. The concurrent

execution returns a control to the host before the device has

completed a requested task (i.e., non-blocking). However,

the CPU that has the control can only perform a function

such as memory copy, setting other input data, or kernel

launches using streams. The key difference on which we

focus is in the use of idle computing resources with concur-

rent execution of the same CUDA kernel on both CPU and

GPU, thereby easing the GPU burden.

4. Design

An overview of our proposed CHC system is shown in

Fig. 2. It contains two runtime procedures for each kernel

launched. Each kernel execution undergoes those proce-

dures. The ﬁrst includes the Workload Distribution Mod-

ule (WDM), designed to apply the distribution ratio to the

kernel conﬁguration information. Then, the modiﬁed con-

ﬁguration information is delivered to both the CPU loader

and the GPU loader. Two sub-kernels (Kernel

CP U

and

Kernel

GP U

) are loaded and executed, based on the modi-

ﬁed kernel conﬁgurations produced by the WDM.

The second procedure is designed to translate the PTX

code into the LLVM intermediate representation (LLVM

IR). As seen in Fig. 2, this procedure extracts the PTX code

from the CUDA binary to prepare the LLVM code for co-

operative computing. On the GPU device, our runtime sys-

tem passes the PTX code through the CUDA device driver,

which means that the GPU executes the kernel in the orig-

inal manner using the PTX-JIT compilation. On the CPU

Kernel Configuration

Information

CUDA Executable Binary

PTX Assembly Code

Workload

Distribution Module

PTX-to-LLVMLLVM PTX

CHC

Framework

GPU

CUDA Driver (PTX-JIT)

Main Memory

Kernel

CPU

CoreCore

CPU

LLVM-JIT

Global Scheduling Queue

---

CPU Loader GPU Loader

Kernel

GPU

SM SM

Global Memory

---

Abstraction Layer for TMS

Figure 2. An overview of the CHC runtime

system.

core side, CHC uses the PTX translator provided in Ocelot

in order to convert PTX instructions into LLVM IR [6]. This

LLVM IR is used for a kernel context of all thread blocks

running on CPU cores, and LLVM-JIT is utilized to execute

the kernel context [11].

The CUDA kernel execution typically needs some start-

up time to initialize the GPU device. In the CHC frame-

work, the GPU start-up process and the PTX-to-LLVM

translation are simultaneously performed to hide the PTX-

to-LLVM translation overhead.

4.1. Workload distribution module and

method

The input of WDM is the kernel conﬁguration informa-

tion and the output speciﬁes two different portions of the

kernel, each for CPU cores and the GPU device. The kernel

conﬁguration information contains the execution conﬁgura-

tion which provides the dimension of a grid and that of a

block. The dimension of a grid can be efﬁciently used for

our workload distribution module.

In order to divide the CUDA kernel, the workload distri-

bution module determines the amount of the thread blocks

to be detached from the grid considering the dimension of

the grid and the workload distribution ratio as depicted in

Fig. 3. As a result, WDM generates two additional exe-

cution conﬁgurations, one for CPU and the other for GPU.

WDM then delivers the generated execution conﬁgurations

(i.e., the output of the WDM) to the CPU and GPU load-

ers. With th ese execution conﬁgurations, each loader now

can make a sub-kernel by using the kernel context such as

kernel_func<<<dGrid, dBlock>>>();

dGrid { x, y }

Work Distribution Flow

sub1_dGrid.x := dGrid.x

sub1_dGrid.y := dGrid.y u CPU

Ratio

sub2_dGrid.x := dGrid.x

sub2_dGrid.y := dGrid.y u GPU

Ratio

CPU Loader GPU Loader

sub1_dGrid sub2_dGrid

LLVM PTX

Abstraction Layer

Sub-Kernel 2

Sub-Kernel 1

Kernel Grid

Core0 Core1

Queue for Work Sharing

Figure 3. Work distribution ﬂow and kernel

mapping to CPU and GPU.

LLVM and PTX.

Typically, WDM assigns the front portion of thread

blocks to the GPU-side, while the rest is assigned to the

CPU-side. Therefore, the ﬁrst identiﬁer of the CPU’s sub-

kernel will be (dGrid.y × GPU

Ratio

) + 1. Then, each thread

block can identify the assigned data with the identiﬁer since

both sides have an identical memory space.

In order to ﬁnd the optim al workload distribution ratio,

we can probably predict the runtime behavior such as the

execution d elay on CPU cores. However, it is quite hard to

predict characteristics of a CUDA program since the run-

time behavior strongly relies on dynamic characteristics o f

the kernel [1, 10]. For this reason, Qilin used an empirical

approach to achieve their proposed adaptive mapping [14].

In fact, our proposed CHC also adopts a heuristic approach

to determine the workload distribution ratio. Then, the CHC

framework performs the dynamic work distribution at run-

time based on this ratio. The proposed work distribution can

split the kernel according to the granularity of thread block.

4.2. Memory consolidation for transparent

memory space

A programmer writing CUDA applications should assign

memory spaces in the device memory of the graphics hard-

ware. These memory locations (or, addresses) are u sed for

the input and output data. In the CUDA model, data can be

copied between the host memory and the d edicated memory

on the device. For this purpose, the host system should pre-

serve pointer variables pointing to the location in the device

memory.

As opposed to the original CUDA model, two differ-

ent memory addresses exist for one pointer variable in our

proposed CHC framework. The key design problem is

caused by the fact that the comp utation results of the CPU

side are stored into the main memory that is different from

the device memory. To address this problem, we propose

and design an abstraction layer, Transparent Memory Space

(TMS), to preserve two different memory addresses in a

pointer variable at a time.

Accessing memory addresses. The abstraction layer

uses double pointers data structures (similar to [19]) for

pointer variables to map one pointer variable onto two mem-

ory addresses: for the main memory and the device mem-

ory. As seen in Fig. 4, we have declared the abstrac-

tion layer that manages a list of the TMS data structures.

Whenever a pointer variable is referenced, the abstraction

layer translates the pointer to the memory addresses, for

both CPU and GPU. For example, when a pointer vari-

able (e.g., d

out) is used to allocate device memory using

cudaMalloc(), the framework assigns memory spaces

both on the device memory and the host memory. The ad-

dresses of these memory spaces are stored in a TMS d ata

structure (e.g., TMS1), and the framework maps the pointer

variable on the TMS data structure. Thus, the runtime

framework can perform the address translation for a pointer

variable.

Launching a kernel. For launching a kernel, pointer

variables deﬁned in advance may be used as arguments of

the kernel function. At that time, the CPU and G PU load-

ers obtain each translated address from the mapping table

so that each sub-kernel could retain actual addresses o n its

memory domain.

Merging separated data. After ﬁnishing the kernel com-

putation, the computation results are copied to the host

memory (cudaMemcpy()) to perform further operations.

Therefore, merging the data of two separate memory do-

mains is required. To reduce memory copy overhead, the

framework traces memory addresses which are modiﬁed by

the CPU-side computation.

4.3. Global scheduling queue for thread

scheduling

GPU is a throughput-oriented architecture which shows

outstanding performance with applications having a large

amount of data parallelism [7]. However, to achieve mean-

ingful performance from the CPU side, scheduling thread

blocks with an efﬁcient policy is important.

Ocelot uses a locality-aware static partitioning scheme

in their proposed thread scheduler, which assigns each

thread block considering load balancing between neighbor-

ing worker thread [6]. However, this static partitioning

method probably causes some cores to ﬁnish their execution

early. In our scheduling scheme, we allow a thread block to

be assigned dynamically to any available core. For this pur-

float *h_in;

float *h_out;

...

cudaMalloc(d_in, size);

cudaMalloc(d_out, size);

...

cudaMemcpy(d_in, h_in, size, ...);

kernel_func<<<...>>>(d_in, d_out);

cudaMemcpy(h_out, d_out, size, ...);

...

ŘŘŘ

d_out

0xC

0xD

TMS1

ŘŘŘ

GPU Dedicated Memory

ŘŘŘ

Main Memory

CPU Loader GPU Loader

Kernel

CPU

Kernel

GPU

(0xB, 0xD)(0xA, 0xC)

d_in

0xA

0xB

TMS0

pointers

d_in

d_out

values

TMS0

TMS1

Mapping Table

Abstraction Layer

Output by GPU computation

Output by CPU computation

↑ 0xD

↑ 0xC

Abstraction layer

d_out points to address of TMS1

Figure 4. Anatomy of transparent memory space.

pose, we have implemented a work sharing scheme using a

Global Scheduling Queue (GSQ) [3]. This scheduling al-

gorithm enqueues a task (i.e., a thread block) into a global

queue so that any worker thread on an available core can

consume the task. Thus, this scheduling scheme allows a

worker thread in each core to pick up only one thread block

and achieve load balancing. In addition, any core which

ﬁnishes the assigned thread block so early would handle an-

other thread block without being idle.

4.4. Limitations on global memory consis-

tency

CHC emulates the global memory on the CPU-side as

well. Thread blocks in the CPU can access the emulated

global memory and perform the atomic operations. How-

ever, our system does not allow the global memory atomic

operations between the thread blocks on the CPU and the

thread blocks on the GPU to avoid severe performance

degradation. In fact, discrete GPUs h ave their own memory

and communicate with the main memory through the PCI

express, which causes long latency p roblems. This archi-

tectural limit suggests that the CHC prototype need not pro-

vide global memory atomic operations between CPU and

GPU.

5. Results

The proposed CHC framework has been fully imple-

mented on a desktop system with two Intel Xeon

X5550

2.66 GHz quad-core processors and an NVIDIA GeForce

9400 GT device. The aim of the CHC framework is to

demonstrate the feasibility of th e par allel kernel execution

on CPU and GPU to improve CUDA execution on low-end

GPUs. This conﬁguration is also ap plicable to a single-chip

heterogeneous multi-core processor that has an integrated

GPU, which is generally slower than discrete GPUs.

We adapt 14 CUDA applications which do not have the

global memory synchronization across CPU and GPU at

runtime; twelve from the NVIDIA CUDA Software Devel-

opment Kit (SDK) [16], SpMV [2], and MD5 hashing [9].

Table 1 summarizes these applications and kernels.

From left to right the columns represent the application

name, the number of computation kernels, the number of

thread blocks in the kernels, a description of the kernel, and

work distribution ratio used in the CHC framework. We

measured the execution time of kernel launches and com-

pared CHC framework against the GPU-only computing.

The validity of the CHC resu lts was compared to a compu-

tation result that has been executed on a CPU only.

5.1. Initial analysis

For the initial analysis, we have m easured the execution

delay using only the GPU device and the delay using only

the host CPU (through the LLVM JIT compilation tech-

nique [5, 6, 11]). In addition, the workload has been conﬁg-

ured either as executing only one thread block or as execut-

ing the complete set of thread blocks.

The maximum performance improvement achievable

based on the initial execution d elays can be experimentally

deduced, as depicted in Table 2. Fig. 5 shows the way

to ﬁnd it; the x-axis represents the workload ratio in terms

of thread blocks assigned to the CPU cores against thread

blocks on the GPU device. With having more thread blocks

on the CPU cores, fewer thread blocks would be assigned

to the GPU device. Therefore, the execution delay for GPU

is proportionally reduced along the x-axis.

From the above observation, the maximum value be-

tween the CPU execution delay and the GPU execution de-

lay at a given workload ratio can be considered as the total

HTML Viewer

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "Cooperative heterogeneous computing for parallel processing on cpu/gpu hybrids" ?

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper.

Q2. What have the authors stated for future works in "Cooperative heterogeneous computing for parallel processing on cpu/gpu hybrids" ?

The authors believe the cooperative heterogeneous computing can be utilized in the future heterogeneous multi-core processors which are expected to include even more GPU cores as well as CPU cores. As future work, the authors will first develop a dynamic control scheme on deciding the workload distribution ratio. The authors also plan to design an efficient thread block distribution technique considering data access patterns and thread divergence. In fact, the future CHC framework needs to address the performance trade-offs considering the CUDA application configurations on various GPU and CPU models.

Q3. What is the main purpose of the GPU?

GPU is a throughput-oriented architecture which shows outstanding performance with applications having a large amount of data parallelism [7].

Q4. What is the main role of the host CPU for the CUDA kernel?

One of the major roles of the host CPU for the CUDA kernel is limited to controlling and accessing the graphics devices, while the GPU device provides a massive amount of data parallelism.

Q5. What is the main purpose of the Merge framework?

The Merge framework has extended EXOCHI for the parallel execution on CPU and GMA; however, it still requires APIs and the additional porting time [13].

Q6. What is the purpose of the framework?

It dynamically distributes the workload, but the framework targets only for generalized reduction applications, while their system targets to map general CUDA applications.

Q7. What is the main purpose of EXOCHI?

In addition, EXOCHI provides a programming environ-ment that enhances computing performance for media kernels on multicore CPUs with Intel R© Graphics Media Accelerator (GMA) [20].

Q8. What is the purpose of the experiments?

The experiments demonstrate that the proposed framework successfully achieves efficient parallel execution and that the performance results obtained are close to the values deducedfrom the theoretical analysis.

Q9. What is the way to schedule thread blocks?

Ocelot uses a locality-aware static partitioning scheme in their proposed thread scheduler, which assigns each thread block considering load balancing between neighboring worker thread [6].

Q10. What is the purpose of the CHC system?

Their CHC system is to use the idle computing resource with concurrent execution of the CUDA kernel on both CPU and GPU (as described in Fig. 1(b)).

Q11. What is the input of the WDM?

The input of WDM is the kernel configuration information and the output specifies two different portions of the kernel, each for CPU cores and the GPU device.

Q12. What is the purpose of the scheduling algorithm?

This scheduling algorithm enqueues a task (i.e., a thread block) into a global queue so that any worker thread on an available core can consume the task.

Q13. What is the main difference between a CUDA and a GPU?

Considering that the future computer systems are expected to incorporate more cores in both general purpose processors and graphics devices, parallel processing on CPU and GPU would become a great computing paradigm for high-performance applications.

Q14. What is the way to predict the performance of a CUDA program?

it is quite hard to predict characteristics of a CUDA program since the runtime behavior strongly relies on dynamic characteristics of the kernel [1, 10].

Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids

Summary (3 min read)

1. Introduction

3. Motivation

4. Design

4.1. Workload distribution module and method

4.2. Memory consolidation for transparent

4.3. Global scheduling queue for thread scheduling

4.4. Limitations on global memory consistency

5. Results

5.1. Initial analysis

5.2. Performance improvements of CHC framework

6. Conclusions

Figures (8)

Citations

Cites background from "Cooperative heterogeneous computing..."

Cites background from "Cooperative heterogeneous computing..."

Cites background from "Cooperative heterogeneous computing..."

Cites methods from "Cooperative heterogeneous computing..."

References

"Cooperative heterogeneous computing..." refers background in this paper

"Cooperative heterogeneous computing..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "Cooperative heterogeneous computing for parallel processing on cpu/gpu hybrids" ?

Q2. What have the authors stated for future works in "Cooperative heterogeneous computing for parallel processing on cpu/gpu hybrids" ?

Q3. What is the main purpose of the GPU?

Q4. What is the main role of the host CPU for the CUDA kernel?

Q5. What is the main purpose of the Merge framework?

Q6. What is the purpose of the framework?

Q7. What is the main purpose of EXOCHI?

Q8. What is the purpose of the experiments?

Q9. What is the way to schedule thread blocks?

Q10. What is the purpose of the CHC system?

Q11. What is the input of the WDM?

Q12. What is the purpose of the scheduling algorithm?

Q13. What is the main difference between a CUDA and a GPU?

Q14. What is the way to predict the performance of a CUDA program?