Proceedings Article•DOI•

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che¹, Michael Boyer¹, Jiayuan Meng¹, David Tarjan¹, Jeremy W. Sheaffer¹, Sang-Ha Lee¹, Kevin Skadron¹ - Show less +3 more•Institutions (1)

University of Virginia¹

04 Oct 2009-pp 44-54

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

read less

Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Summary (2 min read)

Jump to: [INTRODUCTION] – [OVERVIEW OF THE AVAILABLE DATA] – [PALATALISATION AND DEPALATALISATION] – [GENDER AS A SOCIAL FACTOR] and [CONCLUSION]

INTRODUCTION

This article focusses on the interaction between internal and external constraints on linguistic variation and change as reflected in data from Arabic dialects spoken in the Arabian Peninsula and the Levant.
This is because language change has been an object of linguistic inquiry much longer than variation has.
He argues that even the earliest treatises of Arabic grammar, dating back to the 8th century, had sociolinguistic material embedded in them (see e.g. Owens 2001: 421), and that their understanding of the development of Arabic relies heavily on the knowledge the authors possess of the social reality of its speakers.

OVERVIEW OF THE AVAILABLE DATA

The range of features the authors analyse in the research presented here represents several changes in progress in Arabic dialects.
These features are listed below (all notations are in IPA).
Page 6 of 82 Cambridge University Press Language in Society For Peer Review 7 J. Milroy (1993: 220), makes the following compelling observation:.
The phonology of the Medina dialect is thus being restructured vis-à-vis this feature, as no such allophony had existed in the traditional dialects of either social group.
She found that [j] has become the main variant of the young Baḥārna and that their traditional variant [ʤ] is rarely used in daily interactions.

PALATALISATION AND DEPALATALISATION

Evidence from a vocalic morphophonemic feature: the feminine ending Page 15 of 82 Cambridge University Press Language in Society For Peer Review 16 In Arabic dialects, there is a suffix that denotes feminine grammatical gender in many nouns and most adjectives.
Contemporary research on depalatalisation8 of etymological velar stops suggests that gender distinction has an effect on the change in the opposite direction, [ʧ] > [k].
The 6% of the affricate tokens reported by Al-Essa all come from four speakers (out of a sample of 61).
Page 22 of 82 Cambridge University Press Language in Society For Peer Review 23.

CONCLUSION

In what is perhaps the foundational text of sociolinguistic theory, Weinreich, Labov & Herzog (1968) deal with a number of ‘problems’ or ‘riddles’, the most difficult of which is the actuation problem (or actuation riddle).
Steering the discussion away from the hitherto received wisdom that variation in language can be ‘free variation’ and that subsequent language change can be ‘random’, Weinreich et al. (1968: 112) critique earlier accounts of language change, especially that of Hermann Paul (1880).
The youngest age group actually patterned alongside the oldest group in resisting the change, and the middle age group was the one who seemed to be leading it.
It is unlikely that factoring in a speaker’s age into the analysis of their speech would have been possible, or even taken seriously, had it not been for the incorporation of social factors in general into the study of language change.
These may involve or affect the speaker’s state of mind, and that state of mind may very well bear upon the behavior that implements the change’.

Did you find this useful? Give us your feedback

Figures (7)

Fig. 3. Incremental performance improvement from adding optimizations

TABLE II APPLICATION INFORMATION. KN = KERNEL N; C = CONSTANT MEMORY; CA = COALESCED MEMORY ACCESSES; T = TEXTURE MEMORY; S = SHARED MEMORY.

TABLE I RODINIA APPLICATIONS AND KERNELS (*DENOTES KERNEL).

Fig. 4. The fraction of each GPU implementation’s runtime due to the core part of computation (GPU execution, CPU-GPU communication and CPU execution) and I/O and initial setup. Sequential parameter setup and input array randomization are included in “I/O and initial setup”.

Fig. 5. Extra power dissipation of each benchmark implementation in comparison to the system’s idle power (186 W).

Fig. 1. Kiviat diagrams representing the eight microarchitecture-independent characteristics of each benchmark.

Fig. 2. The speedup of the GPU implementations over the equivalent singleand four-threaded CPU implementations. The execution time for calculating the speedup is measured on the CPU and GPU for the core part of the computation, excluding the I/O and initial setup. Figure 4 gives a detailed breakdown of each CUDA implementation’s runtime.

Content maybe subject to copyright Report

Rodinia: A Benchmark Suite for Heterogeneous Computing

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee and Kevin Skadron

{sc5nf, mwb7w, jm6dg, dt2f, jws9c, sl4ge, ks7h}@virginia.edu

Department of Computer Science, University of Virginia

Abstract—This paper presents and characterizes Rodinia, a

benchmark suite for heterogeneous computing. To help architects

study emerging platforms such as GPUs (Graphics Processing

Units), Rodinia includes applications and kernels which target

multi-core CPU and GPU platforms. The choice of applications

is inspired by Berkeley’s dwarf taxonomy. Our characterization

shows that the Rodinia benchmarks cover a wide range of

parallel communication patterns, synchronization techniques and

power consumption, and has led to some important architectural

insight, such as the growing importance of memory-bandwidth

limitations and the consequent importance of data layout.

I. INTRODUCTION

With the microprocessor industry’s shift to multicore archi-

tectures, research in parallel computing is essential to ensure

future progress in mainstream computer systems. This in turn

requires standard benchmark programs that researchers can use

to compare platforms, identify performance bottlenecks, and

evaluate potential solutions. Several current benchmark suites

provide parallel programs, but only for conventional, general-

purpose CPU architectures.

However, various accelerators, such as GPUs and FPGAs,

are increasingly popular because they are becoming easier

to program and offer dramatically better performance for

many applications. These accelerators differ signiﬁcantly from

CPUs in architecture, middleware and programming models.

GPUs also offer parallelism at scales not currently available

with other microprocessors. Existing benchmark suites neither

support these accelerators’ APIs nor represent the kinds of ap-

plications and parallelism that are likely to drive development

of such accelerators. Understanding accelerators’ architectural

strengths and weaknesses is important for computer systems

researchers as well as for programmers, who will gain insight

into the most effective data structures and algorithms for each

platform. Hardware and compiler innovation for accelerators

and for heterogeneous system design may be just as com-

mercially and socially beneﬁcial as for conventional CPUs.

Inhibiting such innovation, however, is the lack of a benchmark

suite providing a diverse set of applications for heterogeneous

systems.

In this paper, we extend and characterize the Rodinia

benchmark suite [4], a set of applications developed to address

these concerns. These applications have been implemented for

both GPUs and multicore CPUs using CUDA and OpenMP.

The suite is structured to span a range of parallelism and data-

sharing characteristics. Each application or kernel is carefully

chosen to represent different types of behavior according

to the Berkeley dwarves [1]. The suite now covers diverse

dwarves and application domains and currently includes nine

applications or kernels. We characterize the suite to ensure

that it covers a diverse range of behaviors and to illustrate

interesting differences between CPUs and GPUs.

In our CPU vs. GPU comparisons using Rodinia, we

have also discovered that the major architectural differences

between CPUs and GPUs have important implications for

software. For instance, the GPU offers a very low ratio of on-

chip storage to number of threads, but also offers specialized

memory spaces that can mitigate these costs: the per-block

shared memory (PBSM), constant, and texture memories. Each

is suited to different data-use patterns. The GPU’s lack of

persistent state in the PBSM results in less efﬁcient commu-

nication among producer and consumer kernels. GPUs do not

easily allow runtime load balancing of work among threads

within a kernel, and thread resources can be wasted as a

result. Finally, discrete GPUs have high kernel-call and data-

transfer costs. Although we used some optimization techniques

to alleviate these issues, they remain a bottleneck for some

applications.

The benchmarks have been evaluated on an NVIDIA

GeForce GTX 280 GPU with a 1.3 GHz shader clock and a 3.2

GHz Quad-core Intel Core 2 Extreme CPU. The applications

exhibit diverse behavior, with speedups ranging from 5.5 to

80.8 over single-threaded CPU programs and from 1.6 to

26.3 over four-threaded CPU programs, varying CPU-GPU

communication overheads (2%-76%, excluding I/O and initial

setup), and varying GPU power consumption overheads (38W-

83W).

The contributions of this paper are as follows:

• We illustrate the need for a new benchmark suite for het-

erogeneous computing, with GPUs and multicore CPUs

used as a case study.

• We characterize the diversity of the Rodinia benchmarks

to show that each benchmark represents unique behavior.

• We use the benchmarks to illustrate some important

architectural differences between CPUs and GPUs.

II. MOTIVATION

The basic requirements of a benchmark suite for general

purpose computing include supporting diverse applications

with various computation patterns, employing state-of-the-art

algorithms, and providing input sets for testing different situ-

ations. Driven by the fast development of multicore/manycore

CPUs, power limits, and increasing popularity of various

accelerators (e.g., GPUs, FPGAs, and the STI Cell [16]),

In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), Oct. 2009

Author's preprint, Sept. 2009

the performance of applications on future architectures is

expected to require taking advantage of multithreading, large

number of cores, and specialized hardware. Most of the

previous benchmark suites focused on providing serial and

parallel applications for conventional, general-purpose CPU

architectures rather than heterogeneous architectures contain-

ing accelerators.

A. General Purpose CPU Benchmarks

SPEC CPU [31] and EEMBC [6] are two widely used

benchmark suites for evaluating general purpose CPUs

and embedded processors, respectively. For instance, SPEC

CPU2006, dedicated to compute-intensive workloads, repre-

sents a snapshot of scientiﬁc and engineering applications.

But both suites are primarily serial in nature. OMP2001 from

SPEC and MultiBench 1.0 from EEMBC have been released

to partially address this problem. Neither, however, provides

implementations that can run on GPUs or other accelerators.

SPLASH-2 [34] is an early parallel application suite com-

posed of multithreaded applications from scientiﬁc and graph-

ics domains. However, the algorithms are no longer state-of-

the-art, data sets are too small, and some forms of paral-

lelization are not represented (e.g. software pipelining) [2].

Parsec [2] addresses some limitations of previous bench-

mark s uites. It provides workloads in the RMS (Recognition,

Mining and Synthesis) [18] and system application domains

and represents a wider range of parallelization techniques.

Neither SPLASH nor Parsec, however, support GPUs or other

accelerators. Many Parsec applications are also optimized for

multicore processors assuming a modest number of cores,

making them difﬁcult to port to manycore organizations such

as GPUs. We are exploring parts of Parsec applications to

GPUs (e.g. Stream Cluster), but ﬁnding that those relying

on task pipelining do not port well unless each stage is also

heavily parallelizable.

B. Specialized and GPU Benchmark Suites

Other parallel benchmark suites include MineBench [28]

for data mining applications, MediaBench [20] and ALP-

Bench [17] for multimedia applications, and BioParallel [14]

for biomedical applications. The motivation for developing

these benchmark suites was to provide a suite of applications

which are representative of those application domains, but not

necessarily to provide a diverse range of behaviors. None of

these suites support GPUs or other accelerators.

The Parboil benchmark suite [33] is an effort to benchmark

GPUs, but its application set is narrower than Rodinia’s and

no diversity characterization has been published. Most of the

benchmarks only consist of single kernels.

C. Benchmarking Heterogeneous Systems

Prior to Rodinia, there has been no well-designed bench-

mark suite speciﬁcally for research in heterogeneous com-

puting. In addition to ensuring diversity of the applications,

an essential feature of such a suite must be implementations

for both multicore CPUs and the accelerators (only GPUs, so

far). A diverse, multi-platform benchmark suite helps software,

middleware, and hardware researchers in a variety of ways:

• Accelerators offer signiﬁcant performance and efﬁciency

beneﬁts compared to CPUs for many applications. A

benchmark suite with implementations for both CPUs and

GPUs allows researchers to compare the two architectures

and identify the inherent architectural advantages and

needs of each platform and design accordingly.

• Fused CPU-GPU processors and other heterogeneous

multiprocessor SoCs are likely to become common in

PCs, servers and HPC environments. Architects need a

set of diverse applications to help decide what hardware

features should be included in the limited area budgets

to best support common computation patterns shared by

various applications.

• Implementations for both multicore-CPU and GPU

can help compiler efforts to port existing CPU lan-

guages/APIs to the GPU by providing reference imple-

mentations.

• Diverse implementations for both multicore-CPU and

GPU can help software developers by provide exemplars

for different types of applications, assisting in the porting

new applications.

III. THE RODINIA BENCHMARK SUITE

Rodinia so far targets GPUs and multicore CPUs as a

starting point in developing a broader treatment of het-

erogeneous computing. Rodinia is maintained online at

http://lava.cs.virginia.edu/wiki/rodinia. I n order to cover di-

verse behaviors, the Berkeley Dwarves [1] are used as guide-

lines for selecting benchmarks. Even though programs repre-

senting a particular dwarf may have varying characteristics,

they share strong underlying patterns [1]. The dwarves are

deﬁned at a high level of abstraction to allow reasoning about

the program behaviors.

The Rodinia suite has the following features:

• The suite consists of four applications and ﬁve kernels.

They have been parallelized with OpenMP for multicore

CPUs and wit h the CUDA API for GPUs. The Similarity

Score kernel is programmed using Mars’ MapReduce API

framework [10]. We use various optimization techniques

in the applications and take advantage of various on-chip

compute resources.

• The workloads exhibit various types of parallelism, data-

access patterns, and data-sharing characteristics. So far

we have only implemented a subset of the dwarves,

including Structured Grid, Unstructured Grid, Dynamic

Programming, Dense Linear Algebra, MapReduce, and

Graph Traversal. We plan to expand Rodinia in the

future to cover the remaining dwarves. Previous work

has shown the applicability of GPUs to applications from

other dwarves such as Combinational Logic [4], Fast

Fourier Transform (FFT) [23], N-Body [25], and Monte

Carlo [24].

• The Rodinia applications cover a diverse range of ap-

plication domains. In Table I we show the applications

along with their corresponding dwarves and domains.

Each application represents a representative application

from its respective domain. Users are given the ﬂexibility

to specify different input sizes for various uses.

• Even applications within the same dwarf show different

features. For instance, the Structured Grid applications

are at the core of scientiﬁc computing, but the reason

that we chose three Structured Grid applications is not

random. SRAD represents a regular application in this

domain. We use HotSpot to demonstrate the impact of

inter-multiprocessor synchronization on application per-

formance. Leukocyte Tracking utilizes diversiﬁed paral-

lelization and optimization techniques. We classify K-

means and Stream Cluster as Dense Linear Algebra

applications because their characteristics are closest to

the description of this dwarf since each operates on

strips of rows and columns. Although we believe that the

dwarf taxonomy is fairly comprehensive, there are some

important categories of applications that still need to be

added (e.g., sorting).

Although the dwarves are a useful guiding principle, as

mentioned above, our work with different instances of the

same dwarf suggests that the dwarf taxonomy alone may

not be sufﬁcient to ensure adequate diversity and that some

important behaviors may not be captured. This is an interesting

area for future research.

TABLE I

RODINIA APPLICATIONS AND KE RNELS (* DENOTES KERNEL) .

Application / Kernel Dwarf Domain

K-means Dense Linear Algebra Data Mining

Needleman-Wunsch Dynamic Programming Bioinformatics

HotSpot* Structured Grid Physics Simulation

Back Propagation* Unstructured Grid Pattern Recognition

SRAD Structured Grid Image Processing

Leukocyte Tracking Structured Grid Medical Imaging

Breadth-First Search* Graph Traversal Graph Algorithms

Stream Cluster* Dense Linear Algebra Data Mining

Similarity Scores* MapReduce Web Mining

A. Workloads

Leukocyte Tracking (LC) detects and tracks rolling leuko-

cytes (white blood cells) in video microscopy of blood ves-

sels [3]. In the application, cells are detected in the ﬁrst

video frame and then tracked through subsequent frames.

The major processes include computing for each pixel the

maximal Gradient Inverse Coefﬁcient of Variation (GICOV)

score across a range of possible ellipses and computing, in

the area surrounding each cell, a Motion Gradient Vector Flow

(MGVF) matrix.

Speckle Reducing Anisotropic Diffusion (SRAD) is a

diffusion algorithm based on partial differential equations and

used for removing the speckles in an image without sacriﬁcing

important image features. SRAD is widely used in ultrasonic

and radar imaging applications. The inputs to the program

are ultrasound images and the value of each point in the

computation domain depends on its four neighbors.

HotSpot (HS) is a thermal simulation tool [13] used for

estimating processor temperature based on an architectural

ﬂoor plan and simulated power measurements. Our benchmark

includes the 2D transient thermal simulation kernel of HotSpot,

which iteratively solves a series of differential equations for

block temperatures. The inputs to the program are power and

initial temperatures. Each output cell in the grid represents

the average temperature value of the corresponding area of

the chip.

Back Propagation (BP) is a machine-learning algorithm

that trains the weights of connecting nodes on a layered neural

network. The application is comprised of two phases: the

Forward Phase, in which the activations are propagated from

the input to the output layer, and the Backward Phase, in which

the error between the observed and requested values in the

output layer is propagated backwards to adjust the weights

and bias values. Our parallelized versions are based on a CMU

implementation [7].

Needleman-Wunsch (NW) is a global optimization method

for DNA sequence alignment. The potential pairs of sequences

are organized in a 2-D matrix. The algorithm ﬁlls the matrix

with s cores, which represent the value of the maximum

weighted path ending at that cell. A trace-back process is used

to search the optimal alignment. A parallel Needleman-Wunsch

algorithm processes the score matrix in diagonal s trips from

top-left to bottom-right.

K-means (KM) is a clustering algorithm used extensively

in data mining. This identiﬁes related points by associating

each data point with its nearest cluster, computing new cluster

centroids, and iterating until convergence. Our OpenMP im-

plementation is based on the Northwestern MineBench [28]

implementation.

Stream Cluster (SC) solves the online clustering problem.

For a s tream of input points, it ﬁnds a pre-determined number

of medians so that each point is assigned to its nearest

center [2]. The quality of the clustering is measured by the

sum of squared distances (SSQ) metric. The original code

is from the Parsec Benchmark suite developed by Princeton

University [2]. We ported the Parsec implementation to CUDA

and OpenMP.

Breadth-First Search (BFS) traverses all the connected

components in a graph. Large graphs involving millions of

vertices are common in scientiﬁc and engineering applications.

The CUDA version of BFS was contributed by IIIT [9].

Similarity Score (SS) is used in web document clustering

to compute the pair-wise similarity between pairs of web

documents. The source code is from the Mars project [10] at

The Hong Kong University of Science and Technology. Mars

hides t he programming complexity of the GPU behind the

simple and familiar MapReduce interface.

B. NVIDIA CUDA

For GPU implementations, the Rodinia suite uses

CUDA [22], an extension to C for GPUs. CUDA represents

the GPU as a co-processor that can run a large number of

threads. The threads are managed by representing parallel

tasks as kernels mapped over a domain. Kernels are scalar

and represent the work to be done by a single thread. A

kernel is invoked as a thread at every point in the domain.

Thread creation is managed in hardware, allowing fast thread

creation. The parallel threads share memory and synchronize

using barriers.

An important feature of CUDA is that the threads are time-

sliced in SIMD groups of 32 called warps. Each warp of 32

threads operates in lockstep. Divergent threads are handled

using hardware masking until they reconverge. Different warps

in a thread block need not operate in lockstep, but if threads

within a warp follow divergent paths, only threads on the same

path can be executed simultaneously. In the worst case, all 32

threads in a warp following different paths would result in

sequential execution of the threads across the warp.

CUDA is currently supported only on NVIDIA GPUs, but

recent work has shown that CUDA programs can be compiled

to execute efﬁciently on multi-core CPUs [32].

The NVIDIA GTX 280 GPU used in this study has 30

streaming multiprocessors (SMs). Each SM has 8 streaming

processors (SPs) for a total of 240 SPs. Each group of 8 SPs

shares one 16 kB of fast per-block shared memory (similar to

scratchpad memory). Each group of three SMs (i.e., 24 SPs)

shares a texture unit. An SP contains a scalar ﬂoating point

ALU that can also perform integer operations. Instructions

are executed in a SIMD fashion across all SPs in a given

multiprocessor. The GTX 280 has 1 GB of device memory.

C. CUDA vs. OpenMP Implementations

One challenge of designing the Rodinia suite is that there

is no single language for programming the platforms we

target, which forced us to choose two different languages

at the current stage. More general languages or APIs that

seek to provide a universal programming standard, such as

OpenCL [26], may address this problem. However, since

OpenCL tools were not available at the time of this writing,

this is left for future work.

Our decision t o choose CUDA and OpenMP actually pro-

vides a real beneﬁt. Because they lie at the extremes of data-

parallel programming models (ﬁne-grained vs. coarse-grained,

explicit vs implicit), comparing the two implementations of a

program provides insight into pros and cons of different ways

of specifying and optimizing parallelism and data manage-

ment.

Even though CUDA programmers must specify the tasks

of threads and thread blocks in a more ﬁne-grained way

than in OpenMP, the basic parallel decompositions in most

CUDA and OpenMP applications are not fundamentally dif-

ferent. Aside from dealing with other ofﬂoading issues, in

a straightforward data-parallel application programmers can

relatively easily convert the OpenMP loop body into a CUDA

kernel body by replacing the for-loop indices with t hread

indices over an appropriate domain (e.g., in Breadth-First

Search). Reductions, however, must be implemented manually

in CUDA (although CUDA libraries [30] make the reduction

easier), while in OpenMP this is handled by the compiler (e.g.,

in Back Propagation and SRAD).

Further optimizations, however, expose signiﬁcant architec-

tural differences. Examples include taking advantage of data-

locality using specialized memories in CUDA, as opposed

to relying on large caches on the CPU, and reducing SIMD

divergence (as discussed in Section VI-B).

IV. METHODOLOGY AND EXPERIMENT SETUP

In this section, we explain the dimensions along which we

characterize the Rodinia benchmarks:

Diversity Analysis Characterization of diversity of the

benchmarks is necessary to identify whether the s uite provides

sufﬁcient coverage.

Parallelization and Speedup The Rodinia applications are

parallelized in various ways and a variety of optimizations

have been applied to obtain satisfactory performance. We

examine how well each applications maps to the two target

platforms.

Computation vs. Communication Many accelerators such

as GPUs use a co-processor model in which computationally-

intensive portions of an application are ofﬂoaded to the ac-

celerator by the host processor. The communication overhead

between GPUs and CPUs often becomes a major performance

consideration.

Synchronization Synchronization overhead can be a barrier

to achieving good performance for applications utilizing ﬁne-

grained synchronization. We analyze synchronization primi-

tives and strategies and their impact on application perfor-

mance.

Power Consumption An advantage of accelerator-based

computing is its potential to achieve better power-efﬁciency

than CPU-based computing. We show the diversity of the

Rodinia benchmarks in terms of power consumption.

All of our measurement results are obtained by running

the applications on real hardware. The benchmarks have been

evaluated on an NVIDIA GeForce GTX 280 GPU with 1.3

GHz shader clock and a 3.2 GHz Quad-core Intel Core 2

Extreme CPU. The system contains an NVIDIA nForce 790i-

based motherboard and the GPU is connected using PCI/e 2.0.

We use NVIDIA driver version 177.11 and CUDA version 2.2,

except for the Similarity Score application, whose Mars [10]

infrastructure only supports CUDA versions up to 1.1.

V. DIVERSITY AN ALYSIS

We use the Microarchitecture-Independent Workload Char-

acterization (MICA) framework developed by Hoste and Eeck-

hout [11] to evaluate the application diversity of the Rodinia

benchmark suite. MICA provides a Pin [19] toolkit to collect

metrics such as instruction mix, instruction-level parallelism,

predictability. Each metric also includes several sub-metrics

with total of 47 program characteristics. The MICA method-

ology uses a Genetic Algorithm to minimize the number of

inherent program characteristics that need to be measured

by exploiting correlation between characteristics. It reduces

Fig. 1. Kiviat diagrams representing the eight microarchitecture-independent

characteristics of each benchmark.

the 47-dimensional application characteristic space to an 8-

dimensional space without compromising the methodology’s

ability to compare benchmarks [11].

The metrics used in MICA are microarchitecture indepen-

dent but not independent of the instruction set architecture

(ISA) and the compiler. Despite this limitation, Hoste and

Eeckhout [12] show that these metrics can provide a fairly

accurate characterization, even across different platforms.

We measure the single-core, CPU version of the applications

from the Rodinia benchmark suite with the MICA tool as

described by Hoste and Eeckhout [11], except that we calculate

the percentage of all arithmetic operations instead of the

percentage of only multiply operations. Our rationale for

performing the analysis using the single-threaded CPU version

of each benchmark is that the underlying set of computations

to be performed is the same as in the parallelized or GPU

version, but this is another question for future work. We use

Kiviat plots to visualize each benchmark’s inherent behavior,

with each axis representing one of the eight microarchitecture-

independent characteristics. The data was normalized to have

a zero mean and a unit standard deviation. Figure 1 shows

the Kiviat plots for the Rodinia programs, demonstrating that

each application exhibits diverse behavior.

Fig. 2. The speedup of the GPU implementations over the equivalent single-

and four-threaded C PU implementations. The execution time for calculating

the speedup is measured on the CPU and GPU for the core part of the

computation, excluding the I/O and initial setup. Figure 4 gives a detailed

breakdown of each CUDA implementation’s runtime.

VI. PARALLELIZATION AND OPTIMIZATION

A. Performance

Figure 2 shows the speedup of each benchmark’s CUDA

implementation running on a GPU relative to OpenMP im-

plementations running on a multicore CPU. The speedups

range from 5.5 to 80.8 over the single-threaded CPU im-

plementations and from 1.6 to 26.3 over the four-threaded

CPU implementations. Although we have not spent equal

effort optimizing all Rodinia applications, we believe that

the majority of the performance diversity results from the

diverse application characteristics inherent in the bench-

marks. SRAD, HotSpot, and Leukocyte are relatively compute-

intensive, while Needleman-Wunsch, Breadth-First Search, K-

means, and Stream Cluster are limited by the GPU’s off-

chip memory bandwidth. The application performance is also

determined by overheads involved in ofﬂoading (e.g., CPU-

GPU memory transfer overhead and kernel call overhead),

which we discuss further in the following sections.

The performance of the CPU implementations also depends

on the compiler’s ability to generate efﬁcient code to better

utilize the CPU hardware (e.g. SSE units). We compared the

performance of some Rodinia benchmarks when compiled

with gcc 4.2.4, the compiler used in this study, and icc 10.1.

The SSE capabilities of icc were enabled by default in our 64-

bit environment. For the single-threaded CPU implementation,

for instance, Needleman-Wunsch compiled with icc is 3%

faster than when compiled with gcc, and SRAD compiled with

icc is 23% slower than when compiled with gcc. For the four-

threaded CPU implementations, Needleman-Wunsch compiled

with icc is 124% faster than when compiled with gcc, and

SRAD compiled with icc is 20% slower than when compiled

with gcc. Given such performance differences due to using

different compilers, for a fair comparison with the GPU, it

would be desirable to hand-code the critical loops of some

CPU implementations in assembly with SSE instructions.

However, this would require low-level programming that is

signiﬁcantly more complex than CUDA programming, which

is beyond the scope of this paper.

HTML Viewer

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Rodinia: a benchmark suite for heterogeneous computing" ?

This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing.

Q2. What are the future works mentioned in the paper "Rodinia: a benchmark suite for heterogeneous computing" ?

Directions for future work include: • Adding new applications to cover further dwarves, such as sparse matrix, sorting, etc. The authors plan to provide different download versions of ap- plications for steps where they add major incremental optimizations. The authors plan to extend the Rodinia benchmarks to support more platforms, such as FPGAs, STI Cell, etc. The authors plan to extend their diversity analysis by using the clustering analysis performed by Joshi et al. [ 15 ], which requires a principal components analysis ( PCA ) that they have left to future work.

Q3. What are the two widely used benchmark suites for general purpose computing?

SPEC CPU [31] and EEMBC [6] are two widely used benchmark suites for evaluating general purpose CPUs and embedded processors, respectively.

Q4. What are the important optimizations for a GPU?

The most important optimizations are to reduce CPUGPU communication and to maximize locality of memory accesses within each warp (ideally allowing a single, coalesced memory transaction to fulfill an entire warp’s loads).

Q5. How fast is Needleman-Wunsch compiled with icc?

For the single-threaded CPU implementation, for instance, Needleman-Wunsch compiled with icc is 3% faster than when compiled with gcc, and SRAD compiled with icc is 23% slower than when compiled with gcc.

Q6. What are the basic requirements of a benchmark suite for general purpose computing?

The basic requirements of a benchmark suite for general purpose computing include supporting diverse applications with various computation patterns, employing state-of-the-art algorithms, and providing input sets for testing different situations.

Q7. What is the limit on registers and shared memory available per SM?

The limit on registers and shared memory available per SM can constrain the number of active threads, sometimes exposing memory latency [29].

Q8. How do the authors extend their diversity analysis?

The authors plan to extend their diversity analysis by using theclustering analysis performed by Joshi et al. [15], which requires a principal components analysis (PCA) that the authors have left to future work.

Q9. Why does Needleman-Wunsch exhibit an L2 miss rate of 41.2%?

Needleman-Wunsch exhibits an L2 miss rate of 41.2% due to its unconventional memory access patterns (diagonal strips) which are poorly handled by prefetching.

Q10. Why do applications such as Leukocyte have relatively low overhead?

Applications such as SRAD and Leukocyte exhibit relatively low overhead because the majority of their computations are independent.

Q11. What is the purpose of the parallax benchmark suite?

A diverse, multi-platform benchmark suite helps software, middleware, and hardware researchers in a variety of ways:• Accelerators offer significant performance and efficiencybenefits compared to CPUs for many applications.

Q12. Why do the authors choose different number of threads per thread block for different applications?

The authors also choose different number of threads per thread block for different applications; generally block sizes are chosen to maximize thread occupancy, although in some cases smaller thread blocks and reduced occupancy provide improved performance.

Q13. What are some constraints that affect the performance of the GPU?

Other constraints include the fact that threads cannot fork new threads, the architecture presents a 32-wide SIMD organization, and the fact that only one kernel can run at a time.

Q14. What is the metric used to visualize the behavior of each benchmark?

The authors use Kiviat plots to visualize each benchmark’s inherent behavior, with each axis representing one of the eight microarchitectureindependent characteristics.

Q15. How did the compiler parallelize the two applications?

The compiler was able to automatically parallelize two of the Rodinia applications, HotSpot and SRAD, after the authors made minimal modifications to the code.