scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

04 Oct 2009-pp 44-54
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

Summary (2 min read)

INTRODUCTION

  • This article focusses on the interaction between internal and external constraints on linguistic variation and change as reflected in data from Arabic dialects spoken in the Arabian Peninsula and the Levant.
  • This is because language change has been an object of linguistic inquiry much longer than variation has.
  • He argues that even the earliest treatises of Arabic grammar, dating back to the 8th century, had sociolinguistic material embedded in them (see e.g. Owens 2001: 421), and that their understanding of the development of Arabic relies heavily on the knowledge the authors possess of the social reality of its speakers.

OVERVIEW OF THE AVAILABLE DATA

  • The range of features the authors analyse in the research presented here represents several changes in progress in Arabic dialects.
  • These features are listed below (all notations are in IPA).
  • Page 6 of 82 Cambridge University Press Language in Society For Peer Review 7 J. Milroy (1993: 220), makes the following compelling observation:.
  • The phonology of the Medina dialect is thus being restructured vis-à-vis this feature, as no such allophony had existed in the traditional dialects of either social group.
  • She found that [j] has become the main variant of the young Baḥārna and that their traditional variant [ʤ] is rarely used in daily interactions.

PALATALISATION AND DEPALATALISATION

  • Evidence from a vocalic morphophonemic feature: the feminine ending Page 15 of 82 Cambridge University Press Language in Society For Peer Review 16 In Arabic dialects, there is a suffix that denotes feminine grammatical gender in many nouns and most adjectives.
  • Contemporary research on depalatalisation8 of etymological velar stops suggests that gender distinction has an effect on the change in the opposite direction, [ʧ] > [k].
  • The 6% of the affricate tokens reported by Al-Essa all come from four speakers (out of a sample of 61).
  • Page 22 of 82 Cambridge University Press Language in Society For Peer Review 23.

GENDER AS A SOCIAL FACTOR

  • Al-Qahtani (2015) studied two isolated villages in Tihāmat Qahṭān, in southern Arabia.
  • On the other hand, there is a steep increase in the use of the innovative variant from old Page 25 of 82 Cambridge University Press Language in Society For Peer Review 26 to young female speakers (16% to 69%).
  • Evidence of drastic social changes with regard to gendered behaviour is found in field notes by Al-Qahtani from her work in southern Saudi Arabia.
  • The authors experience conducting fieldwork across the Arab World has provided us with many narratives such as the one quoted above.
  • Page 28 of 82 Cambridge University Press Language in Society For Peer Review 29.

CONCLUSION

  • In what is perhaps the foundational text of sociolinguistic theory, Weinreich, Labov & Herzog (1968) deal with a number of ‘problems’ or ‘riddles’, the most difficult of which is the actuation problem (or actuation riddle).
  • Steering the discussion away from the hitherto received wisdom that variation in language can be ‘free variation’ and that subsequent language change can be ‘random’, Weinreich et al. (1968: 112) critique earlier accounts of language change, especially that of Hermann Paul (1880).
  • The youngest age group actually patterned alongside the oldest group in resisting the change, and the middle age group was the one who seemed to be leading it.
  • It is unlikely that factoring in a speaker’s age into the analysis of their speech would have been possible, or even taken seriously, had it not been for the incorporation of social factors in general into the study of language change.
  • These may involve or affect the speaker’s state of mind, and that state of mind may very well bear upon the behavior that implements the change’.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Rodinia: A Benchmark Suite for Heterogeneous Computing
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee and Kevin Skadron
{sc5nf, mwb7w, jm6dg, dt2f, jws9c, sl4ge, ks7h}@virginia.edu
Department of Computer Science, University of Virginia
Abstract—This paper presents and characterizes Rodinia, a
benchmark suite for heterogeneous computing. To help architects
study emerging platforms such as GPUs (Graphics Processing
Units), Rodinia includes applications and kernels which target
multi-core CPU and GPU platforms. The choice of applications
is inspired by Berkeley’s dwarf taxonomy. Our characterization
shows that the Rodinia benchmarks cover a wide range of
parallel communication patterns, synchronization techniques and
power consumption, and has led to some important architectural
insight, such as the growing importance of memory-bandwidth
limitations and the consequent importance of data layout.
I. INTRODUCTION
With the microprocessor industry’s shift to multicore archi-
tectures, research in parallel computing is essential to ensure
future progress in mainstream computer systems. This in turn
requires standard benchmark programs that researchers can use
to compare platforms, identify performance bottlenecks, and
evaluate potential solutions. Several current benchmark suites
provide parallel programs, but only for conventional, general-
purpose CPU architectures.
However, various accelerators, such as GPUs and FPGAs,
are increasingly popular because they are becoming easier
to program and offer dramatically better performance for
many applications. These accelerators differ significantly from
CPUs in architecture, middleware and programming models.
GPUs also offer parallelism at scales not currently available
with other microprocessors. Existing benchmark suites neither
support these accelerators’ APIs nor represent the kinds of ap-
plications and parallelism that are likely to drive development
of such accelerators. Understanding accelerators’ architectural
strengths and weaknesses is important for computer systems
researchers as well as for programmers, who will gain insight
into the most effective data structures and algorithms for each
platform. Hardware and compiler innovation for accelerators
and for heterogeneous system design may be just as com-
mercially and socially beneficial as for conventional CPUs.
Inhibiting such innovation, however, is the lack of a benchmark
suite providing a diverse set of applications for heterogeneous
systems.
In this paper, we extend and characterize the Rodinia
benchmark suite [4], a set of applications developed to address
these concerns. These applications have been implemented for
both GPUs and multicore CPUs using CUDA and OpenMP.
The suite is structured to span a range of parallelism and data-
sharing characteristics. Each application or kernel is carefully
chosen to represent different types of behavior according
to the Berkeley dwarves [1]. The suite now covers diverse
dwarves and application domains and currently includes nine
applications or kernels. We characterize the suite to ensure
that it covers a diverse range of behaviors and to illustrate
interesting differences between CPUs and GPUs.
In our CPU vs. GPU comparisons using Rodinia, we
have also discovered that the major architectural differences
between CPUs and GPUs have important implications for
software. For instance, the GPU offers a very low ratio of on-
chip storage to number of threads, but also offers specialized
memory spaces that can mitigate these costs: the per-block
shared memory (PBSM), constant, and texture memories. Each
is suited to different data-use patterns. The GPU’s lack of
persistent state in the PBSM results in less efficient commu-
nication among producer and consumer kernels. GPUs do not
easily allow runtime load balancing of work among threads
within a kernel, and thread resources can be wasted as a
result. Finally, discrete GPUs have high kernel-call and data-
transfer costs. Although we used some optimization techniques
to alleviate these issues, they remain a bottleneck for some
applications.
The benchmarks have been evaluated on an NVIDIA
GeForce GTX 280 GPU with a 1.3 GHz shader clock and a 3.2
GHz Quad-core Intel Core 2 Extreme CPU. The applications
exhibit diverse behavior, with speedups ranging from 5.5 to
80.8 over single-threaded CPU programs and from 1.6 to
26.3 over four-threaded CPU programs, varying CPU-GPU
communication overheads (2%-76%, excluding I/O and initial
setup), and varying GPU power consumption overheads (38W-
83W).
The contributions of this paper are as follows:
We illustrate the need for a new benchmark suite for het-
erogeneous computing, with GPUs and multicore CPUs
used as a case study.
We characterize the diversity of the Rodinia benchmarks
to show that each benchmark represents unique behavior.
We use the benchmarks to illustrate some important
architectural differences between CPUs and GPUs.
II. MOTIVATION
The basic requirements of a benchmark suite for general
purpose computing include supporting diverse applications
with various computation patterns, employing state-of-the-art
algorithms, and providing input sets for testing different situ-
ations. Driven by the fast development of multicore/manycore
CPUs, power limits, and increasing popularity of various
accelerators (e.g., GPUs, FPGAs, and the STI Cell [16]),
In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), Oct. 2009
Author's preprint, Sept. 2009
(c) IEEE, 2009

the performance of applications on future architectures is
expected to require taking advantage of multithreading, large
number of cores, and specialized hardware. Most of the
previous benchmark suites focused on providing serial and
parallel applications for conventional, general-purpose CPU
architectures rather than heterogeneous architectures contain-
ing accelerators.
A. General Purpose CPU Benchmarks
SPEC CPU [31] and EEMBC [6] are two widely used
benchmark suites for evaluating general purpose CPUs
and embedded processors, respectively. For instance, SPEC
CPU2006, dedicated to compute-intensive workloads, repre-
sents a snapshot of scientific and engineering applications.
But both suites are primarily serial in nature. OMP2001 from
SPEC and MultiBench 1.0 from EEMBC have been released
to partially address this problem. Neither, however, provides
implementations that can run on GPUs or other accelerators.
SPLASH-2 [34] is an early parallel application suite com-
posed of multithreaded applications from scientific and graph-
ics domains. However, the algorithms are no longer state-of-
the-art, data sets are too small, and some forms of paral-
lelization are not represented (e.g. software pipelining) [2].
Parsec [2] addresses some limitations of previous bench-
mark s uites. It provides workloads in the RMS (Recognition,
Mining and Synthesis) [18] and system application domains
and represents a wider range of parallelization techniques.
Neither SPLASH nor Parsec, however, support GPUs or other
accelerators. Many Parsec applications are also optimized for
multicore processors assuming a modest number of cores,
making them difficult to port to manycore organizations such
as GPUs. We are exploring parts of Parsec applications to
GPUs (e.g. Stream Cluster), but finding that those relying
on task pipelining do not port well unless each stage is also
heavily parallelizable.
B. Specialized and GPU Benchmark Suites
Other parallel benchmark suites include MineBench [28]
for data mining applications, MediaBench [20] and ALP-
Bench [17] for multimedia applications, and BioParallel [14]
for biomedical applications. The motivation for developing
these benchmark suites was to provide a suite of applications
which are representative of those application domains, but not
necessarily to provide a diverse range of behaviors. None of
these suites support GPUs or other accelerators.
The Parboil benchmark suite [33] is an effort to benchmark
GPUs, but its application set is narrower than Rodinia’s and
no diversity characterization has been published. Most of the
benchmarks only consist of single kernels.
C. Benchmarking Heterogeneous Systems
Prior to Rodinia, there has been no well-designed bench-
mark suite specifically for research in heterogeneous com-
puting. In addition to ensuring diversity of the applications,
an essential feature of such a suite must be implementations
for both multicore CPUs and the accelerators (only GPUs, so
far). A diverse, multi-platform benchmark suite helps software,
middleware, and hardware researchers in a variety of ways:
Accelerators offer significant performance and efficiency
benefits compared to CPUs for many applications. A
benchmark suite with implementations for both CPUs and
GPUs allows researchers to compare the two architectures
and identify the inherent architectural advantages and
needs of each platform and design accordingly.
Fused CPU-GPU processors and other heterogeneous
multiprocessor SoCs are likely to become common in
PCs, servers and HPC environments. Architects need a
set of diverse applications to help decide what hardware
features should be included in the limited area budgets
to best support common computation patterns shared by
various applications.
Implementations for both multicore-CPU and GPU
can help compiler efforts to port existing CPU lan-
guages/APIs to the GPU by providing reference imple-
mentations.
Diverse implementations for both multicore-CPU and
GPU can help software developers by provide exemplars
for different types of applications, assisting in the porting
new applications.
III. THE RODINIA BENCHMARK SUITE
Rodinia so far targets GPUs and multicore CPUs as a
starting point in developing a broader treatment of het-
erogeneous computing. Rodinia is maintained online at
http://lava.cs.virginia.edu/wiki/rodinia. I n order to cover di-
verse behaviors, the Berkeley Dwarves [1] are used as guide-
lines for selecting benchmarks. Even though programs repre-
senting a particular dwarf may have varying characteristics,
they share strong underlying patterns [1]. The dwarves are
defined at a high level of abstraction to allow reasoning about
the program behaviors.
The Rodinia suite has the following features:
The suite consists of four applications and ve kernels.
They have been parallelized with OpenMP for multicore
CPUs and wit h the CUDA API for GPUs. The Similarity
Score kernel is programmed using Mars’ MapReduce API
framework [10]. We use various optimization techniques
in the applications and take advantage of various on-chip
compute resources.
The workloads exhibit various types of parallelism, data-
access patterns, and data-sharing characteristics. So far
we have only implemented a subset of the dwarves,
including Structured Grid, Unstructured Grid, Dynamic
Programming, Dense Linear Algebra, MapReduce, and
Graph Traversal. We plan to expand Rodinia in the
future to cover the remaining dwarves. Previous work
has shown the applicability of GPUs to applications from
other dwarves such as Combinational Logic [4], Fast
Fourier Transform (FFT) [23], N-Body [25], and Monte
Carlo [24].
The Rodinia applications cover a diverse range of ap-
plication domains. In Table I we show the applications
2

along with their corresponding dwarves and domains.
Each application represents a representative application
from its respective domain. Users are given the flexibility
to specify different input sizes for various uses.
Even applications within the same dwarf show different
features. For instance, the Structured Grid applications
are at the core of scientific computing, but the reason
that we chose three Structured Grid applications is not
random. SRAD represents a regular application in this
domain. We use HotSpot to demonstrate the impact of
inter-multiprocessor synchronization on application per-
formance. Leukocyte Tracking utilizes diversified paral-
lelization and optimization techniques. We classify K-
means and Stream Cluster as Dense Linear Algebra
applications because their characteristics are closest to
the description of this dwarf since each operates on
strips of rows and columns. Although we believe that the
dwarf taxonomy is fairly comprehensive, there are some
important categories of applications that still need to be
added (e.g., sorting).
Although the dwarves are a useful guiding principle, as
mentioned above, our work with different instances of the
same dwarf suggests that the dwarf taxonomy alone may
not be sufficient to ensure adequate diversity and that some
important behaviors may not be captured. This is an interesting
area for future research.
TABLE I
RODINIA APPLICATIONS AND KE RNELS (* DENOTES KERNEL) .
Application / Kernel Dwarf Domain
K-means Dense Linear Algebra Data Mining
Needleman-Wunsch Dynamic Programming Bioinformatics
HotSpot* Structured Grid Physics Simulation
Back Propagation* Unstructured Grid Pattern Recognition
SRAD Structured Grid Image Processing
Leukocyte Tracking Structured Grid Medical Imaging
Breadth-First Search* Graph Traversal Graph Algorithms
Stream Cluster* Dense Linear Algebra Data Mining
Similarity Scores* MapReduce Web Mining
A. Workloads
Leukocyte Tracking (LC) detects and tracks rolling leuko-
cytes (white blood cells) in video microscopy of blood ves-
sels [3]. In the application, cells are detected in the first
video frame and then tracked through subsequent frames.
The major processes include computing for each pixel the
maximal Gradient Inverse Coefficient of Variation (GICOV)
score across a range of possible ellipses and computing, in
the area surrounding each cell, a Motion Gradient Vector Flow
(MGVF) matrix.
Speckle Reducing Anisotropic Diffusion (SRAD) is a
diffusion algorithm based on partial differential equations and
used for removing the speckles in an image without sacrificing
important image features. SRAD is widely used in ultrasonic
and radar imaging applications. The inputs to the program
are ultrasound images and the value of each point in the
computation domain depends on its four neighbors.
HotSpot (HS) is a thermal simulation tool [13] used for
estimating processor temperature based on an architectural
floor plan and simulated power measurements. Our benchmark
includes the 2D transient thermal simulation kernel of HotSpot,
which iteratively solves a series of differential equations for
block temperatures. The inputs to the program are power and
initial temperatures. Each output cell in the grid represents
the average temperature value of the corresponding area of
the chip.
Back Propagation (BP) is a machine-learning algorithm
that trains the weights of connecting nodes on a layered neural
network. The application is comprised of two phases: the
Forward Phase, in which the activations are propagated from
the input to the output layer, and the Backward Phase, in which
the error between the observed and requested values in the
output layer is propagated backwards to adjust the weights
and bias values. Our parallelized versions are based on a CMU
implementation [7].
Needleman-Wunsch (NW) is a global optimization method
for DNA sequence alignment. The potential pairs of sequences
are organized in a 2-D matrix. The algorithm fills the matrix
with s cores, which represent the value of the maximum
weighted path ending at that cell. A trace-back process is used
to search the optimal alignment. A parallel Needleman-Wunsch
algorithm processes the score matrix in diagonal s trips from
top-left to bottom-right.
K-means (KM) is a clustering algorithm used extensively
in data mining. This identifies related points by associating
each data point with its nearest cluster, computing new cluster
centroids, and iterating until convergence. Our OpenMP im-
plementation is based on the Northwestern MineBench [28]
implementation.
Stream Cluster (SC) solves the online clustering problem.
For a s tream of input points, it finds a pre-determined number
of medians so that each point is assigned to its nearest
center [2]. The quality of the clustering is measured by the
sum of squared distances (SSQ) metric. The original code
is from the Parsec Benchmark suite developed by Princeton
University [2]. We ported the Parsec implementation to CUDA
and OpenMP.
Breadth-First Search (BFS) traverses all the connected
components in a graph. Large graphs involving millions of
vertices are common in scientific and engineering applications.
The CUDA version of BFS was contributed by IIIT [9].
Similarity Score (SS) is used in web document clustering
to compute the pair-wise similarity between pairs of web
documents. The source code is from the Mars project [10] at
The Hong Kong University of Science and Technology. Mars
hides t he programming complexity of the GPU behind the
simple and familiar MapReduce interface.
B. NVIDIA CUDA
For GPU implementations, the Rodinia suite uses
CUDA [22], an extension to C for GPUs. CUDA represents
the GPU as a co-processor that can run a large number of
threads. The threads are managed by representing parallel
3

tasks as kernels mapped over a domain. Kernels are scalar
and represent the work to be done by a single thread. A
kernel is invoked as a thread at every point in the domain.
Thread creation is managed in hardware, allowing fast thread
creation. The parallel threads share memory and synchronize
using barriers.
An important feature of CUDA is that the threads are time-
sliced in SIMD groups of 32 called warps. Each warp of 32
threads operates in lockstep. Divergent threads are handled
using hardware masking until they reconverge. Different warps
in a thread block need not operate in lockstep, but if threads
within a warp follow divergent paths, only threads on the same
path can be executed simultaneously. In the worst case, all 32
threads in a warp following different paths would result in
sequential execution of the threads across the warp.
CUDA is currently supported only on NVIDIA GPUs, but
recent work has shown that CUDA programs can be compiled
to execute efficiently on multi-core CPUs [32].
The NVIDIA GTX 280 GPU used in this study has 30
streaming multiprocessors (SMs). Each SM has 8 streaming
processors (SPs) for a total of 240 SPs. Each group of 8 SPs
shares one 16 kB of fast per-block shared memory (similar to
scratchpad memory). Each group of three SMs (i.e., 24 SPs)
shares a texture unit. An SP contains a scalar floating point
ALU that can also perform integer operations. Instructions
are executed in a SIMD fashion across all SPs in a given
multiprocessor. The GTX 280 has 1 GB of device memory.
C. CUDA vs. OpenMP Implementations
One challenge of designing the Rodinia suite is that there
is no single language for programming the platforms we
target, which forced us to choose two different languages
at the current stage. More general languages or APIs that
seek to provide a universal programming standard, such as
OpenCL [26], may address this problem. However, since
OpenCL tools were not available at the time of this writing,
this is left for future work.
Our decision t o choose CUDA and OpenMP actually pro-
vides a real benefit. Because they lie at the extremes of data-
parallel programming models (fine-grained vs. coarse-grained,
explicit vs implicit), comparing the two implementations of a
program provides insight into pros and cons of different ways
of specifying and optimizing parallelism and data manage-
ment.
Even though CUDA programmers must specify the tasks
of threads and thread blocks in a more fine-grained way
than in OpenMP, the basic parallel decompositions in most
CUDA and OpenMP applications are not fundamentally dif-
ferent. Aside from dealing with other offloading issues, in
a straightforward data-parallel application programmers can
relatively easily convert the OpenMP loop body into a CUDA
kernel body by replacing the for-loop indices with t hread
indices over an appropriate domain (e.g., in Breadth-First
Search). Reductions, however, must be implemented manually
in CUDA (although CUDA libraries [30] make the reduction
easier), while in OpenMP this is handled by the compiler (e.g.,
in Back Propagation and SRAD).
Further optimizations, however, expose significant architec-
tural differences. Examples include taking advantage of data-
locality using specialized memories in CUDA, as opposed
to relying on large caches on the CPU, and reducing SIMD
divergence (as discussed in Section VI-B).
IV. METHODOLOGY AND EXPERIMENT SETUP
In this section, we explain the dimensions along which we
characterize the Rodinia benchmarks:
Diversity Analysis Characterization of diversity of the
benchmarks is necessary to identify whether the s uite provides
sufficient coverage.
Parallelization and Speedup The Rodinia applications are
parallelized in various ways and a variety of optimizations
have been applied to obtain satisfactory performance. We
examine how well each applications maps to the two target
platforms.
Computation vs. Communication Many accelerators such
as GPUs use a co-processor model in which computationally-
intensive portions of an application are offloaded to the ac-
celerator by the host processor. The communication overhead
between GPUs and CPUs often becomes a major performance
consideration.
Synchronization Synchronization overhead can be a barrier
to achieving good performance for applications utilizing fine-
grained synchronization. We analyze synchronization primi-
tives and strategies and their impact on application perfor-
mance.
Power Consumption An advantage of accelerator-based
computing is its potential to achieve better power-efficiency
than CPU-based computing. We show the diversity of the
Rodinia benchmarks in terms of power consumption.
All of our measurement results are obtained by running
the applications on real hardware. The benchmarks have been
evaluated on an NVIDIA GeForce GTX 280 GPU with 1.3
GHz shader clock and a 3.2 GHz Quad-core Intel Core 2
Extreme CPU. The system contains an NVIDIA nForce 790i-
based motherboard and the GPU is connected using PCI/e 2.0.
We use NVIDIA driver version 177.11 and CUDA version 2.2,
except for the Similarity Score application, whose Mars [10]
infrastructure only supports CUDA versions up to 1.1.
V. DIVERSITY AN ALYSIS
We use the Microarchitecture-Independent Workload Char-
acterization (MICA) framework developed by Hoste and Eeck-
hout [11] to evaluate the application diversity of the Rodinia
benchmark suite. MICA provides a Pin [19] toolkit to collect
metrics such as instruction mix, instruction-level parallelism,
register traffic, working set, data-stream size and branch-
predictability. Each metric also includes several sub-metrics
with total of 47 program characteristics. The MICA method-
ology uses a Genetic Algorithm to minimize the number of
inherent program characteristics that need to be measured
by exploiting correlation between characteristics. It reduces
4

Fig. 1. Kiviat diagrams representing the eight microarchitecture-independent
characteristics of each benchmark.
the 47-dimensional application characteristic space to an 8-
dimensional space without compromising the methodology’s
ability to compare benchmarks [11].
The metrics used in MICA are microarchitecture indepen-
dent but not independent of the instruction set architecture
(ISA) and the compiler. Despite this limitation, Hoste and
Eeckhout [12] show that these metrics can provide a fairly
accurate characterization, even across different platforms.
We measure the single-core, CPU version of the applications
from the Rodinia benchmark suite with the MICA tool as
described by Hoste and Eeckhout [11], except that we calculate
the percentage of all arithmetic operations instead of the
percentage of only multiply operations. Our rationale for
performing the analysis using the single-threaded CPU version
of each benchmark is that the underlying set of computations
to be performed is the same as in the parallelized or GPU
version, but this is another question for future work. We use
Kiviat plots to visualize each benchmark’s inherent behavior,
with each axis representing one of the eight microarchitecture-
independent characteristics. The data was normalized to have
a zero mean and a unit standard deviation. Figure 1 shows
the Kiviat plots for the Rodinia programs, demonstrating that
each application exhibits diverse behavior.
Fig. 2. The speedup of the GPU implementations over the equivalent single-
and four-threaded C PU implementations. The execution time for calculating
the speedup is measured on the CPU and GPU for the core part of the
computation, excluding the I/O and initial setup. Figure 4 gives a detailed
breakdown of each CUDA implementation’s runtime.
VI. PARALLELIZATION AND OPTIMIZATION
A. Performance
Figure 2 shows the speedup of each benchmark’s CUDA
implementation running on a GPU relative to OpenMP im-
plementations running on a multicore CPU. The speedups
range from 5.5 to 80.8 over the single-threaded CPU im-
plementations and from 1.6 to 26.3 over the four-threaded
CPU implementations. Although we have not spent equal
effort optimizing all Rodinia applications, we believe that
the majority of the performance diversity results from the
diverse application characteristics inherent in the bench-
marks. SRAD, HotSpot, and Leukocyte are relatively compute-
intensive, while Needleman-Wunsch, Breadth-First Search, K-
means, and Stream Cluster are limited by the GPU’s off-
chip memory bandwidth. The application performance is also
determined by overheads involved in offloading (e.g., CPU-
GPU memory transfer overhead and kernel call overhead),
which we discuss further in the following sections.
The performance of the CPU implementations also depends
on the compiler’s ability to generate efficient code to better
utilize the CPU hardware (e.g. SSE units). We compared the
performance of some Rodinia benchmarks when compiled
with gcc 4.2.4, the compiler used in this study, and icc 10.1.
The SSE capabilities of icc were enabled by default in our 64-
bit environment. For the single-threaded CPU implementation,
for instance, Needleman-Wunsch compiled with icc is 3%
faster than when compiled with gcc, and SRAD compiled with
icc is 23% slower than when compiled with gcc. For the four-
threaded CPU implementations, Needleman-Wunsch compiled
with icc is 124% faster than when compiled with gcc, and
SRAD compiled with icc is 20% slower than when compiled
with gcc. Given such performance differences due to using
different compilers, for a fair comparison with the GPU, it
would be desirable to hand-code the critical loops of some
CPU implementations in assembly with SSE instructions.
However, this would require low-level programming that is
significantly more complex than CUDA programming, which
is beyond the scope of this paper.
5

Citations
More filters
01 Jan 2012
TL;DR: By including versions of varying levels of optimization of the same fundamental algorithm, the Parboil benchmarks present opportunities to demonstrate tools and architectures that help programmers get the most out of their parallel hardware.
Abstract: The Parboil benchmarks are a set of throughput computing applications useful for studying the performance of throughput computing architecture and compilers. The name comes from the culinary term for a partial cooking process, which represents our belief that useful throughput computing benchmarks must be “cooked”, or preselected to implement a scalable algorithm with fine-grained paralle l tasks. But useful benchmarks for this field cannot be “fully cooked”, because the architectures and programming models and supporting tools are evolving rapidly enough that static benchmark codes will lose relevance very quickly. We have collected benchmarks from throughput computing application researchers in many different scientific and commercial fields including image processing, biomolec ular simulation, fluid dynamics, and astronomy. Each benchmark includes several implementations. Some implementations we provide as readable base implementations from which new optimization efforts can begin, and others as examples of the current state-of-the-art targeting specific CPU and GPU architectures. As we continue to optimiz e these benchmarks for new and existing architectures ourselves, we will also gladly accept new implementations and benchmark contributions from developers to recognize those at the frontier of performance optimization on each architecture. Finally, by including versions of varying levels of optimization of the same fundamental algorithm, the benchmarks present opportunities to demonstrate tools and architectures that help programmers get the most out of their parallel hardware. Less optimized versions are presented as challenges to the compiler and architecture research communities: to develop the technology that automatically raises the performance of simpler implementations to the performance level of sophisticated programmer-optimized implementations, or demonstrate any other performance or programmability improvements. We hope that these benchmarks will facilitate effective demonstrations of such technology.

695 citations


Cites methods from "Rodinia: A benchmark suite for hete..."

  • ...The Rodinia benchmarks published by the University of Virgi nia [2] are very similar in philosophy and development to the Parboil benchmarks....

    [...]

Proceedings ArticleDOI
14 Mar 2010
TL;DR: The Scalable HeterOgeneous Computing benchmark suite (SHOC) is a spectrum of programs that test the performance and stability of scalable heterogeneous computing systems and includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.
Abstract: Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOC's initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine system-wide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.

620 citations

Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work proposes a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements, and accurately tracks the power consumption trend over time.
Abstract: General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

558 citations

Proceedings ArticleDOI
25 Feb 2012
TL;DR: This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity.
Abstract: Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

541 citations


Cites background from "Rodinia: A benchmark suite for hete..."

  • ...It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent....

    [...]

Proceedings ArticleDOI
03 Dec 2011
TL;DR: This work proposes two independent ideas: the large warp microarchitecture and two-level warp scheduling that improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip.
Abstract: Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the same computing kernel. GPUs exploit this parallelism in two ways. First, threads are grouped into fixed-size SIMD batches known as warps, and second, many such warps are concurrently executed on a single GPU core. Despite these techniques, the computational resources on GPU cores are still underutilized, resulting in performance far short of what could be delivered. Two reasons for this are conditional branch instructions and stalls due to long latency operations. To improve GPU performance, computational resources must be more effectively utilized. To accomplish this, we propose two independent ideas: the large warp microarchitecture and two-level warp scheduling. We show that when combined, our mechanisms improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip.

441 citations


Cites methods from "Rodinia: A benchmark suite for hete..."

  • ...We created parallel applications adapted from existing benchmarksuitesincludingRodinia[5],MineBench[18], and NVIDIA sCUDASDK[19]in addition to creatingone of our own (blackjack)....

    [...]

  • ...We created parallel applications adapted from existing benchmark suites including Rodinia [5], MineBench [18], and NVIDIA’s CUDA SDK [19] in addition to creating one of our own (blackjack)....

    [...]

  • ...[5] S. Che et al. Rodinia: A benchmark suite for heterogeneous computing....

    [...]

References
More filters
Journal ArticleDOI
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019 citations


"Rodinia: A benchmark suite for hete..." refers methods in this paper

  • ...While developing and characterizing these benchmarks, we have experienced first-hand the following challenges of the GPU platform: Data Structure Mapping: Programmers must find efficient mappings of their applications’ data structures to CUDA’s hierarchical (grid of thread blocks) domain model....

    [...]

Proceedings ArticleDOI
01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Abstract: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

4,002 citations


"Rodinia: A benchmark suite for hete..." refers background in this paper

  • ...A diverse, multi-platform benchmark suite helps software, middleware, and hardware researchers in a variety of ways: • Accelerators offer significant performance and efficiency benefits compared to CPUs for many applications....

    [...]

Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations


"Rodinia: A benchmark suite for hete..." refers background or methods in this paper

  • ...Needleman-Wunsch uses 16 threads per block as discussed earlier, and Leukocyte uses different thread block sizes (128 and 256) for its two kernels because it operates on different working sets in the detection and tracking phases....

    [...]

  • ...• Fused CPU-GPU processors and other heterogeneous multiprocessor SoCs are likely to become common in PCs, servers and HPC environments....

    [...]

18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

2,262 citations


"Rodinia: A benchmark suite for hete..." refers background in this paper

  • ...Our decision to choose CUDA and OpenMP actually provides a real benefit....

    [...]

  • ...Each application or kernel is carefully chosen to represent different types of behavior according to the Berkeley dwarves [1]....

    [...]

Proceedings ArticleDOI
11 Aug 2008
TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

2,216 citations


"Rodinia: A benchmark suite for hete..." refers methods in this paper

  • ...For GPU implementations, the Rodinia suite uses CUDA [22], an extension to C for GPUs....

    [...]

Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Rodinia: a benchmark suite for heterogeneous computing" ?

This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. 

Directions for future work include: • Adding new applications to cover further dwarves, such as sparse matrix, sorting, etc. The authors plan to provide different download versions of ap- plications for steps where they add major incremental optimizations. The authors plan to extend the Rodinia benchmarks to support more platforms, such as FPGAs, STI Cell, etc. The authors plan to extend their diversity analysis by using the clustering analysis performed by Joshi et al. [ 15 ], which requires a principal components analysis ( PCA ) that they have left to future work. 

SPEC CPU [31] and EEMBC [6] are two widely used benchmark suites for evaluating general purpose CPUs and embedded processors, respectively. 

The most important optimizations are to reduce CPUGPU communication and to maximize locality of memory accesses within each warp (ideally allowing a single, coalesced memory transaction to fulfill an entire warp’s loads). 

For the single-threaded CPU implementation, for instance, Needleman-Wunsch compiled with icc is 3% faster than when compiled with gcc, and SRAD compiled with icc is 23% slower than when compiled with gcc. 

The basic requirements of a benchmark suite for general purpose computing include supporting diverse applications with various computation patterns, employing state-of-the-art algorithms, and providing input sets for testing different situations. 

The limit on registers and shared memory available per SM can constrain the number of active threads, sometimes exposing memory latency [29]. 

The authors plan to extend their diversity analysis by using theclustering analysis performed by Joshi et al. [15], which requires a principal components analysis (PCA) that the authors have left to future work. 

Needleman-Wunsch exhibits an L2 miss rate of 41.2% due to its unconventional memory access patterns (diagonal strips) which are poorly handled by prefetching. 

Applications such as SRAD and Leukocyte exhibit relatively low overhead because the majority of their computations are independent. 

A diverse, multi-platform benchmark suite helps software, middleware, and hardware researchers in a variety of ways:• Accelerators offer significant performance and efficiencybenefits compared to CPUs for many applications. 

The authors also choose different number of threads per thread block for different applications; generally block sizes are chosen to maximize thread occupancy, although in some cases smaller thread blocks and reduced occupancy provide improved performance. 

Other constraints include the fact that threads cannot fork new threads, the architecture presents a 32-wide SIMD organization, and the fact that only one kernel can run at a time. 

The authors use Kiviat plots to visualize each benchmark’s inherent behavior, with each axis representing one of the eight microarchitectureindependent characteristics. 

The compiler was able to automatically parallelize two of the Rodinia applications, HotSpot and SRAD, after the authors made minimal modifications to the code.