What are the contributions mentioned in the paper "Analyzing cuda workloads using a detailed gpu simulator" ?

This paper characterizes several non-graphics applications written in NVIDIA ’ s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA ’ s parallel thread execution ( PTX ) virtual instruction set. For this study, the authors selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware ( versus a CPU-only sequential version of the application ). The authors study the performance of these applications on their GPU performance simulator with configurations comparable to contemporary high-end graphics cards. The authors characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations the authors make are ( 1 ) that for the applications they study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and ( 2 ) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

How many levels of reflections and shadows are taken into account?

Up to 5 levels of reflections and shadows are taken into account, so thread behavior depends on what object the ray hits (if it hits any at all), making the kernel susceptible to branch divergence.

What can be done to increase the number of threads running simultaneously?

Increasing the number of simultaneously running threads can improve performance by having a greater ability to hide memory access latencies.

What is the reason why BFS performs poorly?

BFS also performs poorly since threads in adjacent nodes in the graph (which are grouped into warps) behave differently, causing more than 75% of its warps to have less than 50% occupancy.

What is the reason why the unfilled warps in NN are not due to branch?

In NN, two of the four kernels have only a single thread in a block and they take up the bulk of the execution time, meaning that the unfilled warps in NN are not due to branch divergence.

How did the authors validate their simulator against a Geforce 8600GTS?

The authors also validated their simulator against an Nvidia Geforce 8600GTS (a “low end” graphics card) by configuring their simulator to use 4 shaders and two memory controllers.

What does the current simulator infrastructure require to run CUDA applications?

Their current simulator infrastructure runs CUDA applications without source code modifications on Linux based platforms, but does require access to the application’s source code.

What configuration is unable to run using a baseline?

For the baseline configuration, some benchmarks are already resource-constrained to only 1 or 2 CTAs per shader core, making them unable to run using a configuration with less resources.

What is the scheduling policy for a shader?

Given the widely-varying workload-dependent behavior, always scheduling the maximal number of CTAs supported by a shader core is not always the best scheduling policy.

What is the effect of adding a cache to a shader core?

The authors explored the effects of varying the resources that limit the number of threads and hence CTAs that can run concurrently on a shader core, without modifying the source code for the benchmarks.

How many cycles of latency do the authors add to each router?

Without affecting peak throughput, the authors add an extra pipelined latency of 4, 8, or 16 cycles to each router on top of their baseline router’s 2-cycle latency.

What is the reason why different topologies do not change the performance of benchmarks?

As the authors will show in the next section, one of the reasons why different topologies do not change the performance of most benchmarks dramatically is that the benchmarks are not sensitive to small variations in latency, as long as the interconnection network provides sufficient bandwidth.

How much latency is the input port to the return interconnect stalled?

Their analysis shows that for the baseline configuration, the input port to the return interconnect from memory to the shader cores is stalled 16% of the time on average.

What is the average speedup of a mesh?

On average their baseline mesh interconnect performs comparable to a crossbar with input speedup of two for the workloads that the authors consider.

(Open Access) Analyzing CUDA workloads using a detailed GPU simulator (2009) | Ali Bakhoda

Q: How many applications are reported to have a speedup of more than 50?

Of the 136 applications listed with performance claims, 52 are reported to obtain a speedup of 50× or more, and of these 29 are reported to obtain a speedup of 100× or more.

Analyzing CUDA Workloads UsingaDetailedGPUSimulator

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt

University of British Columbia,

Vancouver, BC, Canada

{bakhoda,gyuan,wwlfung,henryw,aamodt}@ece.ubc.ca

Abstract

Modern Graphic Processing Units (GPUs) provide sufﬁ-

ciently ﬂexible programming models that understanding their

performance can provide insight in designing tomorr ow’s

manycore processors, whether those are GPUs or other-

wise. The combination of multiple, multithreaded, SIMD cores

makes studying these GPUs useful in understanding trade-

offs among memory, data, and thread level parallelism. While

modern GPUs offer orders of magnitude more raw comput-

ing power than contemporary CPUs, many important ap-

plications, even those with abundant data level parallelism,

do not achieve peak performance. This paper characterizes

several non-graphics applications written in NVIDIA’s CUDA

programming model by running them on a novel detailed

microarchitecture performance simulator that runs NVIDIA’s

parallel thread execution (PTX) virtual instruction set. For

this study, we selected twelve non-trivial CUDA applications

demonstrating varying levels of perfo rmance improvement on

GPU hardware (versus a CPU-only sequential version of

the application). We study the performance of these applica-

tions on our GPU performance simulator with conﬁgurations

comparable to contemporary high-end graphics car ds. We

characterize the performance impact of several microarchitec-

tur e design choices including choice of interconnect topology,

use of caches, design of memory controller, parallel work-

load distribution mechanisms, and memory request coalescing

hardware. Two observations we make ar e (1) that for the appli-

cations we stud y, performance is more sensitive to interconnect

bisection bandwidth rather than latency, and (2) that, for some

applications, running fewer threads concurrently than on-chip

resources might otherwise allow can impr ove performance by

reducing contention in the memory system.

1. Introduction

While single-thread performance of commercial superscalar

microprocessors is still increasing, a clear trend today is for

computer manufacturers to provide multithreaded hardware

that strongly encourages software developers to provide ex-

plicit parallelism when possible. One important class of par al-

lel computer hardware is the modern graphics processing unit

(GPU) [22, 25]. With contemporary GPUs recently crossing the

teraﬂop barrier [2,34] and speciﬁc efforts to make GPUs eas ier

to program for non-graphics applications [1, 29, 33], there is

widespread interest in using GPU hardware to accelerate non-

graphics applications.

Since its introduction by NVIDIA Corporation in February

2007, the CUDA programming model [29, 33] has been used to

develop m any applications for GPUs. CUDA provides an easy

to learn extension of the ANSI C language. The programmer

speciﬁes parallel threads, each of which runs scalar code.

While short vector data types are available, their use by the

programmer is not required to achieve peak performance, thus

making CUDA a more attractive programming model to those

less familiar with traditional data parallel architectures. This

execution model has been dubbed a single instruction, multiple

thread (SIMT) model [22] to distinguish it from the more

traditional single instruction, multiple data (SIMD) model.

As of February 2009, NVIDIA has listed 209 third-party

applications on their CUDA Zone website [30]. Of the 136

applications listed with performance claims, 5 2 are reported to

obtain a speedup of 50× or more, and of these 29 are reported

to obtain a speedup of 100× or more. As these applications

already achieve tremendous beneﬁts, this paper instead focuses

on evaluating CUDA applications with reported speedups

below 50× since this group of applications appears most in

need of software tuning or changes to hardware design.

This paper makes the following contributions:

• It presents data characterizing the performance of twelve

existing CUDA applications collected on a research GPU

simulator (GPGPU-Sim).

• It shows that the non-graphics applications we study

tend to be more sensitive to bisection bandwidth versus

latency.

• It shows that, for certain applications, decreasing the

number of threads running concurrently on the hardware

can improve performance by reducing contention for on-

chip resources.

• It provides an analysis of application characteristics in-

cluding the dynamic instruction mix, SIMD warp branch

divergence properties, and DRAM locality characteristics.

We believe the observations made in this paper will provide

useful guidance for directing future architecture and software

research.

The rest of this paper is organized as follows. In Section 2

we describe our baseline architecture and the microarchitecture

design choices that we explore before describing our simu-

lation infrastructure and the benchmarks used in this study.

DRAM DRAM DRAM

Memory

Controller

Memory

Controller

Memory

Controller

Memory

Application

Custom

libcuda

GPGPU-Sim

CPU

Shader Cores

kernel

PTX code

cudaMemcpy

Statistics

Interconnection Network

Core Core Core Core Core Core

L2 L2 L2

(a) Overview

RF RF RF RF

Fetch

Decode

Writeback

local/global access

(or L1 miss); texture

or const cache miss

All threads

hit in L1?

To interconnect

MSHRs

Thread Warp

Data

Scheduler

SIMD

Pipeline

Shader Core

Shared

Mem.

const

local&

global

tex

(b) Detail of Shader Core

Figure 1. Modeled system and GPU architecture [11]. Dashed por tions (L1 and L2 for local/global accesses) omitted from baseline.

Our experimental methodology is described in Section 3 and

Section 4 presents and analyzes results. Section 5 reviews

related work and Section 6 concludes the paper.

2. Design and Implementation

In this section we describe the GPU architecture we simu-

lated, provide an overview of our simulator infrastructure and

then describe the benchmarks we selected for our study.

2.1. Baseline Architecture

Figure 1(a) shows an overview of the system we simulated.

The applications evaluated in this paper were written using

CUDA [29, 33]. In the CUDA programming model, the GPU

is treated as a co-processor o nto which an application running

on a CPU can launch a massively parallel compute kernel. The

kernel is comprised of a grid of scalar threads. Each thread is

given an unique identiﬁer which can be used to help divide up

work among the threads. Within a grid, threads are grouped

into blocks, which are also referred to as cooperative thread

arrays (CTAs) [22]. Within a single CTA threads have access

to a common fast memory called the shared memory and can,

if desired, perform barrier synchronizations.

Figure 1(a) also shows our baseline GPU architecture. The

GPU consists of a collection of small data-parallel compute

cores, labeled shader cores in Figure 1, connected by an

interconnection network t o multiple memory modules (each

labeled memory controller). Each shader core is a unit similar

in scope to a streaming multiprocessor (SM) in NVIDIA

terminology [33]. Threads are distributed to shader cores at

the granularity of entire CTAs, while per-CTA resources, such

as registers, shared memory space, and thread slots, are not

freed until all threads within a CTA have completed execution.

If resources permit, multiple CTAs can be assigned to a

single shader core, thus sharing a common pipeline for their

execution. Our simulator omits graphics speciﬁc hardware not

exposedtoCUDA.

Figure 1(b) shows the detailed implementation of a single

shader core. In this paper, each shader core has a SIMD

width of 8 and uses a 24-stage, in-order pipeline without

forwarding. The 24-stage pipeline is motivated by details in the

CUDA Programming Guide [33], which indicates that at leas t

192 active threads are needed to avoid stalling for true data

dependencies between consecutive instructions from a single

thread (in the absence of long latency memory operations).

We model this pipeline with six logical pi peline stages (fetch,

decode, execute, memory1, memory2, writeback) with super-

pipelining of degree 4 (memory1 is an empty stage in our

model). Threads are scheduled to the SIMD pipeline in a ﬁxed

group of 32 threads called a warp [22]. All 32 threads in a

given warp execute th e same instruction with different d ata

values over four consecutive clock cycles in all pipelines (the

SIMD cores are effectively 8-wide). We use the immediate

post-dominator reconvergence mechanism described in [11] to

handle branch divergence where s ome scalar threads within a

warp evaluate a branch as “taken” and others eva luate it as

“not taken”.

Threads running on the GPU in the CUDA programming

model have access to several memory regions (global, local,

constant, texture, and shared [33]) and our simulator models

accesses to each of these m emory spaces. In particular, each

shader core has access to a 16KB low latency, highly-banked

per-core shared memory; to global texture memory with a per-

core texture cache; and to global constant memory with a

per-core constant cache. Local and global memory accesses

always require off chip memory accesses in our baseline

conﬁguration. For the per-core texture cache, we implement

a 4D blocking address scheme as described in [14], which

essentially permutes the bits in requested addresses to promote

Shader Core

Memory Controller

Figure 2. Layout of memory controller nodes in mesh

spatial locality in a 2D space rather than in linear space. For

the constant cache, we allow single cycle access as long as all

threads in a warp are requesting the same data. Otherwise , a

port conﬂict occurs, forcing data to be sent out over multiple

cycles and resulting in pipeline stalls [33]. Multiple memory

accesses from threads within a single warp to a localized

region are coalesced into fewer wide memory accesses to im-

prove DRAM efﬁciency

. To alleviate the DRAM bandwidth

bottleneck that many applications face, a common technique

used by CUDA programmers is to load frequently accessed

data into the fast on-chip shared memory [40].

Thread scheduling inside a shader core is performed with

zero overhead on a ﬁne-grained basis. Every 4 cycles, warps

ready for execution are selected by the warp scheduler and

issued to the SIMD pipelines in a loose round robin fashion

that skips non-ready warps, such as those waiting on global

memory accesses. In other words, whenever any thread inside

a warp faces a long latency operation, all the threads in the

warp are taken out of the scheduling pool until the long

latency operation is over. Meanwhile, other warps that are

not waiting are sent to the pipeline for execution in a round

robin o rder. The many threads running on each shader core

thus allow a shader core to tolerate long latency operations

without reducing throughput.

In order to access global memory, memory requests must

be sent via an interconnection network to the corresponding

memory controllers, which are physically distributed over

the chip. To avoid protocol deadlock, we model physically

separate send and receive interconnection networks. Using

separate logical networks to break protocol deadlock is another

alternat ive, but one we did not explore. Each on-chip memory

controller then interfaces to two off-chip GDDR3 DRAM

chips

. Figure 2 shows the physical layout of the memory

controllers in our 6x6 mesh conﬁguration as shaded areas

The address decoding scheme is designed in a way such

that successive 2KB DRAM pages [19] are distributed across

different banks and different chips to maximize row locality

while spreading the load among the memory controllers.

1. When memory accesses within a warp cannot be coalesced into a single

memory access, the memory stage will stall until all memory accesses are

issued from the shader core. In our design, the shader core can issue a

maximum of 1 access every 2 cycles.

2. GDDR3 stands for Graphics Double Data Rate 3 [19]. Graphics DRAM

is typically optimized to provide higher peak data bandwidth.

3. Note that with area-array (i.e., “ﬂip-chip”) designs it is possible to place

I/O buffers anywhere on the die [6].

2.2. GPU Architectural Exploration

This section describes some of the GPU architectural design

options explored in this paper. Evaluations of these design

options are presented in Section 4.

2.2.1. Interconnect. The on-chip interconnection network can

be designed in vari ous ways based on its cost and p erformance.

Cost is determined by complexity and number of routers as

well as density and length of wires. Performance depends on

latency, ban dwid th and path diversity of the network [9]. (Path

diversity indicates the number of routes a message can take

from the source to t he destination.)

Butterﬂy networks offer minimal hop count for a given

router radix while having no path diversity and requiring very

long wires. A crossbar interconnect can be seen as a 1-stage

but terﬂy and scales quadratically in area as the number of

ports increase. A 2D torus interconnect can be implemented

on chip with nearly uniformly short wires and offers good path

diversity, which can lead to a more load balanced network.

Ring and mesh interconnects are both special types of torus

interconnects. The main drawback of a mesh network is its

relatively h igher latency due to a larger hop count. As we

will show in Section 4, our benchmarks are not particularly

sensitive to latency so we chose a mesh network as our

baseline while exploring the other choices for interconnect

topology.

2.2.2. CTA distribution. GPUs can use the abundance of par-

allelism in data-parallel applications to tolerate memory access

latency by interleaving the execution of warps. These warps

may either be from the same CTA or from different CTAs

running on the s ame shader core. One advant age of running

multiple smaller CTAs on a shader core rather than using a

single larger CTA relates to the use of barrier synchronization

points within a CTA [40]. Threads from one CTA can make

progress while threads from another CTA are waiting at a

barrier. For a given number of threads per CTA, allowing more

CTAs to run on a shader core provides additional memory

latency tolerance, though it may imply increasing register

and shared memory resource use. However, even if sufﬁcient

on-chip resources exist to allow more CTAs per core, if a

compute kernel is memory-intensive, completely ﬁlling up all

CTA slots may reduce performance by increasing contention in

the interconnection network and DRAM controllers. We issue

CTAs in a breadth-ﬁrst manner across s hader cores, selecting

a shader core that has a minimum number of CTAs running on

it, so as to spread the workload as evenly as possible among

all cores.

2.2.3. Memory Access Coalescing. The minimum granularity

access for GDDR3 memory is 16 bytes and typically scalar

threads in CUDA applications access 4 bytes per scalar

thread [19]. To improve memory s ystem efﬁciency, it thus

makes sense to group accesses from multiple, concurrently-

issued, scalar threads into a single access to a small, contigu-

ous memory region. The CUDA programming guide indicates

that parallel memory access es from every half-warp of 16

NVIDIA

GPU

cudafe + nvopencc

C/C++ compiler

ptxas

Tool

File

Nvidia Toolkit

GPGPU-Sim

User-specific

tool/file

Source code

(.cu)

Host C

code

Executable

PCI-E

.ptx

cubin.bin

libcuda.a

libcuda

Source code

(.cpp)

Application

(a) CUDA Flow with G PU Hardware

GPGPU-Sim

cudafe + nvopencc

C/C++ compiler

Executable

.ptx

Custom

libcuda.a

Custom

libcuda

Statistics

Function

call

ptxas

per thread

usage

Source code

(.cu)

Source code

(.cpp)

Application

Host C

code

(b) GPGPU-Sim

Figure 3. Compilation Flow for GPGPU-Sim from a CUDA application in comparison to the normal CUDA compilation ﬂow.

threads can be coalesced into fewer wide memory accesses if

they all access a contiguous memory region [33]. Our baseline

models similar intra-warp memory coalescing behavior (we

attempt to coalesce memory accesses from all 32 threads in a

warp).

A related issue is that since the GPU is heavily multi-

threaded a balanced design must support many outstanding

memory requests at once. While microprocessors typically

employ miss-status holding registers (MSHRs)[21]thatuse

associative comparison logic merge simultaneous requests for

the same cache block, the number of outstanding misses that

can be supported is typically small (e.g., the original Intel

Pentium 4 used four MSHRs [16]). One way to support a

far greater number of outstanding memory requests is to use

a FIFO for outstanding memory requests [17]. Similarly, our

baseline does not attempt to eliminate multiple requests for

the same block of memory on cache misses or local/global

memory accesses. However, we also explore the possibility of

improving performance by coalescing read memory requests

from later warps that require access to data for which a mem-

ory r equest is already in progress due to another warp running

on the same shader core. We call this inter-warp memory

coalescing. We observe that inter-warp memory coalescing

can signiﬁcantly reduce memory trafﬁc for applications that

contain data dependent accesses to memory. The data for

inter- warp merging quantiﬁes the beneﬁt of supporting large

capacity MSHRs that can detect a s econdary access to an

outstanding request [45].

2.2.4. Caching. While coalescing memory requests captures

spatial locality among threads, memory bandwidth require-

ments may be further reduced with caching if an application

contains temporal locality or spatial locality within the access

pattern of individual threads. We evaluate the performance

impact of adding ﬁrst level, per-core L1 caches for local and

global memory access to the design described in Section 2.1.

We also evaluate the effects of adding a shared L2 cache

on the memory side of the interconnection network at the

memory controller. While threads can only read from texture

and constant memory, they can both read and write to local

and global memory. In our evaluation of caches for local and

global memory we model non-coherent caches. (Note that

threads from different CTAs in the applications we study do

not communicate through global memory.)

2.3. Extending GPGPU-Sim to Support CUDA

We extended GPGPU-Sim, the cycle-accurate simulator we

developed for our earlier work [11]. GPGPU-S im models

various aspects of a massively parallel architecture with highly

programmable pipelines similar to contemporary GPU archi-

tectures. A drawback of the previous version of GPGPU-

Sim was the difﬁcult and time-consuming process of convert-

ing/parallelizing existing applications [11]. We overcome this

difﬁculty by extending GPGPU-Sim to support the CUDA

Parallel Thread Execution (PTX) [35] instruction set. This

enables us to simulate th e numerous existing, optimized

CUDA applications on GPGPU-Sim. Our current simulator

infrastructure runs CUDA applications without source code

modiﬁcations on Linux based platforms, but does require

access to the application’s source code. To build a CUDA

application for our simulator, we replace the common.mk

makeﬁle used in the CUDA SDK with a version that builds the

application to run on our microarchitecture simulator (while

other more complex build scenarios may require more complex

makeﬁle changes).

Figure 3 shows how a CUDA application can be compiled

for simulation on GPGPU-Sim and compares this compila-

tion ﬂow to the normal CUDA compilation ﬂow [33]. Both

compilation ﬂows use cudafe to transform the s ource code of

a CUDA application into host C code running on the CPU

anddeviceCcoderunningontheGPU.TheGPUCcodeis

then compiled into PTX assembly (labeled “.ptx” in F igure 3)

by nvopencc, an open source compiler provided by NVIDIA

based on Open64 [28, 36]. The PTX assembler (ptxas)then

assembles the PTX assembly code into the target GPU’s native

ISA (labeled “cubin.bin” in Figure 3(a)). The assembled code

is then combined with the host C code and compiled into a

single executable linked with the CUDA Runtime API library

(labeled “libcuda.a” in Figure 3) by a standard C compiler. In

the normal CUDA compilation ﬂow (used with NVIDIA G PU

hardware), the resulting executable calls the CUDA Runtime

API to set up and inv oke compute kernels onto the GPU via

the NVIDIA CUDA driver.

When a CUDA application is compiled to use GPGPU-Sim,

many steps remain the same. However, rather than linking

against the NVIDIA supplied libcuda.a binary, we l ink against

our own libcuda.a binary. Our libcuda.a implements “stub”

functions for the interface deﬁned by the header ﬁles supplied

with CUDA. These stub functions set up and invoke simulation

sessions of the compute kernels on GPGPU-Sim (as shown

in Figure 3(b)). Before the ﬁrst simulation session, GPGPU-

Sim parses the text format PTX assembly code generated by

nvopencc to obtain code for the compute kernels. Because

the PTX assembly code has no restriction on register usage

(to improve portability between different GPU architectures),

nvopencc performs register allocation using far more registers

than typically required to avoid spilling. To improve the

realism of our performance model, we determine the register

usage per thread and shared memory used per CTA using

ptxas

. We then use this information to limit the number

of CTAs that can run concurrently on a shader core. The

GPU binary (cubin.bin) produced by ptxas is not used by

GPGPU-Sim. After parsing the PTX assembly code, but before

beginning simulation, GPGPU-Sim performs an immediate

post-dominator analysis on each kernel to annotate branch

instructions with reconvergence points for the stack-based

SIMD control ﬂow handling mechanism described by Fung

et al. [11]. During a simulation, a PTX functional simulator

executes instruct ions from multiple threads according to their

scheduling order as speciﬁed by the performance simulator.

When the simulation completes, the host CPU code is then

allowed to resume execution. In our current implementation,

host code runs on a normal CPU, thus our performance

measurements are for the GPU code only.

2.4. Benchmarks

Our benchmarks are listed in Table 1 along with the main

application properties, such as the organization of threads into

CTAs and grids as well as the different memory spaces on the

GPU exploited by each application. Multiple entries separated

by semi-colons in the grid and CTA dimensions indicate the

application runs multiple kernels.

For comparison purposes we also simulated the following

benchmarks from NVIDIA’s CUDA software development

kit (SDK) [32]: Black-Scholes Option Pricing, Fast Walsh

4. By default, the version of ptxas in CUDA 1.1 appears to attempt to avoid

spilling registers provided the number of registers per thread is less than 128

and none of the applications we studied reached this limit. Directing ptxas to

further restrict the number of registers leads to an increase in local memory

usage above that explicitly used in the PTX assembly, while increasing the

Transform, Binomial Option Pricing, Separable Convolution,

64-bin Histogram, Matrix Multiply, Parallel Reduction, Scalar

Product, Scan of Large Arrays, and Matrix Transpose. Due to

space limitations, and since most of these benchmarks already

perform well on GPUs, we only report details for Black-

Scholes (BLK), a ﬁnancial options pricing application, and

Fast Walsh Transform (FWT), widely used in signal and image

processing and compression. We also report the harmonic

mean of all SDK applications simulated, denoted as SDK in

the data bar charts in Section 4.

Below, we descr ibe the CUDA applications not in the SDK

that we use as benchmarks in our study. These applications

were developed by the researchers cited below and run un-

modiﬁedonoursimulator.

AES Encryption (AES) [24] This application, developed

by Manavski [24], implements the Advanced Encryption Stan-

dard (AES) algorithm in CUDA to encrypt and decrypt ﬁles.

The application h as been optimized by the developer so that

constants are stored in constant memory, the expanded key

stored in texture memory, and the input data processed in

shared memory. We encrypt a 256KB picture using 128-bit

encryption.

Graph Algorithm: Breadth-First Search (BFS) [15]

Developed by Harish and Narayanan [15], this application

performs breadth-ﬁrst search on a graph. As each node in

the graph is mapped to a different thread, the amount of

parallelism in this applications scales with the size of the input

graph. BFS suffers from performance loss due to heavy global

memory trafﬁc and branch d ivergence. We perform breadth-

ﬁrst search on a random g raph with 65,536 nodes and an

average of 6 edges per node.

Coulombic Potential (CP) [18,41] CP is part of the Parboil

Benchmark suite developed by the IMPACT research group at

UIUC [18,41]. CP is useful in the ﬁeld of molecular dynamics.

Loops are manually unrolled to reduce loop overheads and

the point charge data is stored in constant memory to take

advantage of caching. CP has been heavily optimized (it

has been shown to achieve a 647× speedup versus a CPU

version [40]). We simulate 200 atoms on a grid size of

256×256.

gpuDG (DG) [46] gpuDG is a discontinuous Galerkin

time-domain solver, used in the ﬁeld of electromagnetics to

calculate radar scattering from 3D objects and analyze wave

guides, particle accelerators, and EM compatibility [46]. Data

is loaded into shared memory from texture memory. The inner

loop consists mainly of matrix-vector products. We use the 3D

version with polynomial o rder of N=6 and reduce time steps

to 2 to reduce simulation time.

3D Laplace Solver (LPS) [12] Laplace is a highly parallel

ﬁnance application [12]. As well as using shared memory, care

was taken by the application developer to ensure coalesced

global memory accesses. We observe that this benchmark

suffers some performance loss due to branch divergence. We

run one iteration on a 100x100x100 grid.

LIBOR Monte Carlo (LIB) [13] LIBOR performs Monte

Analyzing CUDA workloads using a detailed GPU simulator

Figures

Citations

Dark Silicon and the End of Multicore Scaling

Dark silicon and the end of multicore scaling

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

A detailed and flexible cycle-accurate Network-on-Chip simulator

GPUWattch: enabling energy optimizations in GPGPUs

References

Scalable parallel programming with CUDA

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?

Memory access scheduling

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Related Papers (5)

Rodinia: A benchmark suite for heterogeneous computing

GPUWattch: enabling energy optimizations in GPGPUs

Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Frequently Asked Questions (19)

Q1. What are the contributions mentioned in the paper "Analyzing cuda workloads using a detailed gpu simulator" ?

Q2. How many levels of reflections and shadows are taken into account?

Q3. What can be done to increase the number of threads running simultaneously?

Q4. What is the reason why BFS performs poorly?

Q5. What is the reason why the unfilled warps in NN are not due to branch?

Q6. How many active threads are needed to avoid stalling?

Q7. How does the graph algorithm scale with the size of the input graph?

Q8. Why does nvopencc use more registers than is required to avoid spilling?

Q9. How many applications are reported to have a speedup of more than 50?

Q10. How did the authors validate their simulator against a Geforce 8600GTS?

Q11. What does the current simulator infrastructure require to run CUDA applications?

Q12. What configuration is unable to run using a baseline?

Q13. What is the scheduling policy for a shader?

Q14. What is the effect of adding a cache to a shader core?

Q15. How many cycles of latency do the authors add to each router?

Q16. How many threads are in a given pipeline?

Q17. What is the reason why different topologies do not change the performance of benchmarks?

Q18. How much latency is the input port to the return interconnect stalled?

Q19. What is the average speedup of a mesh?