scispace - formally typeset
Open AccessProceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

TLDR
In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Abstract
Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

read more

Content maybe subject to copyright    Report

Analyzing CUDA Workloads UsingaDetailedGPUSimulator
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt
University of British Columbia,
Vancouver, BC, Canada
{bakhoda,gyuan,wwlfung,henryw,aamodt}@ece.ubc.ca
Abstract
Modern Graphic Processing Units (GPUs) provide suffi-
ciently flexible programming models that understanding their
performance can provide insight in designing tomorr ow’s
manycore processors, whether those are GPUs or other-
wise. The combination of multiple, multithreaded, SIMD cores
makes studying these GPUs useful in understanding trade-
offs among memory, data, and thread level parallelism. While
modern GPUs offer orders of magnitude more raw comput-
ing power than contemporary CPUs, many important ap-
plications, even those with abundant data level parallelism,
do not achieve peak performance. This paper characterizes
several non-graphics applications written in NVIDIA’s CUDA
programming model by running them on a novel detailed
microarchitecture performance simulator that runs NVIDIAs
parallel thread execution (PTX) virtual instruction set. For
this study, we selected twelve non-trivial CUDA applications
demonstrating varying levels of perfo rmance improvement on
GPU hardware (versus a CPU-only sequential version of
the application). We study the performance of these applica-
tions on our GPU performance simulator with configurations
comparable to contemporary high-end graphics car ds. We
characterize the performance impact of several microarchitec-
tur e design choices including choice of interconnect topology,
use of caches, design of memory controller, parallel work-
load distribution mechanisms, and memory request coalescing
hardware. Two observations we make ar e (1) that for the appli-
cations we stud y, performance is more sensitive to interconnect
bisection bandwidth rather than latency, and (2) that, for some
applications, running fewer threads concurrently than on-chip
resources might otherwise allow can impr ove performance by
reducing contention in the memory system.
1. Introduction
While single-thread performance of commercial superscalar
microprocessors is still increasing, a clear trend today is for
computer manufacturers to provide multithreaded hardware
that strongly encourages software developers to provide ex-
plicit parallelism when possible. One important class of par al-
lel computer hardware is the modern graphics processing unit
(GPU) [22, 25]. With contemporary GPUs recently crossing the
teraflop barrier [2,34] and specific efforts to make GPUs eas ier
to program for non-graphics applications [1, 29, 33], there is
widespread interest in using GPU hardware to accelerate non-
graphics applications.
Since its introduction by NVIDIA Corporation in February
2007, the CUDA programming model [29, 33] has been used to
develop m any applications for GPUs. CUDA provides an easy
to learn extension of the ANSI C language. The programmer
specifies parallel threads, each of which runs scalar code.
While short vector data types are available, their use by the
programmer is not required to achieve peak performance, thus
making CUDA a more attractive programming model to those
less familiar with traditional data parallel architectures. This
execution model has been dubbed a single instruction, multiple
thread (SIMT) model [22] to distinguish it from the more
traditional single instruction, multiple data (SIMD) model.
As of February 2009, NVIDIA has listed 209 third-party
applications on their CUDA Zone website [30]. Of the 136
applications listed with performance claims, 5 2 are reported to
obtain a speedup of 50× or more, and of these 29 are reported
to obtain a speedup of 100× or more. As these applications
already achieve tremendous benefits, this paper instead focuses
on evaluating CUDA applications with reported speedups
below 50× since this group of applications appears most in
need of software tuning or changes to hardware design.
This paper makes the following contributions:
It presents data characterizing the performance of twelve
existing CUDA applications collected on a research GPU
simulator (GPGPU-Sim).
It shows that the non-graphics applications we study
tend to be more sensitive to bisection bandwidth versus
latency.
It shows that, for certain applications, decreasing the
number of threads running concurrently on the hardware
can improve performance by reducing contention for on-
chip resources.
It provides an analysis of application characteristics in-
cluding the dynamic instruction mix, SIMD warp branch
divergence properties, and DRAM locality characteristics.
We believe the observations made in this paper will provide
useful guidance for directing future architecture and software
research.
The rest of this paper is organized as follows. In Section 2
we describe our baseline architecture and the microarchitecture
design choices that we explore before describing our simu-
lation infrastructure and the benchmarks used in this study.

DRAM DRAM DRAM
Memory
Controller
Memory
Controller
Memory
Controller
Memory
Application
Custom
libcuda
GPGPU-Sim
CPU
Shader Cores
kernel
PTX code
cudaMemcpy
Statistics
Interconnection Network
Core Core Core Core Core Core
L2 L2 L2
(a) Overview
RF RF RF RF
Fetch
Decode
Writeback
local/global access
(or L1 miss); texture
or const cache miss
All threads
hit in L1?
To interconnect
MSHRs
Thread Warp
Thread Warp
Thread Warp
Thread Warp
Thread Warp
Data
Scheduler
SIMD
Pipeline
Shader Core
Shared
Mem.
L1
const
L1
local&
global
L1
tex
(b) Detail of Shader Core
Figure 1. Modeled system and GPU architecture [11]. Dashed por tions (L1 and L2 for local/global accesses) omitted from baseline.
Our experimental methodology is described in Section 3 and
Section 4 presents and analyzes results. Section 5 reviews
related work and Section 6 concludes the paper.
2. Design and Implementation
In this section we describe the GPU architecture we simu-
lated, provide an overview of our simulator infrastructure and
then describe the benchmarks we selected for our study.
2.1. Baseline Architecture
Figure 1(a) shows an overview of the system we simulated.
The applications evaluated in this paper were written using
CUDA [29, 33]. In the CUDA programming model, the GPU
is treated as a co-processor o nto which an application running
on a CPU can launch a massively parallel compute kernel. The
kernel is comprised of a grid of scalar threads. Each thread is
given an unique identifier which can be used to help divide up
work among the threads. Within a grid, threads are grouped
into blocks, which are also referred to as cooperative thread
arrays (CTAs) [22]. Within a single CTA threads have access
to a common fast memory called the shared memory and can,
if desired, perform barrier synchronizations.
Figure 1(a) also shows our baseline GPU architecture. The
GPU consists of a collection of small data-parallel compute
cores, labeled shader cores in Figure 1, connected by an
interconnection network t o multiple memory modules (each
labeled memory controller). Each shader core is a unit similar
in scope to a streaming multiprocessor (SM) in NVIDIA
terminology [33]. Threads are distributed to shader cores at
the granularity of entire CTAs, while per-CTA resources, such
as registers, shared memory space, and thread slots, are not
freed until all threads within a CTA have completed execution.
If resources permit, multiple CTAs can be assigned to a
single shader core, thus sharing a common pipeline for their
execution. Our simulator omits graphics specific hardware not
exposedtoCUDA.
Figure 1(b) shows the detailed implementation of a single
shader core. In this paper, each shader core has a SIMD
width of 8 and uses a 24-stage, in-order pipeline without
forwarding. The 24-stage pipeline is motivated by details in the
CUDA Programming Guide [33], which indicates that at leas t
192 active threads are needed to avoid stalling for true data
dependencies between consecutive instructions from a single
thread (in the absence of long latency memory operations).
We model this pipeline with six logical pi peline stages (fetch,
decode, execute, memory1, memory2, writeback) with super-
pipelining of degree 4 (memory1 is an empty stage in our
model). Threads are scheduled to the SIMD pipeline in a fixed
group of 32 threads called a warp [22]. All 32 threads in a
given warp execute th e same instruction with different d ata
values over four consecutive clock cycles in all pipelines (the
SIMD cores are effectively 8-wide). We use the immediate
post-dominator reconvergence mechanism described in [11] to
handle branch divergence where s ome scalar threads within a
warp evaluate a branch as “taken” and others eva luate it as
“not taken”.
Threads running on the GPU in the CUDA programming
model have access to several memory regions (global, local,
constant, texture, and shared [33]) and our simulator models
accesses to each of these m emory spaces. In particular, each
shader core has access to a 16KB low latency, highly-banked
per-core shared memory; to global texture memory with a per-
core texture cache; and to global constant memory with a
per-core constant cache. Local and global memory accesses
always require off chip memory accesses in our baseline
configuration. For the per-core texture cache, we implement
a 4D blocking address scheme as described in [14], which
essentially permutes the bits in requested addresses to promote

Shader Core
Memory Controller
Figure 2. Layout of memory controller nodes in mesh
3
spatial locality in a 2D space rather than in linear space. For
the constant cache, we allow single cycle access as long as all
threads in a warp are requesting the same data. Otherwise , a
port conflict occurs, forcing data to be sent out over multiple
cycles and resulting in pipeline stalls [33]. Multiple memory
accesses from threads within a single warp to a localized
region are coalesced into fewer wide memory accesses to im-
prove DRAM efficiency
1
. To alleviate the DRAM bandwidth
bottleneck that many applications face, a common technique
used by CUDA programmers is to load frequently accessed
data into the fast on-chip shared memory [40].
Thread scheduling inside a shader core is performed with
zero overhead on a fine-grained basis. Every 4 cycles, warps
ready for execution are selected by the warp scheduler and
issued to the SIMD pipelines in a loose round robin fashion
that skips non-ready warps, such as those waiting on global
memory accesses. In other words, whenever any thread inside
a warp faces a long latency operation, all the threads in the
warp are taken out of the scheduling pool until the long
latency operation is over. Meanwhile, other warps that are
not waiting are sent to the pipeline for execution in a round
robin o rder. The many threads running on each shader core
thus allow a shader core to tolerate long latency operations
without reducing throughput.
In order to access global memory, memory requests must
be sent via an interconnection network to the corresponding
memory controllers, which are physically distributed over
the chip. To avoid protocol deadlock, we model physically
separate send and receive interconnection networks. Using
separate logical networks to break protocol deadlock is another
alternat ive, but one we did not explore. Each on-chip memory
controller then interfaces to two off-chip GDDR3 DRAM
chips
2
. Figure 2 shows the physical layout of the memory
controllers in our 6x6 mesh configuration as shaded areas
3
.
The address decoding scheme is designed in a way such
that successive 2KB DRAM pages [19] are distributed across
different banks and different chips to maximize row locality
while spreading the load among the memory controllers.
1. When memory accesses within a warp cannot be coalesced into a single
memory access, the memory stage will stall until all memory accesses are
issued from the shader core. In our design, the shader core can issue a
maximum of 1 access every 2 cycles.
2. GDDR3 stands for Graphics Double Data Rate 3 [19]. Graphics DRAM
is typically optimized to provide higher peak data bandwidth.
3. Note that with area-array (i.e., “flip-chip”) designs it is possible to place
I/O buffers anywhere on the die [6].
2.2. GPU Architectural Exploration
This section describes some of the GPU architectural design
options explored in this paper. Evaluations of these design
options are presented in Section 4.
2.2.1. Interconnect. The on-chip interconnection network can
be designed in vari ous ways based on its cost and p erformance.
Cost is determined by complexity and number of routers as
well as density and length of wires. Performance depends on
latency, ban dwid th and path diversity of the network [9]. (Path
diversity indicates the number of routes a message can take
from the source to t he destination.)
Butterfly networks offer minimal hop count for a given
router radix while having no path diversity and requiring very
long wires. A crossbar interconnect can be seen as a 1-stage
but terfly and scales quadratically in area as the number of
ports increase. A 2D torus interconnect can be implemented
on chip with nearly uniformly short wires and offers good path
diversity, which can lead to a more load balanced network.
Ring and mesh interconnects are both special types of torus
interconnects. The main drawback of a mesh network is its
relatively h igher latency due to a larger hop count. As we
will show in Section 4, our benchmarks are not particularly
sensitive to latency so we chose a mesh network as our
baseline while exploring the other choices for interconnect
topology.
2.2.2. CTA distribution. GPUs can use the abundance of par-
allelism in data-parallel applications to tolerate memory access
latency by interleaving the execution of warps. These warps
may either be from the same CTA or from different CTAs
running on the s ame shader core. One advant age of running
multiple smaller CTAs on a shader core rather than using a
single larger CTA relates to the use of barrier synchronization
points within a CTA [40]. Threads from one CTA can make
progress while threads from another CTA are waiting at a
barrier. For a given number of threads per CTA, allowing more
CTAs to run on a shader core provides additional memory
latency tolerance, though it may imply increasing register
and shared memory resource use. However, even if sufficient
on-chip resources exist to allow more CTAs per core, if a
compute kernel is memory-intensive, completely filling up all
CTA slots may reduce performance by increasing contention in
the interconnection network and DRAM controllers. We issue
CTAs in a breadth-first manner across s hader cores, selecting
a shader core that has a minimum number of CTAs running on
it, so as to spread the workload as evenly as possible among
all cores.
2.2.3. Memory Access Coalescing. The minimum granularity
access for GDDR3 memory is 16 bytes and typically scalar
threads in CUDA applications access 4 bytes per scalar
thread [19]. To improve memory s ystem efficiency, it thus
makes sense to group accesses from multiple, concurrently-
issued, scalar threads into a single access to a small, contigu-
ous memory region. The CUDA programming guide indicates
that parallel memory access es from every half-warp of 16

NVIDIA
GPU
cudafe + nvopencc
C/C++ compiler
ptxas
Tool
File
Nvidia Toolkit
GPGPU-Sim
User-specific
tool/file
Source code
(.cu)
Host C
code
Executable
PCI-E
.ptx
cubin.bin
libcuda.a
libcuda
Source code
(.cpp)
Application
(a) CUDA Flow with G PU Hardware
GPGPU-Sim
cudafe + nvopencc
C/C++ compiler
Executable
.ptx
Custom
libcuda.a
Custom
libcuda
Statistics
Function
call
ptxas
per thread
register
usage
Source code
(.cu)
Source code
(.cpp)
Application
Host C
code
(b) GPGPU-Sim
Figure 3. Compilation Flow for GPGPU-Sim from a CUDA application in comparison to the normal CUDA compilation flow.
threads can be coalesced into fewer wide memory accesses if
they all access a contiguous memory region [33]. Our baseline
models similar intra-warp memory coalescing behavior (we
attempt to coalesce memory accesses from all 32 threads in a
warp).
A related issue is that since the GPU is heavily multi-
threaded a balanced design must support many outstanding
memory requests at once. While microprocessors typically
employ miss-status holding registers (MSHRs)[21]thatuse
associative comparison logic merge simultaneous requests for
the same cache block, the number of outstanding misses that
can be supported is typically small (e.g., the original Intel
Pentium 4 used four MSHRs [16]). One way to support a
far greater number of outstanding memory requests is to use
a FIFO for outstanding memory requests [17]. Similarly, our
baseline does not attempt to eliminate multiple requests for
the same block of memory on cache misses or local/global
memory accesses. However, we also explore the possibility of
improving performance by coalescing read memory requests
from later warps that require access to data for which a mem-
ory r equest is already in progress due to another warp running
on the same shader core. We call this inter-warp memory
coalescing. We observe that inter-warp memory coalescing
can significantly reduce memory traffic for applications that
contain data dependent accesses to memory. The data for
inter- warp merging quantifies the benefit of supporting large
capacity MSHRs that can detect a s econdary access to an
outstanding request [45].
2.2.4. Caching. While coalescing memory requests captures
spatial locality among threads, memory bandwidth require-
ments may be further reduced with caching if an application
contains temporal locality or spatial locality within the access
pattern of individual threads. We evaluate the performance
impact of adding first level, per-core L1 caches for local and
global memory access to the design described in Section 2.1.
We also evaluate the effects of adding a shared L2 cache
on the memory side of the interconnection network at the
memory controller. While threads can only read from texture
and constant memory, they can both read and write to local
and global memory. In our evaluation of caches for local and
global memory we model non-coherent caches. (Note that
threads from different CTAs in the applications we study do
not communicate through global memory.)
2.3. Extending GPGPU-Sim to Support CUDA
We extended GPGPU-Sim, the cycle-accurate simulator we
developed for our earlier work [11]. GPGPU-S im models
various aspects of a massively parallel architecture with highly
programmable pipelines similar to contemporary GPU archi-
tectures. A drawback of the previous version of GPGPU-
Sim was the difficult and time-consuming process of convert-
ing/parallelizing existing applications [11]. We overcome this
difficulty by extending GPGPU-Sim to support the CUDA
Parallel Thread Execution (PTX) [35] instruction set. This
enables us to simulate th e numerous existing, optimized
CUDA applications on GPGPU-Sim. Our current simulator
infrastructure runs CUDA applications without source code
modifications on Linux based platforms, but does require
access to the application’s source code. To build a CUDA
application for our simulator, we replace the common.mk
makefile used in the CUDA SDK with a version that builds the
application to run on our microarchitecture simulator (while
other more complex build scenarios may require more complex
makefile changes).
Figure 3 shows how a CUDA application can be compiled
for simulation on GPGPU-Sim and compares this compila-
tion flow to the normal CUDA compilation flow [33]. Both
compilation flows use cudafe to transform the s ource code of
a CUDA application into host C code running on the CPU
anddeviceCcoderunningontheGPU.TheGPUCcodeis
then compiled into PTX assembly (labeled “.ptx” in F igure 3)
by nvopencc, an open source compiler provided by NVIDIA
based on Open64 [28, 36]. The PTX assembler (ptxas)then
assembles the PTX assembly code into the target GPU’s native

ISA (labeled “cubin.bin” in Figure 3(a)). The assembled code
is then combined with the host C code and compiled into a
single executable linked with the CUDA Runtime API library
(labeled “libcuda.a” in Figure 3) by a standard C compiler. In
the normal CUDA compilation flow (used with NVIDIA G PU
hardware), the resulting executable calls the CUDA Runtime
API to set up and inv oke compute kernels onto the GPU via
the NVIDIA CUDA driver.
When a CUDA application is compiled to use GPGPU-Sim,
many steps remain the same. However, rather than linking
against the NVIDIA supplied libcuda.a binary, we l ink against
our own libcuda.a binary. Our libcuda.a implements “stub”
functions for the interface defined by the header files supplied
with CUDA. These stub functions set up and invoke simulation
sessions of the compute kernels on GPGPU-Sim (as shown
in Figure 3(b)). Before the first simulation session, GPGPU-
Sim parses the text format PTX assembly code generated by
nvopencc to obtain code for the compute kernels. Because
the PTX assembly code has no restriction on register usage
(to improve portability between different GPU architectures),
nvopencc performs register allocation using far more registers
than typically required to avoid spilling. To improve the
realism of our performance model, we determine the register
usage per thread and shared memory used per CTA using
ptxas
4
. We then use this information to limit the number
of CTAs that can run concurrently on a shader core. The
GPU binary (cubin.bin) produced by ptxas is not used by
GPGPU-Sim. After parsing the PTX assembly code, but before
beginning simulation, GPGPU-Sim performs an immediate
post-dominator analysis on each kernel to annotate branch
instructions with reconvergence points for the stack-based
SIMD control flow handling mechanism described by Fung
et al. [11]. During a simulation, a PTX functional simulator
executes instruct ions from multiple threads according to their
scheduling order as specified by the performance simulator.
When the simulation completes, the host CPU code is then
allowed to resume execution. In our current implementation,
host code runs on a normal CPU, thus our performance
measurements are for the GPU code only.
2.4. Benchmarks
Our benchmarks are listed in Table 1 along with the main
application properties, such as the organization of threads into
CTAs and grids as well as the different memory spaces on the
GPU exploited by each application. Multiple entries separated
by semi-colons in the grid and CTA dimensions indicate the
application runs multiple kernels.
For comparison purposes we also simulated the following
benchmarks from NVIDIAs CUDA software development
kit (SDK) [32]: Black-Scholes Option Pricing, Fast Walsh
4. By default, the version of ptxas in CUDA 1.1 appears to attempt to avoid
spilling registers provided the number of registers per thread is less than 128
and none of the applications we studied reached this limit. Directing ptxas to
further restrict the number of registers leads to an increase in local memory
usage above that explicitly used in the PTX assembly, while increasing the
register limit does not increase the number of registers used.
Transform, Binomial Option Pricing, Separable Convolution,
64-bin Histogram, Matrix Multiply, Parallel Reduction, Scalar
Product, Scan of Large Arrays, and Matrix Transpose. Due to
space limitations, and since most of these benchmarks already
perform well on GPUs, we only report details for Black-
Scholes (BLK), a financial options pricing application, and
Fast Walsh Transform (FWT), widely used in signal and image
processing and compression. We also report the harmonic
mean of all SDK applications simulated, denoted as SDK in
the data bar charts in Section 4.
Below, we descr ibe the CUDA applications not in the SDK
that we use as benchmarks in our study. These applications
were developed by the researchers cited below and run un-
modiedonoursimulator.
AES Encryption (AES) [24] This application, developed
by Manavski [24], implements the Advanced Encryption Stan-
dard (AES) algorithm in CUDA to encrypt and decrypt files.
The application h as been optimized by the developer so that
constants are stored in constant memory, the expanded key
stored in texture memory, and the input data processed in
shared memory. We encrypt a 256KB picture using 128-bit
encryption.
Graph Algorithm: Breadth-First Search (BFS) [15]
Developed by Harish and Narayanan [15], this application
performs breadth-first search on a graph. As each node in
the graph is mapped to a different thread, the amount of
parallelism in this applications scales with the size of the input
graph. BFS suffers from performance loss due to heavy global
memory traffic and branch d ivergence. We perform breadth-
first search on a random g raph with 65,536 nodes and an
average of 6 edges per node.
Coulombic Potential (CP) [18,41] CP is part of the Parboil
Benchmark suite developed by the IMPACT research group at
UIUC [18,41]. CP is useful in the field of molecular dynamics.
Loops are manually unrolled to reduce loop overheads and
the point charge data is stored in constant memory to take
advantage of caching. CP has been heavily optimized (it
has been shown to achieve a 647× speedup versus a CPU
version [40]). We simulate 200 atoms on a grid size of
256×256.
gpuDG (DG) [46] gpuDG is a discontinuous Galerkin
time-domain solver, used in the field of electromagnetics to
calculate radar scattering from 3D objects and analyze wave
guides, particle accelerators, and EM compatibility [46]. Data
is loaded into shared memory from texture memory. The inner
loop consists mainly of matrix-vector products. We use the 3D
version with polynomial o rder of N=6 and reduce time steps
to 2 to reduce simulation time.
3D Laplace Solver (LPS) [12] Laplace is a highly parallel
finance application [12]. As well as using shared memory, care
was taken by the application developer to ensure coalesced
global memory accesses. We observe that this benchmark
suffers some performance loss due to branch divergence. We
run one iteration on a 100x100x100 grid.
LIBOR Monte Carlo (LIB) [13] LIBOR performs Monte

Citations
More filters
Journal ArticleDOI

Dark Silicon and the End of Multicore Scaling

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.
Proceedings ArticleDOI

Dark silicon and the end of multicore scaling

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.
Proceedings ArticleDOI

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

TL;DR: A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.
Proceedings ArticleDOI

A detailed and flexible cycle-accurate Network-on-Chip simulator

TL;DR: The simulator, BookSim, is designed for simulation flexibility and accurate modeling of network components and offers a large set of configurable network parameters in terms of topology, routing algorithm, flow control, and router microarchitecture, including buffer management and allocation schemes.
Proceedings ArticleDOI

GPUWattch: enabling energy optimizations in GPGPUs

TL;DR: This work proposes a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements, and accurately tracks the power consumption trend over time.
References
More filters
Proceedings ArticleDOI

Scalable parallel programming with CUDA

TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
Journal ArticleDOI

NVIDIA Tesla: A Unified Graphics and Computing Architecture

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
Journal ArticleDOI

Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?

TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.
Proceedings ArticleDOI

Memory access scheduling

TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
Proceedings ArticleDOI

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
Related Papers (5)
Frequently Asked Questions (19)
Q1. What are the contributions mentioned in the paper "Analyzing cuda workloads using a detailed gpu simulator" ?

This paper characterizes several non-graphics applications written in NVIDIA ’ s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA ’ s parallel thread execution ( PTX ) virtual instruction set. For this study, the authors selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware ( versus a CPU-only sequential version of the application ). The authors study the performance of these applications on their GPU performance simulator with configurations comparable to contemporary high-end graphics cards. The authors characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations the authors make are ( 1 ) that for the applications they study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and ( 2 ) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system. 

Up to 5 levels of reflections and shadows are taken into account, so thread behavior depends on what object the ray hits (if it hits any at all), making the kernel susceptible to branch divergence. 

Increasing the number of simultaneously running threads can improve performance by having a greater ability to hide memory access latencies. 

BFS also performs poorly since threads in adjacent nodes in the graph (which are grouped into warps) behave differently, causing more than 75% of its warps to have less than 50% occupancy. 

In NN, two of the four kernels have only a single thread in a block and they take up the bulk of the execution time, meaning that the unfilled warps in NN are not due to branch divergence. 

The 24-stage pipeline is motivated by details in the CUDA Programming Guide [33], which indicates that at least 192 active threads are needed to avoid stalling for true data dependencies between consecutive instructions from a single thread (in the absence of long latency memory operations). 

As each node in the graph is mapped to a different thread, the amount of parallelism in this applications scales with the size of the input graph. 

Because the PTX assembly code has no restriction on register usage (to improve portability between different GPU architectures), nvopencc performs register allocation using far more registers than typically required to avoid spilling. 

Of the 136 applications listed with performance claims, 52 are reported to obtain a speedup of 50× or more, and of these 29 are reported to obtain a speedup of 100× or more. 

The authors also validated their simulator against an Nvidia Geforce 8600GTS (a “low end” graphics card) by configuring their simulator to use 4 shaders and two memory controllers. 

Their current simulator infrastructure runs CUDA applications without source code modifications on Linux based platforms, but does require access to the application’s source code. 

For the baseline configuration, some benchmarks are already resource-constrained to only 1 or 2 CTAs per shader core, making them unable to run using a configuration with less resources. 

Given the widely-varying workload-dependent behavior, always scheduling the maximal number of CTAs supported by a shader core is not always the best scheduling policy. 

The authors explored the effects of varying the resources that limit the number of threads and hence CTAs that can run concurrently on a shader core, without modifying the source code for the benchmarks. 

Without affecting peak throughput, the authors add an extra pipelined latency of 4, 8, or 16 cycles to each router on top of their baseline router’s 2-cycle latency. 

All 32 threads in a given warp execute the same instruction with different data values over four consecutive clock cycles in all pipelines (the SIMD cores are effectively 8-wide). 

As the authors will show in the next section, one of the reasons why different topologies do not change the performance of most benchmarks dramatically is that the benchmarks are not sensitive to small variations in latency, as long as the interconnection network provides sufficient bandwidth. 

Their analysis shows that for the baseline configuration, the input port to the return interconnect from memory to the shader cores is stalled 16% of the time on average. 

On average their baseline mesh interconnect performs comparable to a crossbar with input speedup of two for the workloads that the authors consider.