Analyzing CUDA workloads using a detailed GPU simulator
read more
Citations
Dark Silicon and the End of Multicore Scaling
Dark silicon and the end of multicore scaling
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
A detailed and flexible cycle-accurate Network-on-Chip simulator
GPUWattch: enabling energy optimizations in GPGPUs
References
Scalable parallel programming with CUDA
NVIDIA Tesla: A Unified Graphics and Computing Architecture
Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?
Memory access scheduling
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Related Papers (5)
Frequently Asked Questions (19)
Q2. How many levels of reflections and shadows are taken into account?
Up to 5 levels of reflections and shadows are taken into account, so thread behavior depends on what object the ray hits (if it hits any at all), making the kernel susceptible to branch divergence.
Q3. What can be done to increase the number of threads running simultaneously?
Increasing the number of simultaneously running threads can improve performance by having a greater ability to hide memory access latencies.
Q4. What is the reason why BFS performs poorly?
BFS also performs poorly since threads in adjacent nodes in the graph (which are grouped into warps) behave differently, causing more than 75% of its warps to have less than 50% occupancy.
Q5. What is the reason why the unfilled warps in NN are not due to branch?
In NN, two of the four kernels have only a single thread in a block and they take up the bulk of the execution time, meaning that the unfilled warps in NN are not due to branch divergence.
Q6. How many active threads are needed to avoid stalling?
The 24-stage pipeline is motivated by details in the CUDA Programming Guide [33], which indicates that at least 192 active threads are needed to avoid stalling for true data dependencies between consecutive instructions from a single thread (in the absence of long latency memory operations).
Q7. How does the graph algorithm scale with the size of the input graph?
As each node in the graph is mapped to a different thread, the amount of parallelism in this applications scales with the size of the input graph.
Q8. Why does nvopencc use more registers than is required to avoid spilling?
Because the PTX assembly code has no restriction on register usage (to improve portability between different GPU architectures), nvopencc performs register allocation using far more registers than typically required to avoid spilling.
Q9. How many applications are reported to have a speedup of more than 50?
Of the 136 applications listed with performance claims, 52 are reported to obtain a speedup of 50× or more, and of these 29 are reported to obtain a speedup of 100× or more.
Q10. How did the authors validate their simulator against a Geforce 8600GTS?
The authors also validated their simulator against an Nvidia Geforce 8600GTS (a “low end” graphics card) by configuring their simulator to use 4 shaders and two memory controllers.
Q11. What does the current simulator infrastructure require to run CUDA applications?
Their current simulator infrastructure runs CUDA applications without source code modifications on Linux based platforms, but does require access to the application’s source code.
Q12. What configuration is unable to run using a baseline?
For the baseline configuration, some benchmarks are already resource-constrained to only 1 or 2 CTAs per shader core, making them unable to run using a configuration with less resources.
Q13. What is the scheduling policy for a shader?
Given the widely-varying workload-dependent behavior, always scheduling the maximal number of CTAs supported by a shader core is not always the best scheduling policy.
Q14. What is the effect of adding a cache to a shader core?
The authors explored the effects of varying the resources that limit the number of threads and hence CTAs that can run concurrently on a shader core, without modifying the source code for the benchmarks.
Q15. How many cycles of latency do the authors add to each router?
Without affecting peak throughput, the authors add an extra pipelined latency of 4, 8, or 16 cycles to each router on top of their baseline router’s 2-cycle latency.
Q16. How many threads are in a given pipeline?
All 32 threads in a given warp execute the same instruction with different data values over four consecutive clock cycles in all pipelines (the SIMD cores are effectively 8-wide).
Q17. What is the reason why different topologies do not change the performance of benchmarks?
As the authors will show in the next section, one of the reasons why different topologies do not change the performance of most benchmarks dramatically is that the benchmarks are not sensitive to small variations in latency, as long as the interconnection network provides sufficient bandwidth.
Q18. How much latency is the input port to the return interconnect stalled?
Their analysis shows that for the baseline configuration, the input port to the return interconnect from memory to the shader cores is stalled 16% of the time on average.
Q19. What is the average speedup of a mesh?
On average their baseline mesh interconnect performs comparable to a crossbar with input speedup of two for the workloads that the authors consider.