Scalable parallel programming with CUDA

doi:10.1145/1401132.1401152

Home
/
Papers
/
Scalable parallel programming with CUDA

Proceedings Article•DOI•

Scalable parallel programming with CUDA

John R. Nickolls¹, Ian Buck¹, Michael Garland¹, Kevin Skadron²•Institutions (2)

Nvidia¹, University of Virginia²

11 Aug 2008-Vol. 6, Iss: 2, pp 40-53

TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.

read less

Abstract: The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Rodinia: A benchmark suite for heterogeneous computing

[...]

Shuai Che¹, Michael Boyer¹, Jiayuan Meng¹, David Tarjan¹, Jeremy W. Sheaffer¹, Sang-Ha Lee¹, Kevin Skadron¹ - Show less +3 more•Institutions (1)

University of Virginia¹

04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

2,697 citations

Cites methods from "Scalable parallel programming with ..."

...For GPU implementations, the Rodinia suite uses CUDA [22], an extension to C for GPUs....
[...]

Proceedings Article•DOI•

Channel Pruning for Accelerating Very Deep Neural Networks

[...]

Yihui He¹, Xiangyu Zhang², Jian Sun²•Institutions (2)

Carnegie Mellon University¹, Xi'an Jiaotong University²

01 Oct 2017

TL;DR: In this paper, a LASSO regression based channel selection and least square reconstruction is proposed to accelerate very deep convolutional neural networks, which achieves 5× speedup along with only 0.3% increase of error.

...read moreread less

Abstract: In this paper, we introduce a new channel pruning method to accelerate very deep convolutional neural networks. Given a trained CNN model, we propose an iterative two-step algorithm to effectively prune each layer, by a LASSO regression based channel selection and least square reconstruction. We further generalize this algorithm to multi-layer and multi-branch cases. Our method reduces the accumulated error and enhance the compatibility with various architectures. Our pruned VGG-16 achieves the state-of-the-art results by 5× speed-up along with only 0.3% increase of error. More importantly, our method is able to accelerate modern networks like ResNet, Xception and suffers only 1.4%, 1.0% accuracy loss under 2× speedup respectively, which is significant.

...read moreread less

2,082 citations

Proceedings Article•DOI•

Analyzing CUDA workloads using a detailed GPU simulator

[...]

Ali Bakhoda¹, George L. Yuan¹, Wilson W. L. Fung¹, Henry Wong¹, Tor M. Aamodt¹ - Show less +1 more•Institutions (1)

University of British Columbia¹

26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

Abstract: Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

...read moreread less

1,558 citations

Journal Article•DOI•

NVIDIA Tesla: A Unified Graphics and Computing Architecture

[...]

Erik Lindholm¹, John R. Nickolls¹, S. Oberman¹, John S. Montrym¹•Institutions (1)

Nvidia¹

01 Mar 2008-IEEE Micro

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.

...read moreread less

Abstract: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs.

...read moreread less

1,492 citations

Journal Article•DOI•

A survey of the recent architectures of deep convolutional neural networks

[...]

Asifullah Khan¹, Anabia Sohail¹, Umme Zahoora¹, Aqsa Saeed Qureshi¹•Institutions (1)

Pakistan Institute of Engineering and Applied Sciences¹

01 Dec 2020-Artificial Intelligence Review

TL;DR: Deep Convolutional Neural Networks (CNNs) as mentioned in this paper are a special type of Neural Networks, which has shown exemplary performance on several competitions related to Computer Vision and Image Processing.

...read moreread less

Abstract: Deep Convolutional Neural Network (CNN) is a special type of Neural Networks, which has shown exemplary performance on several competitions related to Computer Vision and Image Processing. Some of the exciting application areas of CNN include Image Classification and Segmentation, Object Detection, Video Processing, Natural Language Processing, and Speech Recognition. The powerful learning ability of deep CNN is primarily due to the use of multiple feature extraction stages that can automatically learn representations from the data. The availability of a large amount of data and improvement in the hardware technology has accelerated the research in CNNs, and recently interesting deep CNN architectures have been reported. Several inspiring ideas to bring advancements in CNNs have been explored, such as the use of different activation and loss functions, parameter optimization, regularization, and architectural innovations. However, the significant improvement in the representational capacity of the deep CNN is achieved through architectural innovations. Notably, the ideas of exploiting spatial and channel information, depth and width of architecture, and multi-path information processing have gained substantial attention. Similarly, the idea of using a block of layers as a structural unit is also gaining popularity. This survey thus focuses on the intrinsic taxonomy present in the recently reported deep CNN architectures and, consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature-map exploitation, channel boosting, and attention. Additionally, the elementary understanding of CNN components, current challenges, and applications of CNN are also provided.

...read moreread less

1,328 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

NVIDIA Tesla: A Unified Graphics and Computing Architecture

[...]

Erik Lindholm¹, John R. Nickolls¹, S. Oberman¹, John S. Montrym¹•Institutions (1)

Nvidia¹

01 Mar 2008-IEEE Micro

...read moreread less

1,492 citations

Journal Article•DOI•

Brook for GPUs: stream computing on graphics hardware

[...]

Ian Buck¹, Tim Foley¹, Daniel Reiter Horn¹, Jeremy Sugerman¹, Kayvon Fatahalian¹, Mike Houston¹, Pat Hanrahan¹ - Show less +3 more•Institutions (1)

Stanford University¹

01 Aug 2004

TL;DR: This paper presents Brook for GPUs, a system for general-purpose computation on programmable graphics hardware that abstracts and virtualizes many aspects of graphics hardware, and presents an analysis of the effectiveness of the GPU as a compute engine compared to the CPU.

...read moreread less

Abstract: In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to hand-written GPU code and up to seven times faster than their CPU counterparts.

...read moreread less

1,288 citations

Book•

Matrix Computations, 3rd Edition

[...]

Golub Gene H. Et.Al

06 Jul 2007

948 citations

Journal Article•DOI•

Accelerating molecular modeling applications with graphics processors

[...]

John E. Stone¹, James C. Phillips¹, Peter L. Freddolino¹, David J. Hardy¹, Leonardo G. Trabuco¹, Klaus Schulten¹ - Show less +2 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Dec 2007-Journal of Computational Chemistry

TL;DR: An overview of recent advances in programmable GPUs is presented, with an emphasis on their application to molecular mechanics simulations and the programming techniques required to obtain optimal performance in these cases.

...read moreread less

Abstract: Molecular mechanics simulations offer a computational approach to study the behavior of biomolecules at atomic detail, but such simulations are limited in size and timescale by the available computing resources. State- of-the-art graphics processing units (GPUs) can perform over 500 billion arithmetic operations per second, a tremen- dous computational resource that can now be utilized for general purpose computing as a result of recent advances in GPU hardware and software architecture. In this article, an overview of recent advances in programmable GPUs is presented, with an emphasis on their application to molecular mechanics simulations and the programming techni- ques required to obtain optimal performance in these cases. We demonstrate the use of GPUs for the calculation of long-range electrostatics and nonbonded forces for molecular dynamics simulations, where GPU-based calculations are typically 10-100 times faster than heavily optimized CPU-based implementations. The application of GPU accel- eration to biomolecular simulation is also demonstrated through the use of GPU-accelerated Coulomb-based ion placement and calculation of time-averaged potentials from molecular dynamics trajectories. A novel approximation to Coulomb potential calculation, the multilevel summation method, is introduced and compared with direct Cou- lomb summation. In light of the performance obtained for this set of calculations, future applications of graphics processors to molecular dynamics simulations are discussed.

...read moreread less

727 citations

Additional excerpts

...x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7]...
[...]
...x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[i] += x[i+4];...
[...]

Proceedings Article•DOI•

Scan primitives for GPU computing

[...]

Shubhabrata Sengupta¹, Mark J. Harris², Yao Zhang¹, John D. Owens¹•Institutions (2)

University of California, Davis¹, Nvidia²

04 Aug 2007

TL;DR: Using the scan primitives, this work shows novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyzes the performance of the scanPrimitives, several sort algorithms that use the scan Primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.

...read moreread less

Abstract: The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API. Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.

...read moreread less

635 citations