Home
/
Authors
/
Pawan Harish

Author

Pawan Harish

International Institute of Information Technology, Hyderabad

Other affiliations: École Polytechnique Fédérale de Lausanne

Bio: Pawan Harish is an academic researcher from International Institute of Information Technology, Hyderabad. The author has contributed to research in topics: Rendering (computer graphics) & Pixel. The author has an hindex of 7, co-authored 13 publications receiving 1059 citations. Previous affiliations of Pawan Harish include École Polytechnique Fédérale de Lausanne.

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Accelerating large graph algorithms on the GPU using CUDA

[...]

Pawan Harish¹, P. J. Narayanan¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

18 Dec 2007

TL;DR: This work presents a few fundamental algorithms - including breadth first search, single source shortest path, and all-pairs shortest path - using CUDA on large graphs using the G80 line of Nvidia GPUs.

...read moreread less

Abstract: Large graphs involving millions of vertices are common in many practical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA programming model. We present a few fundamental algorithms - including breadth first search, single source shortest path, and all-pairs shortest path - using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing $600. In some cases optimal sequential algorithm is not the fastest on the GPU architecture. GPUs have great potential as high-performance co-processors.

...read moreread less

763 citations

Proceedings Article•DOI•

Fast minimum spanning tree for large graphs on the GPU

[...]

Vibhav Vineet¹, Pawan Harish¹, Suryakant Patidar¹, P. J. Narayanan¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

01 Aug 2009

TL;DR: This paper presents a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Borůvka's approach for undirected graphs, implemented using scalable primitives such as scan, segmented scan and split.

...read moreread less

Abstract: Graphics Processor Units are used for many general purpose processing due to high compute power available on them. Regular, data-parallel algorithms map well to the SIMD architecture of current GPU. Irregular algorithms on discrete structures like graphs are harder to map to them. Efficient data-mapping primitives can play crucial role in mapping such algorithms onto the GPU. In this paper, we present a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Boruvka's approach for undirected graphs. We implement it using scalable primitives such as scan, segmented scan and split. The irregular steps of supervertex formation and recursive graph construction are mapped to primitives like split to categories involving vertex ids and edge weights. We obtain 30 to 50 times speedup over the CPU implementation on most graphs and 3 to 10 times speedup over our previous GPU implementation. We construct the minimum spanning tree on a 5 million node and 30 million edge graph in under 1 second on one quarter of the Tesla S1070 GPU.

...read moreread less

126 citations

Large Graph Algorithms for Massively Multithreaded Architectures

[...]

Pawan Harish, Vibhav Vineet, P. J. Narayanan

01 Jan 2009

TL;DR: This paper presents fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest Path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model.

...read moreread less

Abstract: The Graphics Processing Units (GPUs) provide high computation power at a low cost and is an important compute accelerator with a massively multithreaded architecture. In this paper, we present fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model. Our implementations exhibit high performance, especially on large graphs. We use two data-parallel programming methodologies for these algorithms. One is an iterative, mask-based approach that processes valid data elements like vertices and edges using independent threads for each. The other is a divide-and-conquer approach that reduces the problem into smaller problems that are handled later using the same approach. Parallel algorithms for such problems have been reported in the literature before, especially on supercomputers. The massively multithreaded model of the GPU makes it possible to adopt the data-parallel approach even to irregular algorithms like graph algorithms, using O(V ) or O(E) simultaneous threads. The algorithms and the underlying techniques presented in this paper are likely to be applicable to many irregular algorithms. We show the impact of our implementations on random, scalefree, and real-life graphs of up to millions of vertices on highend and low-end GPUs. The availability and spread of GPUs to desktops and laptops make them ideal candidates to accelerate graph operations over the CPU-only implementations. Practical implementations of basic operations go a long way in realizing their potential.

...read moreread less

94 citations

Journal Article•DOI•

Garuda: A Scalable Tiled Display Wall Using Commodity PCs

[...]

Nirnimesh, Pawan Harish, P. J. Narayanan

01 Sep 2007-IEEE Transactions on Visualization and Computer Graphics

TL;DR: Garuda is a client-server-based display wall framework that uses off-the-shelf hardware and a standard network that can transparently render any application built using the open scene graph (OSG) API to a tiled display without any modification by the user.

...read moreread less

Abstract: Cluster-based tiled display walls can provide cost-effective and scalable displays with high resolution and a large display area. The software to drive them needs to scale too if arbitrarily large displays are to be built. Chromium is a popular software API used to construct such displays. Chromium transparently renders any OpenGL application to a tiled display by partitioning and sending individual OpenGL primitives to each client per frame. Visualization applications often deal with massive geometric data with millions of primitives. Transmitting them every frame results in huge network requirements that adversely affect the scalability of the system. In this paper, we present Garuda, a client-server-based display wall framework that uses off-the-shelf hardware and a standard network. Garuda is scalable to large tile configurations and massive environments. It can transparently render any application built using the open scene graph (OSG) API to a tiled display without any modification by the user. The Garuda server uses an object-based scene structure represented using a scene graph. The server determines the objects visible to each display tile using a novel adaptive algorithm that culls the scene graph to a hierarchy of frustums. Required parts of the scene graph are transmitted to the clients, which cache them to exploit the interframe redundancy. A multicast-based protocol is used to transmit the geometry to exploit the spatial redundancy present in tiled display systems. A geometry push philosophy from the server helps keep the clients in sync with one another. Neither the server nor a client needs to render the entire scene, making the system suitable for interactive rendering of massive models. Transparent rendering is achieved by intercepting the cull, draw, and swap functions of OSG and replacing them with our own. We demonstrate the performance and scalability of the Garuda system for different configurations of display wall. We also show that the server and network loads grow sublinearly with the increase in the number of tiles, which makes our scheme suitable to construct very large displays.

...read moreread less

56 citations

Journal Article•DOI•

Parallel Inverse Kinematics for Multithreaded Architectures

[...]

Pawan Harish¹, Mentar Mahmudi¹, Benoît Le Callennec, Ronan Boulic¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

19 Feb 2016-ACM Transactions on Graphics

TL;DR: This article proposes a parallel prioritized Jacobian-based inverse kinematics algorithm that can handle complex articulated bodies at interactive frame rates and demonstrates how the GPU can further exploit fine-grain parallelism not directly available on a multicore processor.

...read moreread less

Abstract: In this article, we present a parallel prioritized Jacobian-based inverse kinematics algorithm for multithreaded architectures. We solve damped least squares inverse kinematics using a parallel line search by identifying and sampling critical input parameters. Parallel competing execution paths are spawned for each parameter in order to select the optimum that minimizes the error criteria. Our algorithm is highly scalable and can handle complex articulated bodies at interactive frame rates. We show results on complex skeletons consisting of more than 600 degrees of freedom while being controlled using multiple end effectors. We implement the algorithm both on multicore and GPU architectures and demonstrate how the GPU can further exploit fine-grain parallelism not directly available on a multicore processor. Our implementations are 10 to 150 times faster compared to a state-of-the-art serial implementation while providing higher accuracy. We also demonstrate the scalability of the algorithm over multiple scenarios and explore the GPU implementation in detail.

...read moreread less

23 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Rodinia: A benchmark suite for heterogeneous computing

[...]

Shuai Che¹, Michael Boyer¹, Jiayuan Meng¹, David Tarjan¹, Jeremy W. Sheaffer¹, Sang-Ha Lee¹, Kevin Skadron¹ - Show less +3 more•Institutions (1)

University of Virginia¹

04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

2,697 citations

Proceedings Article•DOI•

Analyzing CUDA workloads using a detailed GPU simulator

[...]

Ali Bakhoda¹, George L. Yuan¹, Wilson W. L. Fung¹, Henry Wong¹, Tor M. Aamodt¹ - Show less +1 more•Institutions (1)

University of British Columbia¹

26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

Abstract: Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

...read moreread less

1,558 citations

Proceedings Article•DOI•

A scalable processing-in-memory accelerator for parallel graph processing

[...]

Junwhan Ahn¹, Sungpack Hong², Sungjoo Yoo¹, Onur Mutlu³, Kiyoung Choi¹ - Show less +1 more•Institutions (3)

Seoul National University¹, Oracle Corporation², Carnegie Mellon University³

13 Jun 2015

TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.

...read moreread less

Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

...read moreread less

718 citations

Proceedings Article•DOI•

Scalable GPU graph traversal

[...]

Duane Merrill¹, Michael Garland², Andrew S. Grimshaw¹•Institutions (2)

University of Virginia¹, Nvidia²

25 Feb 2012

TL;DR: This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity.

...read moreread less

Abstract: Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

...read moreread less

541 citations

Proceedings Article•DOI•

A quantitative study of irregular programs on GPUs

[...]

Martin Burtscher¹, Rupesh Nasre², Keshav Pingali²•Institutions (2)

Texas State University¹, University of Texas at Austin²

04 Nov 2012

TL;DR: This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures.

...read moreread less

Abstract: GPUs have been used to accelerate many regular applications and, more recently, irregular applications in which the control flow and memory access patterns are data-dependent and statically unpredictable. This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures. For a suite of 13 benchmarks, we find that (i) irregularity at the warp level varies widely, (ii) control-flow irregularity and memory-access irregularity are largely independent of each other, and (iii) most kernels, including regular ones, exhibit some irregularity. A program's irregularity can change between different inputs, systems, and arithmetic precision but generally stays in a specific region of the irregularity space. Whereas some highly tuned implementations of irregular algorithms exhibit little irregularity, trading off extra irregularity for better locality or less work can improve overall performance.

...read moreread less

371 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197

Collapse