scispace - formally typeset
Search or ask a question
Author

Pawan Harish

Bio: Pawan Harish is an academic researcher from International Institute of Information Technology, Hyderabad. The author has contributed to research in topics: Rendering (computer graphics) & Pixel. The author has an hindex of 7, co-authored 13 publications receiving 1059 citations. Previous affiliations of Pawan Harish include École Polytechnique Fédérale de Lausanne.

Papers
More filters
Book ChapterDOI
18 Dec 2007
TL;DR: This work presents a few fundamental algorithms - including breadth first search, single source shortest path, and all-pairs shortest path - using CUDA on large graphs using the G80 line of Nvidia GPUs.
Abstract: Large graphs involving millions of vertices are common in many practical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA programming model. We present a few fundamental algorithms - including breadth first search, single source shortest path, and all-pairs shortest path - using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing $600. In some cases optimal sequential algorithm is not the fastest on the GPU architecture. GPUs have great potential as high-performance co-processors.

763 citations

Proceedings ArticleDOI
01 Aug 2009
TL;DR: This paper presents a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Borůvka's approach for undirected graphs, implemented using scalable primitives such as scan, segmented scan and split.
Abstract: Graphics Processor Units are used for many general purpose processing due to high compute power available on them. Regular, data-parallel algorithms map well to the SIMD architecture of current GPU. Irregular algorithms on discrete structures like graphs are harder to map to them. Efficient data-mapping primitives can play crucial role in mapping such algorithms onto the GPU. In this paper, we present a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Boruvka's approach for undirected graphs. We implement it using scalable primitives such as scan, segmented scan and split. The irregular steps of supervertex formation and recursive graph construction are mapped to primitives like split to categories involving vertex ids and edge weights. We obtain 30 to 50 times speedup over the CPU implementation on most graphs and 3 to 10 times speedup over our previous GPU implementation. We construct the minimum spanning tree on a 5 million node and 30 million edge graph in under 1 second on one quarter of the Tesla S1070 GPU.

126 citations

01 Jan 2009
TL;DR: This paper presents fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest Path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model.
Abstract: The Graphics Processing Units (GPUs) provide high computation power at a low cost and is an important compute accelerator with a massively multithreaded architecture. In this paper, we present fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model. Our implementations exhibit high performance, especially on large graphs. We use two data-parallel programming methodologies for these algorithms. One is an iterative, mask-based approach that processes valid data elements like vertices and edges using independent threads for each. The other is a divide-and-conquer approach that reduces the problem into smaller problems that are handled later using the same approach. Parallel algorithms for such problems have been reported in the literature before, especially on supercomputers. The massively multithreaded model of the GPU makes it possible to adopt the data-parallel approach even to irregular algorithms like graph algorithms, using O(V ) or O(E) simultaneous threads. The algorithms and the underlying techniques presented in this paper are likely to be applicable to many irregular algorithms. We show the impact of our implementations on random, scalefree, and real-life graphs of up to millions of vertices on highend and low-end GPUs. The availability and spread of GPUs to desktops and laptops make them ideal candidates to accelerate graph operations over the CPU-only implementations. Practical implementations of basic operations go a long way in realizing their potential.

94 citations

Journal ArticleDOI
TL;DR: Garuda is a client-server-based display wall framework that uses off-the-shelf hardware and a standard network that can transparently render any application built using the open scene graph (OSG) API to a tiled display without any modification by the user.
Abstract: Cluster-based tiled display walls can provide cost-effective and scalable displays with high resolution and a large display area. The software to drive them needs to scale too if arbitrarily large displays are to be built. Chromium is a popular software API used to construct such displays. Chromium transparently renders any OpenGL application to a tiled display by partitioning and sending individual OpenGL primitives to each client per frame. Visualization applications often deal with massive geometric data with millions of primitives. Transmitting them every frame results in huge network requirements that adversely affect the scalability of the system. In this paper, we present Garuda, a client-server-based display wall framework that uses off-the-shelf hardware and a standard network. Garuda is scalable to large tile configurations and massive environments. It can transparently render any application built using the open scene graph (OSG) API to a tiled display without any modification by the user. The Garuda server uses an object-based scene structure represented using a scene graph. The server determines the objects visible to each display tile using a novel adaptive algorithm that culls the scene graph to a hierarchy of frustums. Required parts of the scene graph are transmitted to the clients, which cache them to exploit the interframe redundancy. A multicast-based protocol is used to transmit the geometry to exploit the spatial redundancy present in tiled display systems. A geometry push philosophy from the server helps keep the clients in sync with one another. Neither the server nor a client needs to render the entire scene, making the system suitable for interactive rendering of massive models. Transparent rendering is achieved by intercepting the cull, draw, and swap functions of OSG and replacing them with our own. We demonstrate the performance and scalability of the Garuda system for different configurations of display wall. We also show that the server and network loads grow sublinearly with the increase in the number of tiles, which makes our scheme suitable to construct very large displays.

56 citations

Journal ArticleDOI
TL;DR: This article proposes a parallel prioritized Jacobian-based inverse kinematics algorithm that can handle complex articulated bodies at interactive frame rates and demonstrates how the GPU can further exploit fine-grain parallelism not directly available on a multicore processor.
Abstract: In this article, we present a parallel prioritized Jacobian-based inverse kinematics algorithm for multithreaded architectures. We solve damped least squares inverse kinematics using a parallel line search by identifying and sampling critical input parameters. Parallel competing execution paths are spawned for each parameter in order to select the optimum that minimizes the error criteria. Our algorithm is highly scalable and can handle complex articulated bodies at interactive frame rates. We show results on complex skeletons consisting of more than 600 degrees of freedom while being controlled using multiple end effectors. We implement the algorithm both on multicore and GPU architectures and demonstrate how the GPU can further exploit fine-grain parallelism not directly available on a multicore processor. Our implementations are 10 to 150 times faster compared to a state-of-the-art serial implementation while providing higher accuracy. We also demonstrate the scalability of the algorithm over multiple scenarios and explore the GPU implementation in detail.

23 citations


Cited by
More filters
Proceedings ArticleDOI
04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

2,697 citations

Proceedings ArticleDOI
26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Abstract: Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

1,558 citations

Proceedings ArticleDOI
13 Jun 2015
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

718 citations

Proceedings ArticleDOI
25 Feb 2012
TL;DR: This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity.
Abstract: Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

541 citations

Proceedings ArticleDOI
04 Nov 2012
TL;DR: This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures.
Abstract: GPUs have been used to accelerate many regular applications and, more recently, irregular applications in which the control flow and memory access patterns are data-dependent and statically unpredictable. This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures. For a suite of 13 benchmarks, we find that (i) irregularity at the warp level varies widely, (ii) control-flow irregularity and memory-access irregularity are largely independent of each other, and (iii) most kernels, including regular ones, exhibit some irregularity. A program's irregularity can change between different inputs, systems, and arithmetic precision but generally stays in a specific region of the irregularity space. Whereas some highly tuned implementations of irregular algorithms exhibit little irregularity, trading off extra irregularity for better locality or less work can improve overall performance.

371 citations