Home
/
Authors
/
Anshuman Goswami

Author

Anshuman Goswami

Other affiliations: Georgia Institute of Technology

Bio: Anshuman Goswami is an academic researcher from Nvidia. The author has contributed to research in topics: Scheduling (computing) & Catalysis. The author has an hindex of 4, co-authored 7 publications receiving 71 citations. Previous affiliations of Anshuman Goswami include Georgia Institute of Technology.

Topics: Scheduling (computing), Catalysis, Chemistry, Reactivity (psychology), CUDA ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Scheduling multi-tenant cloud workloads on accelerator-based systems

[...]

Dipanjan Sengupta¹, Anshuman Goswami¹, Karsten Schwan¹, Krishna Pallavi¹•Institutions (1)

Georgia Institute of Technology¹

16 Nov 2014

TL;DR: The Strings scheduler realizes the vision of a dynamic model where GPUs are treated as first class schedulable entities by decomposing the GPU scheduling problem into a combination of load balancing and per-device scheduling.

...read moreread less

Abstract: Accelerator-based systems are making rapid inroads into becoming platforms of choice for high end cloud services. There is a need therefore, to move from the current model in which high performance applications explicitly and programmatically select the GPU devices on which to run, to a dynamic model where GPUs are treated as first class schedulable entities. The Strings scheduler realizes this vision by decomposing the GPU scheduling problem into a combination of load balancing and per-device scheduling. (i) Device-level scheduling efficiently uses all of a GPU's hardware resources, including its computational and data movement engines, and (ii) load balancing goes beyond obtaining high throughput, to ensure fairness through prioritizing GPU requests that have attained least service. With its methods, Strings achieves improvements in system throughput and fairness of up to 8.70x and 13%, respectively, compared to the CUDA runtime.

...read moreread less

34 citations

Proceedings Article•DOI•

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

[...]

Sreeram Potluri¹, Anshuman Goswami¹, Davide Rossetti¹, Chris J. Newburn¹, Manjunath Gorentla Venkata², Neena Imam² - Show less +2 more•Institutions (2)

Nvidia¹, Oak Ridge National Laboratory²

01 Dec 2017

TL;DR: This work evaluates different design alternatives for Mellanox InfiniBand adapters in CUDA, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU, and implements a 2dstencil application kernel using NVSHMEM.

...read moreread less

Abstract: GPUs have become an essential component for building compute clusters with high compute density and high performance per watt. As such clusters scale to have 1000s of GPUs, efficiently moving data between the GPUs becomes imperative to get maximum performance. NVSHMEM is an implementation of the OpenSHMEM standard for NVIDIA GPU clusters which allows communication to be issued from inside GPU kernels. In earlier work, we have shown how NVSHMEM can be used to achieve better application performance on GPUs connected through PCIe or NVLink. As part of this effort, we implement IB verbs for Mellanox InfiniBand adapters in CUDA. We evaluate different design alternatives, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU. We also consider correctness issues that arise in these designs. We take advantage of these designs transparently or through API extensions in NVSHMEM. With micro-benchmarks, we show that a Nvidia Pascal P100 GPU is able saturate the network bandwidth using only one or two of its 56 available streaming multiprocessors (SM). On a single GPU using a single IB EDR adapter, we achieve a throughput of around 90 million messages per second. In addition, we implement a 2dstencil application kernel using NVSHMEM and compare its performance with a CUDA-aware MPI-based implementation that uses GPUDirect RDMA. Speedups in the range of 23% to 42% are seen for input sizes large enough to fill the occupancy of Nvidia Pascal P100 GPUs on 2 to 4 nodes indicating that there are gains to be had by eliminating the CPU from the communication path when all computation runs on the GPU.

...read moreread less

23 citations

Proceedings Article•DOI•

Landrush: rethinking in-situ analysis for GPGPU workflows

[...]

Anshuman Goswami¹, Yuan Tian², Karsten Schwan¹, Fang Zheng³, Jeffrey Young¹, Matthew Wolf¹, Greg Eisenhauer¹, Scott Klasky² - Show less +4 more•Institutions (3)

Georgia Institute of Technology¹, Oak Ridge National Laboratory², IBM³

16 May 2016

TL;DR: The 'Landrush' approach to GPU sharing proposes a solution that utilizes idle cycles on the GPU to provide an improved time-to-answer, that is, the total time to run the scientific simulation and analysis of the generated data.

...read moreread less

Abstract: In-situ analysis on the output data of scientific simulations has been made necessary by ever-growing output data volumes and increasing costs of data movement as supercomputing is moving towards exascale. With hardware accelerators like GPUs becoming increasingly common in high end machines, new opportunities arise to co-locate scientific simulations and online analysis performed on the scientific data generated by the simulations. However, the asynchronous nature of GPGPU programming models and the limited context-switching capabilities on the GPU pose challenges to co-locating the scientific simulation and analysis on the same GPU. This paper dives deeper into these challenges to understand how best to co-locate analysis with scientific simulations on the GPUs in HPC clusters. Specifically, our 'Landrush' approach to GPU sharing proposes a solution that utilizes idle cycles on the GPU to provide an improved time-to-answer, that is, the total time to run the scientific simulation and analysis of the generated data. Landrush is demonstrated with experimental results obtained from leadership high-end applications on ORNL's Titan supercomputer, which show that (i) GPU-based scientific simulations have varying degrees of idle cycles to afford useful analysis task co-location, and (ii) the inability to context switch on the GPU at instruction granularity can be overcome by careful control of the analysis kernel launches and software-controlled early completion of analysis kernel executions. Results show that Landrush is superior in terms of time-to-answer compared to serially running simulations followed by analysis or by relying on the GPU driver and hardwired thread dispatcher to run analysis concurrently on a single GPU.

...read moreread less

14 citations

Proceedings Article•DOI•

GPUShare: Fair-Sharing Middleware for GPU Clouds

[...]

Anshuman Goswami¹, Jeffrey Young¹, Karsten Schwan¹, Naila Farooqui¹, Ada Gavrilovska¹, Matthew Wolf¹, Greg Eisenhauer¹ - Show less +3 more•Institutions (1)

Georgia Institute of Technology¹

23 May 2016

TL;DR: GPUShare is presented, a software-based mechanism that can yield a kernel before all of its threads have run, thus giving finer control over the time slice for which the GPU is allocated to a process and improves fair GPU sharing across tenants.

...read moreread less

Abstract: Many new cloud-focused applications such as deeplearning and graph analytics have started to rely on the highcomputing throughput of GPUs, but cloud providers cannotcurrently support fine-grained time-sharing on GPUs to enablemulti-tenancy for these types of applications. Currently, schedulingis performed by the GPU driver in combination with ahardware thread dispatcher to maximize utilization. However, when multiple applications with contrasting kernel running timesand high-utilization of the GPU need to be co-located, thisapproach unduly favors one or more of the applications at theexpense of others. This paper presents GPUShare, a middleware solution forGPU fair sharing among high-utilization, long-running applications. It begins by analyzing the scenarios under which thecurrent driver-based multi-process scheduling fails, noting thatsuch scenarios are quite common. It then describes a softwarebasedmechanism that can yield a kernel before all of its threadshave run, thus giving finer control over the time slice for whichthe GPU is allocated to a process. In controlling time slices onthe GPU by yielding kernels, GPUShare improves fair GPUsharing across tenants and outperforms the CUDA driver byup to 45% for two tenants and by up to 89% for more thantwo tenants, while incurring a maximum overhead of only 12%.Additional improvements are obtained from having a centralscheduler that further smooths out disparities across tenants'GPU shares improving fair sharing by up to 92% for two tenantsand by up to 76% for more than two tenants.

...read moreread less

13 citations

Journal Article•DOI•

Influence of framework Al density in chabazite zeolites on copper ion mobility and reactivity during NOx selective catalytic reduction with NH3

[...]

Siddarth H. Krishna, Anshuman Goswami, Yuji Wang, Casey B. Jones, David P. Dean, Jeffrey T. Miller, William F. Schneider, Rajamani Pachayappan Gounder - Show less +4 more

01 Mar 2023-Nature Catalysis

3 citations

Cited by

PDF

Open Access

More filters

Fast parallel algorithms for short-range molecular dynamics

[...]

Steven J. Plimpton¹•Institutions (1)

Sandia National Laboratories¹

01 May 1993

TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.

...read moreread less

Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

...read moreread less

29,323 citations

Journal Article•DOI•

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

[...]

Ang Li¹, Shuaiwen Leon Song¹, Jieyang Chen², Jiajia Li¹, Xu Liu³, Nathan R. Tallent¹, Kevin J. Barker¹ - Show less +3 more•Institutions (3)

Pacific Northwest National Laboratory¹, Oak Ridge National Laboratory², College of William & Mary³

01 Jan 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.

...read moreread less

Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

...read moreread less

118 citations

Journal Article•DOI•

GPU Virtualization and Scheduling Methods: A Comprehensive Survey

[...]

Cheol-Ho Hong¹, Ivor Spence¹, Dimitrios S. Nikolopoulos¹•Institutions (1)

Queen's University Belfast¹

29 Jun 2017-ACM Computing Surveys

TL;DR: An extensive and in-depth survey of GPU virtualization techniques and their scheduling methods is presented and a perspective on the challenges and opportunities for virtualization of heterogeneous computing environments is delivered.

...read moreread less

Abstract: The integration of graphics processing units (GPUs) on high-end compute nodes has established a new accelerator-based heterogeneous computing model, which now permeates high-performance computing. The same paradigm nevertheless has limited adoption in cloud computing or other large-scale distributed computing paradigms. Heterogeneous computing with GPUs can benefit the Cloud by reducing operational costs and improving resource and energy efficiency. However, such a paradigm shift would require effective methods for virtualizing GPUs, as well as other accelerators. In this survey article, we present an extensive and in-depth survey of GPU virtualization techniques and their scheduling methods. We review a wide range of virtualization techniques implemented at the GPU library, driver, and hardware levels. Furthermore, we review GPU scheduling methods that address performance and fairness issues between multiple virtual machines sharing GPUs. We believe that our survey delivers a perspective on the challenges and opportunities for virtualization of heterogeneous computing environments.

...read moreread less

84 citations

Proceedings Article•DOI•

GraphReduce: processing large-scale graphs on accelerator-based systems

[...]

Dipanjan Sengupta¹, Shuaiwen Leon Song², Kapil Agarwal¹, Karsten Schwan¹•Institutions (2)

Georgia Institute of Technology¹, Pacific Northwest National Laboratory²

15 Nov 2015

TL;DR: GraphReduce is presented, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device's internal memory capacity and significantly outperforms other competing out-of-memory approaches.

...read moreread less

Abstract: Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device's internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host and device. GraphReduce-based programming is performed via device functions that include gatherMap, gatherReduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Extensive experimental evaluations for a wide variety of graph inputs and algorithms demonstrate that GraphReduce significantly outperforms other competing out-of-memory approaches.

...read moreread less

81 citations

Proceedings Article•DOI•

Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks

[...]

Soroush Ghodrati¹, Byung Hoon Ahn¹, Joon Kyung Kim¹, Sean Kinzer¹, Brahmendra Reddy Yatham¹, Navateja Alla¹, Hardik Sharma, Mohammad Alian², Eiman Ebrahimi³, Nam Sung Kim⁴, Cliff Young⁵, Hadi Esmaeilzadeh¹ - Show less +8 more•Institutions (5)

University of California, San Diego¹, University of Kansas², Nvidia³, University of Illinois at Urbana–Champaign⁴, Google⁵

01 Oct 2020

TL;DR: This paper defines Planaria1, a microarchitectural capability that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime that enables spatially co-locating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration.

...read moreread less

Abstract: Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries and markets. Cloud infrastructure and accelerators that offer INFerence-as-a-Service (INFaaS) have become the enabler of this rather quick and invasive shift in the industry. To that end, mostly accelerator-based INFaaS (Google’s TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has become the backbone of many real-life applications. However, as the demand for such services grows, merely scaling-out the number of accelerators is not economically cost-effective. Although multi-tenancy has propelled datacenter scalability, it has not been a primary factor in designing DNN accelerators due to the arms race for higher speed and efficiency. This paper sets out to explore this timely requirement of multi-tenancy through a new dimension: dynamic architecture fission. To that end, we define Planaria1 that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime. This microarchitectural capability enables spatially co-locating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration. To realize this dynamic reconfigurability, we first devise breakable omni-directional systolic arrays for DNN acceleration that allows omni-directional flow of data. Second, it uses this capability and a unique organization of on-chip memory, interconnection, and compute resources to enable fission in systolic array based DNN accelerators. Architecture fission and its associated flexibility enables an extra degree of freedom for task scheduling, that even allows breaking the accelerator with regard to the server load, DNN topology, and task priority. As such, it can simultaneously co-locate DNNs to enhance utilization, throughput, QoS, and fairness. We compare the proposed design to PREMA [4], a recent effort that offers multi-tenancy by time-multiplexing the DNN accelerator across multiple tasks. We use the same frequency, the same amount of compute and memory resources for both accelerators. The results show significant benefits with (soft, medium, hard) QoS requirements, in throughput (7.4×, 7.2×, 12.2×), SLA satisfaction rate (45%, 15%, 16%), and fairness (2.1×, 2.3×, 1.9×).

...read moreread less

72 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Collapse