Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework (2011) | Vignesh T. Ravi

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Improving GPGPU concurrency with elastic kernels

[...]

Sreepathi Pai¹, Matthew J. Thazhuthaveetil¹, Ramaswamy Govindarajan¹•Institutions (1)

Indian Institute of Science¹

16 Mar 2013

TL;DR: This work studies concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs, and proposes transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage.

...read moreread less

Abstract: Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

...read moreread less

211 citations

Cites background or methods from "Supporting GPU sharing in cloud env..."

...[14] which uses GPU concurrency to improve GPU throughput for applications in the cloud....
[...]
...Past work [1, 7, 8, 14] has motivated GPU concurrency as a method to improve GPU throughput....
[...]
...Although these works partition GPU resources among concurrent kernels, the granularity of their techniques is either too coarse, operating at the level of a thread block [1, 7, 8], or their techniques are not general enough to apply to all kernels [14]....
[...]
...[14] look at GPGPU applications in the cloud and Guevara et al....
[...]

Journal Article•DOI•

Enabling preemptive multiprogramming on GPUs

[...]

Ivan Tanasic¹, Isaac Gelado², Javier Cabezas¹, Alex Ramirez¹, Nacho Navarro¹, Mateo Valero¹ - Show less +2 more•Institutions (2)

Polytechnic University of Catalonia¹, Nvidia²

14 Jun 2014

TL;DR: This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.

...read moreread less

Abstract: GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x

...read moreread less

191 citations

Cites methods from "Supporting GPU sharing in cloud env..."

...[29] rely on the molding technique (changing the dimensions of grid and thread blocks while preserving the correctness of the computation), when possible....
[...]

Proceedings Article•DOI•

Virtualized FPGA Accelerators for Efficient Cloud Computing

[...]

Suhaib A. Fahmy¹, Kizheppatt Vipin², Shanker Shreejith³•Institutions (3)

University of Warwick¹, École Centrale Paris², Nanyang Technological University³

30 Nov 2015

TL;DR: A framework that integrates reconfigurable accelerators in a standard server with virtualised resource management and communication is discussed and a case study is presented that quantifies the efficiency benefits and break-even point for integrating FPGAs in the cloud.

...read moreread less

Abstract: Hardware accelerators implement custom architectures to significantly speed up computations in a wide range of domains. As performance scaling in server-class CPUs slows, we propose the integration of hardware accelerators in the cloud as a way to maintain a positive performance trend. Field programmable gate arrays (FPGAs) represent the ideal way to integrate accelerators in the cloud, since they can be reprogrammed as needs change and allow multiple accelerators to share optimised communication infrastructure. We discuss a framework that integrates reconfigurable accelerators in a standard server with virtualised resource management and communication. We then present a case study that quantifies the efficiency benefits and break-even point for integrating FPGAs in the cloud.

...read moreread less

156 citations

Cites background from "Supporting GPU sharing in cloud env..."

...GPUs can offer significant performance benefits but cloud integration can be troublesome, since the architectures are designed to be used monolithically [4]....
[...]

Proceedings Article•DOI•

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

[...]

Quan Chen¹, Hailong Yang², Jason Mars³, Lingjia Tang³•Institutions (3)

Shanghai Jiao Tong University¹, Beihang University², University of Michigan³

25 Mar 2016

TL;DR: Baymax is presented, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization.

...read moreread less

Abstract: Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

...read moreread less

117 citations

Cites background from "Supporting GPU sharing in cloud env..."

...GPU resource sharing has been studied at both system [65, 66] and architecture levels [67, 68] to address the resource contention and performance interference....
[...]

Proceedings Article•

A full GPU virtualization solution with mediated pass-through

[...]

Kun Tian¹, Yaozu Dong¹, David J. Cowperthwaite¹•Institutions (1)

Intel¹

19 Jun 2014

TL;DR: gVirt is introduced, a product level GPU virtualization implementation with: 1) full GPUvirtualization running native graphics driver in guest, and 2) mediated pass-through that achieves both good performance and scalability, and also secure isolation among guests.

...read moreread less

Abstract: Graphics Processing Unit (GPU) virtualization is an enabling technology in emerging virtualization scenarios. Unfortunately, existing GPU virtualization approaches are still suboptimal in performance and full feature support. This paper introduces gVirt, a product level GPU virtualization implementation with: 1) full GPU virtualization running native graphics driver in guest, and 2) mediated pass-through that achieves both good performance and scalability, and also secure isolation among guests. gVirt presents a virtual full-fledged GPU to each VM. VMs can directly access performance-critical resources, without intervention from the hypervisor in most cases, while privileged operations from guest are trap-and-emulated at minimal cost. Experiments demonstrate that gVirt can achieve up to 95% native performance for GPU intensive workloads, and scale well up to 7 VMs.

...read moreread less

114 citations

Collapse

Cites background or methods from "Supporting GPU sharing in cloud env..."

Cites methods from "Supporting GPU sharing in cloud env..."

Cites background from "Supporting GPU sharing in cloud env..."

Cites background from "Supporting GPU sharing in cloud env..."

"Supporting GPU sharing in cloud env..." refers background in this paper

"Supporting GPU sharing in cloud env..." refers background in this paper

"Supporting GPU sharing in cloud env..." refers background or methods in this paper

"Supporting GPU sharing in cloud env..." refers background in this paper

"Supporting GPU sharing in cloud env..." refers background in this paper

Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Citations

Cites background or methods from "Supporting GPU sharing in cloud env..."

Cites methods from "Supporting GPU sharing in cloud env..."

Cites background from "Supporting GPU sharing in cloud env..."

Cites background from "Supporting GPU sharing in cloud env..."

References

"Supporting GPU sharing in cloud env..." refers background in this paper

"Supporting GPU sharing in cloud env..." refers background in this paper

"Supporting GPU sharing in cloud env..." refers background or methods in this paper

"Supporting GPU sharing in cloud env..." refers background in this paper

"Supporting GPU sharing in cloud env..." refers background in this paper

Related Papers (5)