scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

TL;DR: A framework to enable applications executing within virtual machines to transparently share one or more GPUs is presented and it is found that even when contention is high the consolidation algorithm is effective in improving the throughput, and that the runtime overhead of the framework is low.
Abstract: Driven by the emergence of GPUs as a major player in high performance computing and the rapidly growing popularity of cloud environments, GPU instances are now being offered by cloud providers. The use of GPUs in a cloud environment, however, is still at initial stages, and the challenge of making GPU a true shared resource in the cloud has not yet been addressed.This paper presents a framework to enable applications executing within virtual machines to transparently share one or more GPUs. Our contributions are twofold: we extend an open source GPU virtualization software to include efficient GPU sharing, and we propose solutions to the conceptual problem of GPU kernel consolidation. In particular, we introduce a method for computing the affinity score between two or more kernels, which provides an indication of potential performance improvements upon kernel consolidation. In addition, we explore molding as a means to achieve efficient GPU sharing also in the case of kernels with high or conflicting resource requirements. We use these concepts to develop an algorithm to efficiently map a set of kernels on a pair of GPUs. We extensively evaluate our framework using eight popular GPU kernels and two Fermi GPUs. We find that even when contention is high our consolidation algorithm is effective in improving the throughput, and that the runtime overhead of our framework is low.
Citations
More filters
Proceedings ArticleDOI
16 Mar 2013
TL;DR: This work studies concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs, and proposes transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage.
Abstract: Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

211 citations


Cites background or methods from "Supporting GPU sharing in cloud env..."

  • ...[14] which uses GPU concurrency to improve GPU throughput for applications in the cloud....

    [...]

  • ...Past work [1, 7, 8, 14] has motivated GPU concurrency as a method to improve GPU throughput....

    [...]

  • ...Although these works partition GPU resources among concurrent kernels, the granularity of their techniques is either too coarse, operating at the level of a thread block [1, 7, 8], or their techniques are not general enough to apply to all kernels [14]....

    [...]

  • ...[14] look at GPGPU applications in the cloud and Guevara et al....

    [...]

Journal ArticleDOI
14 Jun 2014
TL;DR: This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Abstract: GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x

191 citations


Cites methods from "Supporting GPU sharing in cloud env..."

  • ...[29] rely on the molding technique (changing the dimensions of grid and thread blocks while preserving the correctness of the computation), when possible....

    [...]

Proceedings ArticleDOI
30 Nov 2015
TL;DR: A framework that integrates reconfigurable accelerators in a standard server with virtualised resource management and communication is discussed and a case study is presented that quantifies the efficiency benefits and break-even point for integrating FPGAs in the cloud.
Abstract: Hardware accelerators implement custom architectures to significantly speed up computations in a wide range of domains. As performance scaling in server-class CPUs slows, we propose the integration of hardware accelerators in the cloud as a way to maintain a positive performance trend. Field programmable gate arrays (FPGAs) represent the ideal way to integrate accelerators in the cloud, since they can be reprogrammed as needs change and allow multiple accelerators to share optimised communication infrastructure. We discuss a framework that integrates reconfigurable accelerators in a standard server with virtualised resource management and communication. We then present a case study that quantifies the efficiency benefits and break-even point for integrating FPGAs in the cloud.

156 citations


Cites background from "Supporting GPU sharing in cloud env..."

  • ...GPUs can offer significant performance benefits but cloud integration can be troublesome, since the architectures are designed to be used monolithically [4]....

    [...]

Proceedings ArticleDOI
25 Mar 2016
TL;DR: Baymax is presented, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization.
Abstract: Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

117 citations


Cites background from "Supporting GPU sharing in cloud env..."

  • ...GPU resource sharing has been studied at both system [65, 66] and architecture levels [67, 68] to address the resource contention and performance interference....

    [...]

Proceedings Article
19 Jun 2014
TL;DR: gVirt is introduced, a product level GPU virtualization implementation with: 1) full GPUvirtualization running native graphics driver in guest, and 2) mediated pass-through that achieves both good performance and scalability, and also secure isolation among guests.
Abstract: Graphics Processing Unit (GPU) virtualization is an enabling technology in emerging virtualization scenarios. Unfortunately, existing GPU virtualization approaches are still suboptimal in performance and full feature support. This paper introduces gVirt, a product level GPU virtualization implementation with: 1) full GPU virtualization running native graphics driver in guest, and 2) mediated pass-through that achieves both good performance and scalability, and also secure isolation among guests. gVirt presents a virtual full-fledged GPU to each VM. VMs can directly access performance-critical resources, without intervention from the hypervisor in most cases, while privileged operations from guest are trap-and-emulated at minimal cost. Experiments demonstrate that gVirt can achieve up to 95% native performance for GPU intensive workloads, and scale well up to 7 VMs.

114 citations

References
More filters
Proceedings ArticleDOI
15 Nov 2008
TL;DR: Using the Amazon cloud fee structure and a real-life astronomy application, the cost performance tradeoffs of different execution and resource provisioning plans are studied and it is shown that by provisioning the right amount of storage and compute resources, cost can be significantly reduced with no significant impact on application performance.
Abstract: Utility grids such as the Amazon EC2 cloud and Amazon S3 offer computational and storage resources that can be used on-demand for a fee by compute and data-intensive applications. The cost of running an application on such a cloud depends on the compute, storage and communication resources it will provision and consume. Different execution plans of the same application may result in significantly different costs. Using the Amazon cloud fee structure and a real-life astronomy application, we study via simulation the cost performance tradeoffs of different execution and resource provisioning plans. We also study these trade-offs in the context of the storage and communication fees of Amazon S3 when used for long-term application data archival. Our results show that by provisioning the right amount of storage and compute resources, cost can be significantly reduced with no significant impact on application performance.

690 citations


"Supporting GPU sharing in cloud env..." refers background in this paper

  • ...growing popularity of cloud environments, including their use for high performance applications [4, 13, 17, 32]....

    [...]

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Adaptive mapping is proposed, a fully automatic technique to map computations to processing elements on a CPU+GPU machine and it is shown that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduced in energy consumption than static mappings on average for a set of important computation benchmarks.
Abstract: Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.

565 citations


"Supporting GPU sharing in cloud env..." refers background in this paper

  • ...The work presented in this paper is driven by two independent recent trends, which are the emergence of GPUs as a major player in high performance computing [1, 2, 18, 21, 29, 30], and the rapidly...

    [...]

Proceedings ArticleDOI
01 Apr 2009
TL;DR: Experimental evaluation with RUBiS and TPC-W benchmarks along with production-trace-driven workloads indicates that AutoControl can detect and mitigate CPU and disk I/O bottlenecks that occur over time and across multiple nodes by allocating each resource accordingly.
Abstract: Virtualized data centers enable sharing of resources among hosted applications. However, it is difficult to satisfy service-level objectives(SLOs) of applications on shared infrastructure, as application workloads and resource consumption patterns change over time. In this paper, we present AutoControl, a resource control system that automatically adapts to dynamic workload changes to achieve application SLOs. AutoControl is a combination of an online model estimator and a novel multi-input, multi-output (MIMO) resource controller. The model estimator captures the complex relationship between application performance and resource allocations, while the MIMO controller allocates the right amount of multiple virtualized resources to achieve application SLOs. Our experimental evaluation with RUBiS and TPC-W benchmarks along with production-trace-driven workloads indicates that AutoControl can detect and mitigate CPU and disk I/O bottlenecks that occur over time and across multiple nodes by allocating each resource accordingly. We also show that AutoControl can be used to provide service differentiation according to the application priorities during resource contention.

553 citations


"Supporting GPU sharing in cloud env..." refers background or methods in this paper

  • ...There has been much research on the topic of virtualized resource allocation [5, 11, 16, 20, 25]....

    [...]

  • ...[20] P.Padala, et al. Automated control of multiple virtualized resources....

    [...]

  • ...Asimilar control model was used by Padala [20] for virtualized resources management, targeting web applications....

    [...]

  • ...A similar control model was used by Padala [20] for virtualized resources management, targeting web applications....

    [...]

  • ...[11] J. Heo,X. Zhu,P.Padala, andZ....

    [...]

Proceedings ArticleDOI
23 May 2009
TL;DR: This work compares and contrast the performance and monetary cost-benefits of clouds for desktop grid applications, ranging in computational size and storage and examines performance measurements and monetary expenses of real desktop grids and the Amazon elastic compute cloud.
Abstract: Cloud Computing has taken commercial computing by storm. However, adoption of cloud computing platforms and services by the scientific community is in its infancy as the performance and monetary cost-benefits for scientific applications are not perfectly clear. This is especially true for desktop grids (aka volunteer computing) applications. We compare and contrast the performance and monetary cost-benefits of clouds for desktop grid applications, ranging in computational size and storage. We address the following questions: (i) What are the performance tradeoffs in using one platform over the other? (ii) What are the specific resource requirements and monetary costs of creating and deploying applications on each platform? (iii) In light of those monetary and performance cost-benefits, how do these platforms compare? (iv) Can cloud computing platforms be used in combination with desktop grids to improve cost-effectiveness even further? We examine those questions using performance measurements and monetary expenses of real desktop grids and the Amazon elastic compute cloud.

383 citations


"Supporting GPU sharing in cloud env..." refers background in this paper

  • ...growing popularity of cloud environments, including their use for high performance applications [4, 13, 17, 32]....

    [...]

Proceedings ArticleDOI
20 Oct 2006
TL;DR: This work describes Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead of C, and compares the performance of Accelerator versions of the benchmarks against hand-written pixel shaders.
Abstract: GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language and a library that provides only high-level data-parallel operations. No aspects of GPUs are exposed to programmers. The library implementation compiles the data-parallel operations on the fly to optimized GPU pixel shader code and API calls.We describe the compilation techniques used to do this. We evaluate the effectiveness of using data parallelism to program GPUs by providing results for a set of compute-intensive benchmarks. We compare the performance of Accelerator versions of the benchmarks against hand-written pixel shaders. The speeds of the Accelerator versions are typically within 50% of the speeds of hand-written pixel shader code. Some benchmarks significantly outperform C versions on a CPU: they are up to 18 times faster than C code running on a CPU.

357 citations


"Supporting GPU sharing in cloud env..." refers background in this paper

  • ...The work presented in this paper is driven by two independent recent trends, which are the emergence of GPUs as a major player in high performance computing [1, 2, 18, 21, 29, 30], and the rapidly...

    [...]