GPUShare: Fair-Sharing Middleware for GPU Clouds

GPUShare is presented, a software-based mechanism that can yield a kernel before all of its threads have run, thus giving finer control over the time slice for which the GPU is allocated to a process and improves fair GPU sharing across tenants.

Abstract:

Many new cloud-focused applications such as deeplearning and graph analytics have started to rely on the highcomputing throughput of GPUs, but cloud providers cannotcurrently support fine-grained time-sharing on GPUs to enablemulti-tenancy for these types of applications. Currently, schedulingis performed by the GPU driver in combination with ahardware thread dispatcher to maximize utilization. However, when multiple applications with contrasting kernel running timesand high-utilization of the GPU need to be co-located, thisapproach unduly favors one or more of the applications at theexpense of others. This paper presents GPUShare, a middleware solution forGPU fair sharing among high-utilization, long-running applications. It begins by analyzing the scenarios under which thecurrent driver-based multi-process scheduling fails, noting thatsuch scenarios are quite common. It then describes a softwarebasedmechanism that can yield a kernel before all of its threadshave run, thus giving finer control over the time slice for whichthe GPU is allocated to a process. In controlling time slices onthe GPU by yielding kernels, GPUShare improves fair GPUsharing across tenants and outperforms the CUDA driver byup to 45% for two tenants and by up to 89% for more thantwo tenants, while incurring a maximum overhead of only 12%.Additional improvements are obtained from having a centralscheduler that further smooths out disparities across tenants'GPU shares improving fair sharing by up to 92% for two tenantsand by up to 76% for more than two tenants.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs

Xiaolong Xie,Wei Tan,Liana L. Fong,Yun Liang +3 morePeking University,IBM

Show Less

TL;DR: This paper first design high-performance GPU computation kernels that accelerate individual SGD updates by exploiting model parallelism, then design efficient schemes that parallelize SGD Updates by exploiting data parallelism and scales cuMF SGD to large data sets that cannot fit into one GPU's memory.

...read moreread less

Proceedings ArticleDOI

Dynamic application reconfiguration on heterogeneous hardware

Juan Fumero,Michail Papadimitriou,Foivos S. Zakkak,Maria Xekalaki,James Clarkson,Christos Kotselidis +5 moreUniversity of Manchester

Show Less

TL;DR: Through TornadoVM, a virtual machine capable of reconfiguring applications, at runtime, for hardware acceleration based on the currently available hardware resources, this paper introduces a new level of compilation in which applications can benefit from heterogeneous hardware.

...read moreread less

Proceedings ArticleDOI

A View from ORNL: Scientific Data Research Opportunities in the Big Data Age

Scott Klasky,Scott Klasky,Scott Klasky,Matthew Wolf,Mark Ainsworth,Mark Ainsworth,Chuck Atkins,Jong Choi,Greg Eisenhauer,Berk Geveci,William F. Godoy,Mark Kim,James Kress,Tahsin Kurc,Tahsin Kurc,Qing Liu,Qing Liu,Jeremy Logan,Arthur B. Maccabe,Kshitij Mehta,George Ostrouchov,George Ostrouchov,Manish Parashar,Norbert Podhorszki,David Pugmire,David Pugmire,E. Suchyta,Lipeng Wan,Ruonan Wang +28 moreGeorgia Institute of Technology,University of Tennessee,Oak Ridge National Laboratory,Brown University,Kitware,Stony Brook University,New Jersey Institute of Technology,Rutgers University

Show Less

TL;DR: A forward-looking research and development plan which centers around the concept of making Input/Output (I/O) intelligent for users in the scientific community, whether they are accessing scalable storage or performing in situ workflow tasks.

...read moreread less

Proceedings ArticleDOI

Wheel: Accelerating CNNs with Distributed GPUs via Hybrid Parallelism and Alternate Strategy

Xiaoyu Du,Jinhui Tang,Zechao Li,Zhiguang Qin +3 moreUniversity of Electronic Science and Technology of China,Nanjing University of Science and Technology

Show Less

TL;DR: Wheel first partitions the layers of a CNN into two kinds of modules: convolutional module and fully-connected module, and deploys them following the proposed hybrid parallelism, which reduces the transmitted data and fully using GPUs simultaneously.

...read moreread less

Proceedings ArticleDOI

GLoop: an event-driven runtime for consolidating GPGPU applications

Yusuke Suzuki,Hiroshi Yamada,Shinpei Kato,Kenji Kono +3 moreKeio University,University of Tokyo

Show Less

TL;DR: GLoop is presented, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters including GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eater's high functionality while proportionally scheduling them on a shared GPU in an isolated manner.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

Lawrence Page,Sergey Brin,Rajeev Motwani,Terry Winograd +3 more

Show Less

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.

...read moreread less

Posted Content

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia,Evan Shelhamer,Jeff Donahue,Sergey Karayev,Jonathan Long,Ross Girshick,Sergio Guadarrama,Trevor Darrell +7 moreGoogle,University of California, Berkeley

- 20 Jun 2014 -

arXiv: Computer Vision and Pattern Recog...

Show Less

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Proceedings ArticleDOI

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia,Evan Shelhamer,Jeff Donahue,Sergey Karayev,Jonathan Long,Ross Girshick,Sergio Guadarrama,Trevor Darrell +7 moreGoogle,University of California, Berkeley

Show Less

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Journal ArticleDOI

Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment

C. L. Liu,James W. Layland +1 moreUniversity of Illinois at Urbana–Champaign,California Institute of Technology

- 01 Jan 1973 -

Journal of the ACM

Show Less

TL;DR: The problem of multiprogram scheduling on a single processor is studied from the viewpoint of the characteristics peculiar to the program functions that need guaranteed service and it is shown that an optimum fixed priority scheduler possesses an upper bound to processor utilization.

...read moreread less

Book

Scheduling algorithms for multiprogramming in a hard real-time environment

C. L. Liu,James W. Layland +1 more

Show Less

TL;DR: In this paper, the problem of multiprogram scheduling on a single processor is studied from the viewpoint of the characteristics peculiar to the program functions that need guaranteed service, and it is shown that an optimum fixed priority scheduler possesses an upper bound to processor utilization which may be as low as 70 percent for large task sets.

...read moreread less