scispace - formally typeset
Search or ask a question
Author

Peifeng Yu

Bio: Peifeng Yu is an academic researcher from University of Michigan. The author has contributed to research in topics: General-purpose computing on graphics processing units & Computer science. The author has an hindex of 3, co-authored 3 publications receiving 35 citations.

Papers
More filters
Posted Content
TL;DR: Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues, and can be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases.
Abstract: GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time sharing and preemption are expensive. Worse, when a DL application cannot completely use a GPU's resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization. We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, in order to achieve fine-grained GPU sharing among multiple DL applications. Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases. Our integration of Salus with TensorFlow and evaluation on popular DL jobs show that Salus can improve the average completion time of DL training jobs by $3.19\times$, GPU utilization for hyper-parameter tuning by $2.38\times$, and GPU utilization of DL inference applications by $42\times$ over not sharing the GPU and $7\times$ over NVIDIA MPS with small overhead.

35 citations

15 Mar 2020
TL;DR: Salus as mentioned in this paper enables fine-grained GPU sharing among multiple DL applications by exposing the GPU to different DL applications, and enforces finegrained sharing by performing iteration scheduling and addressing associated memory management issues.
Abstract: GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives Consequently, implementing common policies such as time sharing and preemption are expensive Worse, when a DL application cannot completely use a GPU's resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, in order to achieve fine-grained GPU sharing among multiple DL applications Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues We show that these primitives can then be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases Our integration of Salus with TensorFlow and evaluation on popular DL jobs show that Salus can improve the average completion time of DL training jobs by $319\times$, GPU utilization for hyper-parameter tuning by $238\times$, and GPU utilization of DL inference applications by $42\times$ over not sharing the GPU and $7\times$ over NVIDIA MPS with small overhead

6 citations

Proceedings ArticleDOI
07 May 2017
TL;DR: It is argued that by introducing a common representation of learning tasks and a hardware abstraction model to capture compute heterogeneity, it might be able to relieve machine learning researchers from dealing with low-level systems issues and systems researchers from being tied to any specific framework.
Abstract: In recent years, deep learning has pervaded many areas of computing due to the confluence of an explosive growth of large-scale computing capabilities, availability of datasets, and advances in learning techniques. While this rapid growth has resulted in diverse deep learning frameworks, it has also led to inefficiencies for both the users and developers of these frameworks. Specifically, adopting useful techniques across frameworks -- both to perform learning tasks and to optimize performance -- involves significant repetitions and reinventions.In this paper, we observe that despite their diverse origins, many of these frameworks share architectural similarities. We argue that by introducing a common representation of learning tasks and a hardware abstraction model to capture compute heterogeneity, we might be able to relieve machine learning researchers from dealing with low-level systems issues and systems researchers from being tied to any specific framework. We expect this decoupling to accelerate progress in both domains.

5 citations

Journal ArticleDOI
TL;DR: O RLOJ is presented, a dynamic DNN serving system that captures this variance inynamic DNNs using empirical distributions of expected request execution times, and then batches and schedules them without knowing a request’s precise execution time.
Abstract: Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs – e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers – are data-dependent . They exhibit poor performance when served using existing solutions because they experience large variance in request execution times depending on the input – the longest request in a batch inflates the execution times of the smaller ones, causing SLO misses in the absence of careful batching. In this paper, we present O RLOJ , a dynamic DNN serving system, that captures this variance in dynamic DNNs using empirical distributions of expected request execution times, and then efficiently batches and schedules them without knowing a request’s precise execution time. O RLOJ significantly outperforms state-of-the-art serving solutions for high variance dynamic DNN workloads by 51–80% in finish rate under tight SLO constraints, and over 100% under more relaxed SLO settings. For well-studied static DNN workloads, O RLOJ keeps comparable performance with the state-of-the-art.

Cited by
More filters
Proceedings ArticleDOI
09 Mar 2020
TL;DR: This work proposes SwapAdvisor, which performs joint optimization along 3 dimensions based on a given dataflow graph: operator scheduling, memory allocation, and swap decisions, and can train models up to 12 times the GPU memory limit while achieving 53-99% of the throughput of a hypothetical baseline with infinite GPU memory.
Abstract: It is known that deeper and wider neural networks can achieve better accuracy. But it is difficult to continue the trend to increase model size due to limited GPU memory. One promising solution is to support swapping between GPU and CPU memory. However, existing work on swapping only handle certain models and do not achieve satisfactory performance. Deep learning computation is commonly expressed as a dataflow graph which can be analyzed to improve swapping. We propose SwapAdvisor, which performs joint optimization along 3 dimensions based on a given dataflow graph: operator scheduling, memory allocation, and swap decisions. SwapAdvisor explores the vast search space using a custom-designed genetic algorithm. Evaluations using a variety of large models show that SwapAdvisor can train models up to 12 times the GPU memory limit while achieving 53-99% of the throughput of a hypothetical baseline with infinite GPU memory.

93 citations

Proceedings ArticleDOI
12 Jun 2018
TL;DR: It is shown that there exists significant latency-throughput trade-offs but the behavior is very complex, and several factors that affect the performance and yield this complex behavior are demonstrated.
Abstract: We study performance characteristics of convolutional neural networks (CNN) for mobile computer vision systems. CNNs have proven to be a powerful and efficient approach to implement such systems. However, the system performance depends largely on the utilization of hardware accelerators, which are able to speed up the execution of the underlying mathematical operations tremendously through massive parallelism. Our contribution is performance characterization of multiple CNN-based models for object recognition and detection with several different hardware platforms and software frameworks, using both local (on-device) and remote (network-side server) computation. The measurements are conducted using real workloads and real processing platforms. On the platform side, we concentrate especially on TensorFlow and TensorRT. Our measurements include embedded processors found on mobile devices and high-performance processors that can be used on the network side of mobile systems. We show that there exists significant latency-throughput trade-offs but the behavior is very complex. We demonstrate and discuss several factors that affect the performance and yield this complex behavior.

68 citations

Proceedings Article
TL;DR: This paper presents a characterization study of a two-month workload trace collected from a production MLaaS cluster with over 6,000 GPUs in Alibaba, and describes the current solutions and calls for further investiga-tions into the challenges that remain open to address.
Abstract: With the sustained technological advances in machine learning (ML) and the availability of massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) clouds, often with heterogeneous GPUs, to provi-sion a host of ML applications. However, running diverse ML workloads in heterogeneous GPU clusters raises a number of challenges. In this paper, we present a characterization study of a two-month workload trace collected from a production MLaaS cluster with over 6,000 GPUs in Alibaba. We explain the challenges posed to cluster scheduling, including the low GPU utilization, the long queueing delays, the presence of hard-to-schedule tasks demanding high-end GPUs with picky scheduling requirements, the imbalance load across heterogeneous machines, and the potential bottleneck on CPUs. We describe our current solutions and call for further investiga-tions into the challenges that remain open to address. We have released the trace for public access, which is the most comprehensive in terms of the workloads and cluster scale.

65 citations

Proceedings ArticleDOI
15 Apr 2020
TL;DR: Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly and provide fairness and preventing starvation while providing fairness among users in a shared cluster.
Abstract: Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources - each with a different rate of computation - to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.

60 citations

Proceedings Article
Wencong Xiao1, Shiru Ren1, Yong Li1, Yang Zhang1, Pengyang Hou1, Li Zhi1, Feng Yihui1, Wei Lin1, Yangqing Jia1 
01 Jan 2020
TL;DR: AntMan, a deep learning infrastructure that co-designs cluster schedulers with deep learning frameworks and has been deployed in production at Alibaba to manage tens of thousands of daily deep learning jobs across thousands of GPUs, is presented.
Abstract: Efficiently scheduling deep learning jobs on large-scale GPU clusters is crucial for job performance, system throughput, and hardware utilization. It is getting ever more challenging as deep learning workloads become more complex. This paper presents AntMan, a deep learning infrastructure that co-designs cluster schedulers with deep learning frameworks and has been deployed in production at Alibaba to manage tens of thousands of daily deep learning jobs across thousands of GPUs. AntMan accommodates the fluctuating resource demands of deep learning training jobs. As such, it utilizes the spare GPU resources to co-execute multiple jobs on a shared GPU. AntMan exploits unique characteristics of deep learning training to introduce dynamic scaling mechanisms for memory and computation within the deep learning frameworks. This allows fine-grained coordination between jobs and prevents job interference. Evaluations show that AntMan improves the overall GPU memory utilization by 42% and computation utilization by 34% in our multi-tenant cluster without compromising fairness, presenting a new approach to efficiently utilizing GPUs at scale.

58 citations