Home
/
Authors
/
Peifeng Yu

Author

Peifeng Yu

Bio: Peifeng Yu is an academic researcher from University of Michigan. The author has contributed to research in topics: General-purpose computing on graphics processing units & Computer science. The author has an hindex of 3, co-authored 3 publications receiving 35 citations.

Papers

PDF

Open Access

More filters

Posted Content•

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

[...]

Peifeng Yu¹, Mosharaf Chowdhury¹•Institutions (1)

University of Michigan¹

12 Feb 2019-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues, and can be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases.

...read moreread less

Abstract: GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time sharing and preemption are expensive. Worse, when a DL application cannot completely use a GPU's resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization. We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, in order to achieve fine-grained GPU sharing among multiple DL applications. Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases. Our integration of Salus with TensorFlow and evaluation on popular DL jobs show that Salus can improve the average completion time of DL training jobs by $3.19\times$, GPU utilization for hyper-parameter tuning by $2.38\times$, and GPU utilization of DL inference applications by $42\times$ over not sharing the GPU and $7\times$ over NVIDIA MPS with small overhead.

...read moreread less

35 citations

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

[...]

Peifeng Yu, Mosharaf Chowdhury

15 Mar 2020

TL;DR: Salus as mentioned in this paper enables fine-grained GPU sharing among multiple DL applications by exposing the GPU to different DL applications, and enforces finegrained sharing by performing iteration scheduling and addressing associated memory management issues.

...read moreread less

Abstract: GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives Consequently, implementing common policies such as time sharing and preemption are expensive Worse, when a DL application cannot completely use a GPU's resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, in order to achieve fine-grained GPU sharing among multiple DL applications Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues We show that these primitives can then be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases Our integration of Salus with TensorFlow and evaluation on popular DL jobs show that Salus can improve the average completion time of DL training jobs by $319\times$, GPU utilization for hyper-parameter tuning by $238\times$, and GPU utilization of DL inference applications by $42\times$ over not sharing the GPU and $7\times$ over NVIDIA MPS with small overhead

...read moreread less

6 citations

Proceedings Article•DOI•

No!: Not Another Deep Learning Framework

[...]

Linh Nguyen¹, Peifeng Yu¹, Mosharaf Chowdhury¹•Institutions (1)

University of Michigan¹

07 May 2017

TL;DR: It is argued that by introducing a common representation of learning tasks and a hardware abstraction model to capture compute heterogeneity, it might be able to relieve machine learning researchers from dealing with low-level systems issues and systems researchers from being tied to any specific framework.

...read moreread less

Abstract: In recent years, deep learning has pervaded many areas of computing due to the confluence of an explosive growth of large-scale computing capabilities, availability of datasets, and advances in learning techniques. While this rapid growth has resulted in diverse deep learning frameworks, it has also led to inefficiencies for both the users and developers of these frameworks. Specifically, adopting useful techniques across frameworks -- both to perform learning tasks and to optimize performance -- involves significant repetitions and reinventions.In this paper, we observe that despite their diverse origins, many of these frameworks share architectural similarities. We argue that by introducing a common representation of learning tasks and a hardware abstraction model to capture compute heterogeneity, we might be able to relieve machine learning researchers from dealing with low-level systems issues and systems researchers from being tied to any specific framework. We expect this decoupling to accelerate progress in both domains.

...read moreread less

5 citations

Journal Article•DOI•

Orloj: Predictably Serving Unpredictable DNNs

[...]

Peifeng Yu, Yuqing Qiu, Xin Jin, Mosharaf Chowdhury

31 Aug 2022-arXiv.org

TL;DR: O RLOJ is presented, a dynamic DNN serving system that captures this variance inynamic DNNs using empirical distributions of expected request execution times, and then batches and schedules them without knowing a request’s precise execution time.

...read moreread less

Abstract: Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs – e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers – are data-dependent . They exhibit poor performance when served using existing solutions because they experience large variance in request execution times depending on the input – the longest request in a batch inﬂates the execution times of the smaller ones, causing SLO misses in the absence of careful batching. In this paper, we present O RLOJ , a dynamic DNN serving system, that captures this variance in dynamic DNNs using empirical distributions of expected request execution times, and then efﬁciently batches and schedules them without knowing a request’s precise execution time. O RLOJ signiﬁcantly outperforms state-of-the-art serving solutions for high variance dynamic DNN workloads by 51–80% in ﬁnish rate under tight SLO constraints, and over 100% under more relaxed SLO settings. For well-studied static DNN workloads, O RLOJ keeps comparable performance with the state-of-the-art.

...read moreread less

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping

[...]

Chien-Chin Huang¹, Gu Jin¹, Jinyang Li¹•Institutions (1)

New York University¹

09 Mar 2020

TL;DR: This work proposes SwapAdvisor, which performs joint optimization along 3 dimensions based on a given dataflow graph: operator scheduling, memory allocation, and swap decisions, and can train models up to 12 times the GPU memory limit while achieving 53-99% of the throughput of a hypothetical baseline with infinite GPU memory.

...read moreread less

Abstract: It is known that deeper and wider neural networks can achieve better accuracy. But it is difficult to continue the trend to increase model size due to limited GPU memory. One promising solution is to support swapping between GPU and CPU memory. However, existing work on swapping only handle certain models and do not achieve satisfactory performance. Deep learning computation is commonly expressed as a dataflow graph which can be analyzed to improve swapping. We propose SwapAdvisor, which performs joint optimization along 3 dimensions based on a given dataflow graph: operator scheduling, memory allocation, and swap decisions. SwapAdvisor explores the vast search space using a custom-designed genetic algorithm. Evaluations using a variety of large models show that SwapAdvisor can train models up to 12 times the GPU memory limit while achieving 53-99% of the throughput of a hypothetical baseline with infinite GPU memory.

...read moreread less

93 citations

Proceedings Article•DOI•

Latency and throughput characterization of convolutional neural networks for mobile computer vision

[...]

Jussi Hanhirova¹, Teemu Kämäräinen¹, Sipi Seppälä¹, Matti Siekkinen¹, Vesa Hirvisalo¹, Antti Ylä-Jääski¹ - Show less +2 more•Institutions (1)

Aalto University¹

12 Jun 2018

TL;DR: It is shown that there exists significant latency-throughput trade-offs but the behavior is very complex, and several factors that affect the performance and yield this complex behavior are demonstrated.

...read moreread less

Abstract: We study performance characteristics of convolutional neural networks (CNN) for mobile computer vision systems. CNNs have proven to be a powerful and efficient approach to implement such systems. However, the system performance depends largely on the utilization of hardware accelerators, which are able to speed up the execution of the underlying mathematical operations tremendously through massive parallelism. Our contribution is performance characterization of multiple CNN-based models for object recognition and detection with several different hardware platforms and software frameworks, using both local (on-device) and remote (network-side server) computation. The measurements are conducted using real workloads and real processing platforms. On the platform side, we concentrate especially on TensorFlow and TensorRT. Our measurements include embedded processors found on mobile devices and high-performance processors that can be used on the network side of mobile systems. We show that there exists significant latency-throughput trade-offs but the behavior is very complex. We demonstrate and discuss several factors that affect the performance and yield this complex behavior.

...read moreread less

68 citations

Proceedings Article•

MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

[...]

Qizhen Weng, Wencong Xiao, Yinghao Yu, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, Yu Ding - Show less +5 more

TL;DR: This paper presents a characterization study of a two-month workload trace collected from a production MLaaS cluster with over 6,000 GPUs in Alibaba, and describes the current solutions and calls for further investiga-tions into the challenges that remain open to address.

...read moreread less

Abstract: With the sustained technological advances in machine learning (ML) and the availability of massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) clouds, often with heterogeneous GPUs, to provi-sion a host of ML applications. However, running diverse ML workloads in heterogeneous GPU clusters raises a number of challenges. In this paper, we present a characterization study of a two-month workload trace collected from a production MLaaS cluster with over 6,000 GPUs in Alibaba. We explain the challenges posed to cluster scheduling, including the low GPU utilization, the long queueing delays, the presence of hard-to-schedule tasks demanding high-end GPUs with picky scheduling requirements, the imbalance load across heterogeneous machines, and the potential bottleneck on CPUs. We describe our current solutions and call for further investiga-tions into the challenges that remain open to address. We have released the trace for public access, which is the most comprehensive in terms of the workloads and cluster scale.

...read moreread less

65 citations

Proceedings Article•DOI•

AlloX: compute allocation in hybrid clusters

[...]

Tan N. Le¹, Xiao Sun¹, Mosharaf Chowdhury², Zhenhua Liu¹•Institutions (2)

Stony Brook University¹, University of Michigan²

15 Apr 2020

TL;DR: Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly and provide fairness and preventing starvation while providing fairness among users in a shared cluster.

...read moreread less

Abstract: Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources - each with a different rate of computation - to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.

...read moreread less

60 citations

Proceedings Article•

AntMan: Dynamic Scaling on GPU Clusters for Deep Learning.

[...]

Wencong Xiao¹, Shiru Ren¹, Yong Li¹, Yang Zhang¹, Pengyang Hou¹, Li Zhi¹, Feng Yihui¹, Wei Lin¹, Yangqing Jia¹ - Show less +5 more•Institutions (1)

Alibaba Group¹

01 Jan 2020

TL;DR: AntMan, a deep learning infrastructure that co-designs cluster schedulers with deep learning frameworks and has been deployed in production at Alibaba to manage tens of thousands of daily deep learning jobs across thousands of GPUs, is presented.

...read moreread less

Abstract: Efficiently scheduling deep learning jobs on large-scale GPU clusters is crucial for job performance, system throughput, and hardware utilization. It is getting ever more challenging as deep learning workloads become more complex. This paper presents AntMan, a deep learning infrastructure that co-designs cluster schedulers with deep learning frameworks and has been deployed in production at Alibaba to manage tens of thousands of daily deep learning jobs across thousands of GPUs. AntMan accommodates the fluctuating resource demands of deep learning training jobs. As such, it utilizes the spare GPU resources to co-execute multiple jobs on a shared GPU. AntMan exploits unique characteristics of deep learning training to introduce dynamic scaling mechanisms for memory and computation within the deep learning frameworks. This allows fine-grained coordination between jobs and prevents job interference. Evaluations show that AntMan improves the overall GPU memory utilization by 42% and computation utilization by 34% in our multi-tenant cluster without compromising fairness, presenting a new approach to efficiently utilizing GPUs at scale.

...read moreread less

58 citations

1
2
3
4
…
5
6
7
8
9

Collapse