scispace - formally typeset
Search or ask a question

Showing papers by "Richard Harper published in 2020"


Proceedings Article
01 May 2020
TL;DR: This paper demonstrates that it is possible to predict DL workload GPU utilization via extracting information from its model computation graph and proposes a prediction engine to proactively determine the GPU utilization of heterogeneous DL workloads without the need for in-depth or isolated online profiling.
Abstract: Understanding the GPU utilization of Deep Learning (DL) workloads is important for enhancing resource-efficiency and cost-benefit decision making for DL frameworks in the cloud. Current approaches to determine DL workload GPU utilization rely on online profiling within isolated GPU devices, and must be performed for every unique DL workload submission resulting in resource under-utilization and reduced service availability. In this paper, we propose a prediction engine to proactively determine the GPU utilization of heterogeneous DL workloads without the need for in-depth or isolated online profiling. We demonstrate that it is possible to predict DL workload GPU utilization via extracting information from its model computation graph. Our experiments show that the prediction engine achieves an RMSLE of 0.154, and can be exploited by DL schedulers to achieve up to 61.5% improvement to GPU cluster utilization.

15 citations


Book ChapterDOI
02 Oct 2020
TL;DR: This paper proposes Horus, an interference-aware resource manager for DL systems, which estimates job resource utilization and co-location patterns to determine effective DL job placement to minimize likelihood of interference, as well as improve system resource utilized and makespan.
Abstract: Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identified that co-location - multiple jobs co-located within the same GPU - is an effective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel profiling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-profiling, our approach estimates job resource utilization and co-location patterns to determine effective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 different models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5% for resource utilization and 33.6% for makespan.

9 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: This workshop wishes to gather researchers and practitioners interested in information systems research, which is abundant with analysis of the managerial challenges that have not been noted by managers themselves but has been less discussed among CSCW researchers.
Abstract: Workplaces in all sectors are experiencing digitalization spurred primarily by increasing access to data and AI. Many initiatives are failing to produce expected outcomes, and are even producing negative outcomes on workplace wellbeing. The insights generated by CSCW researchers seem to have failed to reach their targets: the challenges and opportunities for successful appropriation of technology have rarely been adopted by managers, or they were not articulated in a way that facilitated follow-on success. A failure of academic research to impact the world is a known problem – information systems research is abundant with analysis of the managerial challenges that have not been noted by managers themselves – it has been less discussed among CSCW researchers. In this workshop, we wish to gather researchers and practitioners interested

1 citations


Book ChapterDOI
01 Jan 2020
TL;DR: ‘HCI in the wild’ was meant to be a call to get HCI investigations out of the lab into the melee of real life, though begs questions about what kinds of methods and topics are suited for exploring in this melee as against in the lab.
Abstract: ‘HCI in the wild’ was meant to be a call to get HCI investigations out of the lab into the melee of real life. This is of course a commendable suggestion, though begs questions about what kinds of methods and topics are suited for exploring in this melee as against in the lab. Claims by some experimentalists that they seek ecological validity in lab studies are largely missing the point since the thing that studies in the wild seek are essentially only those things that occur outside the lab—and hence are not things that can be replicated, modelled, or emulated. But in any case, some of those who have taken up the call for studies in the wild have taken this rather too literally—they have sought wild places, places where HCI researchers have not gone before. Needless to say this being HCI, the places in question are not often that wild, woods near Brighton, for example, street life in south Cambridge. What they ignore as they venture into these settings is the melee of office life, the place where the bulk of computer systems are located and the place in which, oddly enough, increasingly little HCI research gets done.