Showing papers by "Lingjia Tang published in 2014"

PDF

Open Access

Proceedings Article•DOI•

SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers

[...]

Yunqi Zhang¹, Michael A. Laurenzano¹, Jason Mars¹, Lingjia Tang¹•Institutions (1)

13 Dec 2014

TL;DR: This paper demonstrates through a real- system investigation that the fundamental difference between resource sharing behaviors on CMP and SMT architectures calls for a redesign of the way the authors model interference, and proposes SMiTe, a methodology that enables precise performance prediction for SMT co-location on real-system commodity processors.

...read moreread less

Abstract: One of the key challenges for improving efficiency in warehouse scale computers (WSCs) is to improve server utilization while guaranteeing the quality of service (QoS) of latency-sensitive applications. To this end, prior work has proposed techniques to precisely predict performance and QoS interference to identify 'safe' application co-locations. However, such techniques are only applicable to resources shared across cores. Achieving such precise interference prediction on real-system simultaneous multithreading (SMT) architectures has been a significantly challenging open problem due to the complexity introduced by sharing resources within a core. In this paper, we demonstrate through a real-system investigation that the fundamental difference between resource sharing behaviors on CMP and SMT architectures calls for a redesign of the way we model interference. For SMT servers, the interference on different shared resources, including private caches, memory ports, as well as integer and floating-point functional units, do not correlate with each other. This insight suggests the necessity of decoupling interference into multiple resource sharing dimensions. In this work, we propose SMiTe, a methodology that enables precise performance prediction for SMT co-location on real-system commodity processors. With a set of Rulers, which are carefully designed software stressors that apply pressure to a multidimensional space of shared resources, we quantify application sensitivity and contentiousness in a decoupled manner. We then establish a regression model to combine the sensitivity and contentiousness in different dimensions to predict performance interference. Using this methodology, we are able to precisely predict the performance interference in SMT co-location with an average error of 2.80% on SPEC CPU2006 and 1.79% on Cloud Suite. Our evaluation shows that SMiTe allows us to improve the utilization of WSCs by up to 42.57% while enforcing an application's QoS requirements.

...read moreread less

141 citations

Proceedings Article•DOI•

Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers

[...]

Michael A. Laurenzano¹, Yunqi Zhang¹, Lingjia Tang¹, Jason Mars¹•Institutions (1)

University of Michigan¹

13 Dec 2014

TL;DR: This work introduces protean code, a novel approach for enacting arbitrary compiler transformations at runtime for native programs running on commodity hardware with negligible (<;1%) overhead, and designs PC3D, Protean Code for Cache Contention in Datacenters.

...read moreread less

Abstract: Rampant dynamism due to load fluctuations, co runner changes, and varying levels of interference poses a threat to application quality of service (QoS) and has limited our ability to allow co-locations in modern warehouse scale computers (WSCs). Instruction set features such as the non-temporal memory access hints found in modern ISAs (both ARM and x86) may be useful in mitigating these effects. However, despite the challenge of this dynamism and the availability of an instruction set mechanism that might help address the problem, a key capability missing in the system software stack in modern WSCs is the ability to dynamically transform (and re-transform) the executing application code to apply these instruction set features when necessary. In this work we introduce protean code, a novel approach for enacting arbitrary compiler transformations at runtime for native programs running on commodity hardware with negligible (<1%) overhead. The fundamental insight behind the underlying mechanism of protean code is that, instead of maintaining full control throughout the program's execution as with traditional dynamic optimizers, protean code allows the original binary to execute continuously and diverts control flow only at a set of virtualized points, allowing rapid and seamless rerouting to the new code variants. In addition, the protean code compiler embeds IR with high-level semantic information into the program, empowering the dynamic compiler to perform rich analysis and transformations online with little overhead. Using a fully functional protean code compiler and runtime built on LLVM, we design PC3D, Protean Code for Cache Contention in Datacenters. PC3D dynamically employs non-temporal access hints to achieve utilization improvements of up to 2.8x (1.5x on average) higher than state-of-the-art contention mitigation runtime techniques at a QoS target of 98%.

...read moreread less

47 citations

Proceedings Article•

HaPPy: hyperthread-aware power profiling dynamically

[...]

Yan Zhai¹, Xiao Zhang², Stephane Eranian², Lingjia Tang³, Jason Mars³ - Show less +1 more•Institutions (3)

University of Wisconsin-Madison¹, Google², University of Michigan³

19 Jun 2014

TL;DR: A hyperthread-aware power model that differentiates between the states when both hardware threads of a core are in use, and when only one thread is in use is introduced, able to accurately attribute power to each logical CPU in modern servers.

...read moreread less

Abstract: Quantifying the power consumption of individual applications co-running on a single server is a critical component for software-based power capping, scheduling, and provisioning techniques in modern datacenters. However, with the proliferation of hyperthreading in the last few generations of server-grade processor designs, the challenge of accurately and dynamically performing this power attribution to individual threads has been significantly exacerbated. Due to the sharing of core-level resources such as functional units, prior techniques are not suitable to attribute the power consumption between hyperthreads sharing a physical core. In this paper, we present a runtime mechanism that quantifies and attributes power consumption to individual jobs at fine granularity. Specifically, we introduce a hyperthread-aware power model that differentiates between the states when both hardware threads of a core are in use, and when only one thread is in use. By capturing these two different states, we are able to accurately attribute power to each logical CPU in modern servers. We conducted experiments with several Google production workloads on an Intel Sandy Bridge server. Compared to prior hyperthread-oblivious model, HaPPy is substantially more accurate, reducing the prediction error from 20.5% to 7.5% on average and from 31.5% to 9.4% in the worst case.

...read moreread less

37 citations

Journal Article•DOI•

Enabling fair pricing on high performance computer systems with node sharing

[...]

Breslow Alexander D¹, Ananta Tiwari², Martin Schulz³, Laura Carrington², Lingjia Tang⁴, Jason Mars⁴ - Show less +2 more•Institutions (4)

University of California, San Diego¹, San Diego Supercomputer Center², Lawrence Livermore National Laboratory³, University of Michigan⁴

01 Apr 2014

TL;DR: POPPA is a runtime system that enables fair pricing by delivering precise online interference detection and facilitates the adoption of supercomputers with co-locations and is able to quantify inter-application interference within 4% mean absolute error on a variety of co-located benchmark and real scientific workloads.

...read moreread less

Abstract: Co-location, where multiple jobs share compute nodes in large-scale HPC systems, has been shown to increase aggregate throughput and energy efficiency by 10--20%. However, system operators disallow co-location due to fair-pricing concerns, i.e., a pricing mechanism that considers performance interference from co-running jobs. In the current pricing model, application execution time determines the price, which results in unfair prices paid by the minority of users whose jobs suffer from co-location.This paper presents POPPA, a runtime system that enables fair pricing by delivering precise online interference detection and facilitates the adoption of supercomputers with co-locations. POPPA leverages a novel shutter mechanism --a cyclic, fine-grained interference sampling mechanism to accurately deduce the interference between co-runners --to provide unbiased pricing of jobs that share nodes. POPPA is able to quantify inter-application interference within 4% mean absolute error on a variety of co-located benchmark and real scientific workloads.

...read moreread less

4 citations