scispace - formally typeset
Search or ask a question
Author

Huanxing Shen

Bio: Huanxing Shen is an academic researcher from Intel. The author has contributed to research in topics: Cache & Workload. The author has an hindex of 1, co-authored 1 publications receiving 4 citations.
Topics: Cache, Workload

Papers
More filters
Proceedings ArticleDOI
Huanxing Shen1, Cong Li1
01 Nov 2019
TL;DR: A meta-learning approach to discriminate the increase of cache miss metrics taking the cache occupancy data as the precondition for detecting cache interference under the workload intensity is proposed.
Abstract: While workload colocation improves cluster utilization in cloud environments, it introduces performance-impacting contentions on unmanaged resources. We address the problem of detecting the contentions on last-level cache with low level platform counters, but without application performance data. The detection is performed in a noisy environment with a mix of contention cases and non-contention cases, but without the ground truth. We propose a meta-learning approach to discriminate the increase of cache miss metrics taking the cache occupancy data as the precondition. We assume that given a certain workload intensity, when the cache occupancy of the workload drops below its hot data size, increasing cache misses will be observed. Leveraging the assumption, the threshold of cache miss metrics to detect cache interference under the workload intensity is found by inducing the most discriminating rule from the noisy history. Similarly, we determine whether the cache interference impacts performance by discriminating the increase of cycles per instruction metrics with the interference signal. Experimental results indicate that the new approach achieves a decent performance in identifying cache contentions with performance impact in noisy environments.

6 citations


Cited by
More filters
Journal ArticleDOI

538 citations

Proceedings ArticleDOI
Li Yi1, Cong Li1, Jianmei Guo2
01 Oct 2020
TL;DR: It is shown that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases and that the use of CPI without referring to the workload intensity is probably inappropriate, which provokes the discussion of the right way to use CPI.
Abstract: Originally used for micro-architectural performance characterization, the metric of cycles per instruction (CPI) is now emerging as a proxy for workload performance measurement in runtime cloud environments. It has been used to evaluate the performance per workload before and after applying a system configuration change and to detect contentions on the micro-architectural resources in workload colocation. In this paper, we re-examine the use of CPI on two representative cloud computing workloads. An alternative metric, reference cycles per instruction (RCPI), is defined for comparison. We show that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases. However, in the other cases with a different frequency scaling, we observe a better CPI value given a worse performance. We conjecture that both the observations are due to the bias of CPI towards scenarios with a low core frequency. We next demonstrate that a significant change in either CPI or RCPI does not necessarily indicate a boost or loss in performance, since both CPI and RCPI are dependent on workload intensities. It implies that the use of CPI without referring to the workload intensity is probably inappropriate. This provokes the discussion of the right way to use CPI, e.g., modeling CPI as a dependent variable given other relevant factors as the independent variables.

4 citations

Proceedings ArticleDOI
Huanxing Shen1, Cong Li1
28 Sep 2020
TL;DR: In this article, the authors propose a new method for runtime estimation of application memory latency, which helps discover the causal relationship between memory access interference in workload co-location and dissecting the performance problem in the memory subsystem.
Abstract: Various runtime factors impact memory latency and consequently impact application performance. Unfortunately the causal relationship is buried especially at runtime. In this paper we propose a new method for runtime estimation of application memory latency which helps discover the causal relationship. The new method leverages the hardware performance counters to calculate the average time that memory requests wait before getting fulfilled. We evaluate the method empirically in multiple scenarios and the estimation closely approximates the ground truth. We further demonstrate two examples of using the runtime estimation of application memory latency in application performance optimization and analysis, one in mitigating memory access interference in workload co-location and the other in dissecting the performance problem in the memory subsystem.

1 citations

Proceedings ArticleDOI
01 Nov 2022
TL;DR: In this paper , the authors introduce Probabilistic Induction of Theft Evictions, or PInTE, which allows controllable contention induction via data movement towards eviction in the last level cache replacement policy.
Abstract: Cache contention analysis remains complex without a controlled & lightweight method of inducing contention for shared resources. Prior art commonly leverages a second workload on an adjacent core to cause contention, and the workload is either real or tune-able. Using a secondary workload comes with unique problems in simulation: real workloads aren’t controllable and can result in many combinations to measure a broad range of contention; and tune-able workloads provide control but don’t guarantee contention without filling all cache sets with contention behavior. Lastly, running multiple workloads increases the runtime of simulation environments by 2.4× on average.We introduce Probabilistic Induction of Theft Evictions, or PInTE which allows controllable contention induction via data movement towards eviction in the last level cache replacement policy. PInTE provides configurable contention with 2.6× fewer experiments, 2.2× less average time, and 5.6× less total time for a set of SPEC 17 speed-based traces. Further, PInTE incurs −8.46% average relative error in performance when compared to real contention. Run-time and reuse behavior of workloads under PInTE contention approximate behavior under real contention — information distance is 0.03 bits and 0.84 bits, respectively. Additionally, PInTE enables a first-time contention sensitivity analysis of SPEC and case studies which evaluate the resilience of micro-architectural techniques under growing contention.
Huanxing Shen1, Cong Li1
01 Jan 2018
TL;DR: This paper presents Zeno, a novel system which automatically identifies and diagnoses stragglers for jobs by machine learning methods, and is able to generate simple and easy-to-read rules with both valuable insights and decent performance in predictingstragglers.
Abstract: Modern distributed computing frameworks for cloud computing and high performance computing typically accelerate job performance by dividing a large job into small tasks for execution parallelism. Some tasks, however, may run far behind others, which jeopardize the job completion time. In this paper, we present Zeno, a novel system which automatically identifies and diagnoses stragglers for jobs by machine learning methods. First, the system identifies stragglers with an unsupervised clustering method which groups the tasks based on their execution time. It then uses a supervised rule learning algorithm to learn diagnosis rules inferring the stragglers with their resource assignment and usage data. Zeno is evaluated on traces from a Google’s Borg system and an Alibaba’s Fuxi system. The results demonstrate that our system is able to generate simple and easy-to-read rules with both valuable insights and decent performance in predicting stragglers.