scispace - formally typeset
Search or ask a question
Author

Cong Li

Bio: Cong Li is an academic researcher from Intel. The author has contributed to research in topics: Cache & Workload. The author has an hindex of 1, co-authored 2 publications receiving 4 citations.
Topics: Cache, Workload

Papers
More filters
Proceedings ArticleDOI
Huanxing Shen1, Cong Li1
01 Nov 2019
TL;DR: A meta-learning approach to discriminate the increase of cache miss metrics taking the cache occupancy data as the precondition for detecting cache interference under the workload intensity is proposed.
Abstract: While workload colocation improves cluster utilization in cloud environments, it introduces performance-impacting contentions on unmanaged resources. We address the problem of detecting the contentions on last-level cache with low level platform counters, but without application performance data. The detection is performed in a noisy environment with a mix of contention cases and non-contention cases, but without the ground truth. We propose a meta-learning approach to discriminate the increase of cache miss metrics taking the cache occupancy data as the precondition. We assume that given a certain workload intensity, when the cache occupancy of the workload drops below its hot data size, increasing cache misses will be observed. Leveraging the assumption, the threshold of cache miss metrics to detect cache interference under the workload intensity is found by inducing the most discriminating rule from the noisy history. Similarly, we determine whether the cache interference impacts performance by discriminating the increase of cycles per instruction metrics with the interference signal. Experimental results indicate that the new approach achieves a decent performance in identifying cache contentions with performance impact in noisy environments.

6 citations

Proceedings ArticleDOI
Li Yi1, Cong Li1, Jianmei Guo2
01 Oct 2020
TL;DR: It is shown that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases and that the use of CPI without referring to the workload intensity is probably inappropriate, which provokes the discussion of the right way to use CPI.
Abstract: Originally used for micro-architectural performance characterization, the metric of cycles per instruction (CPI) is now emerging as a proxy for workload performance measurement in runtime cloud environments. It has been used to evaluate the performance per workload before and after applying a system configuration change and to detect contentions on the micro-architectural resources in workload colocation. In this paper, we re-examine the use of CPI on two representative cloud computing workloads. An alternative metric, reference cycles per instruction (RCPI), is defined for comparison. We show that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases. However, in the other cases with a different frequency scaling, we observe a better CPI value given a worse performance. We conjecture that both the observations are due to the bias of CPI towards scenarios with a low core frequency. We next demonstrate that a significant change in either CPI or RCPI does not necessarily indicate a boost or loss in performance, since both CPI and RCPI are dependent on workload intensities. It implies that the use of CPI without referring to the workload intensity is probably inappropriate. This provokes the discussion of the right way to use CPI, e.g., modeling CPI as a dependent variable given other relevant factors as the independent variables.

4 citations


Cited by
More filters
Journal ArticleDOI

538 citations

Proceedings ArticleDOI
Li Yi1, Cong Li1, Jianmei Guo2
01 Oct 2020
TL;DR: It is shown that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases and that the use of CPI without referring to the workload intensity is probably inappropriate, which provokes the discussion of the right way to use CPI.
Abstract: Originally used for micro-architectural performance characterization, the metric of cycles per instruction (CPI) is now emerging as a proxy for workload performance measurement in runtime cloud environments. It has been used to evaluate the performance per workload before and after applying a system configuration change and to detect contentions on the micro-architectural resources in workload colocation. In this paper, we re-examine the use of CPI on two representative cloud computing workloads. An alternative metric, reference cycles per instruction (RCPI), is defined for comparison. We show that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases. However, in the other cases with a different frequency scaling, we observe a better CPI value given a worse performance. We conjecture that both the observations are due to the bias of CPI towards scenarios with a low core frequency. We next demonstrate that a significant change in either CPI or RCPI does not necessarily indicate a boost or loss in performance, since both CPI and RCPI are dependent on workload intensities. It implies that the use of CPI without referring to the workload intensity is probably inappropriate. This provokes the discussion of the right way to use CPI, e.g., modeling CPI as a dependent variable given other relevant factors as the independent variables.

4 citations

Journal ArticleDOI
TL;DR: The proposed DVFS-based power management techniques are particularly effective for a class of memory-intensive benchmarks – they improve EE from 121% to 183% and PxEE from 100% to 141%.
Abstract: This paper describes the results of our measurement-based study, conducted on an Intel Core i7 processor running the SPEC CPU2017 benchmark suites, that evaluates the impact of dynamic voltage frequency scaling (DVFS) on performance (P), energy efficiency (EE), and their product (PxEE). The results indicate that the default DVFS-based power management techniques heavily favor performance, resulting in poor energy efficiency. To remedy this problem, we introduce, implement, and evaluate four DVFS-based power management techniques driven by the following metrics derived from the processor's performance monitoring unit: (i) the total pipeline slot stall ratio (FS-PS), (ii) the total cycle stall ratio (FS-TS), (iii) the total memory-related cycle stall ratio (FS-MS), and (iv) the number of last level cache misses per kilo instructions (FS-LLCM). The proposed techniques linearly map these metrics onto the available processor clock frequencies. The experimental evaluation results show that the proposed techniques significantly improve EE and PxEE metrics compared to the existing approaches. Specifically, EE improves from 44% to 92%, and PxEE improves from 31% to 48% when all the benchmarks are considered together. Furthermore, we find that the proposed techniques are particularly effective for a class of memory-intensive benchmarks – they improve EE from 121% to 183% and PxEE from 100% to 141%. Finally, we elucidate the advantages and disadvantages of each of the proposed techniques and offer recommendations on using them.

2 citations

Proceedings ArticleDOI
Huanxing Shen1, Cong Li1
28 Sep 2020
TL;DR: In this article, the authors propose a new method for runtime estimation of application memory latency, which helps discover the causal relationship between memory access interference in workload co-location and dissecting the performance problem in the memory subsystem.
Abstract: Various runtime factors impact memory latency and consequently impact application performance. Unfortunately the causal relationship is buried especially at runtime. In this paper we propose a new method for runtime estimation of application memory latency which helps discover the causal relationship. The new method leverages the hardware performance counters to calculate the average time that memory requests wait before getting fulfilled. We evaluate the method empirically in multiple scenarios and the estimation closely approximates the ground truth. We further demonstrate two examples of using the runtime estimation of application memory latency in application performance optimization and analysis, one in mitigating memory access interference in workload co-location and the other in dissecting the performance problem in the memory subsystem.

1 citations

Proceedings ArticleDOI
07 Nov 2022
TL;DR: In this article , a QoS-aware power management controller for heterogeneous black-box workloads in public clouds is proposed, which is designed to work without offline profiling or prior knowledge about black box workloads.
Abstract: Energy consumption in cloud data centers has become an increasingly important contributor to greenhouse gas emissions and operation costs. To reduce energy-related costs and improve environmental sustainability, most modern data centers consolidate Virtual Machine (VM) workloads belonging to different application classes, some being latency-critical (LC) and others being more tolerant to performance changes, known as best-effort (BE). However, in public cloud scenarios, the real classes of applications are often opaque to data center operators. The heterogeneous applications from different cloud tenants are usually consolidated onto the same hosts to improve energy efficiency, but it is not trivial to guarantee decent performance isolation among colocated workloads. We tackle the above challenges by introducing Demeter, a QoS-aware power management controller for heterogeneous black-box workloads in public clouds. Demeter is designed to work without offline profiling or prior knowledge about black-box workloads. Through the correlation analysis between network throughput and CPU resource utilization, Demeter automatically classifies black-box workloads as either LC or BE. By provisioning differentiated CPU management strategies (including dynamic core allocation and frequency scaling) to LC and BE workloads, Demeter achieves considerable power savings together with a minimum impact on the performance of all workloads. We discuss the design and implementation of Demeter in this work, and conduct extensive experimental evaluations to reveal its effectiveness. Our results show that Demeter not only meets the performance demand of all workloads, but also responds quickly to dynamic load changes in our cloud environment. In addition, Demeter saves an average of 10.6% power consumption than state of the art mechanisms.