scispace - formally typeset
Search or ask a question
Author

Fei Guo

Bio: Fei Guo is an academic researcher from North Carolina State University. The author has contributed to research in topics: Cache & Thread (computing). The author has an hindex of 6, co-authored 6 publications receiving 1125 citations.

Papers
More filters
Proceedings ArticleDOI
12 Feb 2005
TL;DR: Three performance models are proposed that predict the impact of cache sharing on co-scheduled threads and the most accurate model, the inductive probability model, achieves an average error of only 3.9%.
Abstract: This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a chip multi-processor (CMP) architecture. Cache sharing impacts threads nonuniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the inductive probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

543 citations

Proceedings ArticleDOI
12 Jun 2007
TL;DR: A QoS-enabled memory architecture for CMP platforms that enables more cache resources and memory resources for high priority applications based on guidance from the operating environment and allows dynamic resource reassignment during run-time to further optimize the performance of the high priority application with minimal degradation to low priority.
Abstract: As we enter the era of CMP platforms with multiple threads/cores on the die, the diversity of the simultaneous workloads running on them is expected to increase. The rapid deployment of virtualization as a means to consolidate workloads on to a single platform is a prime example of this trend. In such scenarios, the quality of service (QoS) that each individual workload gets from the platform can widely vary depending on the behavior of the simultaneously running workloads. While the number of cores assigned to each workload can be controlled, there is no hardware or software support in today's platforms to control allocation of platform resources such as cache space and memory bandwidth to individual workloads. In this paper, we propose a QoS-enabled memory architecture for CMP platforms that addresses this problem. The QoS-enabled memory architecture enables more cache resources (i.e. space) and memory resources (i.e. bandwidth) for high priority applications based on guidance from the operating environment. The architecture also allows dynamic resource reassignment during run-time to further optimize the performance of the high priority application with minimal degradation to low priority. To achieve these goals, we will describe the hardware/software support required in the platform as well as the operating environment (O/S and virtual machine monitor). Our evaluation framework consists of detailed platform simulation models and a QoS-enabled version of Linux. Based on evaluation experiments, we show the effectiveness of a QoS-enabled architecture and summarize key findings/trade-offs.

282 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: This paper investigates a framework that would be needed for a CMP to fully provide QoS, and proposes novel throughput optimization techniques that include: exploiting various QoS execution modes, and a microarchitecture technique that steals excess resources from a job while still meeting its QoS target.
Abstract: The trends in enterprise IT toward service-oriented com- puting, server consolidation, and virtual computing point to a future in which workloads are becoming increasingly di- verse in terms of performance, reliability, and availability requirements. It can be expected that more and more appli- cations with diverse requirements will run on a CMP and share platform resources such as the lowest level cache and off-chip bandwidth. In this environment, it is desirable to have microarchitecture and software support that can pro- vide a guarantee of a certain level of performance, which we refer to as performance Quality of Service. In this paper, we investigate a framework that would be needed for a CMP to fully provide QoS. We found that the ability of a CMP to partition platform resources alone is not sufficient for fully providing QoS. We also need an ap- propriate way to specify a QoS target, and an admission control policy that accepts jobs only when their QoS targets can be satisfied. We also found that providing strict QoS often leads to a significant reduction in throughput due to resource fragmentation. We propose novel throughput op- timization techniques that include: (1) exploiting various QoS execution modes, and (2) a microarchitecture tech- nique that steals excess resources from a job while still meeting its QoS target. We evaluated our QoS framework with a full system simulation of a 4-core CMP and a re- cent version of the Linux Operating System. We found that compared to an unoptimized scheme, the throughput can be improved by up to 47%, making the throughput significantly closer to a non-QoS CMP.

125 citations

Journal ArticleDOI
26 Jun 2006
TL;DR: This paper is the first to propose an analytical model which predicts the performance of cache replacement policies, based on probability theory, and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs.
Abstract: Due to the increasing gap between CPU and memory speed, cache performance plays an increasingly critical role in determining the overall performance of microprocessor systems. One of the important factors that a affect cache performance is the cache replacement policy. Despite the importance, current analytical cache performance models ignore the impact of cache replacement policies on cache performance. To the best of our knowledge, this paper is the first to propose an analytical model which predicts the performance of cache replacement policies. The input to our model is a simple circular sequence profiling of each application, which requires very little storage overhead. The output of the model is the predicted miss rates of an application under different replacement policies. The model is based on probability theory and utilizes Markov processes to compute each cache access' miss probability. The model realistic assumptions and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs. The model's run time is less than 0.1 seconds, much lower than that of trace simulations. We validate the model by comparing the predicted miss rates of seventeen Spec2000 and NAS benchmark applications against miss rates obtained by detailed execution-driven simulations, across a range of different cache sizes, associativities, and four replacement policies, and show that the model is very accurate. The model's average prediction error is 1.41%,and there are only 14 out of 952 validation points in which the prediction errors are larger than 10%.

74 citations

Journal ArticleDOI
TL;DR: This paper describes QoS prototypes and presents preliminary case studies of multi-tasking and virtualization usage models sharing one critical CMP resource (last-level cache) to demonstrate how proper management of the cache resource can provide service differentiation and deterministic performance behavior when running disparate workloads in future CMP platforms.
Abstract: As more and more cores are enabled on the die of future CMP platforms, we expect that several diverse workloads will run simultaneously on the platform. A key example of this trend is the growth of virtualization usage models. When multiple virtual machines or applications or threads run simultaneously, the quality of service (QoS) that the platform provides to each individual thread is non-deterministic today. This occurs because the simultaneously running threads place very different demands on the shared resources (cache space, memory bandwidth, etc) in the platform and in most cases contend with each other. In this paper, we first present case studies that show how this results in non-deterministic performance. Unlike the compute resources managed through scheduling, platform resource allocation to individual threads cannot be controlled today. In order to provide better determinism and QoS, we then examine resource management mechanisms and present QoS-aware architectures and execution environments. The main contribution of this paper is the architecture feasibility analysis through prototypes that allow experimentation with QoS-Aware execution environments and architectural resources. We describe these QoS prototypes and then present preliminary case studies of multi-tasking and virtualization usage models sharing one critical CMP resource (last-level cache). We then demonstrate how proper management of the cache resource can provide service differentiation and deterministic performance behavior when running disparate workloads in future CMP platforms.

66 citations


Cited by
More filters
Proceedings ArticleDOI
09 Jun 2007
TL;DR: A Dynamic Insertion Policy (DIP) is proposed to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses, and shows that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.
Abstract: The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulting in inefficient use of cache space. Cache performance can be improved if some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits.We show that simple changes to the insertion policy can significantly reduce cache misses for memory-intensive workloads. We propose the LRU Insertion Policy (LIP) which places the incoming line in the LRU position instead of the MRU position. LIP protects the cache from thrashing and results in close to optimal hitrate for applications that have a cyclic reference pattern. We also propose the Bimodal Insertion Policy (BIP) as an enhancement of LIP that adapts to changes in the working set while maintaining the thrashing protection of LIP. We finally propose a Dynamic Insertion Policy (DIP) to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses. The proposed insertion policies do not require any change to the existing cache structure, are trivial to implement, and have a storage requirement of less than two bytes. We show that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.

722 citations

Proceedings ArticleDOI
03 Dec 2011
TL;DR: Bubble-Up is presented, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem and can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation.
Abstract: As much of the world's computing continues to move into the cloud, the overprovisioning of computing resources to ensure the performance isolation of latency-sensitive tasks, such as web search, in modern datacenters is a major contributor to low machine utilization. Being unable to accurately predict performance degradation due to contention for shared resources on multicore systems has led to the heavy handed approach of simply disallowing the co-location of high-priority, latency-sensitive tasks with other tasks. Performing this precise prediction has been a challenging and unsolved problem. In this paper, we present Bubble-Up, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem. By using a bubble to apply a tunable amount of “pressure” to the memory subsystem on processors in production datacenters, our methodology can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation. Using this methodology to arrive at “sensible” co-locations in Google's production datacenters with real-world large-scale applications, we can improve the utilization of a 500-machine cluster by 50% to 90% while guaranteeing a high quality of service of latency-sensitive applications.

596 citations

Journal ArticleDOI
Onur Mutlu1, Thomas Moscibroda1
01 Jun 2008
TL;DR: A parallelism-aware batch scheduler that seamlessly incorporates support for system-level thread priorities and can provide different service levels, including purely opportunistic service, to threads with different priorities, and is also simpler to implement than STFM.
Abstract: In a chip-multiprocessor (CMP) system, the DRAM system isshared among cores. In a shared DRAM system, requests from athread can not only delay requests from other threads by causingbank/bus/row-buffer conflicts but they can also destroy other threads’DRAM-bank-level parallelism. Requests whose latencies would otherwisehave been overlapped could effectively become serialized. As aresult both fairness and system throughput degrade, and some threadscan starve for long time periods.This paper proposes a fundamentally new approach to designinga shared DRAM controller that provides quality of service to threads,while also improving system throughput. Our parallelism-aware batchscheduler (PAR-BS) design is based on two key ideas. First, PARBSprocesses DRAM requests in batches to provide fairness and toavoid starvation of requests. Second, to optimize system throughput,PAR-BS employs a parallelism-aware DRAM scheduling policythat aims to process requests from a thread in parallel in the DRAMbanks, thereby reducing the memory-related stall-time experienced bythe thread. PAR-BS seamlessly incorporates support for system-levelthread priorities and can provide different service levels, includingpurely opportunistic service, to threads with different priorities.We evaluate the design trade-offs involved in PAR-BS and compareit to four previously proposed DRAM scheduler designs on 4-, 8-, and16-core systems. Our evaluations show that, averaged over 100 4-coreworkloads, PAR-BS improves fairness by 1.11X and system throughputby 8.3% compared to the best previous scheduling technique, Stall-Time Fair Memory (STFM) scheduling. Based on simple request prioritizationrules, PAR-BS is also simpler to implement than STFM.

575 citations

Journal ArticleDOI
13 Mar 2010
TL;DR: This study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling, and finds a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware.
Abstract: Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent contention for shared resource can be mitigated via thread scheduling. Scheduling is an attractive tool, because it does not require extra hardware and is relatively easy to integrate into the system. Our study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling. The most difficult part of the problem is to find a classification scheme for threads, which would determine how they affect each other when competing for shared resources. We provide a comprehensive analysis of such classification schemes using a newly proposed methodology that enables to evaluate these schemes separately from the scheduling algorithm itself and to compare them to the optimal. As a result of this analysis we discovered a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware. To show the applicability of our analysis we design a new scheduling algorithm, which we prototype at user level, and demonstrate that it performs within 2\% of the optimal. We also conclude that the highest impact of contention-aware scheduling techniques is not in improving performance of a workload as a whole but in improving quality of service or performance isolation for individual applications.

532 citations

Proceedings ArticleDOI
13 Jun 2015
TL;DR: Heracles is presented, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service and dynamically manages multiple hardware and software isolation mechanisms to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best- Effort tasks.
Abstract: User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.

464 citations