scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

09 Dec 2006-pp 423-432
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
01 Jun 2008
TL;DR: This work explores more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count, to achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on memory-intensive multi-programmed workloads on a quad-core processor.
Abstract: Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on our memory-intensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8% performance improvement over our 3D-stacked memory architecture.

679 citations


Cites background from "Utility-Based Cache Partitioning: A..."

  • ..., IPC) but then continue simulating the program so that it continues to compete for the shared L2 cache and bus resources (this is effectively the same as some other recent multi-core simulation methodologies [35])....

    [...]

Proceedings ArticleDOI
13 Apr 2010
TL;DR: Q-Clouds, a QoS-aware control framework that tunes resource allocations to mitigate performance interference effects, is developed, which uses online feedback to build a multi-input multi-output (MIMO) model that captures performance interference interactions, and uses it to perform closed loop resource management.
Abstract: Cloud computing offers users the ability to access large pools of computational and storage resources on demand. Multiple commercial clouds already allow businesses to replace, or supplement, privately owned IT assets, alleviating them from the burden of managing and maintaining these facilities. However, there are issues that must be addressed before this vision of utility computing can be fully realized. In existing systems, customers are charged based upon the amount of resources used or reserved, but no guarantees are made regarding the application level performance or quality-of-service (QoS) that the given resources will provide. As cloud providers continue to utilize virtualization technologies in their systems, this can become problematic. In particular, the consolidation of multiple customer applications onto multicore servers introduces performance interference between collocated workloads, significantly impacting application QoS. To address this challenge, we advocate that the cloud should transparently provision additional resources as necessary to achieve the performance that customers would have realized if they were running in isolation. Accordingly, we have developed Q-Clouds, a QoS-aware control framework that tunes resource allocations to mitigate performance interference effects. Q-Clouds uses online feedback to build a multi-input multi-output (MIMO) model that captures performance interference interactions, and uses it to perform closed loop resource management. In addition, we utilize this functionality to allow applications to specify multiple levels of QoS as application Q-states. For such applications, Q-Clouds dynamically provisions underutilized resources to enable elevated QoS levels, thereby improving system efficiency. Experimental evaluations of our solution using benchmark applications illustrate the benefits: performance interference is mitigated completely when feasible, and system utilization is improved by up to 35% using Q-states.

614 citations


Additional excerpts

  • ...For example, hardware techniques have been designed that dynamically partition cache resources based upon utility metrics [27], or integrate novel insertion policies to pseudopartition caches [36]....

    [...]

Proceedings ArticleDOI
03 Dec 2011
TL;DR: Bubble-Up is presented, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem and can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation.
Abstract: As much of the world's computing continues to move into the cloud, the overprovisioning of computing resources to ensure the performance isolation of latency-sensitive tasks, such as web search, in modern datacenters is a major contributor to low machine utilization. Being unable to accurately predict performance degradation due to contention for shared resources on multicore systems has led to the heavy handed approach of simply disallowing the co-location of high-priority, latency-sensitive tasks with other tasks. Performing this precise prediction has been a challenging and unsolved problem. In this paper, we present Bubble-Up, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem. By using a bubble to apply a tunable amount of “pressure” to the memory subsystem on processors in production datacenters, our methodology can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation. Using this methodology to arrive at “sensible” co-locations in Google's production datacenters with real-world large-scale applications, we can improve the utilization of a 500-machine cluster by 50% to 90% while guaranteeing a high quality of service of latency-sensitive applications.

596 citations


Additional excerpts

  • ...One direction that has attracted much research attention is the management of shared cache and bandwidth through techniques such as resource partitioning [4, 20, 27–34, 38], throttling [7] and adaptive cache replacement policies [14]....

    [...]

Journal ArticleDOI
13 Mar 2010
TL;DR: This study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling, and finds a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware.
Abstract: Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent contention for shared resource can be mitigated via thread scheduling. Scheduling is an attractive tool, because it does not require extra hardware and is relatively easy to integrate into the system. Our study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling. The most difficult part of the problem is to find a classification scheme for threads, which would determine how they affect each other when competing for shared resources. We provide a comprehensive analysis of such classification schemes using a newly proposed methodology that enables to evaluate these schemes separately from the scheduling algorithm itself and to compare them to the optimal. As a result of this analysis we discovered a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware. To show the applicability of our analysis we design a new scheduling algorithm, which we prototype at user level, and demonstrate that it performs within 2\% of the optimal. We also conclude that the highest impact of contention-aware scheduling techniques is not in improving performance of a workload as a whole but in improving quality of service or performance isolation for individual applications.

532 citations


Cites background or methods or result from "Utility-Based Cache Partitioning: A..."

  • ...SDC, and other solutions relying on stack distance profiles such as cache partitioning [ 17 , 22, 26], assumed that the dominant cause of performance degradation is contention for the space in the shared cache, i.e., when co-scheduled threads evict each other data from the shared cache....

    [...]

  • ...It has been documented in previous studies [6, 14, 15, 17 , 22, 24] that the execution time of a thread can vary greatly depending on which threads run on the other cores of the same chip....

    [...]

  • ...In addition to SDC, many other studies of cache contention used stack-distance or reuse-distance profiles for managing contention [5, 17 , 22, 24]....

    [...]

  • ...In the work on Utility Cache Partitioning [ 17 ], a custom hardware solution estimates each application’s number of hits and misses for all possible number of ways allocated to the application in the cache (the technique is based on stack-distance profiles)....

    [...]

  • ...Methods such as utility cache partitioning (UCP) [ 17 ] and page coloring [6, 24, 27] were devised to mitigate cache contention....

    [...]

Proceedings ArticleDOI
13 Jun 2015
TL;DR: Heracles is presented, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service and dynamically manages multiple hardware and software isolation mechanisms to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best- Effort tasks.
Abstract: User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.

464 citations


Cites background from "Utility-Based Cache Partitioning: A..."

  • ...Most cache partitioning schemes have been evaluated with a utility-based policy that optimizes for aggregate throughput [64]....

    [...]

  • ...Isolation mechanisms: There is significant work on shared cache isolation, including soft partitioning based on replacement policies [77, 78], way-partitioning [65, 64], and finegrained partitioning [68, 49, 71]....

    [...]

  • ...It is already well understood that, even when the colocation is between throughput tasks, it is best to dynamically manage cache partitioning using either hardware [30, 64, 15] or software [58, 43] techniques....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms.
Abstract: The design of efficient storage hierarchies generally involves the repeated running of "typical" program address traces through a simulated storage system while various hierarchy design parameters are adjusted. This paper describes a new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms. The technique depends on an algorithm classification, called "stack algorithms," examples of which are "least frequently used," "least recently used," "optimal," and "random replacement" algorithms. The techniques yield the exact access frequency to each storage device, which can be used to estimate the overall performance of actual storage hierarchies.

1,329 citations


"Utility-Based Cache Partitioning: A..." refers background in this paper

  • ...Fortunately, the baseline LRU policy obeys the stack property [ 10 ], which means that an access that hits in a LRU managed cache containing N ways is guaranteed to also hit if the cache had more than N ways (the number of sets being constant)....

    [...]

Book
01 Jan 1976
TL;DR: In this paper, the authors propose a combinatorial approach for estimating the probability of a given self-test problem using a set of random variables, including continuous random variables and jointly distributed random variables.
Abstract: 1. Combinatorial Analysis 2. Axioms of Probability 3. Conditional Probability and Independence 4. Random Variables 5. Continuous Random Variables 6. Jointly Distributed Random Variables 7. Properties of Expectation 8. Limit Theorems 9. Additional Topics in Probability 10. Simulation Appendix A. Answers to Selected Problems Appendix B. Solutions to Self-Test Problems and Exercises Index

1,244 citations

Journal ArticleDOI
TL;DR: A new model, the “working set model,” is developed, defined to be the collection of its most recently used pages, which provides knowledge vital to the dynamic management of paged memories.
Abstract: Probably the most basic reason behind the absence of a general treatment of resource allocation in modern computer systems is an adequate model for program behavior. In this paper a new model, the “working set model,” is developed. The working set of pages associated with a process, defined to be the collection of its most recently used pages, provides knowledge vital to the dynamic management of paged memories. “Process” and “working set” are shown to be manifestations of the same ongoing computational activity; then “processor demand” and “memory demand” are defined; and resource allocation is formulated as the problem of balancing demands against available equipment.

995 citations


"Utility-Based Cache Partitioning: A..." refers background in this paper

  • ...1Demand is determined by the number of unique cache blocks accessed in a given interval [4]....

    [...]

Proceedings ArticleDOI
29 Sep 2004
TL;DR: It is found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness, and two algorithms are proposed that optimize fairness.
Abstract: This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4/spl times/, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.

544 citations


"Utility-Based Cache Partitioning: A..." refers methods in this paper

  • ...The proposed framework can also be used to implement execution-time fairness [7] without requiring any profile information....

    [...]

Proceedings ArticleDOI
03 Dec 1997
TL;DR: This work presents an analytical model for QoS management in systems which must satisfy application needs along multiple dimensions such as timeliness, reliable delivery schemes, cryptographic security and data quality, and refers to this model as Q-RAM (QoS-based Resource Allocation Model).
Abstract: Quality of service (QoS) has been receiving wide attention in many research communities including networking, multimedia systems, real-time systems and distributed systems. In large distributed systems such as those used in defense systems, on-demand service and inter-networked systems, applications contending for system resources must satisfy timing, reliability and security constraints as well as application-specific quality requirements. Allocating sufficient resources to different applications in order to satisfy various requirements is a fundamental problem in these situations. A basic yet flexible model for performance-driven resource allocations can therefore be useful in making appropriate tradeoffs. We present an analytical model for QoS management in systems which must satisfy application needs along multiple dimensions such as timeliness, reliable delivery schemes, cryptographic security and data quality. We refer to this model as Q-RAM (QoS-based Resource Allocation Model). The model assumes a system with multiple concurrent applications, each of which can operate at different levels of quality based on the system resources available to it. The goal of the model is to be able to allocate resources to the various applications such that the overall system utility is maximized under the constraint that each application can meet its minimum needs. We identify resource profiles of applications which allow such decisions to be made efficiently and in real-time. We also identify application utility functions along different dimensions which are composable to form unique application requirement profiles. We use a video-conferencing system to illustrate the model.

517 citations


"Utility-Based Cache Partitioning: A..." refers background in this paper

  • ...Finding an optimal solution to the partitioning problem has been shown to be NP-hard [14]....

    [...]

  • ...Finding an optimal solution to the partitioning problem has been shown to be NP-hard [14]....

    [...]