scispace - formally typeset
Search or ask a question

Showing papers by "Moinuddin K. Qureshi published in 2006"


Proceedings ArticleDOI
09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

1,083 citations


Journal ArticleDOI
01 May 2006
TL;DR: Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23% and a novel, low-hardware overhead mechanism called sampling based adaptive replacement (SBAR) is proposed, to dynamically choose between an MLp-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls.
Abstract: Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses - some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, traditional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a runtime technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%.

316 citations


01 Jan 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dualcore system by up to 23% and on average 11% over LRU-based cache partitioning.

82 citations


Patent
27 Feb 2006
TL;DR: In this article, a technique for demand-based error correction is proposed, which reduces storage overhead of cache memories containing error correction codes (ECC) while maintaining substantially the same performance of the cache.
Abstract: A technique for demand-based error correction. More particularly, at least one embodiment of the invention relates to a technique to reduce storage overhead of cache memories containing error correction codes (ECC) while maintaining substantially the same performance of the cache.

13 citations


01 Jan 2006
TL;DR: Line distillation is proposed, a technique to increase cache utilization bytering the unused data from a subset of the lines and condensing the remaining useful data into smaller line-sizes, which reduces the storage requirement for the cache’s tag structure and provides a performance improvement proportional to the word-size.
Abstract: Cache hierarchies play a very important role in bridging the speed gap between processors and memory. As thisgap increases, it becomes increasingly important to intelligently design and manage a cache system. The performanceof current caches is reduced because more than half of the data that is brought into the cache is never referenced,resulting in very low utilization. We propose line distillation, a technique to increase cache utilization by filtering theunused data from a subset of the lines and condensing the remaining useful data into smaller line-sizes. We describethree flavors of line distillation: naive-distillation, static-K-distillation, and adaptive-distillation. We also introducethe distill cache, a cache that supports line distillation and heterogeneous line-sizes. The line distillation techniquereduces cache miss-rate by 21% on average. 1 Introduction Caches exploit temporal and spatial locality that exists in memory reference streams. Temporal locality is exploitedby keeping a copy of the data associated with a memory reference so that subsequent references to the same addresscan be satisfied by the cache. Spatial locality is exploited by caching more data than is necessary for a single memoryreference in anticipation of future accesses to contiguous addresses. In this paper, we explore spatial locality as itaffects cache design decisions.There are three basic transactions that take place in a cache: access, fill, and evict. Cache accesses consist ofloads and stores which take place between the cache and the processor. Access transactions take place at the word-sizegranularity as defined by the ISA which the machine is implementing and for which the cache is being designed. Linefills and evictions occur between the cache and the next level of the memory hierarchy and refer to placing data into andremoving data from the cache, respectively. Fill and evict transactions take place at the line-size granularity as definedby the microarchitect designing the cache. The line-size must be at least as large as the word-size but is otherwiseindependent, and a typical line-size is 8-16 times the corresponding word-size in a machine. Using large line-sizesreduces the storage requirement for the cache’s tag structure and provides a performance improvement proportional to

1 citations