scispace - formally typeset
Search or ask a question

Showing papers by "Moinuddin K. Qureshi published in 2008"


Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper proposes Thread-Aware Dynamic Insertion Policy (TADIP), a adaptive insertion policy that can take into account the memory requirements of each of the concurrently executing applications and provides performance benefits similar to doubling the size of an LRU-managed cache.
Abstract: Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared cache performance can be significantly improved by preserving the entire working set of applications that can co-exist in the cache and preserving some portion of the working set of the remaining applications. This paper investigates the use of adaptive insertion policies to manage shared caches. We show that directly extending the recently proposed dynamic insertion policy (DIP) is inadequate for shared caches since DIP is unaware of the characteristics of individual applications. We propose Thread-Aware Dynamic Insertion Policy (TADIP) that can take into account the memory requirements of each of the concurrently executing applications. Our evaluation with multi-programmed workloads for 2-core, 4-core, 8-core, and 16-core CMPs show that a TADIP-managed shared cache improves overall throughput by as much as 94%, 64%, 26%, and 16% respectively (on average 14%, 18%, 15%, and 17%) over the baseline LRU policy. The performance benefit of TADIP is 2.6x compared to DIP and 1.3x compared to the recently proposed Utility-based Cache Partitioning (UCP) scheme. We also show that a TADIP-managed shared cache provides performance benefits similar to doubling the size of an LRU-managed cache. Furthermore, TADIP requires a total storage overhead of less than two bytes per core, does not require changes to the existing cache structure, and performs similar to LRU for LRU friendly workloads.

321 citations


Proceedings ArticleDOI
01 Mar 2008
TL;DR: Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information, is proposed, which can be used to implement Synchronization-Awarethreading (SAT), which predicts the optimal number ofthreads depending on the amount of data-synchronization.
Abstract: Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by data-synchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application gets limited by data-synchronization, or bandwidth, or neither depends not only on the application but also on the input set and the machine configuration. Therefore, controlling the number of threads based on the run-time behavior of the application can significantly improve performance and reduce power.This paper proposes Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information. FDT can be used to implement Synchronization-Aware Threading (SAT), which predicts the optimal number of threads depending on the amount of data-synchronization. Our evaluation shows that SAT can reduce both execution time and power by up to 66% and 78% respectively. Similarly, FDT can be used to implement Bandwidth-Aware Threading (BAT), which predicts the minimum number of threads required to saturate the off-chip bus. Our evaluation shows that BAT reduces on-chip power by up to 78%. When SAT and BAT are combined, the average execution time reduces by 17% and power reduces by 59%. The proposed techniques leverage existing performance counters and require minimal support from the threading library.

200 citations


Journal ArticleDOI
TL;DR: A simple mechanism that dynamically changes the insertion policy used by LRU replacement reduces cache misses by 21 percent and requires a total storage overhead of less than 2 bytes.
Abstract: The commonly used LRU replacement policy causes thrashing for memory- intensive workloads. A simple mechanism that dynamically changes the insertion policy used by LRU replacement reduces cache misses by 21 percent and requires a total storage overhead of less than 2 bytes.

35 citations


Patent
Moinuddin K. Qureshi1
01 Aug 2008
TL;DR: In this paper, a cache controller monitors a counter associated with a cache and determines whether the counter indicates that a plurality of non-dedicated cache sets within the cache should operate as spill cache sets or receive cache sets.
Abstract: Improving cache performance in a data processing system is provided. A cache controller monitors a counter associated with a cache. The cache controller determines whether the counter indicates that a plurality of non-dedicated cache sets within the cache should operate as spill cache sets or receive cache sets. The cache controller sets the plurality of non-dedicated cache sets to spill an evicted cache line to an associated cache set in another cache in the event of a cache miss in response to an indication that the plurality of non-dedicated cache sets should operate as the spill cache sets. The cache controller sets the plurality of non-dedicated cache sets to receive an evicted cache line from another cache set in the event of the cache miss in response to an indication that the plurality of non-dedicated cache sets should operate as the receive cache sets.

19 citations


Journal ArticleDOI

11 citations