Showing papers by "Moinuddin K. Qureshi published in 2008"

PDF

Open Access

Proceedings Article•DOI•

Adaptive insertion policies for managing shared caches

[...]

Aamer Jaleel¹, William C. Hasenplaugh¹, Moinuddin K. Qureshi², Julien Sebot¹, Simon C. Steely¹, Joel Emer¹ - Show less +2 more•Institutions (2)

Intel¹, IBM²

25 Oct 2008

TL;DR: This paper proposes Thread-Aware Dynamic Insertion Policy (TADIP), a adaptive insertion policy that can take into account the memory requirements of each of the concurrently executing applications and provides performance benefits similar to doubling the size of an LRU-managed cache.

...read moreread less

Abstract: Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared cache performance can be significantly improved by preserving the entire working set of applications that can co-exist in the cache and preserving some portion of the working set of the remaining applications. This paper investigates the use of adaptive insertion policies to manage shared caches. We show that directly extending the recently proposed dynamic insertion policy (DIP) is inadequate for shared caches since DIP is unaware of the characteristics of individual applications. We propose Thread-Aware Dynamic Insertion Policy (TADIP) that can take into account the memory requirements of each of the concurrently executing applications. Our evaluation with multi-programmed workloads for 2-core, 4-core, 8-core, and 16-core CMPs show that a TADIP-managed shared cache improves overall throughput by as much as 94%, 64%, 26%, and 16% respectively (on average 14%, 18%, 15%, and 17%) over the baseline LRU policy. The performance benefit of TADIP is 2.6x compared to DIP and 1.3x compared to the recently proposed Utility-based Cache Partitioning (UCP) scheme. We also show that a TADIP-managed shared cache provides performance benefits similar to doubling the size of an LRU-managed cache. Furthermore, TADIP requires a total storage overhead of less than two bytes per core, does not require changes to the existing cache structure, and performs similar to LRU for LRU friendly workloads.

...read moreread less

321 citations

Proceedings Article•DOI•

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

[...]

M. Aater Suleman¹, Moinuddin K. Qureshi, Yale N. Patt¹•Institutions (1)

University of Texas at Austin¹

01 Mar 2008

TL;DR: Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information, is proposed, which can be used to implement Synchronization-Awarethreading (SAT), which predicts the optimal number ofthreads depending on the amount of data-synchronization.

...read moreread less

Abstract: Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by data-synchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application gets limited by data-synchronization, or bandwidth, or neither depends not only on the application but also on the input set and the machine configuration. Therefore, controlling the number of threads based on the run-time behavior of the application can significantly improve performance and reduce power.This paper proposes Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information. FDT can be used to implement Synchronization-Aware Threading (SAT), which predicts the optimal number of threads depending on the amount of data-synchronization. Our evaluation shows that SAT can reduce both execution time and power by up to 66% and 78% respectively. Similarly, FDT can be used to implement Bandwidth-Aware Threading (BAT), which predicts the minimum number of threads required to saturate the off-chip bus. Our evaluation shows that BAT reduces on-chip power by up to 78%. When SAT and BAT are combined, the average execution time reduces by 17% and power reduces by 59%. The proposed techniques leverage existing performance counters and require minimal support from the threading library.

...read moreread less

200 citations

Journal Article•DOI•

Set-Dueling-Controlled Adaptive Insertion for High-Performance Caching

[...]

Moinuddin K. Qureshi¹, Aamer Jaleel², Yale N. Patt³, Simon C. Steely², Joel Emer² - Show less +1 more•Institutions (3)

IBM¹, Intel², University of Texas at Austin³

01 Jan 2008-IEEE Micro

TL;DR: A simple mechanism that dynamically changes the insertion policy used by LRU replacement reduces cache misses by 21 percent and requires a total storage overhead of less than 2 bytes.

...read moreread less

Abstract: The commonly used LRU replacement policy causes thrashing for memory- intensive workloads. A simple mechanism that dynamically changes the insertion policy used by LRU replacement reduces cache misses by 21 percent and requires a total storage overhead of less than 2 bytes.

...read moreread less

35 citations

Patent•

Adaptive Spill-Receive Mechanism for Lateral Caches

[...]

Moinuddin K. Qureshi¹•Institutions (1)

IBM¹

01 Aug 2008

TL;DR: In this paper, a cache controller monitors a counter associated with a cache and determines whether the counter indicates that a plurality of non-dedicated cache sets within the cache should operate as spill cache sets or receive cache sets.

...read moreread less

Abstract: Improving cache performance in a data processing system is provided. A cache controller monitors a counter associated with a cache. The cache controller determines whether the counter indicates that a plurality of non-dedicated cache sets within the cache should operate as spill cache sets or receive cache sets. The cache controller sets the plurality of non-dedicated cache sets to spill an evicted cache line to an associated cache set in another cache in the event of a cache miss in response to an indication that the plurality of non-dedicated cache sets should operate as the spill cache sets. The cache controller sets the plurality of non-dedicated cache sets to receive an evicted cache line from another cache set in the event of the cache miss in response to an indication that the plurality of non-dedicated cache sets should operate as the receive cache sets.

...read moreread less

19 citations

Journal Article•DOI•

S-d-c a i h-p c

[...]

Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, Joel Emer - Show less +1 more

01 Jan 2008-IEEE Micro

11 citations