Top 10 papers published by Moinuddin K. Qureshi from Georgia Institute of Technology in 2012

Proceedings Article•DOI•

Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design

[...]

Moinuddin K. Qureshi¹, Gabe H. Loh²•Institutions (2)

Georgia Institute of Technology¹, Advanced Micro Devices²

01 Dec 2012

TL;DR: This paper proposes a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst, and proposes a simple and highly effective Memory Access Predictor.

...read moreread less

Abstract: This paper analyzes the design trade-offs in architecting large-scale DRAM caches. Prior research, including the recent work from Loh and Hill, have organized DRAM caches similar to conventional caches. In this paper, we contend that some of the basic design decisions typically made for conventional caches (such as serialization of tag and data access, large associativity, and update of replacement state) are detrimental to the performance of DRAM caches, as they exacerbate the already high hit latency. We show that higher performance can be obtained by optimizing the DRAM cache architecture first for latency, and then for hit rate. We propose a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst. We also propose a simple and highly effective Memory Access Predictor that incurs a storage overhead of 96 bytes per core and a latency of 1 cycle. It helps service cache misses faster without the need to wait for a cache miss detection in the common case. Our evaluations show that our latency-optimized cache design significantly outperforms both the recent proposal from Loh and Hill, as well as an impractical SRAM Tag-Store design that incurs an unacceptable overhead of several tens of megabytes. On average, the proposal from Loh and Hill provides 8.7% performance improvement, the "idealized" SRAM Tag design provides 24%, and our simple latency-optimized design provides 35%.

...read moreread less

259 citations

Journal Article•DOI•

PreSET: improving performance of phase change memories by exploiting asymmetry in write times

[...]

Moinuddin K. Qureshi¹, Michele M. Franceschini², Ashish Jagmohan², Luis A. Lastras²•Institutions (2)

Georgia Institute of Technology¹, IBM²

09 Jun 2012

TL;DR: This paper proposes PreSET, an architectural technique that leverages the fundamental property of PCM devices that writes are slow only in one direction and are almost as fast as reads in the other direction and reduces effective read latency from 982 cycles to 594 cycles and increases system performance by 34%, while improving the energy-delay-product by 25%.

...read moreread less

Abstract: Phase Change Memory (PCM) is a promising technology for building future main memory systems. A prominent characteristic of PCM is that it has write latency much higher than read latency. Servicing such slow writes causes significant contention for read requests. For our baseline PCM system, the slow writes increase the effective read latency by almost 2X, causing significant performance degradation. This paper alleviates the problem of slow writes by exploiting the fundamental property of PCM devices that writes are slow only in one direction (SET operation) and are almost as fast as reads in the other direction (RESET operation). Therefore, a write operation to a line in which all memory cells have been SET prior to the write, will incur much lower latency. We propose PreSET, an architectural technique that leverages this property to pro-actively SET all the bits in a given memory line well in advance of the anticipated write to that memory line. Our proposed design initiates a PreSET request for a memory line as soon as that line becomes dirty in the cache, thereby allowing a large window of time for the PreSET operation to complete. Our evaluations show that PreSET is more effective and incurs lower storage overhead than previously proposed write cancellation techniques. We also describe static and dynamic throttling schemes to limit the rate of PreSET operations. Our proposal reduces effective read latency from 982 cycles to 594 cycles and increases system performance by 34%, while improving the energy-delay-product by 25%.

...read moreread less

160 citations

Journal Article•DOI•

FLEXclusion: balancing cache capacity and on-chip bandwidth via flexible exclusion

[...]

Jaewoong Sim¹, Jaekyu Lee¹, Moinuddin K. Qureshi¹, Hyesoon Kim¹•Institutions (1)

Georgia Institute of Technology¹

09 Jun 2012

TL;DR: FLEXclusion is proposed, a design that dynamically selects between exclusion and non-inclusion depending on workload behavior and reduces the on-chip LLC insertion traffic by 72.6% and improves performance by 5.9% when implemented with negligible hardware changes.

...read moreread less

Abstract: Exclusive last-level caches (LLCs) reduce memory accesses by effectively utilizing cache capacity. However, they require excessive on-chip bandwidth to support frequent insertions of cache lines on eviction from upper-level caches. Non-inclusive caches, on the other hand, have the advantage of using the on-chip bandwidth more effectively but suffer from a higher miss rate. Traditionally, the decision to use the cache as exclusive or non-inclusive is made at design time. However, the best option for a cache organization depends on application characteristics, such as working set size and the amount of traffic consumed by LLC insertions. This paper proposes FLEXclusion, a design that dynamically selects between exclusion and non-inclusion depending on workload behavior. With FLEXclusion, the cache behaves like an exclusive cache when the application benefits from extra cache capacity, and it acts as a non-inclusive cache when additional cache capacity is not useful, so that it can reduce on-chip bandwidth. FLEXclusion leverages the observation that both non-inclusion and exclusion rely on similar hardware support, so our proposal can be implemented with negligible hardware changes. Our evaluations show that a FLEXclusive cache reduces the on-chip LLC insertion traffic by 72.6% compared to an exclusive design and improves performance by 5.9% compared to a non-inclusive design.

...read moreread less

37 citations

Patent•

Adaptive workload based optimizations to mitigate current delivery limitations in integrated circuits

[...]

Pradip Bose¹, Alper Buyuktosunoglu¹, John A. Darringer¹, Moinuddin K. Qureshi¹, Jeonghee Shin¹ - Show less +1 more•Institutions (1)

IBM¹

18 Jun 2012

TL;DR: In this article, a dynamic system coupled with pre-silicon design methodologies and post-Silicon current optimizing programming methodologies is proposed to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections.

...read moreread less

Abstract: A dynamic system coupled with “pre-Silicon” design methodologies and “post-Silicon” current optimizing programming methodologies to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections (e.g., Controlled Collapse Chip Connection or C4s). The mechanism consists of measuring or estimating power consumption at a certain granularity within a chip, converting the power information into C4 current information using a method, and triggering throttling mechanisms (including token based throttling) where applicable to limit the current delivery per C4 beyond pre-established limits or periods. Design aids are used to allocate C4s throughout the chip based on the current delivery requirements. The system coupled with design and programming methodologies improve and optimize current delivery is extendable to connections across layers in a multilayer 3D chip stack.

...read moreread less

13 citations

Patent•

Probabilistic associative cache

[...]

Bulent Abali¹, John Steven Dodson¹, Moinuddin K. Qureshi¹, Balaram Sinharoy¹•Institutions (1)

IBM¹

01 May 2012

TL;DR: Probabilistic Set Associative Cache (PAC) as mentioned in this paper is a (1+P)-way set associative cache, where the chosen parameter called Override Probability P determines the average associativity, for example, for P=0.

...read moreread less

Abstract: A computer cache memory organization called Probabilistic Set Associative Cache (PAC) has the hardware complexity and latency of a direct-mapped cache but functions as a set-associative cache for a fraction of the time, thus yielding better than direct mapped cache hit rates. The organization is considered a (1+P)—way set associative cache, where the chosen parameter called Override Probability P determines the average associativity, for example, for P=0.1, effectively it operates as if a 1.1-way set associative cache.

...read moreread less

9 citations

Patent•

Managing errors in a DRAM by weak cell encoding

[...]

Michele M. Franceschini¹, Hillery C. Hunter¹, Ashish Jagmohan¹, Charles A. Kilmer¹, Kyu-hyoun Kim¹, Luis A. Lastras-Montano¹, Moinuddin K. Qureshi¹ - Show less +3 more•Institutions (1)

IBM¹

11 Dec 2012

TL;DR: In this paper, a method for preventing errors in a DRAM (dynamic random access memory) due to weak cells that includes determining the location of a weak cell in a row, receiving data to write to the DRAM, and encoding the data into a bit vector to be written to memory.

...read moreread less

Abstract: This disclosure includes a method for preventing errors in a DRAM (dynamic random access memory) due to weak cells that includes determining the location of a weak cell in a DRAM row, receiving data to write to the DRAM, and encoding the data into a bit vector to be written to memory. For each weak cell location, the corresponding bit from the bit vector is equal to the reliable logic state of the weak cell and the bit vector is longer than the data.

...read moreread less

7 citations

Patent•

Current-aware floorplanning to overcome current delivery limitations in integrated circuits

[...]

Pradip Bose¹, Alper Buyuktosunoglu¹, John A. Darringer¹, Moinuddin K. Qureshi¹, Jeonghee Shin¹ - Show less +1 more•Institutions (1)

IBM¹

18 Jun 2012

TL;DR: In this article, a dynamic system coupled with pre-silicon design methodologies and post-Silicon current optimizing programming methodologies is proposed to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections.

...read moreread less

Abstract: A dynamic system coupled with “pre-Silicon” design methodologies and “post-Silicon” current optimizing programming methodologies to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections (e.g., Controlled Collapse Chip Connection or C4s). The mechanism consists of measuring or estimating power consumption at a certain granularity within a chip, converting the power information into C4 current information using a method, and triggering throttling mechanisms (including token based throttling) where applicable to limit the current delivery per C4 beyond pre-established limits or periods. Design aids are used to allocate C4s throughout the chip based on the current delivery requirements. The system coupled with design and programming methodologies improve and optimize current delivery is extendable to connections across layers in a multilayer 3D chip stack.

...read moreread less

4 citations

Patent•

Method for optimizing refresh rate for dram

[...]

Michele M. Franceschini¹, Hillery C. Hunter¹, Ashish Jagmohan¹, Charles A. Kilmer¹, Kyu-hyoun Kim¹, Luis A. Lastras¹, Moinuddin K. Qureshi¹ - Show less +3 more•Institutions (1)

IBM¹

08 Aug 2012

TL;DR: In this paper, a method for determining an optimized refresh rate involves testing a refresh rate on rows of cells, determining an error rate of the rows, evaluating the error rate, and repeating these steps for a decreased refresh rate until the error rates is greater than a constraint, at which point a slow refresh rate is set.

...read moreread less

Abstract: A method for determining an optimized refresh rate involves testing a refresh rate on rows of cells, determining an error rate of the rows, evaluating the error rate of the rows; and repeating these steps for a decreased refresh rate until the error rate is greater than a constraint, at which point a slow refresh rate is set

...read moreread less

4 citations

Patent•

Token-based current control to mitigate current delivery limitations in integrated circuits

[...]

Pradip Bose¹, Alper Buyuktosunoglu¹, John A. Darringer¹, Moinuddin K. Qureshi¹, Jeonghee Shin¹ - Show less +1 more•Institutions (1)

IBM¹

18 Jun 2012

TL;DR: In this article, the authors propose a system and method of operating an integrated circuit (IC) having a fixed layout of one or more blocks having one-or more current sources therein that draw electrical current from a power source, which includes dynamically issuing to a block configured to perform operations responsive to an instruction received at the block, a reserve amount of tokens; determining for each issuance of instruction to the block whether that block's reserve token amount exceeds zero; and one of: issuing the instruction, if the token reserve for that block is greater than one, and decrementing,

...read moreread less

Abstract: A system and method of operating an integrated circuit (IC) having a fixed layout of one or more blocks having one or more current sources therein that draw electrical current from a power source The method includes dynamically issuing to a block configured to perform operations responsive to an instruction received at the block, a reserve amount of tokens; determining for each issuance of instruction to the block whether that block's reserve token amount exceeds zero; and one of: issuing the instruction to the block if the token reserve for that block is greater than one, and decrementing, after issuance of the instruction, by one token the block's reserve token amount, or, preventing issuance of an instruction to the block In the method, each block may be initialized to have: a reserve token amount of zero, a token expiration period; a token generation cycle and a token generation amount

...read moreread less

3 citations

Patent•

Adaptive workload based optimizations coupled with a heterogeneous current-aware baseline design to mitigate current delivery limitations in integrated circuits

[...]

Pradip Bose¹, Alper Buyuktosunoglu¹, John A. Darringer¹, Moinuddin K. Qureshi¹, Jeonghee Shin¹ - Show less +1 more•Institutions (1)

IBM¹

18 Jun 2012

TL;DR: In this article, a dynamic system coupled with pre-silicon design methodologies and post-Silicon current optimizing programming methodologies is proposed to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections.

...read moreread less

Abstract: A dynamic system coupled with “pre-Silicon” design methodologies and “post-Silicon” current optimizing programming methodologies to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections (e.g., Controlled Collapse Chip Connection or C4s). The mechanism consists of measuring or estimating power consumption at a certain granularity within a chip, converting the power information into C4 current information using a method, and triggering throttling mechanisms (including token based throttling) where applicable to limit the current delivery per C4 beyond pre-established limits or periods. Design aids are used to allocate C4s throughout the chip based on the current delivery requirements. The system coupled with design and programming methodologies improve and optimize current delivery is extendable to connections across layers in a multilayer 3D chip stack.

...read moreread less

3 citations

Showing papers by "Moinuddin K. Qureshi published in 2012"