scispace - formally typeset
Search or ask a question

Showing papers by "Moinuddin K. Qureshi published in 2012"


Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper proposes a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst, and proposes a simple and highly effective Memory Access Predictor.
Abstract: This paper analyzes the design trade-offs in architecting large-scale DRAM caches. Prior research, including the recent work from Loh and Hill, have organized DRAM caches similar to conventional caches. In this paper, we contend that some of the basic design decisions typically made for conventional caches (such as serialization of tag and data access, large associativity, and update of replacement state) are detrimental to the performance of DRAM caches, as they exacerbate the already high hit latency. We show that higher performance can be obtained by optimizing the DRAM cache architecture first for latency, and then for hit rate. We propose a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst. We also propose a simple and highly effective Memory Access Predictor that incurs a storage overhead of 96 bytes per core and a latency of 1 cycle. It helps service cache misses faster without the need to wait for a cache miss detection in the common case. Our evaluations show that our latency-optimized cache design significantly outperforms both the recent proposal from Loh and Hill, as well as an impractical SRAM Tag-Store design that incurs an unacceptable overhead of several tens of megabytes. On average, the proposal from Loh and Hill provides 8.7% performance improvement, the "idealized" SRAM Tag design provides 24%, and our simple latency-optimized design provides 35%.

259 citations


Journal ArticleDOI
09 Jun 2012
TL;DR: This paper proposes PreSET, an architectural technique that leverages the fundamental property of PCM devices that writes are slow only in one direction and are almost as fast as reads in the other direction and reduces effective read latency from 982 cycles to 594 cycles and increases system performance by 34%, while improving the energy-delay-product by 25%.
Abstract: Phase Change Memory (PCM) is a promising technology for building future main memory systems. A prominent characteristic of PCM is that it has write latency much higher than read latency. Servicing such slow writes causes significant contention for read requests. For our baseline PCM system, the slow writes increase the effective read latency by almost 2X, causing significant performance degradation. This paper alleviates the problem of slow writes by exploiting the fundamental property of PCM devices that writes are slow only in one direction (SET operation) and are almost as fast as reads in the other direction (RESET operation). Therefore, a write operation to a line in which all memory cells have been SET prior to the write, will incur much lower latency. We propose PreSET, an architectural technique that leverages this property to pro-actively SET all the bits in a given memory line well in advance of the anticipated write to that memory line. Our proposed design initiates a PreSET request for a memory line as soon as that line becomes dirty in the cache, thereby allowing a large window of time for the PreSET operation to complete. Our evaluations show that PreSET is more effective and incurs lower storage overhead than previously proposed write cancellation techniques. We also describe static and dynamic throttling schemes to limit the rate of PreSET operations. Our proposal reduces effective read latency from 982 cycles to 594 cycles and increases system performance by 34%, while improving the energy-delay-product by 25%.

160 citations


Journal ArticleDOI
09 Jun 2012
TL;DR: FLEXclusion is proposed, a design that dynamically selects between exclusion and non-inclusion depending on workload behavior and reduces the on-chip LLC insertion traffic by 72.6% and improves performance by 5.9% when implemented with negligible hardware changes.
Abstract: Exclusive last-level caches (LLCs) reduce memory accesses by effectively utilizing cache capacity. However, they require excessive on-chip bandwidth to support frequent insertions of cache lines on eviction from upper-level caches. Non-inclusive caches, on the other hand, have the advantage of using the on-chip bandwidth more effectively but suffer from a higher miss rate. Traditionally, the decision to use the cache as exclusive or non-inclusive is made at design time. However, the best option for a cache organization depends on application characteristics, such as working set size and the amount of traffic consumed by LLC insertions. This paper proposes FLEXclusion, a design that dynamically selects between exclusion and non-inclusion depending on workload behavior. With FLEXclusion, the cache behaves like an exclusive cache when the application benefits from extra cache capacity, and it acts as a non-inclusive cache when additional cache capacity is not useful, so that it can reduce on-chip bandwidth. FLEXclusion leverages the observation that both non-inclusion and exclusion rely on similar hardware support, so our proposal can be implemented with negligible hardware changes. Our evaluations show that a FLEXclusive cache reduces the on-chip LLC insertion traffic by 72.6% compared to an exclusive design and improves performance by 5.9% compared to a non-inclusive design.

37 citations


Patent
18 Jun 2012
TL;DR: In this article, a dynamic system coupled with pre-silicon design methodologies and post-Silicon current optimizing programming methodologies is proposed to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections.
Abstract: A dynamic system coupled with “pre-Silicon” design methodologies and “post-Silicon” current optimizing programming methodologies to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections (e.g., Controlled Collapse Chip Connection or C4s). The mechanism consists of measuring or estimating power consumption at a certain granularity within a chip, converting the power information into C4 current information using a method, and triggering throttling mechanisms (including token based throttling) where applicable to limit the current delivery per C4 beyond pre-established limits or periods. Design aids are used to allocate C4s throughout the chip based on the current delivery requirements. The system coupled with design and programming methodologies improve and optimize current delivery is extendable to connections across layers in a multilayer 3D chip stack.

13 citations


Patent
01 May 2012
TL;DR: Probabilistic Set Associative Cache (PAC) as mentioned in this paper is a (1+P)-way set associative cache, where the chosen parameter called Override Probability P determines the average associativity, for example, for P=0.
Abstract: A computer cache memory organization called Probabilistic Set Associative Cache (PAC) has the hardware complexity and latency of a direct-mapped cache but functions as a set-associative cache for a fraction of the time, thus yielding better than direct mapped cache hit rates. The organization is considered a (1+P)—way set associative cache, where the chosen parameter called Override Probability P determines the average associativity, for example, for P=0.1, effectively it operates as if a 1.1-way set associative cache.

9 citations


Patent
11 Dec 2012
TL;DR: In this paper, a method for preventing errors in a DRAM (dynamic random access memory) due to weak cells that includes determining the location of a weak cell in a row, receiving data to write to the DRAM, and encoding the data into a bit vector to be written to memory.
Abstract: This disclosure includes a method for preventing errors in a DRAM (dynamic random access memory) due to weak cells that includes determining the location of a weak cell in a DRAM row, receiving data to write to the DRAM, and encoding the data into a bit vector to be written to memory. For each weak cell location, the corresponding bit from the bit vector is equal to the reliable logic state of the weak cell and the bit vector is longer than the data.

7 citations


Patent
18 Jun 2012
TL;DR: In this article, a dynamic system coupled with pre-silicon design methodologies and post-Silicon current optimizing programming methodologies is proposed to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections.
Abstract: A dynamic system coupled with “pre-Silicon” design methodologies and “post-Silicon” current optimizing programming methodologies to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections (e.g., Controlled Collapse Chip Connection or C4s). The mechanism consists of measuring or estimating power consumption at a certain granularity within a chip, converting the power information into C4 current information using a method, and triggering throttling mechanisms (including token based throttling) where applicable to limit the current delivery per C4 beyond pre-established limits or periods. Design aids are used to allocate C4s throughout the chip based on the current delivery requirements. The system coupled with design and programming methodologies improve and optimize current delivery is extendable to connections across layers in a multilayer 3D chip stack.

4 citations


Patent
08 Aug 2012
TL;DR: In this paper, a method for determining an optimized refresh rate involves testing a refresh rate on rows of cells, determining an error rate of the rows, evaluating the error rate, and repeating these steps for a decreased refresh rate until the error rates is greater than a constraint, at which point a slow refresh rate is set.
Abstract: A method for determining an optimized refresh rate involves testing a refresh rate on rows of cells, determining an error rate of the rows, evaluating the error rate of the rows; and repeating these steps for a decreased refresh rate until the error rate is greater than a constraint, at which point a slow refresh rate is set

4 citations


Patent
18 Jun 2012
TL;DR: In this article, the authors propose a system and method of operating an integrated circuit (IC) having a fixed layout of one or more blocks having one-or more current sources therein that draw electrical current from a power source, which includes dynamically issuing to a block configured to perform operations responsive to an instruction received at the block, a reserve amount of tokens; determining for each issuance of instruction to the block whether that block's reserve token amount exceeds zero; and one of: issuing the instruction, if the token reserve for that block is greater than one, and decrementing,
Abstract: A system and method of operating an integrated circuit (IC) having a fixed layout of one or more blocks having one or more current sources therein that draw electrical current from a power source The method includes dynamically issuing to a block configured to perform operations responsive to an instruction received at the block, a reserve amount of tokens; determining for each issuance of instruction to the block whether that block's reserve token amount exceeds zero; and one of: issuing the instruction to the block if the token reserve for that block is greater than one, and decrementing, after issuance of the instruction, by one token the block's reserve token amount, or, preventing issuance of an instruction to the block In the method, each block may be initialized to have: a reserve token amount of zero, a token expiration period; a token generation cycle and a token generation amount

3 citations


Patent
18 Jun 2012
TL;DR: In this article, a dynamic system coupled with pre-silicon design methodologies and post-Silicon current optimizing programming methodologies is proposed to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections.
Abstract: A dynamic system coupled with “pre-Silicon” design methodologies and “post-Silicon” current optimizing programming methodologies to improve and optimize current delivery into a chip, which is limited by the physical properties of the connections (e.g., Controlled Collapse Chip Connection or C4s). The mechanism consists of measuring or estimating power consumption at a certain granularity within a chip, converting the power information into C4 current information using a method, and triggering throttling mechanisms (including token based throttling) where applicable to limit the current delivery per C4 beyond pre-established limits or periods. Design aids are used to allocate C4s throughout the chip based on the current delivery requirements. The system coupled with design and programming methodologies improve and optimize current delivery is extendable to connections across layers in a multilayer 3D chip stack.

3 citations