Showing papers by "Moinuddin K. Qureshi published in 2015"

PDF

Open Access

Proceedings Article•DOI•

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems

[...]

Moinuddin K. Qureshi¹, Dae Hyun Kim¹, Samira Khan², Prashant J. Nair¹, Onur Mutlu² - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Carnegie Mellon University²

22 Jun 2015

TL;DR: AVATAR is proposed, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures, and provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.

...read moreread less

Abstract: Multirate refresh techniques exploit the non-uniformity in retention times of DRAM cells to reduce the DRAM refresh overheads. Such techniques rely on accurate profiling of retention times of cells, and perform faster refresh only for a few rows which have cells with low retention times. Unfortunately, retention times of some cells can change at runtime due to Variable Retention Time (VRT), which makes it impractical to reliably deploy multirate refresh. Based on experimental data from 24 DRAM chips, we develop architecture-level models for analyzing the impact of VRT. We show that simply relying on ECC DIMMs to correct VRT failures is unusable as it causes a data error once every few months. We propose AVATAR, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures. AVATAR provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.

...read moreread less

222 citations

Proceedings Article•DOI•

DEUCE: Write-Efficient Encryption for Non-Volatile Memories

[...]

Vinson Young¹, Prashant J. Nair¹, Moinuddin K. Qureshi¹•Institutions (1)

Georgia Institute of Technology¹

14 Mar 2015

TL;DR: Dual Counter Encryption (DEUCE) is proposed, based on the observation that a typical writeback only changes a few words, so DEUCE reencrypts only the words that have changed, which improves performance by 27% and increases lifetime by 2x.

...read moreread less

Abstract: Phase Change Memory (PCM) is an emerging Non Volatile Memory (NVM) technology that has the potential to provide scalable high-density memory systems. While the non-volatility of PCM is a desirable property in order to save leakage power, it also has the undesirable effect of making PCM main memories susceptible to newer modes of security vulnerabilities, for example, accessibility to sensitive data if a PCM DIMM gets stolen. PCM memories can be made secure by encrypting the data. Unfortunately, such encryption comes with a significant overhead in terms of bits written to PCM memory, causing half of the bits in the line to change on every write, even if the actual number of bits being written to memory is small. Our studies show that a typical writeback modifies, on average, only 12% of the bits in the cacheline. Thus, encryption causes almost a 4x increase in the number of bits written to PCM memories. Such extraneous bit writes cause significant increase in write power, reduction in write endurance, and reduction in write bandwidth. To provide the benefit of secure memory in a write efficient manner this paper proposes Dual Counter Encryption (DEUCE). DEUCE is based on the observation that a typical writeback only changes a few words, so DEUCE reencrypts only the words that have changed. We show that DEUCE reduces the number of modified bits per writeback for a secure memory from 50% to 24%, which improves performance by 27% and increases lifetime by 2x.

...read moreread less

129 citations

Journal Article•DOI•

Architectural Support for Mitigating Row Hammering in DRAM Memories

[...]

Dae Hyun Kim¹, Prashant J. Nair¹, Moinuddin K. Qureshi¹•Institutions (1)

Georgia Institute of Technology¹

01 Jan 2015-IEEE Computer Architecture Letters

TL;DR: Two architectural solutions are proposed that are effective at mitigating Row hammering while causing negligible performance loss and Probabilistic Row Activation, which obviates storage overhead of tracking and simply allows the memory controller to proactively issue dummy activations to neighboring rows with a small probability for all memory access.

...read moreread less

Abstract: DRAM scaling has been the prime driver of increasing capacity of main memory systems. Unfortunately, lower technology nodes worsen the cell reliability as it increases the coupling between adjacent DRAM cells, thereby exacerbating different failure modes. This paper investigates the reliability problem due to Row Hammering , whereby frequent activations of a given row can cause data loss for its neighboring rows. As DRAM scales to lower technology nodes, the threshold for the number of row activations that causes data loss for the neighboring rows reduces, making Row Hammering a challenging problem for future DRAM chips. To overcome Row Hammering, we propose two architectural solutions: First, Counter-Based Row Activation (CRA) , which uses a counter with each row to count the number of row activations. If the count exceeds the row hammering threshold, a dummy activation is sent to neighboring rows proactively to refresh the data. Second, Probabilistic Row Activation (PRA) , which obviates storage overhead of tracking and simply allows the memory controller to proactively issue dummy activations to neighboring rows with a small probability for all memory access. Our evaluations show that these solutions are effective at mitigating Row hammering while causing negligible performance loss ( $ 1 percent).

...read moreread less

118 citations

Proceedings Article•DOI•

BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches

[...]

Chiachen Chou¹, Aamer Jaleel², Moinuddin K. Qureshi¹•Institutions (2)

Georgia Institute of Technology¹, Nvidia²

13 Jun 2015

TL;DR: Bandwidth Efficient ARchitecture (BEAR) for DRAM caches integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes, and reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%.

...read moreread less

Abstract: Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor. We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.

...read moreread less

86 citations

Proceedings Article•DOI•

Unified address translation for memory-mapped SSDs with FlashMap

[...]

Jian Huang¹, Anirudh Badam², Moinuddin K. Qureshi¹, Karsten Schwan¹•Institutions (2)

Georgia Institute of Technology¹, Microsoft²

13 Jun 2015

TL;DR: This work introduces FlashMap, an SSD interface that is optimized for memory-mapped SSD-files that reduces critical-path latency and improves DRAM caching efficiency by combining the state in the OS memory manager and the page tables to perform sanity and permission checks respectively.

...read moreread less

Abstract: Applications can map data on SSDs into virtual memory to transparently scale beyond DRAM capacity, permitting them to leverage high SSD capacities with few code changes. Obtaining good performance for memory-mapped SSD content, however, is hard because the virtual memory layer, the file system and the flash translation layer (FTL) perform address translations, sanity and permission checks independently from each other. We introduce FlashMap, an SSD interface that is optimized for memory-mapped SSD-files. FlashMap combines all the address translations into page tables that are used to index files and also to store the FTL-level mappings without altering the guarantees of the file system or the FTL. It uses the state in the OS memory manager and the page tables to perform sanity and permission checks respectively. By combining these layers, FlashMap reduces critical-path latency and improves DRAM caching efficiency. We find that this increases performance for applications by up to 3.32x compared to state-of-the-art SSD file-mapping mechanisms. Additionally, latency of SSD accesses reduces by up to 53.2%.

...read moreread less

52 citations

Proceedings Article•DOI•

Reducing read latency of phase change memory via early read and Turbo Read

[...]

Prashant J. Nair¹, Chiachen Chou¹, Bipin Rajendran², Moinuddin K. Qureshi¹•Institutions (2)

Georgia Institute of Technology¹, Indian Institute of Technology Bombay²

01 Feb 2015

TL;DR: A combination of Early Read and Turbo Read can reduce the PCM read latency by 30%, improve the system performance by 21%, and reduce the Energy Delay Product (EDP) by 28%, while requiring minimal changes to the memory system.

...read moreread less

Abstract: Phase Change Memory (PCM) is an emerging memory technology that can enable scalable high-density main memory systems. Unfortunately, PCM has higher read latency than DRAM, resulting in lower system performance. This paper investigates architectural techniques to improve the read latency of PCM. We observe that there is a wide distribution in cell resistance in both the SET state and the RESET state, and that the read latency of PCM is designed conservatively to handle the worst case cell. If PCM sensing can be tuned to exploit the variability in cell resistance, then we can get reduced read latency. We propose two schemes to enable better-than-worst-case read latency for PCM systems. Our first proposal, Early Read, reads the data earlier than the specified time period. Our key observation that Early Read causes only unidirectional errors (SET being read as RESET) allows us to efficiently detect data errors using Berger codes. In the uncommon case that Early Read causes data error(s), we simply retry the read operation with original latency. Our evaluations show that Early Read can reduce the read latency by 25% while incurring a storage overhead of only 10 bits per 64 byte line. Our second proposal, Turbo Read, reduces the sensing time for read operations by pumping higher current, at the expense of accidentally switching the PCM cell with small probability during the read operation. We analyze Error Correction Codes (ECC) and Probabilistic Row Scrubbing (PRS) for maintaining data integrity under Turbo Read. We show that a combination of Early Read and Turbo Read can reduce the PCM read latency by 30%, improve the system performance by 21%, and reduce the Energy Delay Product (EDP) by 28%, while requiring minimal changes to the memory system.

...read moreread less

50 citations

Proceedings Article•DOI•

Reducing Refresh Power in Mobile Devices with Morphable ECC

[...]

Chiachen Chou¹, Prashant J. Nair¹, Moinuddin K. Qureshi¹•Institutions (1)

Georgia Institute of Technology¹

22 Jun 2015

TL;DR: This paper proposes Morphable ECC, which reduces refresh operations during idle mode by 16x, memory power in Idle mode by 2X, while retaining performance within 2% of a system that does not use any ECC.

...read moreread less

Abstract: Energy consumption is a primary consideration that determines the usability of emerging mobile computing devices such as smartphones. Refresh operations for main memory account for a significant fraction of the overall energy consumption, especially during idle periods, when processor can be switched off quickly, however, memory contents continue to get refreshed to avoid data loss. Given that mobile devices are idle most of the times, reducing refresh power in idle mode is critical to maximize the duration for which the device remains usable. The frequency of refresh operations in memory can be reduced significantly by using strong multi-bit error correction codes (ECC). Unfortunately, strong ECC codes incur high latency, which causes significant performance degradation (as high as 21%, and on average 10%). To obtain both low refresh power in idle periods and high performance in active periods, this paper proposes Morphable ECC (MECC). During idle periods, MECC keeps the memory protected with 6-bit ECC (ECC-6) and employs a refresh period of 1 second, instead of the typical refresh period of 64ms. During active operation, MECC reduces the refresh interval to 64ms, and converts memory from ECC-6 to weaker ECC (single-bit error correction) on a demand-basis, thus avoiding the high latency of ECC-6, except for the first access during the active mode. Our proposal reduces refresh operations during idle mode by 16x, memory power in idle mode by 2X, while retaining performance within 2% of a system that does not use any ECC.

...read moreread less

38 citations

Journal Article•DOI•

FaultSim: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems

[...]

Prashant J. Nair¹, David A. Roberts², Moinuddin K. Qureshi¹•Institutions (2)

Georgia Institute of Technology¹, Advanced Micro Devices²

08 Dec 2015-ACM Transactions on Architecture and Code Optimization

TL;DR: FaultSim, a fast configurable memory-reliability simulation tool for 2D and 3D-stacked memory systems, and the novel algorithms and data structures used in FaultSim to accelerate the evaluation of different resilience schemes are discussed.

...read moreread less

Abstract: As memory systems scale, maintaining their Reliability Availability and Serviceability (RAS) is becoming more complex. To make matters worse, recent studies of DRAM failures in data centers and supercomputer environments have highlighted that large-granularity failures are common in DRAM chips. Furthermore, the move toward 3D-stacked memories can make the system vulnerable to newer failure modes, such as those occurring from faults in Through-Silicon Vias (TSVs). To architect future systems and to use emerging technology, system designers will need to employ strong error correction and repair techniques. Unfortunately, evaluating the relative effectiveness of these reliability mechanisms is often difficult and is traditionally done with analytical models, which are both error prone and time-consuming to develop. To this end, this article proposes FaultSim, a fast configurable memory-reliability simulation tool for 2D and 3D-stacked memory systems. FaultSim employs Monte Carlo simulations, which are driven by real-world failure statistics. We discuss the novel algorithms and data structures used in FaultSim to accelerate the evaluation of different resilience schemes. We implement BCH-1 (SECDED) and ChipKill codes using FaultSim and validate against an analytical model. FaultSim implements BCH-1 and ChipKill codes with a deviation of only 0.032p and 8.41p from the analytical model. FaultSim can simulate 1 million Monte Carlo trials (each for a period of 7 years) of BCH-1 and ChipKill codes in only 34 seconds and 33 seconds, respectively.

...read moreread less

25 citations