scispace - formally typeset
Search or ask a question

Showing papers by "Moinuddin K. Qureshi published in 2016"


Proceedings ArticleDOI
12 Mar 2016
TL;DR: A new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost, and whose combined benefit is higher than the benefit of each alone, on a variety of workloads and system configurations.
Abstract: This paper introduces a new DRAM design that enables fast and energy-efficient bulk data movement across subarrays in a DRAM chip. While bulk data movement is a key operation in many applications and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel for bulk data movement results in high latency and energy consumption. Prior work proposed to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. We propose a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling an array of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA's three applications significantly improves performance and memory energy efficiency, and their combined benefit is higher than the benefit of each alone, on a variety of workloads and system configurations.

217 citations


Journal ArticleDOI
18 Jun 2016
TL;DR: EXposed On-Die Error Detection (XED) is proposed, which exposes the On- die error detection information without requiring changes to the memory standards or consuming bandwidth overheads, and can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.
Abstract: Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller.This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose eXposed On-Die Error Detection (XED), which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined "catch-word" instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172x higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.

60 citations


Proceedings ArticleDOI
15 Oct 2016
TL;DR: CANDY is a low-cost and scalable solution that consists of two techniques for two key challenges in architecting giga-scale CDC, and proposes Sharing-Aware Bypass, which dynamically detects the read-write shared data at run-time and enforces such data to bypass the DRAM cache.
Abstract: This paper investigates the use of DRAM caches for multi-node systems. Current systems architect the DRAM cache as Memory-Side Cache (MSC), restricting the DRAM cache to cache only the local data, and relying on only the small on-die caches for the remote data. As MSC keeps only the local data, it is implicitly coherent and obviates the need of any coherence support. Unfortunately, as accessing the data in the remote node incurs a significant inter-node network latency, MSC suffers from such latency overhead on every on-die cache miss to the remote data. A desirable alternative is to allow the DRAM cache to cache both the local and the remote data. However, as data blocks can be cached in multiple DRAM caches, this design requires coherence support for DRAM caches to ensure correctness, and is termed Coherent DRAM Cache (CDC). We identify two key challenges in architecting giga-scale CDC. First, the coherence directory can be as large as few tens of MB. Second, cache misses to the read-write shared data in CDC cause longer delay due to the need to access the DRAM cache. To address both problems, this paper proposes CANDY, a low-cost and scalable solution that consists of two techniques for these challenges. First, CANDY places the coherence directory in 3D DRAM to avoid SRAM storage overhead, and re-purposes the existing on-die coherence directory as a DRAM-cache Coherence Buffer to cache recently accessed directory entries. Second, we propose Sharing-Aware Bypass, which dynamically detects the read-write shared data at run-time and enforces such data to bypass the DRAM cache. Our experiment on a 4-node system with 1GB DRAM cache per node shows that CANDY outperforms MSC by 25%, while incurring a negligible overhead of 8KB per node. CANDY is within 5% of an impractical system that has a 64MB SRAM directory per node, and zero cache latency to access the read-write shared data.

25 citations


Proceedings Article
22 Jun 2016
TL;DR: It is found that the changes to memory manager are highly centralized around the key functionalities, such as memory allocator, page fault handler and memory resource controller, in the first such study on the virtual memory system.
Abstract: We present a comprehensive and quantitative study on the development of the Linux memory manager. The study examines 4587 committed patches over the last five years (2009-2015) since Linux version 2.6.32. Insights derived from this study concern the development process of the virtual memory system, including its patch distribution and patterns, and techniques for memory optimizations and semantics. Specifically, we find that the changes to memory manager are highly centralized around the key functionalities, such as memory allocator, page fault handler and memory resource controller. The well-developed memory manager still suffers from increasing number of bugs unexpectedly. And the memory optimizations mainly focus on data structures, memory policies and fast path. To the best of our knowledge, this is the first such study on the virtual memory system.

21 citations


Journal ArticleDOI
TL;DR: Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank, enables a high-performance and low-power memory system and also efficiently protects the stacked memory system from large-granularity failures.
Abstract: Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to newer failure modes (e.g., due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory-level parallelism, causing significant slowdown and higher power consumption.This article proposes Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank. By retaining cache lines within banks, Citadel enables a high-performance and low-power memory system and also efficiently protects the stacked memory system from large-granularity failures. Citadel consists of three components; TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs; Tri-Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures; and Dynamic Dual-Granularity Sparing (DDS), which can mitigate permanent faults by dynamically sparing faulty memory regions either at a row granularity or at a bank granularity. Our evaluations with real-world data for DRAM failures show that Citadel provides performance and power similar to maintaining the entire cache line in the same bank, and yet provides 700 × higher reliability than ChipKill-like ECC codes.

14 citations


Proceedings ArticleDOI
11 Sep 2016
TL;DR: Novel energy-aware persistence (EAP) principles are developed that identify data durability (logging) as the dominant factor in energy increase and a relaxed durability mechanism - ACI-RD - that relaxes data logging without affecting the correctness of an application is proposed.
Abstract: Next generation byte addressable nonvolatile memories (NVMs) such as PCM, Memristor, and 3D X-Point are attractive solutions for mobile and other end-user devices, as they offer memory scalability as well as fast persistent storage. However, NVM's limitations of slow writes and high write energy are magnified for applications that require atomic, consistent, isolated and durable (ACID) persistence. For maintaining ACID persistence guarantees, applications not only need to do extra writes to NVM but also need to execute a significant number of additional CPU instructions for performing NVM writes in a transactional manner. Our analysis shows that maintaining persistence with ACID guarantees increases CPU energy up to 7.3x and NVM energy up to 5.1x compared to a baseline with no ACID guarantees. For computing platforms such as mobile devices, where energy consumption is a critical factor, it is important that the energy cost of persistence is reduced.To address the energy overheads of persistence with ACID guarantees, we develop novel energy-aware persistence (EAP) principles that identify data durability (logging) as the dominant factor in energy increase. Next, for low energy states, we formulate energy efficient durability techniques that include a mechanism to switch between performance and energy efficient logging modes, support for NVM group commit, and a memory management method that reduces energy by trading capacity via less frequent garbage collection. For critical energy states, we propose a relaxed durability mechanism -- ACI-RD -- that relaxes data logging without affecting the correctness of an application. Finally, we evaluate EAP's principles with real applications and benchmarks. Our experimental results demonstrate up to 2x reduction in CPU and 2.4x reduction in NVM energy usage compared to the traditional ACID persistence.

10 citations