Bio: Samira Khan is an academic researcher from University of Virginia. The author has contributed to research in topics: Dram & Cache. The author has an hindex of 26, co-authored 48 publications receiving 2730 citations. Previous affiliations of Samira Khan include Carnegie Mellon University & University of Texas at San Antonio.
••09 Mar 2015
TL;DR: Adaptive-Latency DRAM (AL-DRAM), a mechanism that adoptively reduces the timing parameters for DRAM modules based on the current operating condition, is proposed and shown that dynamically optimizing the DRAM timing parameters can reliably improve system performance.
Abstract: In current systems, memory accesses to a DRAM chip must obey a set of minimum latency restrictions specified in the DRAM standard. Such timing parameters exist to guarantee reliable operation. When deciding the timing parameters, DRAM manufacturers incorporate a very large margin as a provision against two worst-case scenarios. First, due to process variation, some outlier chips are much slower than others and cannot be operated as fast. Second, chips become slower at higher temperatures, and all chips need to operate reliably at the highest supported (i.e., worst-case) DRAM temperature (85° C). In this paper, we show that typical DRAM chips operating at typical temperatures (e.g., 55° C) are capable of providing a much smaller access latency, but are nevertheless forced to operate at the largest latency of the worst-case. Our goal in this paper is to exploit the extra margin that is built into the DRAM timing parameters to improve performance. Using an FPGA-based testing platform, we first characterize the extra margin for 115 DRAM modules from three major manufacturers. Our results demonstrate that it is possible to reduce four of the most critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55°C without sacrificing correctness. Based on this characterization, we propose Adaptive-Latency DRAM (AL-DRAM), a mechanism that adoptively reduces the timing parameters for DRAM modules based on the current operating condition. AL-DRAM does not require any changes to the DRAM chip or its interface. We evaluate AL-DRAM on a real system that allows us to reconfigure the timing parameters at runtime. We show that AL-DRAM improves the performance of memory-intensive workloads by an average of 14% without introducing any errors. We discuss and show why AL-DRAM does not compromise reliability. We conclude that dynamically optimizing the DRAM timing parameters can reliably improve system performance.
••22 Jun 2015
TL;DR: AVATAR is proposed, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures, and provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.
Abstract: Multirate refresh techniques exploit the non-uniformity in retention times of DRAM cells to reduce the DRAM refresh overheads. Such techniques rely on accurate profiling of retention times of cells, and perform faster refresh only for a few rows which have cells with low retention times. Unfortunately, retention times of some cells can change at runtime due to Variable Retention Time (VRT), which makes it impractical to reliably deploy multirate refresh. Based on experimental data from 24 DRAM chips, we develop architecture-level models for analyzing the impact of VRT. We show that simply relying on ECC DIMMs to correct VRT failures is unusable as it causes a data error once every few months. We propose AVATAR, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures. AVATAR provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.
••04 Dec 2010
TL;DR: This paper introduces sampling dead block prediction, a technique that samples program counters (PCs) to determine when a cache block is likely to be dead, and shows how this technique can reduce the number of LLC misses over LRU and be used to significantly improve a cache with a default random replacement policy.
Abstract: Last-level caches (LLCs) are large structures with significant power requirements. They can be quite inefficient. On average, a cache block in a 2MB LRU-managed LLC is dead 86% of the time, i.e., it will not be referenced again before it is evicted. This paper introduces sampling dead block prediction, a technique that samples program counters (PCs) to determine when a cache block is likely to be dead. Rather than learning from accesses and evictions from every set in the cache, a sampling predictor keeps track of a small number of sets using partial tags. Sampling allows the predictor to use far less state than previous predictors to make predictions with superior accuracy. Dead block prediction can be used to drive a dead block replacement and bypass optimization. A sampling predictor can reduce the number of LLC misses over LRU by 11.7% for memory-intensive single-thread benchmarks and 23% for multi-core workloads. The reduction in misses yields a geometric mean speedup of 5.9% for single-thread benchmarks and a geometric mean normalized weighted speedup of 12.5% for multi-core workloads. Due to the reduced state and number of accesses, the sampling predictor consumes only 3.1% of the of the dynamic power and 1.2% of the leakage power of a baseline 2MB LLC, comparing favorably with more costly techniques. The sampling predictor can even be used to significantly improve a cache with a default random replacement policy.
••01 Oct 2016
TL;DR: The In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal and addresses the key challenges of how to achieve high parallelism in the presence of serial accesses in pointer chasing, and how to effectively perform virtual-to-physical address translation on the memory side without requiring expensive accesses to the CPU's memory management unit.
Abstract: Pointer chasing is a fundamental operation, used by many important data-intensive applications (e.g., databases, key-value stores, graph processing workloads) to traverse linked data structures. This operation is both memory bound and latency sensitive, as it (1) exhibits irregular access patterns that cause frequent cache and TLB misses, and (2) requires the data from every memory access to be sent back to the CPU to determine the next pointer to access. Our goal is to accelerate pointer chasing by performing it inside main memory, thereby avoiding inefficient and high-latency data transfers between main memory and the CPU. To this end, we propose the In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal.
14 Jun 2016
TL;DR: Flexible-LatencY DRAM is proposed, a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance and exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations.
Abstract: Long DRAM latency is a critical performance bottleneck in current systems. DRAM access latency is defined by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which opens the row to perform accesses; (ii) precharge, which prepares the cell array for the next memory access; and (iii) restoration of the row, which restores the values of cells in the row that were destroyed due to activation. There is significant latency variation for each of these operations across the cells of a single DRAM chip due to irregularity in the manufacturing process. As a result, some cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation. The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make several new observations about latency variation within DRAM. We find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced. Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system. We conclude that the experimental characterization and analysis of latency variation within modern DRAM, provided by this work, can lead to new techniques that improve DRAM and system performance.
TL;DR: This keynote argues that there is in fact even more profound change that the authors are facing – the programmability aspect that is intimately associated with all IoT systems.
14 Jun 2014
TL;DR: This paper exposes the vulnerability of commodity DRAM chips to disturbance errors, and shows that it is possible to corrupt data in nearby addresses by reading from the same address in DRAM by activating the same row inDRAM.
Abstract: Memory isolation is a key property of a reliable and secure computing system--an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology scales down to smaller dimensions, it becomes more difficult to prevent DRAM cells from electrically interacting with each other. In this paper, we expose the vulnerability of commodity DRAM chips to disturbance errors. By reading from the same address in DRAM, we show that it is possible to corrupt data in nearby addresses. More specifically, activating the same row in DRAM corrupts data in nearby rows. We demonstrate this phenomenon on Intel and AMD systems using a malicious program that generates many DRAM accesses. We induce errors in most DRAM modules (110 out of 129) from three major DRAM manufacturers. From this we conclude that many deployed systems are likely to be at risk. We identify the root cause of disturbance errors as the repeated toggling of a DRAM row's wordline, which stresses inter-cell coupling effects that accelerate charge leakage from nearby rows. We provide an extensive characterization study of disturbance errors and their behavior using an FPGA-based testing platform. Among our key findings, we show that (i) it takes as few as 139K accesses to induce an error and (ii) up to one in every 1.7K cells is susceptible to errors. After examining various potential ways of addressing the problem, we propose a low-overhead solution to prevent the errors
01 Jan 2016
TL;DR: The design of analog cmos integrated circuits is universally compatible with any devices to read and is available in the book collection an online access to it is set as public so you can download it instantly.
Abstract: Thank you very much for downloading design of analog cmos integrated circuits. Maybe you have knowledge that, people have look hundreds times for their favorite novels like this design of analog cmos integrated circuits, but end up in malicious downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they cope with some malicious virus inside their laptop. design of analog cmos integrated circuits is available in our book collection an online access to it is set as public so you can download it instantly. Our digital library saves in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Merely said, the design of analog cmos integrated circuits is universally compatible with any devices to read.
••18 Jun 2016
TL;DR: Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss.
Abstract: This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvlutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.
••14 Apr 2014
TL;DR: PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O), is implemented.
Abstract: Emerging byte-addressable, non-volatile memory technologies offer performance within an order of magnitude of DRAM, prompting their inclusion in the processor memory subsystem. However, such load/store accessible Persistent Memory (PM) has implications on system design, both hardware and software. In this paper, we explore system software support to enable low-overhead PM access by new and legacy applications. To this end, we implement PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O). PMFS exploits the processor's paging and memory ordering features for optimizations such as fine-grained logging (for consistency) and transparent large page support (for faster memory-mapped I/O). To provide strong consistency guarantees, PMFS requires only a simple hardware primitive that provides software enforceable guarantees of durability and ordering of stores to PM. Finally, PMFS uses the processor's existing features to protect PM from stray writes, thereby improving reliability.Using a hardware emulator, we evaluate PMFS's performance with several workloads over a range of PM performance characteristics. PMFS shows significant (up to an order of magnitude) gains over traditional file systems (such as ext4) on a RAMDISK-like PM block device, demonstrating the benefits of optimizing system software for PM.