Home
/
Authors
/
Samira Khan

Author

Samira Khan

Other affiliations: Carnegie Mellon University, University of Texas at San Antonio, Intel

Bio: Samira Khan is an academic researcher from University of Virginia. The author has contributed to research in topics: Dram & Cache. The author has an hindex of 26, co-authored 48 publications receiving 2730 citations. Previous affiliations of Samira Khan include Carnegie Mellon University & University of Texas at San Antonio.

Topics: Dram, Cache, Cache pollution, Memory rank, Cache coloring ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Adaptive-latency DRAM: Optimizing DRAM timing for the common-case

[...]

Donghyuk Lee¹, Yoongu Kim¹, Gennady Pekhimenko¹, Samira Khan¹, Vivek Seshadri¹, Kevin K. Chang¹, Onur Mutlu¹ - Show less +3 more•Institutions (1)

Carnegie Mellon University¹

09 Mar 2015

TL;DR: Adaptive-Latency DRAM (AL-DRAM), a mechanism that adoptively reduces the timing parameters for DRAM modules based on the current operating condition, is proposed and shown that dynamically optimizing the DRAM timing parameters can reliably improve system performance.

...read moreread less

Abstract: In current systems, memory accesses to a DRAM chip must obey a set of minimum latency restrictions specified in the DRAM standard. Such timing parameters exist to guarantee reliable operation. When deciding the timing parameters, DRAM manufacturers incorporate a very large margin as a provision against two worst-case scenarios. First, due to process variation, some outlier chips are much slower than others and cannot be operated as fast. Second, chips become slower at higher temperatures, and all chips need to operate reliably at the highest supported (i.e., worst-case) DRAM temperature (85° C). In this paper, we show that typical DRAM chips operating at typical temperatures (e.g., 55° C) are capable of providing a much smaller access latency, but are nevertheless forced to operate at the largest latency of the worst-case. Our goal in this paper is to exploit the extra margin that is built into the DRAM timing parameters to improve performance. Using an FPGA-based testing platform, we first characterize the extra margin for 115 DRAM modules from three major manufacturers. Our results demonstrate that it is possible to reduce four of the most critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55°C without sacrificing correctness. Based on this characterization, we propose Adaptive-Latency DRAM (AL-DRAM), a mechanism that adoptively reduces the timing parameters for DRAM modules based on the current operating condition. AL-DRAM does not require any changes to the DRAM chip or its interface. We evaluate AL-DRAM on a real system that allows us to reconfigure the timing parameters at runtime. We show that AL-DRAM improves the performance of memory-intensive workloads by an average of 14% without introducing any errors. We discuss and show why AL-DRAM does not compromise reliability. We conclude that dynamically optimizing the DRAM timing parameters can reliably improve system performance.

...read moreread less

236 citations

Proceedings Article•DOI•

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems

[...]

Moinuddin K. Qureshi¹, Dae Hyun Kim¹, Samira Khan², Prashant J. Nair¹, Onur Mutlu² - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Carnegie Mellon University²

22 Jun 2015

TL;DR: AVATAR is proposed, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures, and provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.

...read moreread less

Abstract: Multirate refresh techniques exploit the non-uniformity in retention times of DRAM cells to reduce the DRAM refresh overheads. Such techniques rely on accurate profiling of retention times of cells, and perform faster refresh only for a few rows which have cells with low retention times. Unfortunately, retention times of some cells can change at runtime due to Variable Retention Time (VRT), which makes it impractical to reliably deploy multirate refresh. Based on experimental data from 24 DRAM chips, we develop architecture-level models for analyzing the impact of VRT. We show that simply relying on ECC DIMMs to correct VRT failures is unusable as it causes a data error once every few months. We propose AVATAR, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures. AVATAR provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.

...read moreread less

222 citations

Proceedings Article•DOI•

Sampling Dead Block Prediction for Last-Level Caches

[...]

Samira Khan¹, Yingying Tian¹, Daniel A. Jimenez¹•Institutions (1)

University of Texas at San Antonio¹

04 Dec 2010

TL;DR: This paper introduces sampling dead block prediction, a technique that samples program counters (PCs) to determine when a cache block is likely to be dead, and shows how this technique can reduce the number of LLC misses over LRU and be used to significantly improve a cache with a default random replacement policy.

...read moreread less

Abstract: Last-level caches (LLCs) are large structures with significant power requirements. They can be quite inefficient. On average, a cache block in a 2MB LRU-managed LLC is dead 86% of the time, i.e., it will not be referenced again before it is evicted. This paper introduces sampling dead block prediction, a technique that samples program counters (PCs) to determine when a cache block is likely to be dead. Rather than learning from accesses and evictions from every set in the cache, a sampling predictor keeps track of a small number of sets using partial tags. Sampling allows the predictor to use far less state than previous predictors to make predictions with superior accuracy. Dead block prediction can be used to drive a dead block replacement and bypass optimization. A sampling predictor can reduce the number of LLC misses over LRU by 11.7% for memory-intensive single-thread benchmarks and 23% for multi-core workloads. The reduction in misses yields a geometric mean speedup of 5.9% for single-thread benchmarks and a geometric mean normalized weighted speedup of 12.5% for multi-core workloads. Due to the reduced state and number of accesses, the sampling predictor consumes only 3.1% of the of the dynamic power and 1.2% of the leakage power of a baseline 2MB LLC, comparing favorably with more costly techniques. The sampling predictor can even be used to significantly improve a cache with a default random replacement policy.

...read moreread less

207 citations

Proceedings Article•DOI•

Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation

[...]

Kevin Hsieh¹, Samira Khan², Nandita Vijaykumar¹, Kevin K. Chang¹, Amirali Boroumand¹, Saugata Ghose¹, Onur Mutlu³ - Show less +3 more•Institutions (3)

Carnegie Mellon University¹, University of Virginia², ETH Zurich³

01 Oct 2016

TL;DR: The In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal and addresses the key challenges of how to achieve high parallelism in the presence of serial accesses in pointer chasing, and how to effectively perform virtual-to-physical address translation on the memory side without requiring expensive accesses to the CPU's memory management unit.

...read moreread less

Abstract: Pointer chasing is a fundamental operation, used by many important data-intensive applications (e.g., databases, key-value stores, graph processing workloads) to traverse linked data structures. This operation is both memory bound and latency sensitive, as it (1) exhibits irregular access patterns that cause frequent cache and TLB misses, and (2) requires the data from every memory access to be sent back to the CPU to determine the next pointer to access. Our goal is to accelerate pointer chasing by performing it inside main memory, thereby avoiding inefficient and high-latency data transfers between main memory and the CPU. To this end, we propose the In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal.

...read moreread less

205 citations

Proceedings Article•DOI•

Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization

[...]

Kevin K. Chang¹, Abhijith Kashyap¹, Hasan Hassan¹, Saugata Ghose¹, Kevin Hsieh¹, Donghyuk Lee¹, Tianshi Li¹, Gennady Pekhimenko¹, Samira Khan², Onur Mutlu¹ - Show less +6 more•Institutions (2)

Carnegie Mellon University¹, University of Virginia²

14 Jun 2016

TL;DR: Flexible-LatencY DRAM is proposed, a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance and exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations.

...read moreread less

Abstract: Long DRAM latency is a critical performance bottleneck in current systems. DRAM access latency is defined by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which opens the row to perform accesses; (ii) precharge, which prepares the cell array for the next memory access; and (iii) restoration of the row, which restores the values of cells in the row that were destroyed due to activation. There is significant latency variation for each of these operations across the cells of a single DRAM chip due to irregularity in the manufacturing process. As a result, some cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation. The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make several new observations about latency variation within DRAM. We find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced. Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system. We conclude that the experimental characterization and analysis of latency variation within modern DRAM, provided by this work, can lead to new techniques that improve DRAM and system performance.

...read moreread less

203 citations

1
2
3
4
…
5
6
7
8
9
10
11
12

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

Service-oriented computing

[...]

Mike P. Papazoglou, Dimitrios Georgakopoulos

01 Jan 2003-Communications of The ACM

TL;DR: This keynote argues that there is in fact even more profound change that the authors are facing – the programmability aspect that is intimately associated with all IoT systems.

...read moreread less

1,171 citations

Journal Article•DOI•

Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors

[...]

Yoongu Kim¹, Ross Daly¹, Jeremie S. Kim¹, Chris Fallin¹, Ji-Hye Lee¹, Donghyuk Lee¹, Christopher B. Wilkerson², Konrad K. Lai², Onur Mutlu¹ - Show less +5 more•Institutions (2)

Carnegie Mellon University¹, Intel²

14 Jun 2014

TL;DR: This paper exposes the vulnerability of commodity DRAM chips to disturbance errors, and shows that it is possible to corrupt data in nearby addresses by reading from the same address in DRAM by activating the same row inDRAM.

...read moreread less

Abstract: Memory isolation is a key property of a reliable and secure computing system--an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology scales down to smaller dimensions, it becomes more difficult to prevent DRAM cells from electrically interacting with each other. In this paper, we expose the vulnerability of commodity DRAM chips to disturbance errors. By reading from the same address in DRAM, we show that it is possible to corrupt data in nearby addresses. More specifically, activating the same row in DRAM corrupts data in nearby rows. We demonstrate this phenomenon on Intel and AMD systems using a malicious program that generates many DRAM accesses. We induce errors in most DRAM modules (110 out of 129) from three major DRAM manufacturers. From this we conclude that many deployed systems are likely to be at risk. We identify the root cause of disturbance errors as the repeated toggling of a DRAM row's wordline, which stresses inter-cell coupling effects that accelerate charge leakage from nearby rows. We provide an extensive characterization study of disturbance errors and their behavior using an FPGA-based testing platform. Among our key findings, we show that (i) it takes as few as 139K accesses to induce an error and (ii) up to one in every 1.7K cells is susceptible to errors. After examining various potential ways of addressing the problem, we propose a low-overhead solution to prevent the errors

...read moreread less

999 citations

Design Of Analog Cmos Integrated Circuits

[...]

Melanie Hartmann

01 Jan 2016

TL;DR: The design of analog cmos integrated circuits is universally compatible with any devices to read and is available in the book collection an online access to it is set as public so you can download it instantly.

...read moreread less

Abstract: Thank you very much for downloading design of analog cmos integrated circuits. Maybe you have knowledge that, people have look hundreds times for their favorite novels like this design of analog cmos integrated circuits, but end up in malicious downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they cope with some malicious virus inside their laptop. design of analog cmos integrated circuits is available in our book collection an online access to it is set as public so you can download it instantly. Our digital library saves in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Merely said, the design of analog cmos integrated circuits is universally compatible with any devices to read.

...read moreread less

912 citations

Journal Article•DOI•

Cnvlutin: ineffectual-neuron-free deep neural network computing

[...]

Jorge Albericio¹, Patrick Judd¹, Tayler Hetherington², Tor M. Aamodt², Natalie Enright Jerger¹, Andreas Moshovos¹ - Show less +2 more•Institutions (2)

University of Toronto¹, University of British Columbia²

18 Jun 2016

TL;DR: Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss.

...read moreread less

Abstract: This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvlutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.

...read moreread less

687 citations

Proceedings Article•DOI•

System software for persistent memory

[...]

Subramanya R. Dulloor¹, Sanjeev Kumar², Anil S. Keshavamurthy², Lantz Philip R², Dheeraj Reddy², Rajesh M. Sankaran², Jeff Jackson² - Show less +3 more•Institutions (2)

Georgia Institute of Technology¹, Intel²

14 Apr 2014

TL;DR: PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O), is implemented.

...read moreread less

Abstract: Emerging byte-addressable, non-volatile memory technologies offer performance within an order of magnitude of DRAM, prompting their inclusion in the processor memory subsystem. However, such load/store accessible Persistent Memory (PM) has implications on system design, both hardware and software. In this paper, we explore system software support to enable low-overhead PM access by new and legacy applications. To this end, we implement PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O). PMFS exploits the processor's paging and memory ordering features for optimizations such as fine-grained logging (for consistency) and transparent large page support (for faster memory-mapped I/O). To provide strong consistency guarantees, PMFS requires only a simple hardware primitive that provides software enforceable guarantees of durability and ordering of stores to PM. Finally, PMFS uses the processor's existing features to protect PM from stray writes, thereby improving reliability.Using a hardware emulator, we evaluate PMFS's performance with several workloads over a range of PM performance characteristics. PMFS shows significant (up to an order of magnitude) gains over traditional file systems (such as ext4) on a RAMDISK-like PM block device, demonstrating the benefits of optimizing system software for PM.

...read moreread less

622 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse