Home
/
Authors
/
Moinuddin K. Qureshi

Author

Moinuddin K. Qureshi

Other affiliations: IBM, University of Texas at Austin, Intel ...read more

Bio: Moinuddin K. Qureshi is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Cache & Cache pollution. The author has an hindex of 44, co-authored 131 publications receiving 9956 citations. Previous affiliations of Moinuddin K. Qureshi include IBM & University of Texas at Austin.

Topics: Cache, Cache pollution, CPU cache, Smart Cache, Cache coloring ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

CEASER: mitigating conflict-based cache attacks via encrypted-address and remapping

[...]

Moinuddin K. Qureshi¹•Institutions (1)

Georgia Institute of Technology¹

20 Oct 2018

TL;DR: This paper provides the key insight that randomized mapping can be accomplished efficiently by accessing the cache with an encrypted address, as encryption would cause the lines that map to the same set of a conventional cache to get scattered to different sets.

...read moreread less

Abstract: Modern processors share the last-level cache between all the cores to efficiently utilize the cache space. Unfortunately, such sharing makes the cache vulnerable to attacks whereby an adversary can infer the access pattern of a co-running application by carefully orchestrating evictions using cache conflicts. Conflict-based attacks can be mitigated by randomizing the location of the lines in the cache. Unfortunately, prior proposals for randomized mapping require storage-intensive tables and are effective only if the OS can classify the applications into protected and unprotected groups. The goal of this paper is to mitigate conflict-based attacks while incurring negligible storage and performance overheads, and without relying on OS support. This paper provides the key insight that randomized mapping can be accomplished efficiently by accessing the cache with an encrypted address, as encryption would cause the lines that map to the same set of a conventional cache to get scattered to different sets. This paper proposes CEASE, a design that uses Low-Latency Block-Cipher (LLBC) to translate the physical line-address into an encrypted line-address, and accesses the cache with this encrypted line-address. We analyze efficient designs for LLBC that can perform encryption and decryption within two cycles. We also propose CEASER, a design that periodically changes the encryption key and performs dynamic-remapping to improve robustness. CEASER provides strong security (tolerates 100+ years of attack), has low performance overhead (1% slowdown), requires a storage overhead of less than 24 bytes for the newly added structures, and does not need any OS support.

...read moreread less

195 citations

Proceedings Article•DOI•

ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates

[...]

Prashant J. Nair¹, Dae Hyun Kim¹, Moinuddin K. Qureshi¹•Institutions (1)

Georgia Institute of Technology¹

23 Jun 2013

TL;DR: This paper proposes ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells and efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2% performance degradation, and still maintains 1-bit error tolerance against soft errors.

...read moreread less

Abstract: DRAM scaling has been the prime driver for increasing the capacity of main memory system over the past three decades. Unfortunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller geometries, coupled with the problems of device variation and leakage. Future DRAM devices are likely to experience significantly high error-rates. Techniques that can tolerate errors efficiently can enable DRAM to scale to smaller technology nodes. However, existing techniques such as row/column sparing and ECC become prohibitive at high error-rates.To develop cost-effective solutions for tolerating high error-rates, this paper advocates a cross-layer approach. Rather than hiding the faulty cell information within the DRAM chips, we expose it to the architectural level. We propose ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells. ArchShield tolerates these faults using two components, a Fault Map that keeps information about faulty words in a cache line, and Selective Word-Level Replication (SWLR) that replicates faulty words for error resilience. Both Fault Map and SWLR are integrated in reserved area in DRAM memory. Our evaluations with 8GB DRAM DIMM show that ArchShield can efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2% performance degradation, and still maintains 1-bit error tolerance against soft errors.

...read moreread less

178 citations

Journal Article•DOI•

PreSET: improving performance of phase change memories by exploiting asymmetry in write times

[...]

Moinuddin K. Qureshi¹, Michele M. Franceschini², Ashish Jagmohan², Luis A. Lastras²•Institutions (2)

Georgia Institute of Technology¹, IBM²

09 Jun 2012

TL;DR: This paper proposes PreSET, an architectural technique that leverages the fundamental property of PCM devices that writes are slow only in one direction and are almost as fast as reads in the other direction and reduces effective read latency from 982 cycles to 594 cycles and increases system performance by 34%, while improving the energy-delay-product by 25%.

...read moreread less

Abstract: Phase Change Memory (PCM) is a promising technology for building future main memory systems. A prominent characteristic of PCM is that it has write latency much higher than read latency. Servicing such slow writes causes significant contention for read requests. For our baseline PCM system, the slow writes increase the effective read latency by almost 2X, causing significant performance degradation. This paper alleviates the problem of slow writes by exploiting the fundamental property of PCM devices that writes are slow only in one direction (SET operation) and are almost as fast as reads in the other direction (RESET operation). Therefore, a write operation to a line in which all memory cells have been SET prior to the write, will incur much lower latency. We propose PreSET, an architectural technique that leverages this property to pro-actively SET all the bits in a given memory line well in advance of the anticipated write to that memory line. Our proposed design initiates a PreSET request for a memory line as soon as that line becomes dirty in the cache, thereby allowing a large window of time for the PreSET operation to complete. Our evaluations show that PreSET is more effective and incurs lower storage overhead than previously proposed write cancellation techniques. We also describe static and dynamic throttling schemes to limit the rate of PreSET operations. Our proposal reduces effective read latency from 982 cycles to 594 cycles and increases system performance by 34%, while improving the energy-delay-product by 25%.

...read moreread less

160 citations

Proceedings Article•DOI•

A tagless coherence directory

[...]

Jason Zebchuk¹, Moinuddin K. Qureshi¹, Vijayalakshmi Srinivasan², Andreas Moshovos²•Institutions (2)

University of Toronto¹, IBM²

12 Dec 2009

TL;DR: Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5% increase in bandwidth utilization, and Analytical modelling predicts that TL continues to scale well up to at least 1024 cores.

...read moreread less

Abstract: A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs. The Tagless Coherence Directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information. Conceptually, TL consists of a grid of small Bloom filters. The grid has one column per core and one row per cache set. TL uses 48% less area, 57% less leakage power, and 44% less dynamic energy than a conventional coherence directory for a 16-core CMP with 1MB private L2 caches. Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5% increase in bandwidth utilization. Analytical modelling predicts that TL continues to scale well up to at least 1024 cores. Categories and Subject Descriptors

...read moreread less

155 citations

Journal Article•DOI•

NVRAM-aware logging in transaction systems

[...]

Jian Huang¹, Karsten Schwan¹, Moinuddin K. Qureshi¹•Institutions (1)

Georgia Institute of Technology¹

01 Dec 2014

TL;DR: This paper shows that using NVRAM only for the logging subsystem (NV-Logging) provides much higher transactions per dollar than simply replacing all disk storage with NVR AM, and shows that the software overheads associated with centralized log buffers cause performance bottlenecks and limit scaling.

...read moreread less

Abstract: Emerging byte-addressable, non-volatile memory technologies (NVRAM) like phase-change memory can increase the capacity of future memory systems by orders of magnitude. Compared to systems that rely on disk storage, NVRAM-based systems promise significant improvements in performance for key applications like online transaction processing (OLTP). Unfortunately, NVRAM systems suffer from two drawbacks: their asymmetric read-write performance and the notable higher cost of the new memory technologies compared to disk. This paper investigates the cost-effective use of NVRAM in transaction systems. It shows that using NVRAM only for the logging subsystem (NV-Logging) provides much higher transactions per dollar than simply replacing all disk storage with NVRAM. Specifically, for NV-Logging, we show that the software overheads associated with centralized log buffers cause performance bottlenecks and limit scaling. The per-transaction logging methods described in the paper help avoid these overheads, enabling concurrent logging for multiple transactions. Experimental results with a faithful emulation of future NVRAM-based servers using the TPCC, TATP, and TPCB benchmarks show that NV-Logging improves throughput by 1.42 - 2.72x over the costlier option of replacing all disk storage with NVRAM. Results also show that NV-Logging performs 1.21 - 6.71x better than when logs are placed into the PMFS NVRAM-optimized file system. Compared to state-of-the-art distributed logging, NV-Logging delivers 20.4% throughput improvements.

...read moreread less

151 citations

…
1
2
3
4
5
6
7
…
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Dark Silicon and the End of Multicore Scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², R. St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

01 May 2012-IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

1,556 citations

DOI•

International Technology Roadmap for Semiconductors 2003の要求清浄度について－シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について－

[...]

飯田裕幸, 竹田菊男, 藤本武利

20 Sep 2004

1,387 citations

Proceedings Article•DOI•

Dark silicon and the end of multicore scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², Renee St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

04 Jun 2011

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

...read moreread less

Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

...read moreread less

1,379 citations

Journal Article•DOI•

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

[...]

Ping Chi¹, Shuangchen Li¹, Cong Xu², Tao Zhang³, Jishen Zhao¹, Yongpan Liu⁴, Yu Wang⁴, Yuan Xie¹ - Show less +4 more•Institutions (4)

University of California¹, Hewlett-Packard², Nvidia³, Tsinghua University⁴

18 Jun 2016

TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.

...read moreread less

Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

...read moreread less

1,197 citations

Proceedings Article•DOI•

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

[...]

Moinuddin K. Qureshi, Yale N. Patt

09 Dec 2006

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.

...read moreread less

Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

...read moreread less

1,083 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse