Home
/
Authors
/
Moinuddin K. Qureshi

Author

Moinuddin K. Qureshi

Other affiliations: IBM, University of Texas at Austin, Intel ...read more

Bio: Moinuddin K. Qureshi is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Cache & Cache pollution. The author has an hindex of 44, co-authored 131 publications receiving 9956 citations. Previous affiliations of Moinuddin K. Qureshi include IBM & University of Texas at Austin.

Topics: Cache, Cache pollution, CPU cache, Smart Cache, Cache coloring ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003

Papers

PDF

Open Access

More filters

Posted Content•

EQUAL: Improving the Fidelity of Quantum Annealers by Injecting Controlled Perturbations

[...]

Ramin Ayanzadeh¹, Poulami Das, Swamit S. Tannu, Moinuddin K. Qureshi•Institutions (1)

Georgia Institute of Technology¹

24 Aug 2021-arXiv: Quantum Physics

TL;DR: In this article, an ensemble of quantum machine instructions (QMIs) is generated by adding controlled perturbations to the program QMI to steer the program away from encountering the same bias during all trials.

...read moreread less

Abstract: Quantum computing is an information processing paradigm that uses quantum-mechanical properties to speedup computationally hard problems. Although promising, existing gate-based quantum computers consist of only a few dozen qubits and are not large enough for most applications. On the other hand, existing QAs with few thousand of qubits have the potential to solve some domain-specific optimization problems. QAs are single instruction machines and to execute a program, the problem is cast to a Hamiltonian, embedded on the hardware, and a single quantum machine instruction (QMI) is run. Unfortunately, noise and imperfections in hardware result in sub-optimal solutions on QAs even if the QMI is run for thousands of trials. The limited programmability of QAs mean that the user executes the same QMI for all trials. This subjects all trials to a similar noise profile throughout the execution, resulting in a systematic bias. We observe that systematic bias leads to sub-optimal solutions and cannot be alleviated by executing more trials or using existing error-mitigation schemes. To address this challenge, we propose EQUAL (Ensemble Quantum Annealing). EQUAL generates an ensemble of QMIs by adding controlled perturbations to the program QMI. When executed on the QA, the ensemble of QMIs steers the program away from encountering the same bias during all trials and thus, improves the quality of solutions. Our evaluations using the 2041-qubit D-Wave QA show that EQUAL bridges the difference between the baseline and the ideal by an average of 14% (and up to 26%), without requiring any additional trials. EQUAL can be combined with existing error mitigation schemes to further bridge the difference between the baseline and ideal by an average of 55% (and up to 68%).

...read moreread less

1 citations

Dissertation•

Adaptive caching for high-performance memory systems

[...]

Moinuddin K. Qureshi

01 Jan 2007

1 citations

Line Distillation: A Mechanism to Improve Cache Utilization

[...]

Moinuddin K. Qureshi, David A. Thompson, Thomas R. Puzaky, Yale N. Patt

01 Jan 2006

TL;DR: Line distillation is proposed, a technique to increase cache utilization bytering the unused data from a subset of the lines and condensing the remaining useful data into smaller line-sizes, which reduces the storage requirement for the cache’s tag structure and provides a performance improvement proportional to the word-size.

...read moreread less

Abstract: Cache hierarchies play a very important role in bridging the speed gap between processors and memory. As thisgap increases, it becomes increasingly important to intelligently design and manage a cache system. The performanceof current caches is reduced because more than half of the data that is brought into the cache is never referenced,resulting in very low utilization. We propose line distillation, a technique to increase cache utilization by ﬁltering theunused data from a subset of the lines and condensing the remaining useful data into smaller line-sizes. We describethree ﬂavors of line distillation: naive-distillation, static-K-distillation, and adaptive-distillation. We also introducethe distill cache, a cache that supports line distillation and heterogeneous line-sizes. The line distillation techniquereduces cache miss-rate by 21% on average. 1 Introduction Caches exploit temporal and spatial locality that exists in memory reference streams. Temporal locality is exploitedby keeping a copy of the data associated with a memory reference so that subsequent references to the same addresscan be satisﬁed by the cache. Spatial locality is exploited by caching more data than is necessary for a single memoryreference in anticipation of future accesses to contiguous addresses. In this paper, we explore spatial locality as itaffects cache design decisions.There are three basic transactions that take place in a cache: access, ﬁll, and evict. Cache accesses consist ofloads and stores which take place between the cache and the processor. Access transactions take place at the word-sizegranularity as deﬁned by the ISA which the machine is implementing and for which the cache is being designed. Lineﬁlls and evictions occur between the cache and the next level of the memory hierarchy and refer to placing data into andremoving data from the cache, respectively. Fill and evict transactions take place at the line-size granularity as deﬁnedby the microarchitect designing the cache. The line-size must be at least as large as the word-size but is otherwiseindependent, and a typical line-size is 8-16 times the corresponding word-size in a machine. Using large line-sizesreduces the storage requirement for the cache’s tag structure and provides a performance improvement proportional to

...read moreread less

1 citations

Proceedings Article•

Probabilistic Integrity for Low-Overhead Secure Memories

[...]

Gururaj Saileshwar¹, Moinuddin K. Qureshi¹•Institutions (1)

Georgia Institute of Technology¹

20 Oct 2018

TL;DR: In navigating this performance-security tradeoff, SGX and SME end up at different ends of the spectrum as shown in Fig 1.

...read moreread less

Abstract: Main-memories are prone to attacks [1], [4], [12] that allow an adversary to take control of the system by reading and tampering memory-contents. Commercial solutions like Intel’s Software Guard Extensions (SGX) [3] and AMD’s Secure Memory Encryption (SME) [5] attempt to secure memory against attacks. However, providing security requires accessing metadata, resulting in storage and performance overheads. In navigating this performance-security tradeoff, SGX and SME end up at different ends of the spectrum as shown in Fig 1.

...read moreread less

Journal Article•DOI•

SecDDR: Enabling Low-Cost Secure Memories by Protecting the DDR Interface

[...]

Ali Fakhrzadehgan, Prakash Ramrakhyani, Moinuddin K. Qureshi, Mattan Erez

01 Sep 2022-arXiv.org

TL;DR: SecDDR is proposed, a low-cost RAP that targets direct-attached memories, like DDRx, and only adds a small amount of logic to memory components and does not change the underlying DDR protocol, making it practical for widespread adoption.

...read moreread less

Abstract: —The security goals of cloud providers and users include memory conﬁdentiality and integrity, which requires implementing Replay-Attack protection (RAP). RAP can be achieved using integrity trees or mutually authenticated channels. Integrity trees incur signiﬁcant performance overheads and are impractical for protecting large memories. Mutually authenticated channels have been proposed only for packetized memory interfaces that address only a very small niche domain and require fundamental changes to memory system architecture. We propose SecDDR , a low-cost RAP that targets direct-attached memories, like DDRx. SecDDR avoids memory-side data authen- tication, and thus, only adds a small amount of logic to memory components and does not change the underlying DDR protocol, making it practical for widespread adoption. In contrast to prior mutual authentication proposals, which require trusting the entire memory module, SecDDR targets untrusted modules by placing its limited security logic on the DRAM die (or package) of the ECC chip. Our evaluation shows that SecDDR performs within 1% of an encryption-only memory without RAP and that SecDDR provides 18.8% and 7.8% average performance improvements (up to 190.4% and 24.8%) relative to a 64-ary integrity tree and an authenticated channel, respectively.

...read moreread less

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
…
24
25
26
27
28
29

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Dark Silicon and the End of Multicore Scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², R. St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

01 May 2012-IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

1,556 citations

DOI•

International Technology Roadmap for Semiconductors 2003の要求清浄度について－シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について－

[...]

飯田裕幸, 竹田菊男, 藤本武利

20 Sep 2004

1,387 citations

Proceedings Article•DOI•

Dark silicon and the end of multicore scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², Renee St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

04 Jun 2011

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

...read moreread less

Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

...read moreread less

1,379 citations

Journal Article•DOI•

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

[...]

Ping Chi¹, Shuangchen Li¹, Cong Xu², Tao Zhang³, Jishen Zhao¹, Yongpan Liu⁴, Yu Wang⁴, Yuan Xie¹ - Show less +4 more•Institutions (4)

University of California¹, Hewlett-Packard², Nvidia³, Tsinghua University⁴

18 Jun 2016

TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.

...read moreread less

Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

...read moreread less

1,197 citations

Proceedings Article•DOI•

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

[...]

Moinuddin K. Qureshi, Yale N. Patt

09 Dec 2006

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.

...read moreread less

Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

...read moreread less

1,083 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse