Home
/
Authors
/
Moinuddin K. Qureshi

Author

Moinuddin K. Qureshi

Other affiliations: IBM, University of Texas at Austin, Intel ...read more

Bio: Moinuddin K. Qureshi is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Cache & Cache pollution. The author has an hindex of 44, co-authored 131 publications receiving 9956 citations. Previous affiliations of Moinuddin K. Qureshi include IBM & University of Texas at Austin.

Topics: Cache, Cache pollution, CPU cache, Smart Cache, Cache coloring ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Adaptive insertion policies for managing shared caches

[...]

Aamer Jaleel¹, William C. Hasenplaugh¹, Moinuddin K. Qureshi², Julien Sebot¹, Simon C. Steely¹, Joel Emer¹ - Show less +2 more•Institutions (2)

Intel¹, IBM²

25 Oct 2008

TL;DR: This paper proposes Thread-Aware Dynamic Insertion Policy (TADIP), a adaptive insertion policy that can take into account the memory requirements of each of the concurrently executing applications and provides performance benefits similar to doubling the size of an LRU-managed cache.

...read moreread less

Abstract: Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared cache performance can be significantly improved by preserving the entire working set of applications that can co-exist in the cache and preserving some portion of the working set of the remaining applications. This paper investigates the use of adaptive insertion policies to manage shared caches. We show that directly extending the recently proposed dynamic insertion policy (DIP) is inadequate for shared caches since DIP is unaware of the characteristics of individual applications. We propose Thread-Aware Dynamic Insertion Policy (TADIP) that can take into account the memory requirements of each of the concurrently executing applications. Our evaluation with multi-programmed workloads for 2-core, 4-core, 8-core, and 16-core CMPs show that a TADIP-managed shared cache improves overall throughput by as much as 94%, 64%, 26%, and 16% respectively (on average 14%, 18%, 15%, and 17%) over the baseline LRU policy. The performance benefit of TADIP is 2.6x compared to DIP and 1.3x compared to the recently proposed Utility-based Cache Partitioning (UCP) scheme. We also show that a TADIP-managed shared cache provides performance benefits similar to doubling the size of an LRU-managed cache. Furthermore, TADIP requires a total storage overhead of less than two bytes per core, does not require changes to the existing cache structure, and performs similar to LRU for LRU friendly workloads.

...read moreread less

321 citations

Journal Article•DOI•

A Case for MLP-Aware Cache Replacement

[...]

Moinuddin K. Qureshi¹, Daniel N. Lynch¹, Onur Mutlu¹, Yale N. Patt¹•Institutions (1)

University of Texas at Austin¹

01 May 2006

TL;DR: Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23% and a novel, low-hardware overhead mechanism called sampling based adaptive replacement (SBAR) is proposed, to dynamically choose between an MLp-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls.

...read moreread less

Abstract: Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses - some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, traditional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a runtime technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%.

...read moreread less

316 citations

Proceedings Article•DOI•

Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing

[...]

Moinuddin K. Qureshi¹, Michele M. Franceschini¹, Luis A. Lastras-Montano¹•Institutions (1)

IBM¹

01 Apr 2010

TL;DR: This work proposes adaptive Write Cancellation policies, which can abort the processing of a scheduled write requests if a read request arrives to the same bank within a predetermined period, and Write Pausing, which exploits the iterative write algorithms used in PCM to pause at the end of each write iteration to service any pending reads.

...read moreread less

Abstract: Phase Change Memory (PCM) is emerging as a promising technology to build large-scale main memory systems in a cost-effective manner. A characteristic of PCM is that it has write latency much higher than read latency. A higher write latency can typically be tolerated using buffers. However, once a write request is scheduled for service to a bank, it can still cause increased latency for later arriving read requests to the same bank. We show that for the baseline PCM system with read-priority scheduling, the write requests increase the effective read latency to 2.3x (on average), causing significant performance degradation. To reduce the read latency of PCM devices under such scenarios, we propose adaptive Write Cancellation policies. Such policies can abort the processing of a scheduled write requests if a read request arrives to the same bank within a predetermined period. We also propose Write Pausing, which exploits the iterative write algorithms used in PCM to pause at the end of each write iteration to service any pending reads. For the baseline system, the proposed technique removes 75% of the latency increase incurred by read requests and improves overall system performance by 46% (on average), while requiring negligible hardware and simple extensions to PCM controller.

...read moreread less

313 citations

Proceedings Article•DOI•

Not All Qubits Are Created Equal: A Case for Variability-Aware Policies for NISQ-Era Quantum Computers

[...]

Swamit S. Tannu¹, Moinuddin K. Qureshi¹•Institutions (1)

Georgia Institute of Technology¹

04 Apr 2019

TL;DR: In this paper, the authors proposed Variation-Aware Qubit Movement (VQM) and Variation Aware Qubit Allocation (VAQA) policies that optimize the movement and allocation of qubits to avoid the weaker qubits and links, and guide more operations towards the stronger qu bits and links.

...read moreread less

Abstract: Existing and near-term quantum computers are not yet large enough to support fault-tolerance. Such systems with few tens to few hundreds of qubits are termed as Noisy Intermediate Scale Quantum computers (NISQ), and these systems can provide benefits for a class of quantum algorithms. In this paper, we study the problems of Qubit-Allocation (mapping of program qubits to machine qubits) and Qubit-Movement (routing qubits from one location to another for entanglement). We observe that there can be variation in the error rates of different qubits and links, which can impact the decisions for qubit movement and qubit allocation. We analyze publicly available characterization data for the IBM-Q20 to quantify the variation and show that there is indeed significant variability in the error rates of the qubits and the links connecting them. We show that the device variability has a significant impact on the overall system reliability. To exploit the variability in error rate, we propose Variation-Aware Qubit Movement (VQM) and Variation-Aware Qubit Allocation (VQA), policies that optimize the movement and allocation of qubits to avoid the weaker qubits and links, and guide more operations towards the stronger qubits and links. Our evaluations, with a simulation-based model of IBM-Q20, show that Variation-Aware policies can improve the system reliability by up to 1.7x. We also evaluate our policies on the IBM-Q5 machine and demonstrate that our proposal significantly improves the reliability of real systems (up to 1.9X).

...read moreread less

295 citations

Proceedings Article•DOI•

Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design

[...]

Moinuddin K. Qureshi¹, Gabe H. Loh²•Institutions (2)

Georgia Institute of Technology¹, Advanced Micro Devices²

01 Dec 2012

TL;DR: This paper proposes a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst, and proposes a simple and highly effective Memory Access Predictor.

...read moreread less

Abstract: This paper analyzes the design trade-offs in architecting large-scale DRAM caches. Prior research, including the recent work from Loh and Hill, have organized DRAM caches similar to conventional caches. In this paper, we contend that some of the basic design decisions typically made for conventional caches (such as serialization of tag and data access, large associativity, and update of replacement state) are detrimental to the performance of DRAM caches, as they exacerbate the already high hit latency. We show that higher performance can be obtained by optimizing the DRAM cache architecture first for latency, and then for hit rate. We propose a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst. We also propose a simple and highly effective Memory Access Predictor that incurs a storage overhead of 96 bytes per core and a latency of 1 cycle. It helps service cache misses faster without the need to wait for a cache miss detection in the common case. Our evaluations show that our latency-optimized cache design significantly outperforms both the recent proposal from Loh and Hill, as well as an impractical SRAM Tag-Store design that incurs an unacceptable overhead of several tens of megabytes. On average, the proposal from Loh and Hill provides 8.7% performance improvement, the "idealized" SRAM Tag design provides 24%, and our simple latency-optimized design provides 35%.

...read moreread less

259 citations

1
2
3
4
5
…
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Dark Silicon and the End of Multicore Scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², R. St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

01 May 2012-IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

1,556 citations

DOI•

International Technology Roadmap for Semiconductors 2003の要求清浄度について－シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について－

[...]

飯田裕幸, 竹田菊男, 藤本武利

20 Sep 2004

1,387 citations

Proceedings Article•DOI•

Dark silicon and the end of multicore scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², Renee St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

04 Jun 2011

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

...read moreread less

Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

...read moreread less

1,379 citations

Journal Article•DOI•

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

[...]

Ping Chi¹, Shuangchen Li¹, Cong Xu², Tao Zhang³, Jishen Zhao¹, Yongpan Liu⁴, Yu Wang⁴, Yuan Xie¹ - Show less +4 more•Institutions (4)

University of California¹, Hewlett-Packard², Nvidia³, Tsinghua University⁴

18 Jun 2016

TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.

...read moreread less

Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

...read moreread less

1,197 citations

Proceedings Article•DOI•

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

[...]

Moinuddin K. Qureshi, Yale N. Patt

09 Dec 2006

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.

...read moreread less

Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

...read moreread less

1,083 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse