scispace - formally typeset
Search or ask a question
Author

Moinuddin K. Qureshi

Other affiliations: IBM, University of Texas at Austin, Intel  ...read more
Bio: Moinuddin K. Qureshi is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Cache & Cache pollution. The author has an hindex of 44, co-authored 131 publications receiving 9956 citations. Previous affiliations of Moinuddin K. Qureshi include IBM & University of Texas at Austin.


Papers
More filters
Posted Content
TL;DR: This paper proposes Variation-Aware Qubit Movement (VQM), policies that optimize the movement and allocation of qubits to avoid the weaker qubits and links, and guide more operations towards the stronger qu bits and links.
Abstract: Recently, IBM, Google, and Intel showcased quantum computers ranging from 49 to 72 qubits. While these systems represent a significant milestone in the advancement of quantum computing, existing and near-term quantum computers are not yet large enough to fully support quantum error-correction. Such systems with few tens to few hundreds of qubits are termed as Noisy Intermediate Scale Quantum computers (NISQ), and these systems can provide benefits for a class of quantum algorithms. In this paper, we study the problems of Qubit-Allocation (mapping of program qubits to machine qubits) and Qubit-Movement(routing qubits from one location to another to perform entanglement). We observe that there exists variation in the error rates of different qubits and links, which can have an impact on the decisions for qubit movement and qubit allocation. We analyze characterization data for the IBM-Q20 quantum computer gathered over 52 days to understand and quantify the variation in the error-rates and find that there is indeed significant variability in the error rates of the qubits and the links connecting them. We define reliability metrics for NISQ computers and show that the device variability has the substantial impact on the overall system reliability. To exploit the variability in error rate, we propose Variation-Aware Qubit Movement (VQM) and Variation-Aware Qubit Allocation (VQA), policies that optimize the movement and allocation of qubits to avoid the weaker qubits and links and guide more operations towards the stronger qubits and links. We show that our Variation-Aware policies improve the reliability of the NISQ system up to 2.5x.

48 citations

Proceedings ArticleDOI
02 Oct 2017
TL;DR: The minimum operational temperature for 55 DIMMs is reported and a significant fraction of DRAM chips continue to work at temperatures as low as 80K, which is an initial step towards evaluating the effectiveness of cryogenic DRAM as a main memory for quantum computers.
Abstract: A quantum computer can solve fundamentally difficult problems by utilizing properties of quantum bits (qubits). It consists of a quantum substrate, connected to a conventional computer, termed as control processor. A control processor can manipulate and measure the state of the qubits and act as an interface between qubits and the programmer. Unfortunately, qubits are extremely noise-sensitive, and to minimize the noise; qubits are operated at cryogenic temperatures. To build a scalable quantum computer, a control processor which can work at cryogenic temperatures is essential [3, 14]. In this paper, we focus on the challenges of building a memory system for a cryogenic control processor. A scalable quantum computer will require large memory capacity for storing the program and the data generated by the quantum error correction. To this end, we evaluate the feasibility of cryogenic DRAM-based memory system by characterizing commercial DRAM modules at cryogenic temperatures. In this paper, we report the minimum operational temperature for 55 DIMMs (consisting of a total of 750 DRAM chips) and analyze the error patterns in commodity DRAM devices operated at cryogenic temperatures. Our study shows that a significant fraction of DRAM chips continue to work at temperatures as low as 80K. This study is an initial step towards evaluating the effectiveness of cryogenic DRAM as a main memory for quantum computers.

43 citations

Proceedings ArticleDOI
24 Jun 2017
TL;DR: DICE is proposed, a dynamic design that can adapt between spatial indexing and TSI, depending on the compressibility of the data, and low-cost Cache Index Predictors (CIP) that can accurately predict the cache indexing scheme on access in order to avoid probing both indices for retrieving a given cache line.
Abstract: This paper investigates compression for DRAM caches. As the capacity of DRAM cache is typically large, prior techniques on cache compression, which solely focus on improving cache capacity, provide only a marginal benefit. We show that more performance benefit can be obtained if the compression of the DRAM cache is tailored to provide higher bandwidth. If a DRAM cache can provide two compressed lines in a single access, and both lines are useful, the effective bandwidth of the DRAM cache would double. Unfortunately, it is not straight-forward to compress DRAM caches for bandwidth. The typically used Traditional Set Indexing (TSI) maps consecutive lines to consecutive sets, so the multiple compressed lines obtained from the set are from spatially distant locations and unlikely to be used within a short period of each other. We can change the indexing of the cache to place consecutive lines in the same set to improve bandwidth; however, when the data is incompressible, such spatial indexing reduces effective capacity and causes significant slowdown.Ideally, we would like to have spatial indexing when the data is compressible and TSI otherwise. To this end, we propose Dynamic-Indexing Cache comprEssion (DICE), a dynamic design that can adapt between spatial indexing and TSI, depending on the compressibility of the data. We also propose low-cost Cache Index Predictors (CIP) that can accurately predict the cache indexing scheme on access in order to avoid probing both indices for retrieving a given cache line. Our studies with a 1GB DRAM cache, on a wide range of workloads (including SPEC and Graph), show that DICE improves performance by 19.0% and reduces energy-delay-product by 36% on average. DICE is within 3% of a design that has double the capacity and double the bandwidth. DICE incurs a storage overhead of less than 1KB and does not rely on any OS support.

41 citations

Posted Content
TL;DR: This work considers surface code error correction, which is the most popular family of error correcting codes for quantum computing, and designs a decoder micro-architecture for the Union-Find decoding algorithm that significantly speeds up the decoder.
Abstract: Quantum computation promises significant computational advantages over classical computation for some problems. However, quantum hardware suffers from much higher error rates than in classical hardware. As a result, extensive quantum error correction is required to execute a useful quantum algorithm. The decoder is a key component of the error correction scheme whose role is to identify errors faster than they accumulate in the quantum computer and that must be implemented with minimum hardware resources in order to scale to the regime of practical applications. In this work, we consider surface code error correction, which is the most popular family of error correcting codes for quantum computing, and we design a decoder micro-architecture for the Union-Find decoding algorithm. We propose a three-stage fully pipelined hardware implementation of the decoder that significantly speeds up the decoder. Then, we optimize the amount of decoding hardware required to perform error correction simultaneously over all the logical qubits of the quantum computer. By sharing resources between logical qubits, we obtain a 67% reduction of the number of hardware units and the memory capacity is reduced by 70%. Moreover, we reduce the bandwidth required for the decoding process by a factor at least 30x using low-overhead compression algorithms. Finally, we provide numerical evidence that our optimized micro-architecture can be executed fast enough to correct errors in a quantum computer.

41 citations

Proceedings ArticleDOI
22 Jun 2015
TL;DR: This paper proposes Morphable ECC, which reduces refresh operations during idle mode by 16x, memory power in Idle mode by 2X, while retaining performance within 2% of a system that does not use any ECC.
Abstract: Energy consumption is a primary consideration that determines the usability of emerging mobile computing devices such as smartphones. Refresh operations for main memory account for a significant fraction of the overall energy consumption, especially during idle periods, when processor can be switched off quickly, however, memory contents continue to get refreshed to avoid data loss. Given that mobile devices are idle most of the times, reducing refresh power in idle mode is critical to maximize the duration for which the device remains usable. The frequency of refresh operations in memory can be reduced significantly by using strong multi-bit error correction codes (ECC). Unfortunately, strong ECC codes incur high latency, which causes significant performance degradation (as high as 21%, and on average 10%). To obtain both low refresh power in idle periods and high performance in active periods, this paper proposes Morphable ECC (MECC). During idle periods, MECC keeps the memory protected with 6-bit ECC (ECC-6) and employs a refresh period of 1 second, instead of the typical refresh period of 64ms. During active operation, MECC reduces the refresh interval to 64ms, and converts memory from ECC-6 to weaker ECC (single-bit error correction) on a demand-basis, thus avoiding the high latency of ECC-6, except for the first access during the active mode. Our proposal reduces refresh operations during idle mode by 16x, memory power in idle mode by 2X, while retaining performance within 2% of a system that does not use any ECC.

38 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.
Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

1,556 citations

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.
Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

1,379 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.
Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

1,197 citations

Proceedings ArticleDOI
09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

1,083 citations