scispace - formally typeset
Search or ask a question
Author

Moinuddin K. Qureshi

Other affiliations: IBM, University of Texas at Austin, Intel  ...read more
Bio: Moinuddin K. Qureshi is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Cache & Cache pollution. The author has an hindex of 44, co-authored 131 publications receiving 9956 citations. Previous affiliations of Moinuddin K. Qureshi include IBM & University of Texas at Austin.


Papers
More filters
Posted Content
TL;DR: This paper shows that merely porting existing CMOS-based accelerator to superconducting technology provides 10.6X improvement in energy efficiency, and investigates solutions to make the accelerator tolerant of new fault modes and shows how this fault-tolerant design can be leveraged to reduce the operating current.
Abstract: As the scaling of conventional CMOS-based technologies slows down, there is growing interest in alternative technologies that can improve performance or energy-efficiency. Superconducting circuits based on Josephson Junction (JJ) is an emerging technology that can provide devices which can be switched with pico-second latencies and consuming two orders of magnitude lower switching energy compared to CMOS. While JJ-based circuits can provide high operating frequency and energy-efficiency, this technology faces three critical challenges: limited device density and lack of area-efficient technology for memory structures, reduced gate fanout compared to CMOS, and new failure modes of Flux-Traps that occurs due to the operating environment. The lack of dense memory technology restricts the use of superconducting technology in the near term to application domains that have high compute intensity but require negligible amount of memory. In this paper, we study the use of superconducting technology to build an accelerator for SHA-256 engines commonly used in Bitcoin mining applications. We show that merely porting existing CMOS-based accelerator to superconducting technology provides 10.6X improvement in energy efficiency. Redesigning the accelerator to suit the unique constraints of superconducting technology (such as low fanout) improves the energy efficiency to 12.2X. We also investigate solutions to make the accelerator tolerant of new fault modes and show how this fault-tolerant design can be leveraged to reduce the operating current, thereby increasing the overall energy-efficiency to 46X compared to CMOS. Our paper also develops a workflow for evaluating area, performance, and power for accelerators built in superconducting technology, and this workflow can help other researchers explore designs using this technology.

5 citations

Posted Content
TL;DR: The Zombie-Based Mitigation (ZBM) is proposed, a simple hardware-based design that successfully guards against attacks by simply treating hits on Zombie-lines as misses to avoid any timing leaks to the attacker.
Abstract: OS-based page sharing is a commonly used optimization in modern systems to reduce memory footprint. Unfortunately, such sharing can cause Flush+Reload cache attacks, whereby a spy periodically flushes a cache line of shared data (using the clflush instruction) and reloads it to infer the access patterns of a victim application. Current proposals to mitigate Flush+Reload attacks are impractical as they either disable page sharing, or require application rewrite, or require OS support, or incur ISA changes. Ideally, we want to tolerate attacks without requiring any OS or ISA support and while incurring negligible performance and storage overheads. This paper makes the key observation that when a cache line is invalidated due to a Flush-Caused Invalidation (FCI), the tag and data of the invalidated line are still resident in the cache and can be used for detecting Flush-based attacks. We call lines invalidated due to FCI as Zombie lines. Our design explicitly marks such lines as Zombies, preserves the Zombie lines in the cache, and uses the hits and misses to Zombie lines to tolerate the attacks. We propose Zombie-Based Mitigation (ZBM), a simple hardware-based design that successfully guards against attacks by simply treating hits on Zombie-lines as misses to avoid any timing leaks to the attacker. We analyze the robustness of ZBM using three spy programs: attacking AES T-Tables, attacking RSA Square-and-Multiply, and Function Watcher (FW), and show that ZBM successfully defends against these attacks. Our solution requires negligible storage (4-bits per cache line), retains OS-based page sharing, requires no OS/ISA changes, and does not incur slowdown for benign applications.

5 citations

Proceedings ArticleDOI
01 Oct 2022
TL;DR: AQUA is proposed, a Rowhammer mitigation that breaks the spatial correlation between aggressor and victim rows by dynamically quarantining the aggressor row in a dedicated region of memory and causes an order of magnitude less slowdown and SRAM.
Abstract: Rowhammer allows an attacker to induce bit flips in a row by rapidly accessing neighboring rows. Rowhammer is a severe security threat as it can be used to escalate privilege or break confidentiality. Moreover, the threshold of activations needed to induce Rowhammer continues to reduce and new attacks like Half-Double break existing solutions that refresh victim rows. The recently proposed Randomized Row-Swap (RRS) scheme is resilient to Half-Double as it provides mitigation by swapping an aggressor row with a random row. However, to ensure security, the threshold for triggering a row-swap must be set much lower than the Rowhammer threshold, leading to a significant performance loss of 20% on average, at a Rowhammer threshold of 1K. Furthermore, the SRAM overhead for storing the indirection table of RRS becomes prohibitively large – 2.4MB per rank at a Rowhammer threshold of 1K. Our goal is to develop a scalable Rowhammer mitigation that incurs negligible performance and storage overheads.To this end, we propose AQUA, a Rowhammer mitigation that breaks the spatial correlation between aggressor and victim rows by dynamically quarantining the aggressor row in a dedicated region of memory. AQUA allows for an effective row migration threshold much higher than in RRS, leading to an order of magnitude less slowdown and SRAM. As the security of AQUA is not reliant on keeping the destination row a secret, we further reduce the SRAM overheads of the indirection table by storing it in DRAM, and accessing it on-demand. We derive the size of the quarantine region required to ensure security for AQUA and show that reserving about 1% of DRAM is sufficient to mitigate Rowhammer at a threshold of 1K. Our evaluations show that AQUA incurs an average slowdown of 2% and an SRAM overhead (for mapping and migration) of only 41KB per rank at a Rowhammer threshold of 1K.

5 citations

Journal ArticleDOI
TL;DR: This paper proposes the Cocktail Party Attack (CPA) that is able to recover the private inputs from gradients aggregated over a very large batch size, and outperforms prior gradient inversion attacks, scales to ImageNet- sized inputs, and works on large batch sizes of up to 1024.
Abstract: —Federated learning (FL) aims to perform privacy- preserving machine learning on distributed data held by multiple data owners. To this end, FL requires the data owners to perform training locally and share the gradient updates (instead of the private inputs) with the central server, which are then securely aggregated over multiple data owners. Although aggregation by itself does not provably offer privacy protection, prior work showed that it may suffice if the batch size is sufficiently large. In this paper, we propose the Cocktail Party Attack (CPA) that, contrary to prior belief, is able to recover the private inputs from gradients aggregated over a very large batch size. CPA leverages the crucial insight that aggregate gradients from a fully connected layer is a linear combination of its inputs, which leads us to frame gradient inversion as a blind source separation (BSS) problem (informally called the cocktail party problem). We adapt independent component analysis (ICA)—a classic solution to the BSS problem—to recover private inputs for fully-connected and convolutional networks, and show that CPA significantly outperforms prior gradient inversion attacks, scales to ImageNet- sized inputs, and works on large batch sizes of up to 1024.

4 citations

Patent
27 Jul 2010
TL;DR: In this article, a data switching activity identification mechanism is proposed for approximating data switching activities in a data processing system, which is based on the identification of a set of data storage devices and their associated bits.
Abstract: A mechanism is provided for approximating data switching activity in a data processing system. A data switching activity identification mechanism in the data processing system receives an identification of a set of data storage devices and a set of bits in the set of data storage devices in the data processing system to be monitored for the data switching activity. The data switching activity identification mechanism sums a count of the identified bits that have changed state for the data storage device along with other counts of the identified bits that have changed state for other data storage devices in the set of data storage devices to form an approximation of data switching activity. A power manager in the data processing system then adjusts a set of operational parameters associated with the data processing system using the approximation of data switching activity.

4 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.
Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

1,556 citations

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.
Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

1,379 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.
Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

1,197 citations

Proceedings ArticleDOI
09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

1,083 citations