scispace - formally typeset
Search or ask a question
Author

Moinuddin K. Qureshi

Other affiliations: IBM, University of Texas at Austin, Intel  ...read more
Bio: Moinuddin K. Qureshi is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Cache & Cache pollution. The author has an hindex of 44, co-authored 131 publications receiving 9956 citations. Previous affiliations of Moinuddin K. Qureshi include IBM & University of Texas at Austin.


Papers
More filters
Patent
11 Dec 2012
TL;DR: In this paper, a method for preventing errors in a DRAM (dynamic random access memory) due to weak cells that includes determining the location of a weak cell in a row, receiving data to write to the DRAM, and encoding the data into a bit vector to be written to memory.
Abstract: This disclosure includes a method for preventing errors in a DRAM (dynamic random access memory) due to weak cells that includes determining the location of a weak cell in a DRAM row, receiving data to write to the DRAM, and encoding the data into a bit vector to be written to memory. For each weak cell location, the corresponding bit from the bit vector is equal to the reliable logic state of the weak cell and the bit vector is longer than the data.

7 citations

Patent
30 Sep 2009
TL;DR: In this article, a set dueling mechanism is proposed to emulate larger linesize in a substrate with smaller linesize using gang fetching and gang replacement, and the cache uses that linesize for the remainder of the sets.
Abstract: A mechanism is provided in a cache for emulating larger linesize in a substrate with smaller linesize using gang fetching and gang replacement. Gang fetching fetches multiple lines on a cache miss to ensure that all smaller lines that make up the larger line are resident in cache at the same time. Gang replacement evicts all smaller lines in cache that would have been evicted had the cache linesize been larger. The mechanism provides adaptive linesize using set dueling by dynamically selecting between multiple linsizes depending on which linesize performs the best at runtime. Set dueling dedicates a portion of sets of the cache to always use smaller linesize and dedicates one or more portions of the sets of cache to always emulate larger linesizes. One or more counters keep track of which linesize has the best performance. The cache uses that linesize for the remainder of the sets.

7 citations

Proceedings ArticleDOI
01 Dec 2014
TL;DR: It is demonstrated that by taking the workload-specific context switch overhead into consideration, the CPU scheduler can make better decisions to minimize the context switch penalty for the most affected workloads, thereby resulting in substantial performance improvements.
Abstract: Virtualization allows the platform to have increased number of logical processors by multiplexing the underlying resources across different virtual machines. The hardware resources get time shared not only between different virtual machines, but also between different workloads of the same virtual machine. An important source of performance degradation in such a scenario comes from the cache warmup penalties a workload experiences when it gets scheduled, as the working set belonging to the workload gets displaced by other concurrently running workloads. We show that a virtual machine that time switches between four workloads can cause some of the workloads a slowdown of as much as 54%. However, such performance degradation depends on the workload behavior, with some workloads experiencing negligible degradation and some severe degradation. We propose Elastic Time Slicing (ETS) to reduce the context switch overhead for the most affected workloads. We demonstrate that by taking the workload-specific context switch overhead into consideration, the CPU scheduler can make better decisions to minimize the context switch penalty for the most affected workloads, thereby resulting in substantial performance improvements. ETS enhances performance without compromising on response time, thereby achieving dual benefits. To facilitate ETS, we develop a low-overhead hardware-based mechanism that dynamically estimates the sensitivity of a given workload to context switching. We evaluate the accuracy of the mechanism under various cache management policies and show that it is very reliable. Context switch related warmup penalties increase as optimizations are applied to address traditional cache misses. For the first time, we assess the impact of advanced replacement policies and establish that it is significant.

6 citations

Posted Content
TL;DR: TicToc as mentioned in this paper proposes an organization that provision both TIC and TOC to get hit and miss benefits of both DRAM caching and 3D-XPoint memory, and shows that TIC enables 10% speedup over baseline TIC, with an idealized DRAM cache with 64MB of SRAM tags.
Abstract: This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is becoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems; however, 3D-XPoint has 4-8x slower read, and worse writes. As such, effective DRAM caching in front of 3D-XPoint is important to enable a high-capacity, low-latency, and high-write-bandwidth memory. There are two major approaches for DRAM cache design: (1) a Tag-Inside-Cacheline (TIC) organization that optimizes for hits, by storing tag next to each line such that one access gets both tag and data, and (2) a Tag-Outside-Cacheline (TOC) organization that optimizes for misses, by storing tags from multiple data-lines together such that one tag-access gets info for several data-lines. Ideally, we desire the low hit-latency of TIC, and the low miss-bandwidth of TOC. To this end, we propose TicToc, an organization that provisions both TIC and TOC to get hit and miss benefits of both. However, we find that naively combining both actually performs worse than TIC, because one needs to pay bandwidth to maintain both metadata. The main contribution of this work is developing architectural techniques to reduce the bandwidth of maintaining both TIC and TOC metadata. We find the majority of the bandwidth cost is due to maintaining TOC dirty bits. We propose DRAM Cache Dirtiness Bit, which carries DRAM cache dirty info to last-level caches, to prune repeated dirty-bit checks for known dirty lines. We then propose Preemptive Dirty Marking, which predicts which lines will be written and proactively marks dirty bit at install time, to amortize the initial dirty-bit update. Our evaluations on a 4GB DRAM cache with 3D-XPoint memory show that TicToc enables 10% speedup over baseline TIC, nearing 14% speedup possible with an idealized DRAM cache w/ 64MB of SRAM tags, while needing only 34KB SRAM.

5 citations

Proceedings ArticleDOI
04 Jul 2019
TL;DR: A TicToc organization is proposed that provisions both TIC and TOC to get the hit and miss benefits of both, and proposes a DRAM Cache Dirtiness Bit technique that carries DRAM cache dirty information to last-level caches, to help prune repeated dirty-bit updates for known dirty lines.
Abstract: This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is becoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems. However, 3D-XPoint has several characteristics that limit it from outright replacing DRAM: 4-8x slower read, and even worse writes. As such, effective DRAM caching in front of 3D-XPoint is important to enable a high-capacity, low-latency, and high-write-bandwidth memory. There are currently two major approaches for DRAM cache design: (1) a Tag-Inside-Cacheline (TIC) organization that optimizes for hits, by storing tag next to each line such that one access gets both tag and data, and (2) a Tag-Outside-Cacheline (TOC) organization that optimizes for misses, by storing tags from multiple data lines together in a tag-line such that one access to a tag-line gets information on several data-lines. Ideally, we would like to have the low hit-latency of TIC designs, and the low miss-bandwidth of TOC designs. To this end, we propose a TicToc organization that provisions both TIC and TOC to get the hit and miss benefits of both. We find that naively combining both techniques actually performs worse than TIC individually, because one has to pay the bandwidth cost of maintaining both metadata. The main contribution of this work is developing architectural techniques to reduce bandwidth cost of accessing and maintaining both TIC and TOC metadata. We find that most of the update bandwidth is due to maintaining the TOC dirty information. We propose a DRAM Cache Dirtiness Bit technique that carries DRAM cache dirty information to last-level caches, to help prune repeated dirty-bit updates for known dirty lines. We also propose a Preemptive Dirty Marking (PDM) technique that predicts which lines will be written and proactively marks the dirty bit at install time, to help avoid the initial dirty-bit update for dirty lines. To support PDM, we develop a novel PC-based Write-Predictor to aid in marking only write-likely lines. Our evaluations on a 4GB DRAM cache in front of 3D-XPoint show that our TicToc organization enables 10% speedup over the baseline TIC, nearing the 14% speedup possible with an idealized DRAM cache design with 64MB of SRAM tags, while needing only 34KB SRAM.

5 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.
Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

1,556 citations

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.
Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

1,379 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.
Abstract: Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.

1,197 citations

Proceedings ArticleDOI
09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

1,083 citations