Accelerating dependent cache misses with an enhanced memory controller

doi:10.1145/3007787.3001184

Journal ArticleDOI

Accelerating dependent cache misses with an enhanced memory controller

Milad Hashemi, +4 more

- Vol. 44, Iss: 3, pp 444-455

Chats0

TLDR

This work proposes adding just enough functionality to dynamically identify instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM, allowing memory requests issued by the new Enhanced Memory Controller to experience a 20% lower latency than ifissued by the core.

Abstract:

On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the frst cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Bufer prefetcher, the highest performing prefetcher in our evaluation.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation

Kevin Hsieh, +6 more

TL;DR: The In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal and addresses the key challenges of how to achieve high parallelism in the presence of serial accesses in pointer chasing, and how to effectively perform virtual-to-physical address translation on the memory side without requiring expensive accesses to the CPU's memory management unit.

...read moreread less

Proceedings ArticleDOI

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Ashutosh Pattnaik, +7 more

TL;DR: Two new runtime techniques are developed: a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and a concurrent kernel management mechanism that uses the affinity Prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU core in memory.

...read moreread less

Journal ArticleDOI

Processing data where it makes sense: Enabling in-memory computation

Onur Mutlu, +5 more

- 01 Jun 2019 -

Microprocessors and Microsystems

TL;DR: In this paper, the authors discuss some recent research that aims to practically enable computation close to data and discuss at least two promising directions for processing-in-memory (PIM): (1) performing massively-parallel bulk operations in memory by exploiting the analog operational properties of DRAM, with low-cost changes, and (2) exploiting the logic layer in 3D-stacked memory technology to accelerate important data-intensive applications.

...read moreread less

Journal ArticleDOI

RowHammer: A Retrospective

Onur Mutlu, +1 more

- 01 Aug 2020 -

IEEE Transactions on Computer-Aided Desi...

TL;DR: Kim et al. as mentioned in this paper comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammers, and discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or phase change memory, that can potentially threaten the foundations of secure systems.

...read moreread less

Posted Content

The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser

Onur Mutlu

- 02 Mar 2017 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: This work discusses the RowHammer problem in DRAM, which is a prime (and perhaps the first) example of how a circuit-level failure mechanism can cause a practical and widespread system security vulnerability, and describes and advocates a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Sheng Li, +5 more

TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.

...read moreread less

Proceedings ArticleDOI

Automatically characterizing large scale program behavior

Timothy Sherwood, +3 more

TL;DR: This work quantifies the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explores the large scale behavior of several programs, and develops a set of algorithms based on clustering capable of analyzing this behavior.

...read moreread less

Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Norman P. Jouppi

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.

...read moreread less

CACTI 6.0: A Tool to Model Large Caches

Naveen Muralimanohar, +2 more

TL;DR: This report details the analytical model assumed for the newly added modules along with their validation analysis of CACTI 6.0, a significantly enhanced version of the tool that primarily focuses on interconnect design for large caches.

...read moreread less

Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

Junwhan Ahn, +4 more

TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.

...read moreread less

Collapse

Accelerating dependent cache misses with an enhanced memory controller

Citations

Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Processing data where it makes sense: Enabling in-memory computation

RowHammer: A Retrospective

The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser

References

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Automatically characterizing large scale program behavior

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

CACTI 6.0: A Tool to Model Large Caches

A scalable processing-in-memory accelerator for parallel graph processing

Related Papers (5)

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

A scalable processing-in-memory accelerator for parallel graph processing

Practical Near-Data Processing for In-Memory Analytics Frameworks

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

TOP-PIM: throughput-oriented programmable processing in memory