scispace - formally typeset
Open AccessJournal ArticleDOI

CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory

TLDR
This article analyzes the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graph-computing applications and proposes CAIRO, a compiler-assisted technique and decision model for enabling instruction- level offloading of PIM without any burden on programmers.
Abstract: 
Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the memory. Several memory vendors have also started to integrate computation logics into the memory, such as Hybrid Memory Cube (HMC), the latest version of which supports up to 18 in-memory atomic instructions. Although industry prototypes have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. As a result, compiler support for recognizing offloading candidates and utilizing instruction-level PIM offloading is unavailable. In this article, we analyze the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graph-computing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers. To develop CAIRO, we analyzed how instruction offloading enables performance gain in both CPU and GPU workloads. Our studies show that performance gain from bandwidth savings, the ratio of number of cache misses to total cache accesses, and the overhead of host atomic instructions are the key factors in selecting an offloading candidate. Based on our analytical models, we characterize the properties of beneficial and nonbeneficial candidates for offloading. We evaluate CAIRO with 27 multithreaded CPU and 36 GPU benchmarks. In our evaluation, CAIRO not only doubles the speedup for a set of PIM-beneficial workloads by exploiting HMC-atomic instructions but also prevents slowdown caused by incorrect offloading decisions for other workloads.

read more

Citations
More filters
Proceedings ArticleDOI

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

TL;DR: In this article, the authors propose a new PIM architecture that does not change the existing sequential programming models and automatically decides whether to execute PIM operations in memory or processors depending on the locality of data.
Proceedings ArticleDOI

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

TL;DR: This work comprehensively analyzes the energy and performance impact of data movement for several widely-used Google consumer workloads, and finds that processing-in-memory (PIM) can significantly reduceData movement for all of these workloads by performing part of the computation close to memory.
Proceedings ArticleDOI

CoNDA: efficient cache coherence support for near-data accelerators

TL;DR: CoNDA is proposed, a coherence mechanism that lets an NDA optimistically execute an Nda kernel, under the assumption that the NDA has all necessary coherence permissions, and allows CoNDA to gather information on the memory accesses performed by the Nda and by the rest of the system.
Journal ArticleDOI

Near-memory computing: Past, present, and future

TL;DR: In this article, the authors survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions.
Proceedings ArticleDOI

Opportunistic computing in GPU architectures

TL;DR: This paper develops two offloading techniques, called LLC-Compute and Omni- Compute, which employ simple bookkeeping hardware to enable GPU cores to compute instructions offloaded by other GPU cores.
References
More filters
Journal ArticleDOI

Hitting the memory wall: implications of the obvious

TL;DR: This work proposes an exact analysis, removing all remaining uncertainty, based on model checking, using abstract-interpretation results to prune down the model for scalability, and notably improves precision upon classical abstract interpretation at reasonable cost.
Journal ArticleDOI

DRAMSim2: A Cycle Accurate Memory System Simulator

TL;DR: The process of validating DRAMSim2 timing against manufacturer Verilog models in an effort to prove the accuracy of simulation results is described.
Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Proceedings ArticleDOI

Hybrid memory cube new DRAM architecture increases density and performance

TL;DR: The Hybrid Memory Cube is a three-dimensional DRAM architecture that improves latency, bandwidth, power and density and Heterogeneous die are stacked with significantly more connections, thereby reducing the distance signals travel.
Related Papers (5)