CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory
TLDR
This article analyzes the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graph-computing applications and proposes CAIRO, a compiler-assisted technique and decision model for enabling instruction- level offloading of PIM without any burden on programmers.Abstract:Â
Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the memory. Several memory vendors have also started to integrate computation logics into the memory, such as Hybrid Memory Cube (HMC), the latest version of which supports up to 18 in-memory atomic instructions. Although industry prototypes have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. As a result, compiler support for recognizing offloading candidates and utilizing instruction-level PIM offloading is unavailable. In this article, we analyze the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graph-computing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers. To develop CAIRO, we analyzed how instruction offloading enables performance gain in both CPU and GPU workloads. Our studies show that performance gain from bandwidth savings, the ratio of number of cache misses to total cache accesses, and the overhead of host atomic instructions are the key factors in selecting an offloading candidate. Based on our analytical models, we characterize the properties of beneficial and nonbeneficial candidates for offloading. We evaluate CAIRO with 27 multithreaded CPU and 36 GPU benchmarks. In our evaluation, CAIRO not only doubles the speedup for a set of PIM-beneficial workloads by exploiting HMC-atomic instructions but also prevents slowdown caused by incorrect offloading decisions for other workloads.read more
Citations
More filters
Proceedings ArticleDOI
PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture
TL;DR: In this article, the authors propose a new PIM architecture that does not change the existing sequential programming models and automatically decides whether to execute PIM operations in memory or processors depending on the locality of data.
Proceedings ArticleDOI
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks
Amirali Boroumand,Saugata Ghose,Youngsok Kim,Rachata Ausavarungnirun,Eric Shiu,Rahul Thakur,Dae Hyun Kim,Aki Kuusela,Allan Knies,Parthasarathy Ranganathan,Onur Mutlu +10 more
TL;DR: This work comprehensively analyzes the energy and performance impact of data movement for several widely-used Google consumer workloads, and finds that processing-in-memory (PIM) can significantly reduceData movement for all of these workloads by performing part of the computation close to memory.
Proceedings ArticleDOI
CoNDA: efficient cache coherence support for near-data accelerators
Amirali Boroumand,Zheng Hongzhong,Onur Mutlu,Saugata Ghose,Minesh Patel,Hasan Hassan,Brandon Lucia,Rachata Ausavarungnirun,Kevin Hsieh,Nastaran Hajinazar,Malladi Krishna T +10 more
TL;DR: CoNDA is proposed, a coherence mechanism that lets an NDA optimistically execute an Nda kernel, under the assumption that the NDA has all necessary coherence permissions, and allows CoNDA to gather information on the memory accesses performed by the Nda and by the rest of the system.
Journal ArticleDOI
Near-memory computing: Past, present, and future
Gagandeep Singh,Lorenzo Chelini,Stefano Corda,Ahsan Javed Awan,Sander Stuijk,Roel Jordans,Henk Corporaal,Albert-Jan Boonstra +7 more
TL;DR: In this article, the authors survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions.
Proceedings ArticleDOI
Opportunistic computing in GPU architectures
Ashutosh Pattnaik,Xulong Tang,Onur Kayiran,Adwait Jog,Mishra Asit K,Mahmut Kandemir,Anand Sivasubramaniam,Chita R. Das +7 more
TL;DR: This paper develops two offloading techniques, called LLC-Compute and Omni- Compute, which employ simple bookkeeping hardware to enable GPU cores to compute instructions offloaded by other GPU cores.
References
More filters
Journal ArticleDOI
Hitting the memory wall: implications of the obvious
William A. Wulf,Sally A. McKee +1 more
TL;DR: This work proposes an exact analysis, removing all remaining uncertainty, based on model checking, using abstract-interpretation results to prune down the model for scalability, and notably improves precision upon classical abstract interpretation at reasonable cost.
Journal ArticleDOI
DRAMSim2: A Cycle Accurate Memory System Simulator
TL;DR: The process of validating DRAMSim2 timing against manufacturer Verilog models in an effort to prove the accuracy of simulation results is described.
Proceedings ArticleDOI
A scalable processing-in-memory accelerator for parallel graph processing
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Proceedings ArticleDOI
Hybrid memory cube new DRAM architecture increases density and performance
Joe M. Jeddeloh,Brent Keeth +1 more
TL;DR: The Hybrid Memory Cube is a three-dimensional DRAM architecture that improves latency, bandwidth, power and density and Heterogeneous die are stacked with significantly more connections, thereby reducing the distance signals travel.