TOP-PIM: throughput-oriented programmable processing in memory

doi:10.1145/2600212.2600213

Proceedings ArticleDOI

TOP-PIM: throughput-oriented programmable processing in memory

- pp 85-98

TLDR

This work explores the use of 3D die stacking to move memory-intensive computations closer to memory and introduces a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware.

Abstract:

As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical to continued performance scaling. Moving computation closer to memory presents an opportunity to reduce both energy and data movement overheads. We explore the use of 3D die stacking to move memory-intensive computations closer to memory. This approach to processing in memory addresses some drawbacks of prior research on in-memory computing and is commercially viable in the foreseeable future.Because 3D stacking provides increased bandwidth, we study throughput-oriented computing using programmable GPU compute units across a broad range of benchmarks, including graph and HPC applications. We also introduce a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware. Our results show that, on average, viable PIM configurations show moderate performance losses (27%) in return for significant energy efficiency improvements (76\% reduction in EDP) relative to a representative mainstream GPU at 22nm technology. At 16nm technology, on average, viable PIM configurations are performance competitive with a representative mainstream GPU (7% speedup) and provide even greater energy efficiency improvements (85\% reduction in EDP).

Citations

PDF

Open Access

More filters

Journal ArticleDOI

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

Ping Chi, +7 more

TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.

...read moreread less

Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

Junwhan Ahn, +4 more

TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.

...read moreread less

Proceedings ArticleDOI

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

Linghao Song, +3 more

TL;DR: PipeLayer is presented, a ReRAM-based PIM accelerator for CNNs that support both training and testing and proposes highly parallel design based on the notion of parallelism granularity and weight replication, which enables the highly pipelined execution of bothTraining and testing, without introducing the potential stalls in previous work.

...read moreread less

Proceedings ArticleDOI

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

Vivek Seshadri, +9 more

TL;DR: Ambit is proposed, an Accelerator-in-Memory for bulk bitwise operations that largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area).

...read moreread less

Proceedings ArticleDOI

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

Junwhan Ahn, +3 more

TL;DR: In this article, the authors propose a new PIM architecture that does not change the existing sequential programming models and automatically decides whether to execute PIM operations in memory or processors depending on the locality of data.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Proceedings ArticleDOI

Architecting phase change memory as a scalable dram alternative

Benjamin C. Lee, +3 more

TL;DR: This work proposes, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM.

...read moreread less

Book ChapterDOI

On Graph Kernels: Hardness Results and Efficient Alternatives

Thomas Gärtner, +3 more

TL;DR: As most ‘real-world’ data is structured, research in kernel methods has begun investigating kernels for various kinds of structured data, but only very specific graphs such as trees and strings have been considered.

...read moreread less

Proceedings ArticleDOI

Shortest-path kernels on graphs

Karsten M. Borgwardt, +1 more

TL;DR: This work proposes graph kernels based on shortest paths, which are computable in polynomial time, retain expressivity and are still positive definite, and shows significantly higher classification accuracy than walk-based kernels.

...read moreread less

Journal ArticleDOI

3D-Stacked Memory Architectures for Multi-core Processors

Gabriel H. Loh

TL;DR: This work explores more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count, to achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on memory-intensive multi-programmed workloads on a quad-core processor.

...read moreread less

Collapse

IEEE Micro

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

Ping Chi, +7 more

TOP-PIM: throughput-oriented programmable processing in memory

Citations

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

A scalable processing-in-memory accelerator for parallel graph processing

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

References

Rodinia: A benchmark suite for heterogeneous computing

Architecting phase change memory as a scalable dram alternative

On Graph Kernels: Hardness Results and Efficient Alternatives

Shortest-path kernels on graphs

3D-Stacked Memory Architectures for Multi-core Processors

Related Papers (5)

A scalable processing-in-memory accelerator for parallel graph processing

PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture

NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads

A case for intelligent RAM

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory