A scalable processing-in-memory accelerator for parallel graph processing

doi:10.1145/2749469.2750386

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design

[...]

Xingyao Zhang¹, Shuaiwen Leon Song², Chenhao Xie³, Jing Wang⁴, Weigong Zhang, Xin Fu¹ - Show less +2 more•Institutions (4)

University of Houston¹, University of Sydney², Pacific Northwest National Laboratory³, Capital Normal University⁴

01 Feb 2020

TL;DR: PIM-CapsNet as mentioned in this paper proposes a hybrid computing architecture for CapsNet, which preserves GPU's on-chip computing capability for accelerating CNN types of layers, while pipelining with an off-chip in-memory acceleration solution that effectively tackles routing procedure's inefficiency by leveraging the processing-in-memory capability of today's 3D stacked memory.

...read moreread less

Abstract: In recent years, the CNNs have achieved great successes in the image processing tasks, e.g., image recognition and object detection. Unfortunately, traditional CNN's classification is found to be easily misled by increasingly complex image features due to the usage of pooling operations, hence unable to preserve accurate position and pose information of the objects. To address this challenge, a novel neural network structure called Capsule Network has been proposed, which introduces equivariance through capsules to significantly enhance the learning ability for image segmentation and object detection. Due to its requirement of performing a high volume of matrix operations, CapsNets have been generally accelerated on modern GPU platforms that provide highly optimized software library for common deep learning tasks. However, based on our performance characterization on modern GPUs, CapsNets exhibit low efficiency due to the special program and execution features of their routing procedure, including massive unshareable intermediate variables and intensive synchronizations, which are very difficult to optimize at software level. To address these challenges, we propose a hybrid computing architecture design named PIM-CapsNet. It preserves GPU's on-chip computing capability for accelerating CNN types of layers in CapsNet, while pipelining with an off-chip in-memory acceleration solution that effectively tackles routing procedure's inefficiency by leveraging the processing-in-memory capability of today's 3D stacked memory. Using routing procedure's inherent parallellization feature, our design enables hierarchical improvements on CapsNet inference efficiency through minimizing data movement and maximizing parallel processing in memory. Evaluation results demonstrate that our proposed design can achieve substantial improvement on both performance and energy savings for CapsNet inference, with almost zero accuracy loss. The results also suggest good performance scalability in optimizing the routing procedure with increasing network size.

...read moreread less

16 citations

Proceedings Article•DOI•

A novel ReRAM-based processing-in-memory architecture for graph computing

[...]

Lei Han¹, Zhaoyan Shen¹, Zili Shao¹, H. Howie Huang², Tao Li³ - Show less +1 more•Institutions (3)

Hong Kong Polytechnic University¹, George Washington University², University of Florida³

01 Aug 2017

TL;DR: This work proposes and implements a new ReRAM-based processing-in-memory architecture called RPBFS, in which graphs can be processed and persistently stored, and designs an efficient graph traversal scheme.

...read moreread less

Abstract: Graph algorithms such as breadth-first search (BFS) have been gaining ever-increasing importance in the era of Big Data. However, the memory bandwidth remains the key performance bottleneck for graph processing. To address this problem, we utilize processing-in-memory (PIM), combined with non-volatile metal-oxide resistive random access memory (ReRAM), to improve the performance of both computation and I/O. The idea is to integrate the computation logic into the memory in which the data accesses are located. We propose and implement a new ReRAM-based processing-in-memory architecture called RPBFS, in which graphs can be processed and persistently stored. We also design an efficient graph traversal scheme. Benefited from low data movement overhead and bank-level parallel computation, RPBFS shows a significant performance improvement compared with both the CPU-based and GPU-based BFS implementations. On a suite of real world graphs, our architecture yields up to 33.8× speedup.

...read moreread less

16 citations

Cites background or methods from "A scalable processing-in-memory acc..."

...In [3], the authors demonstrate that increasing computation cores is inefficient because higher performance would require bigger memory bandwidth....
[...]
...The location pointer of V ertex 4 is [2, 2], so the cells after location [2, 2] to [3, 1] are adjacent vertices of V ertex 5....
[...]
...Driven by the 3D-stacking technology in recent years, PIM is resurgent by putting logic layer into 3D stacked memories [3]....
[...]
...The next step is to attain the adjacent vertices of V ertex 5 from the coordinate [2, 2] to [3, 1] in crossbar by activating wordline No....
[...]
...To maximize the available memory bandwidth, [3] integrates PIM technology into 3D-stacked memory....
[...]

Posted Content•

Energy Efficient Computing Systems: Architectures, Abstractions and Modeling to Techniques and Standards.

[...]

Rajeev Muralidhar, Renata Borovica-Gajic, Rajkumar Buyya

20 Jul 2020-arXiv: Hardware Architecture

TL;DR: This survey aims to bring domains holistically together around specification, modeling/simulation, benchmarking and verification of complex chips, present the latest in each of these areas, highlight potential gaps and challenges, and discuss opportunities for the next generation of energy efficient systems.

...read moreread less

Abstract: Computing systems have undergone several inflexion points - while Moore's law guided the semiconductor industry to cram more and more transistors and logic into the same volume, the limits of instruction-level parallelism (ILP) and the end of Dennard's scaling drove the industry towards multi-core chips. We have now entered the era of domain-specific architectures for new workloads like AI and ML. These trends continue, arguably with other limits, along with challenges imposed by tighter integration, extreme form factors and diverse workloads, making systems more complex from an energy efficiency perspective. Many research surveys have covered different aspects of techniques in hardware and microarchitecture across devices, servers, HPC, data center systems along with software, algorithms, frameworks for energy efficiency and thermal management. Somewhat in parallel, the semiconductor industry has developed techniques and standards around specification, modeling and verification of complex chips; these areas have not been addressed in detail by previous research surveys. This survey aims to bring these domains together and is composed of a systematic categorization of key aspects of building energy efficient systems - (a) specification - the ability to precisely specify the power intent or properties at different layers (b) modeling and simulation of the entire system or subsystem (hardware or software or both) so as to be able to perform what-if analysis, (c) techniques used for implementing energy efficiency at different levels of the stack, (d) verification techniques used to provide guarantees that the functionality of complex designs are preserved, and (e) energy efficiency standards and consortiums that aim to standardize different aspects of energy efficiency, including cross-layer optimizations.

...read moreread less

16 citations

Cites background from "A scalable processing-in-memory acc..."

...[6] propose Tesseract, a programmable PIM accelerator for large scale graph processing using 3D integration....
[...]

Proceedings Article•DOI•

Buffered compares: Excavating the hidden parallelism inside DRAM architectures with lightweight logic

[...]

Jinho Lee¹, Jung Ho Ahn¹, Kiyoung Choi¹•Institutions (1)

Seoul National University¹

14 Mar 2016

TL;DR: This work proposes a less-invasive processing-in-memory solution that can be used with existing processor memory interfaces such as DDR3/4 with minimal changes and significantly improves the performance and efficiency of the system on the tested workloads.

...read moreread less

Abstract: We propose an approach called buffered compares, a less-invasive processing-in-memory solution that can be used with existing processor memory interfaces such as DDR3/4 with minimal changes. The approach is based on the observation that multi-bank architecture, a key feature of modern main memory DRAM devices, can be used to provide huge internal bandwidth without any major modification. We place a small buffer and a simple ALU per bank, define a set of new DRAM commands to fill the buffer and feed data to the ALU, and return the result for a set of commands (not for each command) to the host memory controller. By exploiting the under-utilized internal bandwidth using ‘compare-n-op’ operations, which are frequently used in many applications, we not only reduce the amount of energy-inefficient processor-memory communication, but also accelerate the computation of big data processing applications by utilizing parallelism of the buffered compare units in DRAM banks. Experimental results show that our solution significantly improves the performance and efficiency of the system on the tested workloads.

...read moreread less

16 citations

Journal Article•DOI•

Design of Processing-“Inside”-Memory Optimized for DRAM Behaviors

[...]

Wonjun Lee¹, Chang Hyun Kim¹, Yoonah Paik¹, Jongsun Park¹, Il Park², Seon Wook Kim¹ - Show less +2 more•Institutions (2)

Korea University¹, SK Hynix²

21 Jun 2019-IEEE Access

TL;DR: This paper shows how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost.

...read moreread less

Abstract: The computing domain of today’s computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, processing-in-memory (PIM) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data movement between CPU/GPU and memory. However, most PIM studies so far have been conducted in a way that the processing units are designed only as an accelerator on the base die of 3D-stacked DRAM, not involved inside memory while not servicing the standard DRAM requests during the PIM execution. Therefore, in this paper, we show how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost. To make our goals, we extend a standard DRAM state diagram to depict the PIM behaviors in the same way as standard DRAM commands are scheduled and operated on the DRAM devices and exploit several levels of parallelism to overlap memory and computing operations. Also, we present how the entire architecture layers from applications to operating systems, memory controllers, and PIM devices should work together for the effective execution by applying our approaches to our experiment platform. In our HBM2-based experimental platform to include 16-cycle MAC (Multiply-and-Add) units and 8-cycle reducers for a matrix-vector multiplication, we achieved 406% and 35.2% faster performance by the all-bank and the per-bank schedulings, respectively, at ( $1024\times1024$ ) $\times $ ( $1024\times1$ ) 8-bit integer matrix-vector multiplication than the execution of only its operand burst reads assuming the external full DRAM bandwidth. It should be noted that the performance of the PIM on a base die of a 3D-stacked memory cannot be better than that provided by the full bandwidth in any case.

...read moreread less

16 citations

Cites background or result from "A scalable processing-in-memory acc..."

...The performance of the previous studies implementing PIM on a base die of a 3D-stacked memory [19]–[21] cannot be better than that provided by the external full memory bandwidth...
[...]
...on a base die of a 3D-stacked memory [19]–[21] cannot be...
[...]
...First, the standard memory commands need to be neither blocked nor handled differently during the PIM execution; thus, at any time during the PIM computation, we can service high priority standard memory requests and naturally satisfy their performance requirement, which was not presented in the previous PIM studies [19], [21]–[23]....
[...]
...Tesseract [21] focused on the scalability of PIM memory for large-scale graph analysis [32], [33], [51]....
[...]
...the standard memory requests are assumed to be not received when the PIM operation is in progress [19], [21]–[23], [42]....
[...]

Collapse

A scalable processing-in-memory accelerator for parallel graph processing

Citations

Cites background or methods from "A scalable processing-in-memory acc..."

Cites background from "A scalable processing-in-memory acc..."

Cites background or result from "A scalable processing-in-memory acc..."

References

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

Related Papers (5)