scispace - formally typeset
Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

TLDR
This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract
The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory

TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.
Proceedings ArticleDOI

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

TL;DR: PipeLayer is presented, a ReRAM-based PIM accelerator for CNNs that support both training and testing and proposes highly parallel design based on the notion of parallelism granularity and weight replication, which enables the highly pipelined execution of bothTraining and testing, without introducing the potential stalls in previous work.
Proceedings ArticleDOI

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TL;DR: The hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory, are presented and it is shown that despite the use of small SRAM buffers, the presence of3D memory simplifies dataflow scheduling for NN computations.
Proceedings ArticleDOI

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

TL;DR: Ambit is proposed, an Accelerator-in-Memory for bulk bitwise operations that largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area).
Journal ArticleDOI

Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory

TL;DR: The basic architecture of the Neurocube is presented and an analysis of the logic tier synthesized in 28nm and 15nm process technologies are presented and the performance is evaluated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.
References
More filters
Proceedings ArticleDOI

FlexRAM: Toward an advanced Intelligent Memory system

TL;DR: A PIM chip and a PIM-based memory system to satisfy requirements of general purpose and low programming cost and Evaluation of the system through simulations shows that 4 FlexRAM chips often allow a workstation to run 25-40 times faster.
Journal ArticleDOI

Near-Data Processing: Insights from a MICRO-46 Workshop

TL;DR: The many reasons why NDP is compelling today are described and key upcoming challenges in realizing the potential of NDP are identified.
Journal ArticleDOI

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

TL;DR: Extensive evaluations across a variety of modern memory-intensive GPU workloads show that TOM significantly improves performance compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.
Proceedings ArticleDOI

Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture

TL;DR: The potential of PIM-based architectures in accelerating the performance of three irregular computations, sparse conjugate gradient, a natural-join database operation and an object-oriented database query are demonstrated.
Proceedings ArticleDOI

PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor

TL;DR: It is shown how 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing and that a PicoServer performs comparably to a Pentium 4-like class machine while consuming only about 1/10 of the power.
Related Papers (5)