scispace - formally typeset
Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

Reads0
Chats0
TLDR
This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract
The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Identifying the potential of near data processing for apache spark

TL;DR: The case of NDP architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and in-storage processing for Apache Spark is built by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.
Proceedings ArticleDOI

SpZip: architectural support for effective data compression in irregular applications

TL;DR: SpZip as discussed by the authors is an architectural approach that makes data compression practical for irregular algorithms, such as graph analytics and sparse linear algebra, which exhibit frequent indirect, data-dependent accesses to single or short sequences of elements that cause high main memory traffic and limit performance.
Journal ArticleDOI

An Overview of Processing-in-Memory Circuits for Artificial Intelligence and Machine Learning

TL;DR: This paper presents a comprehensive investigation of state-of-the-art PIM research works based on various memory device types, such as static-random-access-memory (SRAM), dynamic- Random Access Memory (DRAM), and resistive memory (ReRAM), and the overview of PIM designs in each memory type, covering from bit cells, circuits, and architecture.
Proceedings ArticleDOI

Concurrent Data Structures with Near-Data-Processing: an Architecture-Aware Implementation

TL;DR: An empirical evaluation of several NDP-aware algorithms for general-purpose concurrent data structures such as linked-lists, skiplists, and FIFO queues reveals that the potential benefits of NDP-based concurrent data Structures are less than what had been expected in earlier studies.
Journal ArticleDOI

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

TL;DR: This work focuses on the development of near-bank PIM designs that tightly couple a PIM core with each DRAM bank, exploiting bank-level parallelism to expose high on-chip memory bandwidth of standard DRAM to processors.
References
More filters
Journal ArticleDOI

The anatomy of a large-scale hypertextual Web search engine

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Journal Article

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Sergey Brin, +1 more
- 01 Jan 1998 - 
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Journal ArticleDOI

A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Journal ArticleDOI

Pin: building customized program analysis tools with dynamic instrumentation

TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Proceedings ArticleDOI

Pregel: a system for large-scale graph processing

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Related Papers (5)