Proceedings ArticleDOI
A scalable processing-in-memory accelerator for parallel graph processing
Junwhan Ahn,Sungpack Hong,Sungjoo Yoo,Onur Mutlu,Kiyoung Choi +4 more
- Vol. 43, Iss: 3, pp 105-117
TLDR
This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.Abstract:
The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.read more
Citations
More filters
Posted Content
Decoupling GPU Programming Models from Resource Management for Enhanced Programming Ease, Portability, and Performance
Nandita Vijaykumar,Kevin Hsieh,Gennady Pekhimenko,Samira Khan,Ashish Shrestha,Saugata Ghose,Adwait Jog,Phillip B. Gibbons,Onur Mutlu +8 more
TL;DR: Zorua is a new resource virtualization framework, that decouples the programmer-specified resource usage of a GPU application from the actual allocation in the on-chip hardware resources and enables this decoupling by virtualizing each resource transparently to the programmer.
Journal ArticleDOI
RevaMp3D: Architecting the Processor Core and Cache Hierarchy for Systems with Monolithically-Integrated Logic and Memory
Nika Mansouri Ghiasi,Mohammad Sadrosadati,Geraldo F. Oliveira,Konstantinos Kanellopoulos,Rachata Ausavarungnirun,Juan G'omez Luna,Aditya Manglik,João Dinis Ferreira,Jeremie S. Kim,Christina Giannoula,Nandita Vijaykumar,Ji-Soo Park,Onur Mutlu +12 more
TL;DR: It is shown that the performance and energy bottlenecks shift in M3D systems from main memory to processor core and memory hierarchy, and the goal is to redesign the core and cache hierarchy, given the fundamentally new trade-offs of M3d technology, to benefit a wide range of workloads.
Posted Content
Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips
Kevin K. Chang,Abhijith Kashyap,Hasan Hassan,Saugata Ghose,Kevin Hsieh,Donghyuk Lee,Tianshi Li,Gennady Pekhimenko,Samira Khan,Onur Mutlu +9 more
TL;DR: Flexible-LatencY DRAM is proposed, a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance and exploit the spatial locality of slower cells within DRAM to access the faster DRAM regions with reduced latencies for the fundamental operations.
Journal ArticleDOI
Optimizing Vertex Pressure Dynamic Graph Partitioning in Many-Core Systems
Andrew McCrabb,Valeria Bertacco +1 more
TL;DR: This work examines the effectiveness and efficiency of different vertex-pressure repartitioning schemes, which move vertices so to co-locate them near their most relevant neighbors, and indicates that optimized dynamic repartitionsing techniques can often provide over 2x performance speedup over state-of-the-art static solutions.
Proceedings ArticleDOI
Gzippo
Xing Li,Rachata Ausavarungnirun,Xiao Liu,Xue Liu,Xuan Zhang,Heng Lu,Zhuoran Song,Naifeng Jing,Xiaoyao Liang +8 more
TL;DR: Gzippo as discussed by the authors employs a tandem-isomorphic crossbar architecture both to eliminate redundant searches and sequential indexing during iterations, and to remove sparsity leading to non-effective computation on zero values.
References
More filters
Journal ArticleDOI
The anatomy of a large-scale hypertextual Web search engine
Sergey Brin,Lawrence Page +1 more
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Journal Article
The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Sergey Brin,Lawrence Page +1 more
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Journal ArticleDOI
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
George Karypis,Vipin Kumar +1 more
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Journal ArticleDOI
Pin: building customized program analysis tools with dynamic instrumentation
Chi-Keung Luk,Robert Cohn,Robert Muth,Harish Patil,Artur Klauser,Geoff Lowney,Steven Wallace,Vijay Janapa Reddi,Kim Hazelwood +8 more
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Proceedings ArticleDOI
Pregel: a system for large-scale graph processing
Grzegorz Malewicz,Matthew H. Austern,Aart J. C. Bik,James C. Dehnert,Ilan Horn,Naty Leiser,Grzegorz Czajkowski +6 more
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.