(PDF) A scalable processing-in-memory accelerator for parallel graph processing (2015) | Junwhan Ahn

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation

[...]

Kevin Hsieh¹, Samira Khan², Nandita Vijaykumar¹, Kevin K. Chang¹, Amirali Boroumand¹, Saugata Ghose¹, Onur Mutlu³ - Show less +3 more•Institutions (3)

Carnegie Mellon University¹, University of Virginia², ETH Zurich³

01 Oct 2016

TL;DR: The In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal and addresses the key challenges of how to achieve high parallelism in the presence of serial accesses in pointer chasing, and how to effectively perform virtual-to-physical address translation on the memory side without requiring expensive accesses to the CPU's memory management unit.

...read moreread less

Abstract: Pointer chasing is a fundamental operation, used by many important data-intensive applications (e.g., databases, key-value stores, graph processing workloads) to traverse linked data structures. This operation is both memory bound and latency sensitive, as it (1) exhibits irregular access patterns that cause frequent cache and TLB misses, and (2) requires the data from every memory access to be sent back to the CPU to determine the next pointer to access. Our goal is to accelerate pointer chasing by performing it inside main memory, thereby avoiding inefficient and high-latency data transfers between main memory and the CPU. To this end, we propose the In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal.

...read moreread less

205 citations

Cites background from "A scalable processing-in-memory acc..."

...IMPICA addresses the key challenges of (1) how to achieve high parallelism in the presence of serial accesses in pointer chasing, and (2) how to effectively perform virtualto-physical address translation on the memory side without requiring expensive accesses to the CPU’s memory management unit....
[...]
...IMPICA also significantly reduces overall system energy consumption (by 41%, 23%, and 10% for the three commonly-used data structures, and by 6% for DBx1000)....
[...]
...We then discuss opportunities for acceleration within 3D-stacked memory....
[...]

Proceedings Article•DOI•

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

[...]

Ashutosh Pattnaik¹, Xulong Tang¹, Adwait Jog², Onur Kayiran³, Asit K. Mishra⁴, Mahmut Kandemir¹, Onur Mutlu⁵, Chita R. Das¹ - Show less +4 more•Institutions (5)

Pennsylvania State University¹, College of William & Mary², Advanced Micro Devices³, Intel⁴, Carnegie Mellon University⁵

11 Sep 2016

TL;DR: Two new runtime techniques are developed: a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and a concurrent kernel management mechanism that uses the affinity Prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU core in memory.

...read moreread less

Abstract: Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and applications, where main memory bandwidth is a critical bottleneck, can benefit from the use of PIM. To this end, an application should be properly partitioned and scheduled to execute on either the main, powerful GPU cores that are far away from memory or the auxiliary, simple GPU cores that are close to memory (e.g., in the logic layer of 3D-stacked DRAM). This paper investigates two key code scheduling issues in such a GPU architecture that has PIM capabilities, to maximize performance and energy-efficiency: (1) how to automatically identify the code segments, or kernels, to be offloaded to the cores in memory, and (2) how to concurrently schedule multiple kernels on the main GPU cores and the auxiliary GPU cores in memory. We develop two new runtime techniques: (1) a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and (2) a concurrent kernel management mechanism that uses the affinity prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU cores in memory. Our experimental evaluations across 25 GPU applications demonstrate that these two techniques can significantly improve both application performance (by 25% and 42%, respectively, on average) and energy efficiency (by 28% and 27%).

...read moreread less

200 citations

Cites background from "A scalable processing-in-memory acc..."

..., Processing-In Memory (PIM) [3,4,27,33], also known as Processing-Near Memory (PNM) or Near-Data Computing (NDC) [13]....
[...]
...With the significant advances in adoption of 3D-stacked memory technology that tightly combines a logic layer and DRAM layers [3, 4, 48, 64, 71, 86, 109], this limitation has been overcome and PIM has become a likelyviable approach to improve system design....
[...]
...3D-stacked memory technology brings new dimensions and better feasibility to PIM-based architectures [3, 4, 14, 15, 27, 39, 41, 44, 70, 71, 73, 87, 109]....
[...]
...As we discussed in Section 1, 3D-stacked memory technology enables the ability to place computational units in the base logic layer that is underneath the memory stacks [3,4,48,64,71,86,109]....
[...]

Journal Article•DOI•

Fast Bulk Bitwise AND and OR in DRAM

[...]

Vivek Seshadri¹, Kevin Hsieh¹, Amirali Boroumand¹, Donghyuk Lee¹, Michael Kozuch², Onur Mutlu¹, Phillip B. Gibbons², Todd C. Mowry¹ - Show less +4 more•Institutions (2)

Carnegie Mellon University¹, Intel²

01 Jul 2015-IEEE Computer Architecture Letters

TL;DR: This work proposes a new and simple mechanism to implement bulk bitwise AND and OR operations in DRAM, which is faster and more efficient than existing mechanisms.

...read moreread less

Abstract: Bitwise operations are an important component of modern day programming, and are used in a variety of applications such as databases. In this work, we propose a new and simple mechanism to implement bulk bitwise AND and OR operations in DRAM, which is faster and more efficient than existing mechanisms. Our mechanism exploits existing DRAM operation to perform a bitwise AND/OR of two DRAM rows completely within DRAM. The key idea is to simultaneously connect three cells to a bitline before the sense-amplification. By controlling the value of one of the cells, the sense amplifier forces the bitline to the bitwise AND or bitwise OR of the values of the other two cells. Our approach can improve the throughput of bulk bitwise AND/OR operations by $9.7X$ and reduce their energy consumption by $50.5X$ . Since our approach exploits existing DRAM operation as much as possible, it requires negligible changes to DRAM logic. We evaluate our approach using a real-world implementation of a bit-vector based index for databases. Our mechanism improves the performance of commonly-used range queries by 30 percent on average.

...read moreread less

193 citations

Cites background from "A scalable processing-in-memory acc..."

..., [4,5,8,9,24]) have been proposed to exploit the logic layer to implement some computation close to DRAM....
[...]

Proceedings Article•DOI•

HRL: Efficient and flexible reconfigurable logic for near-data processing

[...]

Mingyu Gao¹, Christos Kozyrakis¹•Institutions (1)

Stanford University¹

12 Mar 2016

TL;DR: Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays, and achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.

...read moreread less

Abstract: The energy constraints due to the end of Dennard scaling, the popularity of in-memory analytics, and the advances in 3D integration technology have led to renewed interest in near-data processing (NDP) architectures that move processing closer to main memory. Due to the limited power and area budgets of the logic layer, the NDP compute units should be area and energy efficient while providing sufficient compute capability to match the high bandwidth of vertical memory channels. They should also be flexible to accommodate a wide range of applications. Towards this goal, NDP units based on fine-grained (FPGA) and coarse-grained (CGRA) reconfigurable logic have been proposed as a compromise between the efficiency of custom engines and the flexibility of programmable cores. Unfortunately, FPGAs incur significant area overheads for bit-level reconfiguration, while CGRAs consume significant power in the interconnect and are inefficient for irregular data layouts and control flows. This paper presents Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays. HRL combines both coarse-grained and fine-grained logic blocks, separates routing networks for data and control signals, and uses specialized units to effectively support branch operations and irregular data layouts in analytics workloads. HRL has the power efficiency of FPGA and the area efficiency of CGRA. It improves performance per Watt by 2.2x over FPGA and 1.7x over CGRA. For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.

...read moreread less

184 citations

Cites background or methods from "A scalable processing-in-memory acc..."

...For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application....
[...]
...CGRA arrays incur high power overheads due to the powerful interconnect for complicated data flow support [21], and are typically inefficient for irregular data and control flow patterns....
[...]

Proceedings Article•DOI•

GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition

[...]

Mingxing Zhang¹, Youwei Zhuo², Chao Wang², Mingyu Gao³, Yongwei Wu¹, Kang Chen¹, Christos Kozyrakis³, Xuehai Qian² - Show less +4 more•Institutions (3)

Tsinghua University¹, University of Southern California², Stanford University³

01 Feb 2018

TL;DR: It is argued that a PIM-based graph processing system should take data organization as a first-order design consideration and proposed GraphP, a novel HMC-based software/hardware co-designed graphprocessing system that drastically reduces communication and energy consumption compared to TESSERACT.

...read moreread less

Abstract: Processing-In-Memory (PIM) is an effective technique that reduces data movements by integrating processing units within memory. The recent advance of “big data” and 3D stacking technology make PIM a practical and viable solution for the modern data processing workloads. It is exemplified by the recent research interests on PIM-based acceleration. Among them, TESSERACT is a PIM-enabled parallel graph processing architecture based on Micron’s Hybrid Memory Cube (HMC), one of the most prominent 3D-stacked memory technologies. It implements a Pregel-like vertex-centric programming model, so that users could develop programs in the familiar interface while taking advantage of PIM. Despite the orders of magnitude speedup compared to DRAM-based systems, TESSERACT generates excessive crosscube communications through SerDes links, whose bandwidth is much less than the aggregated local bandwidth of HMCs. Our investigation indicates that this is because of the restricted data organization required by the vertex programming model. In this paper, we argue that a PIM-based graph processing system should take data organization as a first-order design consideration. Following this principle, we propose GraphP, a novel HMC-based software/hardware co-designed graph processing system that drastically reduces communication and energy consumption compared to TESSERACT. GraphP features three key techniques. 1) “Source-cut” partitioning, which fundamentally changes the cross-cube communication from one remote put per cross-cube edge to one update per replica. 2) “Two-phase Vertex Program”, a programming model designed for the “source-cut” partitioning with two operations: GenUpdate and ApplyUpdate. 3) Hierarchical communication and overlapping, which further improves performance with unique opportunities offered by the proposed partitioning and programming model. We evaluate GraphP using a cycle accurate simulator with 5 real-world graphs and 4 algorithms. The results show that it provides on average 1.7 speedup and 89% energy saving compared to TESSERACT.

...read moreread less

179 citations

Cites methods or result from "A scalable processing-in-memory acc..."

...[16] have tried to use METIS [17] to obtain a better partitioning for TESSERACT, but the result is not that promising....
[...]
...TESSERACT [16] is a PIM-enabled parallel graph processing architecture....
[...]
...In fact, the results in [16] confirms this observation: the bandwidth utilization of TESSERACT is usually less than 40%....
[...]

Collapse

A scalable processing-in-memory accelerator for parallel graph processing

Citations

Cites background from "A scalable processing-in-memory acc..."

Cites background from "A scalable processing-in-memory acc..."

Cites background from "A scalable processing-in-memory acc..."

Cites background or methods from "A scalable processing-in-memory acc..."

Cites methods or result from "A scalable processing-in-memory acc..."

References

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

Related Papers (5)