Proceedings ArticleDOI
NDMiner: accelerating graph pattern mining using near data processing
TLDR
NDMiner is presented, a Near Data Processing (NDP) architecture that improves the performance of GPM workloads and proposes a new graph remapping scheme in memory and a hardware-based set operation reordering technique to best optimize bank, rank, and channel-level parallelism in DRAM.Abstract:
Graph Pattern Mining (GPM) algorithms mine structural patterns in graphs. The performance of GPM workloads is bottlenecked by control flow and memory stalls. This is because of data-dependent branches used in set intersection and difference operations that dominate the execution time. This paper first conducts a systematic GPM workload analysis and uncovers four new observations to inform the optimization effort. First, GPM workloads mostly fetch inputs of costly set operations from different memory banks. Second, to avoid redundant computation, modern GPM workloads employ symmetry breaking that discards several data reads, resulting in cache pollution and wasted DRAM bandwidth. Third, sparse pattern mining algorithms perform redundant memory reads and computations. Fourth, GPM workloads do not fully utilize the in-DRAM data parallelism. Based on these observations, this paper presents NDMiner, a Near Data Processing (NDP) architecture that improves the performance of GPM workloads. To reduce in-memory data transfer of fetching data from different memory banks, NDMiner integrates compute units to offload set operations in the buffer chip of DRAM. To alleviate the wasted memory bandwidth caused by symmetry breaking, NDMiner integrates a load elision unit in hardware that detects the satisfiability of symmetry breaking constraints and terminates unnecessary loads. To optimize the performance of sparse pattern mining, NDMiner employs compiler optimizations and maps reduced reads and composite computation to NDP hardware that improves algorithmic efficiency of sparse GPM. Finally, NDMiner proposes a new graph remapping scheme in memory and a hardware-based set operation reordering technique to best optimize bank, rank, and channel-level parallelism in DRAM. To orchestrate NDP computation, this paper presents design modifications at the host ISA, compiler, and memory controller. We compare the performance of NDMiner with state-of-the-art software and hardware baselines using a mix of dense and sparse GPM algorithms. Our evaluation shows that NDMiner significantly outperforms software and hardware baselines by 6.4X and 2.5X, on average, while incurring a negligible area overhead on CPU and DRAM.read more
Citations
More filters
Proceedings ArticleDOI
Mint: An Accelerator For Mining Temporal Motifs
Nishil Talati,Haojie Ye,Sanketh Vedula,Kuan-Yu Chen,Yuhan Chen,Daniel Liu,Yichao Yuan,David Blaauw,Alex C. Bronstein,Trevor Mudge,Ronald G. Dreslinski +10 more
TL;DR: In this paper , the authors propose a task-centric programming model that enables decoupled, asynchronous execution of task context information on-chip and design a domain-specific hardware accelerator using its data path and memory subsystem design.
Journal ArticleDOI
Software Systems Implementation and Domain-Specific Architectures towards Graph Analytics
Hai Jin,Hao Qi,Jin Zhao,Xinyu Jiang,Yu Huang,Chuangyi Gui,Qinggang Wang,Xinyang Shen,Yi Zhang,Ao Hu,Dan Chen,Chaoqiang Liu,Haifeng Liu,Haiheng He,Xiangyu Ye,Runze Wang,Jingrui Yuan,Pengcheng Yao,Yu Zhang,Long Zheng,Xiaofei Liao +20 more
TL;DR: In this article , the authors discuss the future challenges of graph analytics and present several programming models, execution modes, and messaging strategies to improve the utilization of traditional hardware and performance of graph applications.
Proceedings Article
Arya: Arbitrary Graph Pattern Mining with Decomposition-based Sampling
Zeying Zhu,Kan Wu,Zaoxing Liu +2 more
TL;DR: Arya as discussed by the authors combines graph decomposition theory with edge sampling-based approximation to reduce the complexity of mining complex patterns on graphs with up to tens of billions of edges, a scale that was only possible on supercomputers.
Journal ArticleDOI
PIMMiner: A High-performance PIM Architecture-aware Graph Mining Framework
Jiya Su,Peng Jiang,Rujia Wang +2 more
TL;DR: PIMMiner as mentioned in this paper is a high-performance PIM architecture graph mining framework that enhances the locality, and internal bandwidth utilization and reduces remote bank accesses and load imbalance through cohesive algorithm and architecture co-designs.
Proceedings ArticleDOI
Shogun: A Task Scheduling Framework for Graph Mining Accelerators
TL;DR: Shogun as discussed by the authors enables adaptive locality-aware out-of-order task scheduling by deploying a task tree to decouple the task generation and execution pipeline stages and further develops accelerator optimizations including task tree splitting for load balance, and search tree merging to explore multiple search trees in parallel on one PE.
References
More filters
Journal ArticleDOI
ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars
Ali Shafiee,Anirban Nag,Naveen Muralimanohar,Rajeev Balasubramonian,John Paul Strachan,Miao Hu,R. Stanley Williams,Vivek Srikumar +7 more
TL;DR: This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.
Journal ArticleDOI
PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory
TL;DR: This work proposes a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory, and distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving.
Proceedings ArticleDOI
A scalable processing-in-memory accelerator for parallel graph processing
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Journal ArticleDOI
A case for intelligent RAM
David A. Patterson,Thomas Anderson,Neal Cardwell,Richard Fromm,Kimberly Keeton,Christos Kozyrakis,R. Thomas,Katherine Yelick +7 more
TL;DR: The state of microprocessors and DRAMs today is reviewed, some of the opportunities and challenges for IRAMs are explored, and performance and energy efficiency of three IRAM designs are estimated.
Journal ArticleDOI
Ramulator: A Fast and Extensible DRAM Simulator
TL;DR: This paper presents Ramulator, a fast and cycle-accurate DRAM simulator that is built from the ground up for extensibility, and is able to provide out-of-the-box support for a wide array of DRAM standards.