scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

13 Jun 2015-Vol. 43, Iss: 3, pp 105-117
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 2017
TL;DR: This dissertation expands the study on vulnerability assessment of LLC through investigating the impact of Process Variations (PV) on narrow resistive sensing margins in high-density NVM arrays, including on-chip cache and primary memory, and proposes LI2 enhancement to attain high performance delivery in the post-Moore’s Law era.
Abstract: While technology scaling enables increased density for memory cells, the intrinsic high leakage power of conventional CMOS technology and the demand for reduced energy consumption inspires the use of emerging technology alternatives such as eDRAM and Non-Volatile Memory (NVM) including STT-MRAM, PCM, and RRAM The utilization of emerging technology in Last Level Cache (LLC) designs which occupies a signifcant fraction of total die area in Chip Multi Processors (CMPs) introduces new dimensions of vulnerability, energy consumption, and performance delivery To be specific, a part of this research focuses on eDRAM Bit Upset Vulnerability Factor (BUVF) to assess vulnerable portion of the eDRAM refresh cycle where the critical charge varies depending on the write voltage, storage and bit-line capacitance This dissertation broaden the study on vulnerability assessment of LLC through investigating the impact of Process Variations (PV) on narrow resistive sensing margins in high-density NVM arrays, including on-chip cache and primary memory Large-latency and power-hungry Sense Amplifers (SAs) have been adapted to combat PV in the past Herein, a novel approach is proposed to leverage the PV in NVM arrays using Self-Organized Sub-bank (SOS) design SOS engages the preferred SA alternative based on the intrinsic as-built behavior of the resistive sensing timing margin to reduce the latency and power consumption while maintaining acceptable access time On the other hand, this dissertation investigates a novel technique to prioritize the service to 1) Extensive Read Reused Accessed blocks of the LLC that are silently dropped from higher levels of cache, and 2) the portion of the working set that may exhibit distant re-reference interval in L2 In particular, we develop a lightweight Multi-level Access History Profiler to effciently identify ERRA blocks through aggregating the LLC block addresses tagged with identical Most Signifcant Bits into a single entry Experimental results indicate that the proposed technique can reduce the L2 read miss ratio by 517% on average across PARSEC and SPEC2006 workloads In addition, this dissertation will broaden and apply advancements in theories of subspace recovery to pioneer computationally-aware in-situ operand reconstruction via the novel Logic In Interconnect (LI2) scheme LI2 will be developed, validated, and re?ned both theoretically and experimentally to realize a radically different approach to post-Moore's Law computing by leveraging low-rank matrices features offering data reconstruction instead of fetching data from main memory to reduce energy/latency cost per data movement We propose LI2 enhancement to attain high performance delivery in the post-Moore's Law era through equipping the contemporary micro-architecture design with a customized memory controller which orchestrates the memory request for fetching low-rank matrices to customized Fine Grain Reconfigurable Accelerator (FGRA) for reconstruction while the other memory requests are serviced as before The goal of LI2…

3 citations


Cites methods from "A scalable processing-in-memory acc..."

  • ...Ahn et al. [8] tuned a PIM accelerator for large-scale graph processing called Tesseract....

    [...]

  • ...Apart from the advancements in designing high performance many-core processors, the progress of tightly stacking logic and memory elements in a package has enabled the researchers to redesign the conventional Processing-In-Memory (PIM) concept for in-memory big-data processing [8]....

    [...]

  • ...Regarding to the specialized memory access characteristics of graph processing workloads, new hardware prefetchers are utilized in Tesseract to enhance the utilization of memory bandwidth....

    [...]

  • ...[8] tuned a PIM accelerator for large-scale graph processing called Tesseract....

    [...]

  • ...% of cl ea n v ict im bl oc ks [1, 8) [8, 16) [16, 32) [32, 64) > 64 # of Re-reference:...

    [...]

Posted Content
Onur Mutlu1, Jeremie S. Kim1
TL;DR: A principled approach to memory reliability and security research is described and advocated that can enable us to better anticipate and prevent vulnerabilities in DRAM and other types of memories, as the memory technologies scale to higher densities.
Abstract: This retrospective paper describes the RowHammer problem in Dynamic Random Access Memory (DRAM), which was initially introduced by Kim et al. at the ISCA 2014 conference~\cite{rowhammer-isca2014}. RowHammer is a prime (and perhaps the first) example of how a circuit-level failure mechanism can cause a practical and widespread system security vulnerability. It is the phenomenon that repeatedly accessing a row in a modern DRAM chip causes bit flips in physically-adjacent rows at consistently predictable bit locations. RowHammer is caused by a hardware failure mechanism called {\em DRAM disturbance errors}, which is a manifestation of circuit-level cell-to-cell interference in a scaled memory technology. Researchers from Google Project Zero demonstrated in 2015 that this hardware failure mechanism can be effectively exploited by user-level programs to gain kernel privileges on real systems. Many other follow-up works demonstrated other practical attacks exploiting RowHammer. In this article, we comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammer. We also discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or Phase Change Memory, that can potentially threaten the foundations of secure systems, as the memory technologies scale to higher densities. We conclude by describing and advocating a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.

3 citations

Journal ArticleDOI
TL;DR: In this article , a novel ReRAM-based accelerator, PattPIM, is proposed to achieve space compression and computation reuse by exploring DNN WPR based on practical ReRAM crossbars.
Abstract: Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architecture has been designed to accelerate deep neural networks (DNNs) by concurring computation and memory barriers. To further improve memory and computation efficiency, the weight sparsity characteristic has been explored to optimize the ReRAM-based DNN accelerators. However, these designs only focus on compressing zero weights to eliminate ineffectual computation. In this article, we thoroughly analyze the weight distribution characteristics of several typical DNN models and observe many nonzero weight pattern repetitions (WPRs). Therefore, there is an opportunity to further improve the performance and energy efficiency by reusing these WPR. We propose a novel ReRAM-based accelerator—PattPIM, to achieve space compression and computation reuse by exploring DNN WPR based on practical ReRAM crossbars. In PattPIM, we propose a configurable WPR-aware DNN engine and a WPR-to-OU mapping scheme to save both space and computation resources. An intraprocessing engine (PE) pipeline is designed to improve the parallelism of the computation process. Furthermore, we adopt an approximate weight pattern transform algorithm to improve the DNN WPR ratio to enhance the reuse efficiency with negligible accuracy loss. Our evaluation with 6 DNN models shows that the proposed PattPIM delivers significant performance improvement, ReRAM resource efficiency and energy saving.

3 citations

Posted Content
TL;DR: Simulation-based evaluation with seven data-intensive applications shows Tvarak's performance and energy efficiency, including Redis set-only performance by only 3%, compared to 50% reduction for a state-of-the-art software-only approach.
Abstract: Tvarak efficiently implements system-level redundancy for direct-access (DAX) NVM storage. Production storage systems complement device-level ECC (which covers media errors) with system-checksums and cross-device parity. This system-level redundancy enables detection of and recovery from data corruption due to device firmware bugs (e.g., reading data from the wrong physical location). Direct access to NVM penalizes software-only implementations of system-level redundancy, forcing a choice between lack of data protection or significant performance penalties. Offloading the update and verification of system-level redundancy to Tvarak, a hardware controller co-located with the last-level cache, enables efficient protection of data from such bugs in memory controller and NVM DIMM firmware. Simulation-based evaluation with seven data-intensive applications shows Tvarak's performance and energy efficiency. For example, Tvarak reduces Redis set-only performance by only 3%, compared to 50% reduction for a state-of-the-art software-only approach.

3 citations


Additional excerpts

  • ...computation [2, 5, 6, 13, 20, 25, 32, 59, 60, 71, 78] continues....

    [...]

Journal ArticleDOI
Leandro Fiorin1, Rik Jongerius1, Erik Vermij1, Jan van Lunteren1, Christoph Hagleitner1 
TL;DR: This paper discusses the implementation of a 3D-stacked near-memory accelerator, targeting radio astronomy and scientific applications, and shows that the accelerator can achieve an energy efficiency improvement compared to alternative solutions.
Abstract: Processing-in-memory and near-memory computing have recently been rediscovered as a way to alleviate the “memory wall problem” of traditional computing architectures. In this paper, we discuss the implementation of a 3D-stacked near-memory accelerator, targeting radio astronomy and scientific applications. After exploring the design space of the architecture by focusing on minimizing the execution power of the processing pipeline of the SKA1-Low central signal processor, we show that our accelerator can achieve an energy efficiency of up to 390 GFLOPS/W, corresponding to an energy consumption one order of magnitude lower than alternative state-of-the-art implementations. When running additional mathematical and streaming-oriented kernels, our accelerator achieves from 6.4 $\times$ to 20 $\times$ energy efficiency improvement compared to alternative solutions.

3 citations


Cites background from "A scalable processing-in-memory acc..."

  • ...The exploitation of the in-memory bandwidth available on 3D-stacked memory devices is presented in [18], in which processing-inmemory is used for parallel processing of big-data graphs....

    [...]

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Abstract: Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph [Bui and Jones, Proc. of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993, 445--452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, NM, 1993]. From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.

5,629 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...For this purpose, we use METIS [27] to perform 512-way multi-constraint partitioning to balance the number of vertices, outgoing edges, and incoming edges of each partition, as done in a recent previous work [51]....

    [...]

  • ...This is confirmed by the observation that Tesseract with METIS spends 59% of execution time waiting for synchronization barriers....

    [...]

Journal ArticleDOI
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...We evaluate our architecture using an in-house cycle-accurate x86-64 simulator whose frontend is Pin [38]....

    [...]

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

  • ...It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model....

    [...]