scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

13 Jun 2015-Vol. 43, Iss: 3, pp 105-117
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
04 Jan 2022
TL;DR: DR-STRaNGe, an end-to-end system design for DRAM-based TRNGs that reduces the RNG interference by separating RNG requests from regular memory requests in the memory controller, improves fairness across applications with an RNG-aware memory request scheduler, and hides the large TRNG latencies using a random number buffering mechanism combined with a new DRAM idleness predictor that accurately identifies idle DRAM periods is proposed.
Abstract: Random number generation is an important task in a wide variety of critical applications including cryptographic algorithms, scientific simulations, and industrial testing tools. True Random Number Generators (TRNGs) produce cryptographically-secure truly random data by sampling a physical entropy source that typically requires custom hardware and suffers from long latency. To enable high-bandwidth and low-latency TRNGs on widely-available commodity devices, recent works propose hardware TRNGs that generate random numbers using commodity DRAM as an entropy source. Although prior works demonstrate promising TRNG mechanisms using DRAM, practical integration of such mechanisms into real systems poses various challenges.We identify three key challenges for using DRAM-based TRNGs in current systems: (1) generating random numbers with DRAM-based TRNGs can degrade overall system performance by slowing down concurrently-running applications due to the interference between RNG and regular memory operations in the memory controller (i.e., RNG interference), (2) this RNG interference can degrade system fairness by causing unfair prioritization of applications that intensively use random numbers (i.e., RNG applications), and (3) RNG applications can experience significant slowdown due to the high latency of DRAM-based TRNGs.To address these challenges, we propose DR-STRaNGe, an end-to-end system design for DRAM-based TRNGs that (1) reduces the RNG interference by separating RNG requests from regular memory requests in the memory controller, (2) improves fairness across applications with an RNG-aware memory request scheduler, and (3) hides the large TRNG latencies using a random number buffering mechanism combined with a new DRAM idleness predictor that accurately identifies idle DRAM periods.We evaluate DR-STRaNGe using a comprehensive set of 186 multi-programmed workloads. Compared to an RNG-oblivious baseline system, DR-STRaNGe improves the performance of non-RNG and RNG applications on average by 17.9% and 25.1%, respectively. DR-STRaNGe improves system fairness by 32.1% on average when generating random numbers at a 5 Gb/s throughput. DR-STRaNGe reduces energy consumption by 21% compared to the RNG-oblivious baseline design by reducing the time spent for RNG and non-RNG memory accesses by 15.8%.

6 citations

Journal ArticleDOI
TL;DR: The state of the art established in recent years is described and trends and challenges in research and development that point towards the future of graph processing systems are outlined.
Abstract: Driven by a multitude of use cases, graph data analytics has become a hot topic in research and industry. Particularly on big graphs, performing complex analytical queries efficiently to derive new insights is a challenging task. Systems that aim at solving the technical part of this challenge are often referred to as graph processing systems. They allow expressing and executing analytic algorithms and queries, while hiding most of the technical details related to efficiently storing and processing graph data. Since 2010, work on graph processing systems for distributed systems as well as shared memory systems has virtually exploded. In this article, we give an overview of this work with the particular focus on graph processing systems for large multiprocessor machines. We describe the state of the art established in recent years and outline trends and challenges in research and development that point towards the future of graph processing systems.

6 citations


Cites background from "A scalable processing-in-memory acc..."

  • ...that leverage modern hardware to accelerate graph processing include processing-in-memory (PIM) [4] and field-programmable gate arrays (FPGA) [77]....

    [...]

  • ...Other examples that leverage modern hardware to accelerate graph processing include processing-in-memory (PIM) [4] and field-programmable gate arrays (FPGA) [77]....

    [...]

Journal ArticleDOI
TL;DR: In this article, a task-to-HMC mapping is performed to hide the average communication latency of intermediate DNN processing results and a task schedule is generated using retiming to accelerate DNN inference while maximizing resource utilization.
Abstract: Processing-in-memory (PIM) comprises computational logic in the memory domain. It is the most promising solution to alleviate the memory bandwidth problem in deep neural network (DNN) processing. The hybrid memory cube (HMC), a 3D stacked memory structure, can efficiently implement the PIM architecture by maximizing the existing legacy hardware. To accelerate DNN inference, multiple HMCs can be connected, and data-independent tasks can be assigned to processing elements (PEs) within each HMC. However, owing to the packet-switched network structure, inter-HMC interconnects exhibit variable and unpredictable latencies depending on the data transmission path and link contention. A well-designed task schedule using context switching can effectively hide communication latency and improve PE utilization. Nevertheless, as the number of HMC increases, the variability of a wide range of inter-HMC communication latencies causes frequent context switching, degrading overall performance. This paper proposes a DNN task scheduling that can effectively utilize task parallelism by reducing the communication latency variance owing to HMC interconnect characteristics. Task partitions are generated to exploit parallelism while providing inter-HMC traffic within the sustainable link bandwidth. Task-to-HMC mapping is performed to hide the average communication latency of intermediate DNN processing results. A task schedule is generated using retiming to accelerate DNN inference while maximizing resource utilization. The effectiveness of the proposed method was verified through simulations using various realistic DNN applications performed on a ZSim x86-64 simulator. The simulations revealed that DNN processing with the proposed scheduling improved the DNN processing speed by reducing the processing time by 18.19% over conventional methods where each HMC operated independently.

6 citations

Proceedings ArticleDOI
02 Jun 2019
TL;DR: This work proposes static and dynamic techniques to optimize the thermal behavior of PIM architectures running intensive in-memory search operations and test the proposed design in two important categories of applications which benefit from the search-based PIM acceleration - hyper-dimensional computing and database query.
Abstract: Recently, Processing-In-Memory (PIM) techniques exploiting resistive RAM (ReRAM) have been used to accelerate various big data applications. ReRAM-based in-memory search is a powerful operation which efficiently finds required data in a large data set. However, such operations result in a large amount of current which may create serious thermal issues, especially in state-of-the-art 3D stacking chips. Therefore, designing PIM accelerators based on in-memory searches requires a careful consideration of temperature. In this work, we propose static and dynamic techniques to optimize the thermal behavior of PIM architectures running intensive in-memory search operations. Our experiments show the proposed design significantly reduces the peak chip temperature and dynamic management overhead. We test our proposed design in two important categories of applications which benefit from the search-based PIM acceleration - hyper-dimensional computing and database query. Validated experiments show that the proposed method can reduce the steady-state temperature by at least 15.3 °C which extends the lifetime of the ReRAM device by 57.2% on average. Furthermore, the proposed fine-grained dynamic thermal management provides 17.6% performance improvement over stateof-the-art methods. CCS CONCEPTS • Computer systems organization → Architectures; • Hardware → Emerging technologies;

6 citations


Cites background from "A scalable processing-in-memory acc..."

  • ...Data movement is the main bottleneck of current computing systems when the size of data increases over the cache capacity of the processing core [25]....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a non-volatile processing-in-memory (PIM) architecture which is extremely energy efficient, supports minimal overhead checkpointing for intermittent computing, can operate in a wide range of temperatures, and has a natural resilience to radiation.
Abstract: Beyond-edge devices can operate outside the reach of the power grid and without batteries. Such devices can be deployed in large numbers in regions that are difficult to access. Using machine learning, these devices can solve complex problems and relay valuable information back to a host. Many such devices deployed in low Earth orbit can even be used as nanosatellites. Due to the harsh and unpredictable nature of the environment, these devices must be highly energy-efficient, be capable of operating intermittently over a wide temperature range, and be tolerant of radiation. Here, we propose a non-volatile processing-in-memory architecture that is extremely energy-efficient, supports minimal overhead checkpointing for intermittent computing, can operate in a wide range of temperatures, and has a natural resilience to radiation.

6 citations

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Abstract: Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph [Bui and Jones, Proc. of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993, 445--452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, NM, 1993]. From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.

5,629 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...For this purpose, we use METIS [27] to perform 512-way multi-constraint partitioning to balance the number of vertices, outgoing edges, and incoming edges of each partition, as done in a recent previous work [51]....

    [...]

  • ...This is confirmed by the observation that Tesseract with METIS spends 59% of execution time waiting for synchronization barriers....

    [...]

Journal ArticleDOI
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...We evaluate our architecture using an in-house cycle-accurate x86-64 simulator whose frontend is Pin [38]....

    [...]

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

  • ...It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model....

    [...]