scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

13 Jun 2015-Vol. 43, Iss: 3, pp 105-117
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
13 Aug 2018
TL;DR: A novel memory coalescer methodology is introduced that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC.
Abstract: Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can dramatically slow down the overall performance of applications. The growing desire of high memory bandwidth and low latency access stimulate the advent of novel 3D-staked memory devices such as the Hybrid Memory Cube (HMC), which provides significantly higher bandwidth compared with the conventional JEDEC DDR devices. Even though many existing studies have been devoted to achieving high bandwidth throughput of HMC, the bandwidth potential cannot be fully exploited due to the lack of highly efficient memory coalescing and interfacing methodology for HMC devices. In this research, we introduce a novel memory coalescer methodology that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC. We present the design and implementation of this approach on RISC-V embedded cores with attached HMC devices. Our evaluation results show that the new memory coalescer eliminates 47.47% memory accesses to HMC and improves the overall performance by 13.14% on average.

4 citations


Cites background from "A scalable processing-in-memory acc..."

  • ...Given that many data-intensive applications rarely achieve the desired performance with traditional architectures, research and development efforts for efficiently handling massive memory accesses have drawn increasing attention in recent years [8, 41]....

    [...]

Posted Content
TL;DR: Bitlet as discussed by the authors is an analytical modeling tool that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the affinity of workloads for PIM as opposed to traditional computing.
Abstract: Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the affinity of workloads for PIM as opposed to traditional computing. The tool uncovers interesting tradeoffs between, mainly, the PIM computation complexity (cycles required to perform a computation through PIM), the amount of memory used for PIM, the system memory bandwidth, and the data transfer size. Despite its simplicity, the model reveals new insights when applied to real-life examples. The model is demonstrated for several synthetic examples and then applied to explore the influence of different parameters on two systems - IMAGING and FloatPIM. Based on the demonstrations, insights about PIM and its combination with CPU are concluded.

4 citations

Proceedings ArticleDOI
29 Jul 2019
TL;DR: This work proposes MessageFusion, a domain-specific architecture that greatly reduces network traffic by computing many vertex-updates at the source node, as well as in the network, and leverages a novel edge-reordering mechanism to boost the number of partial update operations that can be completed before reaching their destination.
Abstract: The natural ability of graphs to capture complex relationships within a large amount of data makes graph-based algorithms critical kernels for a wide range of data-analytics applications. Recent processing-in-memory solutions based on a network of 3D-stacked memory, such as Hybrid Memory Cubes (HMCs), have been shown to be a good fit to run graph-based algorithms. However, the communication bandwidth of the network limits their energy and performance efficiencies. In this work, we propose MessageFusion, a domain-specific architecture that greatly reduces network traffic by computing many vertex-updates at the source node, as well as in the network. To this end, we observe that, for many algorithms, vertex-updates need not be atomic, but can be decomposed and computed in a distributed manner. MessageFusion leverages a novel edge-reordering mechanism to boost the number of partial update operations that can be completed before reaching their destination. In addition, to counteract the power overhead introduced by Message-Fusion’s edge-reordering mechanism, our solution employs module-level utilization-based, power-gating techniques. Our experimental evaluation shows that MessageFusion achieves a 3× energy savings over a highly-optimized processing-in-memory solution, while also improving performance by 2.1×, on average.

4 citations


Cites background or methods from "A scalable processing-in-memory acc..."

  • ...5%, leading to a power density of 124 mW/mm2, which is still under the thermal constraint (133 mW/mm2) reported in [5]....

    [...]

  • ...We refer to prior works for estimation of the logic layer, including the SerDes links, and the DRAM layers of the HMC [5],...

    [...]

  • ...(rather than conventional caches of [5])....

    [...]

  • ...Finally, our baseline architecture includes an edgeprefetcher as in [5] to stream the edgeArray from the local vault’s memory....

    [...]

  • ...Specifically, we simulated Tesseract [5], a domain-specific architecture, with 16 HMCs and 32 vaults per HMC, on an infrastructure based on...

    [...]

Journal ArticleDOI
TL;DR: Chiplet combines processor cores and memory chips with advanced packaging technologies, such as 2.5D, 3 dimensions (3D), and fan-out packaging, that can realize the function of PIM and analyzes some of its application results.
Abstract: With the rapid development of 5G, artificial intelligence (AI), and high-performance computing (HPC), there is a huge increase in the data exchanged between the processor and memory. However, the “storage wall” caused by the von Neumann architecture severely limits the computational performance of the system. To efficiently process such large amounts of data and break up the “storage wall”, it is necessary to develop processing-in-memory (PIM) technology. Chiplet combines processor cores and memory chips with advanced packaging technologies, such as 2.5D, 3 dimensions (3D), and fan-out packaging. This improves the quality and bandwidth of signal transmission and alleviates the “storage wall” problem. This paper reviews the Chiplet packaging technology that has achieved the function of PIM in recent years and analyzes some of its application results. First, the research status and development direction of PIM are presented and summarized. Second, the Chiplet packaging technologies that can realize the function of PIM are introduced, which are divided into 2.5D, 3D packaging, and fan-out packaging according to their physical form. Further, the form and characteristics of their implementation of PIM are summarized. Finally, this paper is concluded, and the future development of Chiplet in the field of PIM is discussed.

4 citations

Journal ArticleDOI
TL;DR: A holistic key-value store to explorer near-data processing (NDP) and on-demand scheduling for compaction optimization in an LSM-tree key- value store, named DStore, which not only accomplishes compaction for key- Value stores but also improves the system performance.
Abstract: Log-structured merge tree (LSM-tree)-based key-value stores are widely deployed in large-scale storage systems. The underlying reason is that the traditional relational databases cannot reach the high performance required by big-data applications. As high-throughput alternatives to relational databases, LSM-tree-based key-value stores can support high-throughput write operations and provide high sequential bandwidth in storage systems. However, the compaction process triggers write amplification and is confronted with the degraded write performance, especially under update-intensive workloads. To address this issue, we design a holistic key-value store to explorer near-data processing (NDP) and on-demand scheduling for compaction optimization in an LSM-tree key-value store, named DStore. DStore makes full use of various computing capacities in the host-side and device-side subsystems. DStore dynamically divides the whole host-side compaction tasks into the above two-side subsystems according to two-side different computing capabilities. Meanwhile, the device must be featured with an NDP model. The divided compaction tasks are performed by the host and the device in parallel. In DStore, the NDP-based devices exhibit low-latency and high-bandwidth performance, thus facilitating key-value stores. DStore not only accomplishes compaction for key-value stores but also improves the system performance. We implement our DStore prototype in a real-world platform, and different kinds of testbeds are employed in our experiment. LevelDB and a static compaction optimization using the NDP model (called Co-KV) are used to compare with the DStore in our evaluation. Results show that DStore achieves about $3.7 \times $ performance improvement over LevelDB under the db_bench workload. In addition, DStore-enabled key-value stores outperform LevelDB by a factor of about $3.3 \times $ and 77% in terms of throughput and latency under YCSB benchmark, respectively.

4 citations

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Abstract: Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph [Bui and Jones, Proc. of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993, 445--452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, NM, 1993]. From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.

5,629 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...For this purpose, we use METIS [27] to perform 512-way multi-constraint partitioning to balance the number of vertices, outgoing edges, and incoming edges of each partition, as done in a recent previous work [51]....

    [...]

  • ...This is confirmed by the observation that Tesseract with METIS spends 59% of execution time waiting for synchronization barriers....

    [...]

Journal ArticleDOI
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...We evaluate our architecture using an in-house cycle-accurate x86-64 simulator whose frontend is Pin [38]....

    [...]

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

  • ...It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model....

    [...]