scispace - formally typeset
Search or ask a question
Proceedings Articleā€¢DOIā€¢

A scalable processing-in-memory accelerator for parallel graph processing

13 Jun 2015-Vol. 43, Iss: 3, pp 105-117
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

Content maybe subject toĀ copyrightĀ Ā Ā  Report

Citations
More filters
Posted Contentā€¢
27 Jan 2019
TL;DR: The proposed Eva-CiM encompasses a multilevel (from device to architecture) comprehensive tool chain that leverages existing modeling and simulation tools, such as GEM5, McPAT, and DESTINY, and introduces several novel modeling/analysis approaches including models for capturing memory access and dependency-aware ISA traces, and for quantifying interactions between the host CPU and the CiM module.
Abstract: Computing-in-Memory (CiM) architectures aim to reduce costly data transfers by performing arithmetic and logic operations in memory and hence relieve the pressure due to the memory wall. However, determining whether a given workload can really benefit from CiM, which memory hierarchy and what device technology should be adopted by a CiM architecture requires in-depth study that is not only time consuming but also demands significant expertise in architectures and compilers. This paper presents an energy evaluation framework, Eva-CiM, for systems based on CiM architectures. Eva-CiM encompasses a multi-level (from device to architecture) comprehensive tool chain by leveraging existing modeling and simulation tools such as GEM5, McPAT [2] and DESTINY [3]. To support high-confidence prediction, rapid design space exploration and ease of use, Eva-CiM introduces several novel modeling/analysis approaches including models for capturing memory access and dependency-aware ISA traces, and for quantifying interactions between the host CPU and CiM modules. Eva-CiM can readily produce energy estimates of the entire system for a given program, a processor architecture, and the CiM array and technology specifications. Eva-CiM is validated by comparing with DESTINY [3] and [4], and enables findings including practical contributions from CiM-supported accesses, CiM-sensitive benchmarking as well as the pros and cons of increased memory size for CiM. Eva-CiM also enables exploration over different configurations and device technologies, showing 1.3-6.0X energy improvement for SRAM and 2.0-7.9X for FeFET-RAM, respectively.

10Ā citations


Cites methods from "A scalable processing-in-memory acc..."

  • ...Our experiments employ 17 benchmarks from a wide range of application based on prior works [8], [11], [16], [19], [34], as summarized in Table IV....

    [...]

Proceedings Articleā€¢DOIā€¢
30 Sep 2019
TL;DR: PIMS, implemented in the logic layer of a 3D-stacked memory, exploits the high bandwidth provided by through-silicon vias to reduce redundant memory traffic and is presented as an in-memory accelerator for stencil computations.
Abstract: Stencil computation is a classic computational kernel present in many high-performance scientific applications, like image processing and partial differential equation solvers (PDE). A stencil computation sweeps over a multi-dimensional grid and repeatedly updates values associated with points using the values from neighboring points. Stencil computations often employ large datasets that exceed cache capacity, leading to excessive accesses to the memory subsystem. As such, 3D stencil computations on large grid sizes are memory-bound.In this paper we present PIMS, an in-memory accelerator for stencil computations. PIMS, implemented in the logic layer of a 3D-stacked memory, exploits the high bandwidth provided by through-silicon vias to reduce redundant memory traffic. Our comprehensive evaluation using three different grid sizes with six categories of orders indicate that the proposed architecture reduces 48.25% of data movement on average and obtains up to 65.55% of bank conflict reduction.

10Ā citations


Cites background from "A scalable processing-in-memory acc..."

  • ...In [1, 7, 8], the proposed solutions require the programmers to identify the code that should be run near to memory....

    [...]

  • ...Several PIM architectures and programming models have been proposed by academic projects [1, 2, 11, 38] and multiple memory vendors are starting to adopt 3D die stacking in mass-produced memory....

    [...]

  • ...+ c[1] * (grid_a[i-1][j][k] + grid_a[i+1][j][k]...

    [...]

Proceedings Articleā€¢DOIā€¢
01 Aug 2017
TL;DR: This work analyzes the implementation of both HPCG and Graph500 on a novel shared-memory near-data processing architecture, and shows that, with many of the existing data-locality optimizations for this specific graph workload applied, local memory bandwidth is not the crucial parameter, and a high-bandwidth as well as low-latency interconnect are arguably more important.
Abstract: HPCG and Graph500 can be regarded as the two most relevant benchmarks for high-performance computing systems. Existing supercomputer designs, however, tend to focus on floating-point peak performance, a metric less relevant for these two benchmarks, leaving resources underutilized, and resulting in little performance improvements, for these benchmarks, over time. In this work, we analyze the implementation of both benchmarks on a novel shared-memory near-data processing architecture. We study a number of aspects: 1. a system parameter design exploration, 2. software optimizations, and 3. the exploitation of unique architectural features like user-enhanced coherence as well as the exploitation of data-locality for inter near-data processor traffic.For the HPCG benchmark, we show a factor 2.5x application level speedup with respect to a CPU, and a factor 2.5x power-efficiency improvement with respect to a GPU. For the Graph500 benchmark, we show up to a factor 3.5x speedup with respect to a CPU. Furthermore, we show that, with many of the existing data-locality optimizations for this specific graph workload applied, local memory bandwidth is not the crucial parameter, and a high-bandwidth as well as low-latency interconnect are arguably more important, shining a new light on the near-data processing characteristics most relevant for this type of heavily optimized graph processing.

10Ā citations


Cites methods from "A scalable processing-in-memory acc..."

  • ...In [14], a special-purpose graph-processing architecture is presented, utilizing the logic layer of stacked memories to achieve very high performance....

    [...]

Proceedings Articleā€¢DOIā€¢
01 Dec 2016
TL;DR: It is concluded that computing nodes with small number of computing threads consume more energy than the powerful ones, which can help cloud administrators on energy-efficient resource allocation.
Abstract: In modern cloud computing and analytics applications, large-scale data is often represented in the form of graphs. Many recent works have focused on understanding and improving performance of graph processing frameworks. Power consumption, which also serves as a key factor in the deployment and management of graph processing frameworks, has not been extensively studied. In this paper, we demonstrate the use of an online software power estimation tool that is capable of obtaining fine-grained power traces. By leveraging component-level power behavior, we show that static power consumption still constitutes a significant portion of the total power. Moreover, we illustrate the impact of various dynamic voltage and frequency scaling polices on these workloads, and observe that setting the computing node to its maximum frequency can achieve optimal performance and energy consumption. From our analysis on the impact of machine scale-up, we conclude that computing nodes with small number of computing threads consume more energy than the powerful ones. This observation can help cloud administrators on energy-efficient resource allocation.

9Ā citations


Cites background from "A scalable processing-in-memory acc..."

  • ...On the hardware side, domain-specific accelerators dedicated towards graph processing has also been proposed [10], [11] to achieve high performance....

    [...]

  • ...[10] purposes to alleviate conventional concept of processing-inmemory (PIM) to design a programmable PIM accelerator that can achieve memory-capacity-proportional performance for large scale graph processing....

    [...]

Proceedings Articleā€¢DOIā€¢
20 Jun 2019
TL;DR: This work presents a compiler-driven approach to reducing data movement costs called Computing with Near Data (CND), built upon a concept called "recomputation", in which a costly data access is replaced by a few less costly dataAccesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data Access.
Abstract: One cost that plays a significant role in shaping the overall performance of both single-threaded and multi-thread applications in modern computing systems is the cost of moving data between compute elements and storage elements. Traditional approaches to address this cost are code and data layout reorganizations and various hardware enhancements. More recently, an alternative paradigm, called Near Data Computing (NDC) or Near Data Processing (NDP), has been shown to be effective in reducing the data movements costs, by moving computation to data, instead of the traditional approach of moving data to computation. Unfortunately, the existing Near Data Computing proposals require significant modifications to hardware and are yet to be widely adopted. In this paper, we present a software-only (compiler-driven) approach to reducing data movement costs in both single-threaded and multi-threaded applications. Our approach, referred to as Computing with Near Data (CND), is built upon a concept called "recomputation," in which a costly data access is replaced by a few less costly data accesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data access. If implemented carefully, CND can successfully trade off data access with computation, and considering the continuously increasing latency gap between the two, doing so can significantly reduce the execution latencies of both sequential and parallel application programs. We i) quantify the intrinsic recomputability of a set of single-threaded and multi-threaded applications, ii) propose a practical, compiler-driven approach that automatically transforms a given application code fragment to a version that employs recomputation, iii) discuss an optimization strategy that increases recomputability; and iv) compare CND, both qualitatively and quantitatively, against NDC. Our experimental analysis of CND reveals that i) the average recomputability across our benchmarks is 51.1%, ii) our compiler-driven strategy is able to exploit 79.3% of the recomputation opportunities presented by our workloads, and iii) our enhancements increase the value of the recomputability metric significantly. As a result, our compiler-driven approach with the proposed enhancements brings an average execution time improvement of 40.1%.

9Ā citations

References
More filters
Journal Articleā€¢DOIā€¢
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696Ā citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

Journal Articleā€¢
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327Ā citations

Journal Articleā€¢DOIā€¢
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Abstract: Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph [Bui and Jones, Proc. of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993, 445--452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, NM, 1993]. From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.

5,629Ā citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...For this purpose, we use METIS [27] to perform 512-way multi-constraint partitioning to balance the number of vertices, outgoing edges, and incoming edges of each partition, as done in a recent previous work [51]....

    [...]

  • ...This is confirmed by the observation that Tesseract with METIS spends 59% of execution time waiting for synchronization barriers....

    [...]

Journal Articleā€¢DOIā€¢
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), ItaniumĀ®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019Ā citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...We evaluate our architecture using an in-house cycle-accurate x86-64 simulator whose frontend is Pin [38]....

    [...]

Proceedings Articleā€¢DOIā€¢
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840Ā citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

  • ...It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model....

    [...]