scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

13 Jun 2015-Vol. 43, Iss: 3, pp 105-117
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Sep 2016
TL;DR: This talk discusses major challenges facing modern memory systems in the presence of greatly increasing demand for data and its fast analysis, and examines some promising research and design directions to overcome these challenges and thus enable scalable memory systems for the future.
Abstract: The memory system is a fundamental performance and energy bottleneck in almost all computing systems. Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more important system bottleneck [103, 110]. At the same time, DRAM technology is experiencing difficult technology scaling challenges that make the maintenance and enhancement of its capacity, energy efficiency, and reliability significantly more costly with conventional techniques (see, for example [27, 32, 53, 56–58, 66, 67, 70– 72, 82, 84, 85, 92, 107, 114, 123]). In fact, recent reliability issues with DRAM [97], such as the RowHammer problem [66, 107], are already threatening system security and predictability [10, 68, 107]. In this talk, we first discuss major challenges facing modern memory systems in the presence of greatly increasing demand for data and its fast analysis.We then examine some promising research and design directions to overcome these challenges and thus enable scalable memory systems for the future.We discuss three key solution directions: 1) enabling new memory architectures, functions, interfaces, and better integration of memory and the rest of the system (e.g., [2, 3, 9, 24– 27, 43–45, 47, 48, 56, 58, 65, 78, 79, 82, 84, 89, 107, 114, 115, 118, 123, 127, 129, 132, 134, 135]), 2) designing a memory system that intelligently employs emerging non-volatile memory (NVM) technologies and coordinates memory and storage management (e.g., [52, 69– 72, 87, 93–95, 122, 124, 146–148]), 3) reducing memory interference and providing predictable performance to applications sharing the memory system (e.g., [5, 23, 29–31, 34–37, 42, 51, 59, 60, 63, 64, 73– 75, 80, 99, 101, 108, 109, 113, 130, 138–143]). In the first solution direction, we will try to answer the question: can we design system architectures that treat memory as a more central, more intelligent, more autonomous component? We will discuss compute-capable memory architectures [2, 3, 9, 26, 43, 44, 47, 61, 115, 127, 129, 132, 134, 135], DRAM retention time analysis and mechanisms for refresh reduction [56, 57, 84, 85, 114, 123], DRAM error analysis and online detection and management of memory errors [27, 56–58, 66, 82, 85, 96, 114, 123], heterogeneous-reliability memory designs [89], DRAM latency analysis and low-latency DRAM architectures [25, 26, 45, 65, 78, 79, 82, 129], high-bandwidth DRAM architectures [65, 80, 81], memory energy reduction techniques [26, 27, 32, 65, 129], memory compression mechanisms [117–121], and new virtual memory architectures [47, 133]. In the second solution direction, we will try to answer the question: can we enable a system design where an emerging NVM technology can effectively replace, augment, or perhaps even surpass DRAM?We aim to discuss architecting memory systems to incorporate emerging memory technologies as DRAM replacement [69–72, 87, 88, 94, 122, 147, 148], hybrid memory systems that incorporate multiple different memory technologies and manage them to obtain the best of multiple technologies [83, 93, 95, 124, 146], systems that can merge memory and storage management into a single unified interface to make the most of the fast, byte-addressable persistence characteristics of emerging NVMs [87, 95, 124], and techniques for enabling programmers and systems to more easily take advantage of byte-level persistence characteristics of NVM [28, 124]. In the third solution direction, we will try to answer the question: can we enable a flexible and configurable memory system that can provide predictable performance and good quality-of-service to applications, enable software to enforce various different QoS policies, and maximize system performance? To this end, we aim to discuss new memory controller designs [5, 7, 37, 49, 51, 59, 60, 63, 64, 73, 75, 76, 80, 99–101, 108, 109, 130, 138–142, 148], new network-onchip designs [6, 8, 29, 30, 39–42, 98, 144, 145], new cache designs [7, 55, 117, 128, 130, 131], memory partitioning mechanisms [80, 101], source throttling mechanisms [6, 8, 23, 35, 36, 54, 112, 113], application scheduling mechanisms [31, 143], and intelligent prefetch management techniques [33, 34, 36, 50, 73, 74, 76, 111, 131, 136]. We will also touch upon new, open-source infrastructures my research group has released [27, 45, 46, 66, 67, 81, 82, 107, 124, 125, 128, 140–142] to facilitate exploration of novel ideas in all three solution directions. Finally, if time permits, we will describe our ongoing related work in combating technology scaling, reliability, and lifetime challenges of NAND flash memory (e.g., [12–21, 38, 86, 90, 91, 96]) to enable much more reliable, durable and high-performance storage systems.

5 citations

Proceedings ArticleDOI
01 Aug 2020
TL;DR: This paper develops and open-source SAGA-Bench, a benchmark for streaming graph analytics, and performs workload characterization at the architecture level, revealing that the graph update phase exhibits lower utilization of architecture resources than the compute phase.
Abstract: Many application scenarios such as social network analysis and real-time financial fraud detection involve performing batched updates and analytics on a time-evolving or streaming graph. Despite their importance, streaming graph analytics workloads have not been systematically studied at either the software or the architecture levels. This paper fills this gap through three contributions. First, we develop and open-source SAGA-Bench, a benchmark for streaming graph analytics, which puts together different data structures and compute models on the same platform for a fair and systematic characterization. Second, we perform software-level characterization using SAGA-Bench. Our profiling reveals that the best data structure for a streaming graph depends on the per-batch degree distribution of the graph. We also observe that the incremental compute model provides performance benefits especially for larger graphs. Finally, we show that the graph update phase contributes at least 40% of the streaming graph processing latency in many cases. Third, we perform workload characterization at the architecture level. Our study reveals that the graph update phase exhibits lower utilization of architecture resources than the compute phase. Furthermore, the hardware resource utilization of the update phase strongly depends on the underlying structure of the batches of the graph. Finally, between compute and update phases, the former exhibits a higher L3 cache hit ratio, whereas the latter shows a higher L2 cache hit ratio.

5 citations

Proceedings ArticleDOI
21 May 2018
TL;DR: This paper proposes memory optimizations for a "sea of simple MIMD cores (SSMC)" PNM architecture, called Millipede, which (pre) fetches and operates on entire memory rows to exploit BMLAs' row-density and employs cross-corelet flow-control to prevent eviction.
Abstract: The technology-push of die stacking and application pull of Big Data machine learning analytics (BMLA) have created a unique opportunity for processing-near-memory (PNM). This paper makes four contributions: (1) While previous PNM work explores general MapReduce workloads, we identify three application characteristics of most BMLAs: (a) irregular-and-compute-light (i.e., perform only a few operations per input word which include data-dependent branches and indirect memory accesses); (b) compact (i.e., the relevant portion of the input data and the intermediate live data for each thread are small); and (c) memory-row-dense (i.e., process the input data without skipping over many bytes). These characteristics, except for irregularity, are necessary for bandwidth-and energy-efficient PNM, irrespective of the architecture. (2) Based on these characteristics, we propose memory optimizations for a "sea of simple MIMD cores (SSMC)" PNM architecture, called Millipede, which (pre) fetches and operates on entire memory rows to exploit BMLAs' row-density. Instead of this row-oriented access and compute-schedule, traditional multicores opportunistically improve row locality while fetching and operating on cache blocks. (3) Millipede employs well-known MIMD execution to handle BMLAs' irregularity, and sequential prefetch of input data to hide memory latency. In Millipede, however, one corelet prefetches a row for all the corelets which may stray far from each other due to their MIMD execution. Consequently, a leading corelet may prematurely evict the prefetched data before a lagging corelet has consumed the data. Millipede employs cross-corelet flow-control to prevent such eviction. (4) Millipede further exploits its flow-controlled prefetch for frequency scaling based on coarse-grain compute-memory rate-matching which decreases (increases) the processor clock speed when the prefetch buffers are empty (full). Using simulations, we compare PNM architectures to show that Millipede improves performance and energy, by 135% and 27% over a GPGPU with prefetch, and by 35% and 36% over SSMC with prefetch, when all three PNM architectures use the same resources (i.e., number of cores and on-processor-die memory) and identical die-stacking.

4 citations


Cites background or methods from "A scalable processing-in-memory acc..."

  • ...For comparison purposes, we use GPGPUsim to simulate PNM architectures based on a GPGPU, Variable Warp Sizing (VWS) [41] which is currently the best branch-optimized GPGPU (for BMLAs’ branches), and SSMC (representing previous multicores without row-orientedness [11], [10], [12])....

    [...]

  • ...Further, Tesseract is not row-oriented and would incur straying similar to conventional multicores and plain SSMC....

    [...]

  • ...While processing-in-memory (PIM) has been around for decades [5], [6], [7], [8], [9], [10], [11], [12], [13], there have been three problems....

    [...]

  • ...While Tesseract [12] targets graph workloads via MIMD and inter-core communication, such workloads are not row-dense or compact....

    [...]

Journal ArticleDOI
TL;DR: MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in the authors' implementation, and improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.
Abstract: This article describes Memory Squeeze (MemSZ), a new approach for lossy general-purpose memory compression. MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in our implementation. Our compressor is placed between the memory controller and the cache hierarchy of a processor to reduce the memory traffic of applications that tolerate approximations in parts of their data. Thereby, the available off-chip bandwidth is utilized more efficiently improving system performance and energy efficiency. Two alternative multi-core variants of the MemSZ system are described. The first variant has a shared last-level cache (LLC) on the processor-die, which is modified to store both compressed and uncompressed data. The second has a 3D-stacked DRAM cache with larger cache lines that match the granularity of the compressed memory blocks and stores only uncompressed data. For applications that tolerate aggressive approximation in large fractions of their data, MemSZ reduces baseline memory traffic by up to 81%, execution time by up to 62%, and energy costs by up to 25% introducing up to 1.8% error to the application output. Compared to the current state-of-the-art lossy memory compression design, MemSZ improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.

4 citations


Cites background from "A scalable processing-in-memory acc..."

  • ...In addition, new emerging data-intensive applications further increase memory traffic [4, 5, 47]....

    [...]

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Abstract: Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph [Bui and Jones, Proc. of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993, 445--452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, NM, 1993]. From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.

5,629 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...For this purpose, we use METIS [27] to perform 512-way multi-constraint partitioning to balance the number of vertices, outgoing edges, and incoming edges of each partition, as done in a recent previous work [51]....

    [...]

  • ...This is confirmed by the observation that Tesseract with METIS spends 59% of execution time waiting for synchronization barriers....

    [...]

Journal ArticleDOI
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...We evaluate our architecture using an in-house cycle-accurate x86-64 simulator whose frontend is Pin [38]....

    [...]

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

  • ...It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model....

    [...]