scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A scalable processing-in-memory accelerator for parallel graph processing

13 Jun 2015-Vol. 43, Iss: 3, pp 105-117
TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.
Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
18 May 2020
TL;DR: A novel ReRAM-based graph analytics accelerator, named Spara, is presented, which can maximize the workload density of crossbars dynamically by using a tightly-coupled bank parallel architecture further proposed.
Abstract: Resistive random access memory (ReRAM) addresses the high memory bandwidth requirement challenge of graph analytics by integrating the computing logic in the memory. Due to the matrix-structured crossbar architecture, existing ReRAM-based accelerators, when handling real-world graphs that often have the skewed degree distribution, suffer from the severe sparsity problem arising from zero fillings and activation nondeterminism, incurring substantial ineffectual computations.In this paper, we observe that the sparsity sources lie in the consecutive mapping of source and destination vertex index onto the wordline and bitline of a crossbar. Although exhaustive graph reordering improves the sparsity-induced inefficiency, its totally-random (source and destination) vertex mapping leads to expensive overheads. This work exploits the insight in a mid-point vertex mapping with the random wordlines and consecutive bitlines. A cost-effective preprocessing is proposed to exploit the insight by rapidly exploring the crossbar-fit vertex reorderings but ignores the sparsity arising from activation dynamics. We present a novel ReRAM-based graph analytics accelerator, named Spara, which can maximize the workload density of crossbars dynamically by using a tightly-coupled bank parallel architecture further proposed. Results on real-world and synthesized graphs show that Spara outperforms GraphR and GraphSAR by 8.21 × and 5.01 × in terms of performance, and by 8.97 × and 5.68× in terms of energy savings (on average), while incurring a reasonable (<9.98%) pre-processing overhead.

22 citations

Proceedings ArticleDOI
12 May 2022
TL;DR: This work proposes SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence- to-sequence mapping, for both short and long reads.
Abstract: A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highly-parallel memory access, which alleviates the memory bottleneck. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph (i.e., S2G) and sequence-to-sequence (i.e., S2S) mapping pipelines. First, SeGraM outperforms state-of-the-art S2G mapping tools by 5.9×/3.9× and 106×/- 742× for long and short reads, respectively, while reducing power consumption by 4.1×/4.4× and 3.0×/3.2×. Second, BitAlign outperforms a state-of-the-art S2G alignment tool by 41×-539× and three S2S alignment accelerators by 1.2×-4.8×. We conclude that SeGraM is a high-performance and low-cost universal genomics mapping accelerator that efficiently supports both sequence-to-graph and sequence-to-sequence mapping pipelines.

21 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: DRAF is introduced, an architecture for bit-level reconfigurable logic that uses DRAM subarrays to implement dense lookup tables and overlaps DRAM operations like bitline precharge and charge restoration with routing within the reconfigured routing fabric to minimize the impact of DRAM latency.
Abstract: FPGAs are a popular target for application-specific accelerators because they lead to a good balance between flexibility and energy efficiency. However, FPGA lookup tables introduce significant area and power overheads, making it difficult to use FPGA devices in environments with tight cost and power constraints. This is the case for datacenter servers, where a modestly-sized FPGA cannot accommodate the large number of diverse accelerators that datacenter applications need.This paper introduces DRAF, an architecture for bit-level reconfigurable logic that uses DRAM subarrays to implement dense lookup tables. DRAF overlaps DRAM operations like bitline precharge and charge restoration with routing within the reconfigurable routing fabric to minimize the impact of DRAM latency. It also supports multiple configuration contexts that can be used to quickly switch between different accelerators with minimal latency. Overall, DRAF trades off some of the performance of FPGAs for significant gains in area and power. DRAF improves area density by 10x over FPGAs and power consumption by more than 3x, enabling DRAF to satisfy demanding applications within strict power and cost constraints. While accelerators mapped to DRAF are 2-3x slower than those in FPGAs, they still deliver a 13x speedup and an 11x reduction in power consumption over a Xeon core for a wide range of datacenter tasks, including analytics and interactive services like speech recognition.

21 citations

Proceedings ArticleDOI
01 Feb 2021
TL;DR: GradPIM as discussed by the authors is a processing-in-memory architecture which accelerates parameter updates of deep neural networks training by extending DDR4 SDRAM to utilize bank-group parallelism.
Abstract: In this paper, we present GradPIM, a processingin-memory architecture which accelerates parameter updates of deep neural networks training. As one of processing-in-memory techniques that could be realized in the near future, we propose an incremental, simple architectural design that does not invade the existing memory protocol. Extending DDR4 SDRAM to utilize bank-group parallelism makes our operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. Our experimental results show that the proposed architecture can improve the performance of DNN training and greatly reduce memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and DRAM area.

21 citations

Posted Content
TL;DR: A novel filtering algorithm, GRIM-Filter, optimized to exploit emerging 3D-stacked memory systems that integrate computation within a stacked logic layer, enabling processing-in-memory (PIM).
Abstract: Motivation: Seed filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. Read mappers 1) quickly generate possible mapping locations (i.e., seeds) for each read, 2) extract reference sequences at each of the mapping locations, and then 3) check similarity between each read and its associated reference sequences with a computationally expensive dynamic programming algorithm (alignment) to determine the origin of the read. Location filters come into play before alignment, discarding seed locations that alignment would have deemed a poor match. The ideal location filter would discard all poor matching locations prior to alignment such that there is no wasted computation on poor alignments. Results: We propose a novel filtering algorithm, GRIM-Filter, optimized to exploit emerging 3D-stacked memory systems that integrate computation within a stacked logic layer, enabling processing-in-memory (PIM). GRIM-Filter quickly filters locations by 1) introducing a new representation of coarse-grained segments of the reference genome and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for 5% error acceptance rates, GRIM-Filter eliminates 5.59x-6.41x more false negatives and exhibits end-to-end speedups of 1.81x-3.65x compared to mappers employing the best previous filtering algorithm.

21 citations


Cites background from "A scalable processing-in-memory acc..."

  • ...GRIM-Filter takes in read sequences and generates filter masks on the 3D-stacked logic die using precomputed bit vectors, while the CPU filters locations, queried from the hash table, using the resulting filter mask....

    [...]

  • ...and only executes instructions within the CPU [21, 22, 26, 27]....

    [...]

  • ...Our evaluation system is an Intel(R) CoreTM i7-2600 CPU @ 3.40GHz with 16 GB of RAM for all experiments....

    [...]

  • ...When the read mapper is ready to align a read to a a segment of the reference sequence, the read mapper must INPUT: Read Sequence Potential Locations GAACTTGGAGTCTACGAG GTACGATT GAACT GAACTT GTACGA TT ACTTG GAACTTGGAGTC TACGAG GTACGATT CGATT 1 Ext ra ct t ok e ns from re a d se que nce 2 Ga t he r re giona l bi t ve ctors GAACT 00100010001110001010…0100110110 AACTT 01001010101011001001…0000101101 ACTTG 00001010001110001010…0100110100 CGATT 00001010001110001010…0100110100 3 Ge t t ok e n ma sk by compa r ing e a ch va lue in Sum Ve ct or a ga inst thre shold P ASS 00001010001110001010…0100110100 Token filter mask 00001010001110001010…0100110100 x PASS PASS 4 Compute r eg i on b i t i ndex 5 Check to ken mask Reference string storage R ef s t r in g @ 2 0 1 3 1 Edit-distance calculation R ef s t r in g @ 4 1 4 4 1 5 OUTPUT: Correct Mappings GRIM-Filter: in-memory operations Read Mapper: in-CPU operations Sum Vector … … FAIL Figure 3: Flow diagram for our algorithm....

    [...]

  • ...We show that when we run mapping with GRIMFilter on a commodity CPU, we see 1.81x − 3.65x performance improvement....

    [...]

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
TL;DR: This work presents a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of theSize of the final partition obtained after multilevel refinement, and presents a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening.
Abstract: Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph [Bui and Jones, Proc. of the 6th SIAM Conference on Parallel Processing for Scientific Computing, 1993, 445--452; Hendrickson and Leland, A Multilevel Algorithm for Partitioning Graphs, Tech. report SAND 93-1301, Sandia National Laboratories, Albuquerque, NM, 1993]. From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan--Lin (KL) algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.

5,629 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...For this purpose, we use METIS [27] to perform 512-way multi-constraint partitioning to balance the number of vertices, outgoing edges, and incoming edges of each partition, as done in a recent previous work [51]....

    [...]

  • ...This is confirmed by the observation that Tesseract with METIS spends 59% of execution time waiting for synchronization barriers....

    [...]

Journal ArticleDOI
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...We evaluate our architecture using an in-house cycle-accurate x86-64 simulator whose frontend is Pin [38]....

    [...]

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840 citations


"A scalable processing-in-memory acc..." refers methods in this paper

  • ...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....

    [...]

  • ...It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model....

    [...]