A scalable processing-in-memory accelerator for parallel graph processing

doi:10.1145/2749469.2750386

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

[...]

Damla Senol Cali¹, Kalsi Gurpreet S², Zülal Bingöl³, Can Firtina⁴, Lavanya Subramanian⁵, Jeremie S. Kim⁴, Rachata Ausavarungnirun⁶, Mohammed Alser⁴, Juan Gómez-Luna⁴, Amirali Boroumand¹, Anant Norion², Allison Scibisz¹, Sreenivas Subramoneyon², Can Alkan³, Saugata Ghose⁷, Onur Mutlu⁴ - Show less +12 more•Institutions (7)

Carnegie Mellon University¹, Intel², Bilkent University³, ETH Zurich⁴, Facebook⁵, King Mongkut's University of Technology North Bangkok⁶, University of Illinois at Urbana–Champaign⁷

01 Oct 2020

TL;DR: GenASM as discussed by the authors accelerates read alignment for both long reads and short reads, with 3.7× the performance of a state-of-the-art pre-alignment filter.

...read moreread less

Abstract: Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. To perform genome sequencing, devices extract small random fragments of an organism’s DNA sequence (known as reads). The first step of genome sequence analysis is a computational process known as read mapping. In read mapping, each fragment is matched to its potential location in the reference genome with the goal of identifying the original location of each read in the genome. Unfortunately, rapid genome sequencing is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM), which is used at multiple points during the mapping process. ASM enables read mapping to account for sequencing errors and genetic variations in the reads.We propose GenASM, the first ASM acceleration framework for genome sequence analysis. GenASM performs bitvectorbased ASM, which can efficiently accelerate multiple steps of genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint. Using this modified algorithm, we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized systolic-array-based compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth, resulting in an efficient design whose performance scales linearly as we increase the number of compute units working in parallel.We demonstrate that GenASM provides significant performance and power benefits for three different use cases in genome sequence analysis. First, GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116× and 3.9×, respectively, while reducing power consumption by 37× and 2.7×. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111× and 1.9×. Second, GenASM accelerates pre-alignment filtering for short reads, with 3.7× the performance of a state-of-the-art pre-alignment filter, while reducing power consumption by 1.7× and significantly improving the filtering accuracy. Third, GenASM accelerates edit distance calculation, with 22–12501× and 9.3–400× speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while reducing power consumption by 548–582× and 67×. We conclude that GenASM is a flexible, high-performance, and low-power framework, and we briefly discuss four other use cases that can benefit from GenASM.

...read moreread less

92 citations

Journal Article•DOI•

Processing-in-memory: A workload-driven perspective

[...]

Saugata Ghose¹, Amirali Boroumand¹, Jeremie S. Kim¹, Juan Gómez-Luna², Onur Mutlu², Onur Mutlu¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, ETH Zurich²

08 Aug 2019-Ibm Journal of Research and Development

TL;DR: This article describes the work on systematically identifying opportunities for PIM in real applications and quantifies potential gains for popular emerging applications (e.g., machine learning, data analytics, genome analysis) and describes challenges that remain for the widespread adoption of PIM.

...read moreread less

Abstract: Many modern and emerging applications must process increasingly large volumes of data. Unfortunately, prevalent computing paradigms are not designed to efficiently handle such large-scale data: The energy and performance costs to move this data between the memory subsystem and the CPU now dominate the total costs of computation. This forces system architects and designers to fundamentally rethink how to design computers. Processing-in-memory (PIM) is a computing paradigm that avoids most data movement costs by bringing computation to the data. New opportunities in modern memory systems are enabling architectures that can perform varying degrees of processing inside the memory subsystem. However, many practical system-level issues must be tackled to construct PIM architectures, including enabling workloads and programmers to easily take advantage of PIM. This article examines three key domains of work toward the practical construction and widespread adoption of PIM architectures. First, we describe our work on systematically identifying opportunities for PIM in real applications and quantify potential gains for popular emerging applications (e.g., machine learning, data analytics, genome analysis). Second, we aim to solve several key issues in programming these applications for PIM architectures. Third, we describe challenges that remain for the widespread adoption of PIM.

...read moreread less

91 citations

Cites background from "A scalable processing-in-memory acc..."

...examples of entire application offloading: Tesseract [5] and GRIM-Filter [6]....
[...]
...coherence proposed by prior PIM works [5, 21, 75] either...
[...]
...Tesseract combines this new architecture with a message-passing-based programming model, where message passing is used to perform operations on the graph nodes by moving the operations to the vaults where the corresponding graph nodes are stored....
[...]
...8 , and reduces the energy consumption by 87%, over a conventional CPU-only system [5]....
[...]
...Tesseract adds an in-order core to each vault in an HMC-like 3D-stacked memory and implements an efficient communication protocol between these in-order cores....
[...]

Proceedings Article•DOI•

Concurrent Data Structures for Near-Memory Computing

[...]

Zhiyu Liu¹, Irina Calciu², Maurice Herlihy¹, Onur Mutlu³•Institutions (3)

Brown University¹, VMware², ETH Zurich³

24 Jul 2017

TL;DR: This paper is the first to examine the design of concurrent data structures for PIM, and shows two main results: (1) naive PIM data structures cannot outperform state-of-the-art concurrentData structures, and (2) novel designs for Pim data structures, using techniques such as combining, partitioning and pipelining, can outperform traditional concurrent data structure, with a significantly simpler design.

...read moreread less

Abstract: The performance gap between memory and CPU has grown exponentially. To bridge this gap, hardware architects have proposed near-memory computing (also called processing-in-memory, or PIM), where a lightweight processor (called a PIM core) is located close to memory. Due to its proximity to memory, a memory access from a PIM core is much faster than that from a CPU core. New advances in 3D integration and die-stacked memory make PIM viable in the near future. Prior work has shown significant performance improvements by using PIM for embarrassingly parallel and data-intensive applications, as well as for pointer-chasing traversals in sequential data structures. However, current server machines have hundreds of cores, and algorithms for concurrent data structures exploit these cores to achieve high throughput and scalability, with significant benefits over sequential data structures. Thus, it is important to examine how PIM performs with respect to modern concurrent data structures and understand how concurrent data structures can be developed to take advantage of PIM. This paper is the first to examine the design of concurrent data structures for PIM. We show two main results: (1) naive PIM data structures cannot outperform state-of-the-art concurrent data structures, such as pointer-chasing data structures and FIFO queues, (2) novel designs for PIM data structures, using techniques such as combining, partitioning and pipelining, can outperform traditional concurrent data structures, with a significantly simpler design.

...read moreread less

90 citations

Cites background or methods from "A scalable processing-in-memory acc..."

...Although some researchers have studied how PIM memory can help speed up concurrent operations to data structures, such as parallel graph processing [1] and parallel pointer chasing on linked data structures [30], the applications they consider require very simple, if any, synchronization between operations....
[...]
...A PIM core is a lightweight CPU that may be slower than a full-edged CPU with respect to computation speed [1]....
[...]
...For example, one PIM design [1, 2, 9, 53] assumes that memory is organized in multiple vaults, each having an inorder PIM core to manage it....
[...]
...Prior work has already shown signicant performance improvements by using PIM for embarrassingly parallel and data-intensive applications [1, 3, 29, 53, 54], as well as for pointer-chasing traversals [23, 30] in sequential data structures....
[...]
...Prior work ([1, 23, 30]) has shown that pointer chasing can be done more eciently by a PIM core for a sequential data structure....
[...]

Proceedings Article•DOI•

NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning

[...]

Gagandeep Singh¹, Juan Gómez-Luna¹, Giovanni Mariani¹, Geraldo F. Oliveira¹, Stefano Corda¹, Sander Stuijk¹, Onur Mutlu¹, Henk Corporaal¹ - Show less +4 more•Institutions (1)

Eindhoven University of Technology¹

02 Jun 2019

TL;DR: NAPEL is presented, a high-level performance and energy estimation framework for NMC architectures that leverages ensemble learning to develop a model that is based on micro architectural parameters and application characteristics and is capable of making accurate predictions for previously-unseen applications.

...read moreread less

Abstract: The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies work-loads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on micro architectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.

...read moreread less

89 citations

Cites background or methods from "A scalable processing-in-memory acc..."

..., [2, 4, 7, 15, 20]) for architectural performance and energy evaluation....
[...]
...The NMC subsystem consists of a 3D-stacked memory [2, 22, 23, 29] with processing elements (PEs) embedded in its logic layer....
[...]
...In this work, we model NMC PEs as in-order, single-issue cores with a private cache as proposed in previous work [2, 11], taking into account the limited thermal and area budget in the logic layer....
[...]

Journal Article•DOI•

Caribou: intelligent distributed storage

[...]

Zsolt István¹, David Sidler¹, Gustavo Alonso¹•Institutions (1)

ETH Zurich¹

01 Aug 2017

TL;DR: This paper explores near-data processing in database engines, i.e., the option of offloading part of the computation directly to the storage nodes, and implements the ideas in Caribou, an intelligent distributed storage layer incorporating many of the lessons learned while building systems with specialized hardware.

...read moreread less

Abstract: The ever increasing amount of data being handled in data centers causes an intrinsic inefficiency: moving data around is expensive in terms of bandwidth, latency, and power consumption, especially given the low computational complexity of many database operations.In this paper we explore near-data processing in database engines, i.e., the option of offloading part of the computation directly to the storage nodes. We implement our ideas in Caribou, an intelligent distributed storage layer incorporating many of the lessons learned while building systems with specialized hardware. Caribou provides access to DRAM/NVRAM storage over the network through a simple key-value store interface, with each storage node providing high-bandwidth near-data processing at line rate and fault tolerance through replication. The result is a highly efficient, distributed, intelligent data storage that can be used to both boost performance and reduce power consumption and real estate usage in the data center thanks to the micro-server architecture adopted.

...read moreread less

89 citations

Cites background from "A scalable processing-in-memory acc..."

...The idea of performing near-data computation is not new, and has been explored both in the context of main memory (active memory [2, 17, 39, 66]) and persistent storage (active disk [1, 14, 23, 46])....
[...]

Collapse

A scalable processing-in-memory accelerator for parallel graph processing

Citations

Cites background from "A scalable processing-in-memory acc..."

Cites background or methods from "A scalable processing-in-memory acc..."

Cites background or methods from "A scalable processing-in-memory acc..."

Cites background from "A scalable processing-in-memory acc..."

References

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

Related Papers (5)