scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Near-Memory and In-Storage FPGA Acceleration for Emerging Cognitive Computing Workloads

TL;DR: The case for leveraging FPGAs in near-memory and in-storage settings, and present opportunities and challenges in such scenarios, is made and a conceptual FPGA-based near-data processing architecture is introduced.
Abstract: The slow down in Moore’s Law has resulted in poor scaling of performance and energy. This slow down in scaling has been accompanied by the explosive growth of cognitive computing applications, creating a demand for high performance and energy efficient solutions. Amidst this climate, FPGA-based accelerators are emerging as a potential platform for deploying accelerators for cognitive computing workloads. However, the slow-down in scaling also limits the scaling of memory and I/O bandwidths. Additionally, a growing fraction of energy is spent on data transfer between off-chip memory and the compute units. Thus, now more than ever, there is a need to leverage near-memory and in-storage computing to maximize the bandwidth available to accelerators, and further improve energy efficiency. In this paper, we make the case for leveraging FPGAs in near-memory and in-storage settings, and present opportunities and challenges in such scenarios. We introduce a conceptual FPGA-based near-data processing architecture, and discuss innovations in architecture, systems, and compilers for accelerating cognitive computing workloads.
Citations
More filters
Proceedings ArticleDOI
12 Jun 2022
TL;DR: This work introduces a new near-memory acceleration scheme for in-memory database operations, called Acceleration DIMM (AxDIMM), which behaves like a normal DIMS through the standard DIMm-compatible interface, but has embedded computing units for data-intensive operations.
Abstract: The significant overhead needed to transfer the data between CPUs and memory devices is one of the hottest issues in many areas of computing, such as database management systems. Disaggregated computing on the memory devices is being highlighted as one promising approach. In this work, we introduce a new near-memory acceleration scheme for in-memory database operations, called Acceleration DIMM (AxDIMM). It behaves like a normal DIMM through the standard DIMM-compatible interface, but has embedded computing units for data-intensive operations. With the minimized data transfer overhead, it reduces CPU resource consumption, relieves the memory bandwidth bottleneck, and boosts energy efficiency. We implement scan operations, one of the most data-intensive database operations, within AxDIMM and compare its performance with SIMD (Single Instruction Multiple Data) implementation on CPU. Our investigation shows that the acceleration achieves 6.8x more throughput than the SIMD implementation.

6 citations

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a hardware/software design methodology for associative in-memory processors, which aims to decrease energy consumption and area requirement of the processor architecture specifically programmed to perform a given task.

5 citations

Proceedings ArticleDOI
17 Jun 2021
TL;DR: In this article, a framework for automatic generation of FPGA-based accelerators capable of data filtering and transformation for key-value stores based on simple data-format specifications is presented.
Abstract: Near-Data Processing is a promising approach to overcome the limitations of slow I/O interfaces in the quest to analyze the ever-growing amount of data stored in database systems. Next to CPUs, FPGAs will play an important role for the realization of functional units operating close to data stored in non-volatile memories such as Flash.It is essential that the NDP-device understands formats and layouts of the persistent data, to perform operations in-situ. To this end, carefully optimized format parsers and layout accessors are needed. However, designing such FPGA-based Near-Data Processing accelerators requires significant effort and expertise. To make FPGA-based Near-Data Processing accessible to non-FPGA experts, we will present a framework for the automatic generation of FPGA-based accelerators capable of data filtering and transformation for key-value stores based on simple data-format specifications.The evaluation shows that our framework is able to generate accelerators that are almost identical in performance compared to the manually optimized designs of prior work, while requiring little to no FPGA-specific knowledge and additionally providing improved flexibility and more powerful functionality.

2 citations

Journal ArticleDOI
TL;DR: This brief investigates the usage of analog memristive CAMs to improve the area density of MBCs, thus enabling denser memories and smaller power consumption, and results indicate significant area and energy savings.
Abstract: Memory-based reconfigurable computing rose to outperform conventional FPGAs in logic function density, as well as being less impacted by the programmable interconnection. However, the major drawback is that the area and power overheads are significant, and this gets worse in high-density, high-performance Memory-based Computing (MBC) devices. In this brief, we investigate the usage of analog memristive CAMs to improve the area density of MBCs, thus enabling denser memories and smaller power consumption. Simulation results indicate significant area and energy savings, up to $2.2\times $ and $2.6\times $ , respectively, compared to the non-volatile counterparts, at the cost of increasing latency by up to 20%.

1 citations

Proceedings ArticleDOI
01 Jul 2022
TL;DR: This work presents an automated design and optimization flow for near on-chip memory computing systems by placing dedicated hardware accelerators directly next to the onchip memory, allowing a much richer set of optimizations than traditional Register Transfer Level (RTL) based approaches.
Abstract: This work presents an automated design and optimization flow for near on-chip memory computing systems by placing dedicated hardware accelerators directly next to the onchip memory. The salient feature of our proposed flow is that it allows the design of these complex systems completely at the behavioral level, thus, allowing a much richer set of optimizations than traditional Register Transfer Level (RTL) based approaches. Moreover, raising the level of design abstraction allows to quickly evaluate the effect of different optimization on the overall area, performance and power. In addition, it allows to quickly generate system with particular area and performance trade-offs by simply setting different synthesis options' combinations. Experimental results setting different constraints show the effectiveness of our proposed approach.
References
More filters
Proceedings ArticleDOI
02 Jun 2018
TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.
Abstract: Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

498 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: The basic architecture of the Neurocube is presented and an analysis of the logic tier synthesized in 28nm and 15nm process technologies are presented and the performance is evaluated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.
Abstract: This paper presents a programmable and scalable digital neuromorphic architecture based on 3D high-density memory integrated with logic tier for efficient neural computing. The proposed architecture consists of clusters of processing engines, connected by 2D mesh network as a processing tier, which is integrated in 3D with multiple tiers of DRAM. The PE clusters access multiple memory channels (vaults) in parallel. The operating principle, referred to as the memory centric computing, embeds specialized state-machines within the vault controllers of HMC to drive data into the PE clusters. The paper presents the basic architecture of the Neurocube and an analysis of the logic tier synthesized in 28nm and 15nm process technologies. The performance of the Neurocube is evaluated and illustrated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.

415 citations


"Near-Memory and In-Storage FPGA Acc..." refers background in this paper

  • ...HMC is designed to be connected to a host processor and be chained together to provide higher capacities....

    [...]

  • ...Thus, the growing demands for energy efficiency, memory bandwidth, and fast access to large amounts of data have prompted researchers to explore placing general-purpose compute and accelerators closer to memory and storage, in the form of near memory acceleration [4, 5] and in-storage computing [6–8]....

    [...]

  • ...Recent interest in NMA has been driven by the development of stacked DRAM organizations such as Hybrid Memory Cubes (HMC) and High Bandwidth Memory (HBM)....

    [...]

  • ...While similar in spirit, HBM and HMC have different approaches and targets....

    [...]

  • ...Note that this is a less costly way of integrating FPGAs in the memory system than embedding them in the logic layers of HBM/HMC, or even 2.5D package integration....

    [...]

Proceedings ArticleDOI
12 Mar 2016
TL;DR: Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays, and achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.
Abstract: The energy constraints due to the end of Dennard scaling, the popularity of in-memory analytics, and the advances in 3D integration technology have led to renewed interest in near-data processing (NDP) architectures that move processing closer to main memory. Due to the limited power and area budgets of the logic layer, the NDP compute units should be area and energy efficient while providing sufficient compute capability to match the high bandwidth of vertical memory channels. They should also be flexible to accommodate a wide range of applications. Towards this goal, NDP units based on fine-grained (FPGA) and coarse-grained (CGRA) reconfigurable logic have been proposed as a compromise between the efficiency of custom engines and the flexibility of programmable cores. Unfortunately, FPGAs incur significant area overheads for bit-level reconfiguration, while CGRAs consume significant power in the interconnect and are inefficient for irregular data layouts and control flows. This paper presents Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays. HRL combines both coarse-grained and fine-grained logic blocks, separates routing networks for data and control signals, and uses specialized units to effectively support branch operations and irregular data layouts in analytics workloads. HRL has the power efficiency of FPGA and the area efficiency of CGRA. It improves performance per Watt by 2.2x over FPGA and 1.7x over CGRA. For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.

184 citations


"Near-Memory and In-Storage FPGA Acc..." refers background in this paper

  • ...HMC is designed to be connected to a host processor and be chained together to provide higher capacities....

    [...]

  • ...Thus, the growing demands for energy efficiency, memory bandwidth, and fast access to large amounts of data have prompted researchers to explore placing general-purpose compute and accelerators closer to memory and storage, in the form of near memory acceleration [4, 5] and in-storage computing [6–8]....

    [...]

  • ...Recent interest in NMA has been driven by the development of stacked DRAM organizations such as Hybrid Memory Cubes (HMC) and High Bandwidth Memory (HBM)....

    [...]

  • ...While similar in spirit, HBM and HMC have different approaches and targets....

    [...]

  • ...Note that this is a less costly way of integrating FPGAs in the memory system than embedding them in the logic layers of HBM/HMC, or even 2.5D package integration....

    [...]

Proceedings ArticleDOI
13 Jun 2015
TL;DR: BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network, is presented, showing that BlueDBM outperforms aflash-based system without these features by a factor of 10 for some important applications.
Abstract: Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.

174 citations


"Near-Memory and In-Storage FPGA Acc..." refers background in this paper

  • ...These in-storage processing units are typically embedded CPUs [8], but recent attempts have leveraged FPGAs as well [6, 7]....

    [...]

Proceedings ArticleDOI
06 Oct 2014
TL;DR: The potential of making programmability a central feature of the SSD interface is explored, and it is found that defining SSD semantics in software is easy and beneficial, and that Willow makes it feasible for a wide range of IO-intensive applications to benefit from a customized SSD interface.
Abstract: We explore the potential of making programmability a central feature of the SSD interface. Our prototype system, called Willow, allows programmers to augment and extend the semantics of an SSD with application-specific features without compromising file system protections. The SSD Apps running on Willow give applications low-latency, high-bandwidth access to the SSD's contents while reducing the load that IO processing places on the host processor. The programming model for SSD Apps provides great flexibility, supports the concurrent execution of multiple SSD Apps in Willow, and supports the execution of trusted code in Willow.We demonstrate the effectiveness and flexibility of Willow by implementing six SSD Apps and measuring their performance. We find that defining SSD semantics in software is easy and beneficial, and that Willow makes it feasible for a wide range of IO-intensive applications to benefit from a customized SSD interface.

150 citations


"Near-Memory and In-Storage FPGA Acc..." refers background in this paper

  • ...These in-storage processing units are typically embedded CPUs [8], but recent attempts have leveraged FPGAs as well [6, 7]....

    [...]