scispace - formally typeset
Search or ask a question

Showing papers by "Wayne Luk published in 2021"


Proceedings ArticleDOI
TL;DR: In this paper, a reconfigurable architecture for reducing the latency of recurrent neural networks (RNNs) that are used for detecting gravitational waves is presented, which is based on optimizing the initiation intervals (II) in a multi-layer LSTM (Long Short-Term Memory) network, by identifying appropriate reuse factors for each layer.
Abstract: This paper presents novel reconfigurable architectures for reducing the latency of recurrent neural networks (RNNs) that are used for detecting gravitational waves. Gravitational interferometers such as the LIGO detectors capture cosmic events such as black hole mergers which happen at unknown times and of varying durations, producing time-series data. We have developed a new architecture capable of accelerating RNN inference for analyzing time-series data from LIGO detectors. This architecture is based on optimizing the initiation intervals (II) in a multi-layer LSTM (Long Short-Term Memory) network, by identifying appropriate reuse factors for each layer. A customizable template for this architecture has been designed, which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools. The proposed approach has been evaluated based on two LSTM models, targeting a ZYNQ 7045 FPGA and a U250 FPGA. Experimental results show that with balanced II, the number of DSPs can be reduced up to 42% while achieving the same IIs. When compared to other FPGA-based LSTM designs, our design can achieve about 4.92 to 12.4 times lower latency.

21 citations


Proceedings ArticleDOI
26 Jun 2021
TL;DR: In this paper, a reconfigurable architecture for reducing the latency of recurrent neural networks (RNNs) that are used for detecting gravitational waves is presented, which is based on optimizing the initiation intervals (II) in a multi-layer LSTM (Long Short-Term Memory) network, by identifying appropriate reuse factors for each layer.
Abstract: This paper presents novel reconfigurable architectures for reducing the latency of recurrent neural networks (RNNs) that are used for detecting gravitational waves. Gravitational interferometers such as the LIGO detectors capture cosmic events such as black hole mergers which happen at unknown times and of varying durations, producing time-series data. We have developed a new architecture capable of accelerating RNN inference for analyzing time-series data from LIGO detectors. This architecture is based on optimizing the initiation intervals (II) in a multi-layer LSTM (Long Short-Term Memory) network, by identifying appropriate reuse factors for each layer. A customizable template for this architecture has been designed, which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools. The proposed approach has been evaluated based on two LSTM models, targeting a ZYNQ 7045 FPGA and a U250 FPGA. Experimental results show that with balanced II, the number of DSPs can be reduced up to 42% while achieving the same IIs. When compared to other FPGA-based LSTM designs, our design can achieve about 4.92 to 12.4 times lower latency.

12 citations


Proceedings ArticleDOI
05 Dec 2021
TL;DR: In this article, the authors proposed an FPGA-based hardware architecture to accelerate BNNs inferred through Monte Carlo Dropout, which can achieve up to 4 times higher energy efficiency and 9 times better compute efficiency compared with other state-of-theart BNN accelerators.
Abstract: Neural networks (NNs) have demonstrated their potential in a wide range of applications such as image recognition, decision making or recommendation systems. However, standard NNs are unable to capture their model uncertainty which is crucial for many safety-critical applications including healthcare and autonomous vehicles. In comparison, Bayesian neural networks (BNNs) are able to express uncertainty in their prediction via a mathematical grounding. Nevertheless, BNNs have not been as widely used in industrial practice, mainly because of their expensive computational cost and limited hardware performance. This work proposes a novel FPGA based hardware architecture to accelerate BNNs inferred through Monte Carlo Dropout. Compared with other state-of-the-art BNN accelerators, the proposed accelerator can achieve up to 4 times higher energy efficiency and 9 times better compute efficiency. Considering partial Bayesian inference, an automatic framework is proposed, which explores the trade-off between hardware and algorithmic performance. Extensive experiments are conducted to demonstrate that our proposed framework can effectively find the optimal points in the design space.

6 citations


Journal ArticleDOI
TL;DR: This work introduces a novel method for fast and accurate estimation of different metrics that are of importance when performing design space exploration based on a Gaussian process regression model parametrised by the features of the accelerator and the target NN to be accelerated.
Abstract: Contemporary advances in neural networks (NNs) have demonstrated their potential in different applications such as in image classification, object detection or natural language processing. In particular, reconfigurable accelerators have been widely used for the acceleration of NNs due to their reconfigurability and efficiency in specific application instances. To determine the configuration of the accelerator, it is necessary to conduct design space exploration to optimize the performance. However, the process of design space exploration is time consuming because of the slow performance evaluation for different configurations. Therefore, there is a demand for an accurate and fast performance prediction method to speed up design space exploration. This work introduces a novel method for fast and accurate estimation of different metrics that are of importance when performing design space exploration. The method is based on a Gaussian process regression model parametrised by the features of the accelerator and the target NN to be accelerated. We evaluate the proposed method together with other popular machine learning based methods in estimating the latency and energy consumption of our implemented accelerator on two different hardware platforms targeting convolutional neural networks. We demonstrate improvements in estimation accuracy, without the need for significant implementation effort or tuning.

6 citations


Journal ArticleDOI
TL;DR: In this article, a highly customized streaming hardware architecture that focuses on improving the compute efficiency for streaming applications by providing full-stack acceleration of CNNs on FPGAs is presented.
Abstract: Due to the huge success and rapid development of convolutional neural networks (CNNs), there is a growing demand for hardware accelerators that accommodate a variety of CNNs to improve their inference latency and energy efficiency, in order to enable their deployment in real-time applications. Among popular platforms, field-programmable gate arrays (FPGAs) have been widely adopted for CNN acceleration because of their capability to provide superior energy efficiency and low-latency processing, while supporting high reconfigurability, making them favorable for accelerating rapidly evolving CNN algorithms. This article introduces a highly customized streaming hardware architecture that focuses on improving the compute efficiency for streaming applications by providing full-stack acceleration of CNNs on FPGAs. The proposed accelerator maps most computational functions, that is, convolutional and deconvolutional layers into a singular unified module, and implements the residual and concatenative connections between the functions with high efficiency, to support the inference of mainstream CNNs with different topologies. This architecture is further optimized through exploiting different levels of parallelism, layer fusion, and fully leveraging digital signal processing blocks (DSPs). The proposed accelerator has been implemented on Intel's Arria 10 GX1150 hardware and evaluated with a wide range of benchmark models. The results demonstrate a high performance of over 1.3 TOP/s of throughput, up to 97% of compute [multiply-accumulate (MAC)] efficiency, which outperforms the state-of-the-art FPGA accelerators.

5 citations


Proceedings ArticleDOI
17 Feb 2021
TL;DR: In this paper, the authors propose a novel alignment pipeline that considers all information in sequencing data for biologically accurate acceleration of short read mapping, which can accelerate the memory-bound operations which have been a bottleneck in short read alignment.
Abstract: Existing FPGA accelerators for short read mapping often fail to utilize the complete biological information in sequencing data for simple hardware design, leading to missed or incorrect alignment. Furthermore, their performance may not be optimized across hardware platforms. This paper proposes a novel alignment pipeline that considers all information in sequencing data for biologically accurate acceleration of short read mapping. To ensure the performance of the proposed design optimized across different platforms, we accelerate the memory-bound operations which have been a bottleneck in short read mapping. Specifically, we partition the FM-index into buckets. The length of each bucket is equal to an optimal multiple of the memory burst size and is determined through data-driven exploration. A tool has been developed to obtain the optimal parameters of the design for different hardware platforms to enhance performance optimization. Experimental results indicate that our design maximizes alignment accuracy compared to the state-of-the-art software Bowtie, mapping reads 4.48x as fast. Compared to the previous hardware aligner, our achieved accuracy is 97.7% which reports 4.48 M more valid alignments with a similar speed.

5 citations


Proceedings ArticleDOI
27 Jul 2021
TL;DR: In this paper, a reinforcement learning method known as Reward-modulated STDP is presented as an online learning algorithm in the network and evaluated the system performance in a single box of their designed architecture using 6000 concurrent hardware threads and demonstrate scaling to networks with up to 2 million neurons and 400 million synapses.
Abstract: Neuromorphic computing systems simulate spiking neural networks that are used for research into how biological neural networks function, as well as for applied engineering such as robotics, pattern recognition, and machine learning. In this paper, we present a neuromorphic system based on an asynchronous event-based hardware platform. We represent three algorithms for implementing spiking networks on our asynchronous hardware platform. We also discuss different trade-offs between synchronisation and messaging costs. A reinforcement learning method known as Reward-modulated STDP is presented as an online learning algorithm in the network. We evaluate the system performance in a single box of our designed architecture using 6000 concurrent hardware threads and demonstrate scaling to networks with up to 2 million neurons and 400 million synapses. The performance of our architecture is also compared to existing neuromorphic platforms, showing a 20 times speed-up over the Brian simulator on an x86 machine, and a 16 times speed-up over a 48-chip SpiNNaker node.

4 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a design methodology to facilitate rigorous development of complex applications targeting reconfigurable hardware, which relies on analytical estimation of system performance and area utilisation for a given specific application and a particular system instance consisting of a controlflow machine working in conjunction with one or more reconfigable dataflow accelerators.
Abstract: We propose a design methodology to facilitate rigorous development of complex applications targeting reconfigurable hardware. Our methodology relies on analytical estimation of system performance and area utilisation for a given specific application and a particular system instance consisting of a controlflow machine working in conjunction with one or more reconfigurable dataflow accelerators. The targeted application is carefully analyzed, and the parts identified for hardware acceleration are reimplemented as a set of representative software models. Next, with the results of the application analysis, a suitable system architecture is devised and its performance is evaluated to determine bottlenecks, allowing predictable design. The architecture is iteratively refined, until the final version satisfying the specification requirements in terms of performance and required hardware area is obtained. We validate the presented methodology using a widely accepted convolutional neural network (VGG-16) and an important HPC application (BQCD). In both cases, our methodology relieved and alleviated all system bottlenecks before the hardware implementation was started. As a result the architectures were implemented first time right, achieving state-of-the-art performance within 15% of our modelling estimations.

3 citations


Proceedings ArticleDOI
15 Jul 2021
TL;DR: Simodense as mentioned in this paper is a recently released open-source softcore optimized for evaluating custom SIMD instructions, which can help with the challenges found in today's FPGAs by providing RTL-based programmability.
Abstract: This demo elaborates on the programmability aspect of Simodense, a recently released open-source softcore, optimised for evaluating custom SIMD instructions. CPUs featuring small reconfigurable areas for implementing custom instructions is an alternative path in computer architecture that can help with the challenges found in today’s FPGAs. By providing RTL-based programmability for implementing custom SIMD instructions, highly-integrated accelerators can be developed, while benefiting from the pre-existing CPU logic, such as the caches and their high memory throughput to main memory.

3 citations


Proceedings ArticleDOI
09 May 2021
TL;DR: In this paper, the authors propose a flexible debug instrumentation that allows for the live debugging of machine learning systems during training, which can be used to gather data in a large variety of ways that would likely not be anticipated at compile time.
Abstract: FPGAs have recently shown promise for accelerating machine learning training. This has led to research into the co-design of narrow-precision accelerator architectures and the investigation of novel machine learning models. Such research can be extremely expensive, as the steep cost of training a model can increase several-fold due to the need of performing hyper-parameter tuning and adjustments to the model to ensure acceptable convergence speed and accuracy. In this scenario, monitoring key data on-chip is essential to more quickly understand and diagnose problems, significantly reducing training costs.Previous work has proposed on-chip debug instrumentation to monitor key signals for both general-purpose circuits and inference algorithms. This instrumentation either performs limited on-chip compression, or is extremely restricted in the amount of run-time customization that may occur. We argue that for training applications, the extremely long and expensive training runs warrant significantly more flexibility in the on-chip instrumentation, even at the expense of some chip area.In this paper, we propose flexible debug instrumentation that allows for the live debugging of machine learning systems during training. Different from previous debug instrumentation, our instrumentation offers firmware programmability, allowing the researcher to gather data in a large variety of ways that would likely not be anticipated at compile time.

3 citations


Journal ArticleDOI
TL;DR: Artisan offers complete design-flow orchestration in a unified programming environment based on Python 3 to enable accessible codification of reusable optimisation strategies that can be automatically applied to high-level application descriptions.
Abstract: In today's increasingly heterogeneous compute landscape, there is high demand for design tools that offer seemingly contradictory features: portable programming abstractions that hide underlying architectural detail, and the capability to optimise and exploit architectural features. Our meta-programming approach, Artisan, decouples application functionality from optimisation concerns to address the complexity of mapping high-level application descriptions onto heterogeneous platforms from which they are abstracted. With Artisan, application experts focus on algorithmic behaviour, while platform and domain experts focus on optimisation and mapping. Artisan offers complete design-flow orchestration in a unified programming environment based on Python 3 to enable accessible codification of reusable optimisation strategies that can be automatically applied to high-level application descriptions. We have developed and evaluated an Artisan prototype and a set of customised meta-programs used to automatically optimise six case study applications for CPU+FPGA targets. In our experiments, Artisan-optimised designs achieve the same order of magnitude speedup as manually optimised designs compared to corresponding unoptimised software.


Proceedings ArticleDOI
05 Jul 2021
TL;DR: Simodense as discussed by the authors is a high-performance open-source RISC-V (RV32IM) softcore, optimized for exploring custom SIMD instructions, and its memory system is optimized for streaming bandwidth, such as very wide blocks for the last level cache.
Abstract: Simodense is a high-performance open-source RISC-V (RV32IM) softcore, optimised for exploring custom SIMD instructions. In order to maximise SIMD instruction performance, the design’s memory system is optimised for streaming bandwidth, such as very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications with custom instructions. This paper also provides insights on the effectiveness of adding FPGA resources in general purpose processors in the form of reconfigurable SIMD instructions.

Posted Content
TL;DR: In this article, a three-phase co-design framework is proposed to locate designs on the Pareto frontier for deep neural networks (DNNs) by decoupling DNN training from the design space exploration of hardware architecture and neural architecture.
Abstract: Recent advances in algorithm-hardware co-design for deep neural networks (DNNs) have demonstrated their potential in automatically designing neural architectures and hardware designs. Nevertheless, it is still a challenging optimization problem due to the expensive training cost and the time-consuming hardware implementation, which makes the exploration on the vast design space of neural architecture and hardware design intractable. In this paper, we demonstrate that our proposed approach is capable of locating designs on the Pareto frontier. This capability is enabled by a novel three-phase co-design framework, with the following new features: (a) decoupling DNN training from the design space exploration of hardware architecture and neural architecture, (b) providing a hardware-friendly neural architecture space by considering hardware characteristics in constructing the search cells, (c) adopting Gaussian process to predict accuracy, latency and power consumption to avoid time-consuming synthesis and place-and-route processes. In comparison with the manually-designed ResNet101, InceptionV2 and MobileNetV2, we can achieve up to 5% higher accuracy with up to 3x speed up on the ImageNet dataset. Compared with other state-of-the-art co-design frameworks, our found network and hardware configuration can achieve 2% ~ 6% higher accuracy, 2x ~ 26x smaller latency and 8.5x higher energy efficiency.

Posted Content
TL;DR: In this article, a set of vector instruction types for exploring custom SIMD instructions in a softcore is presented, allowing simultaneous access to a relatively high number of operands, reducing the instruction count where applicable.
Abstract: This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count where applicable. Additionally, a high-performance open-source RISC-V (RV32 IM) softcore is introduced, optimised for exploring custom SIMD instructions and streaming performance. By providing instruction templates for instruction development in HDL/Verilog, efficient FPGA-based instructions can be developed with few low-level lines of code. In order to improve custom SIMD instruction performance, the softcore's cache hierarchy is optimised for bandwidth, such as with very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications on an FPGA. Although the exploration is based on the softcore, the goal is to provide a means to experiment with advanced SIMD instructions which could be loaded in future CPUs that feature reconfigurable regions as custom instructions. Finally, we provide some insights on the challenges and effectiveness of such future micro-architectures.

Journal ArticleDOI
TL;DR: In this paper, the authors present a Function-as-a-Service (FaaS) approach for deploying managed cloud functions onto heterogeneous cloud infrastructures, including hardware accelerators such as GPUs and FPGAs.
Abstract: This paper presents a Function-as-a-Service (FaaS) approach for deploying managed cloud functions onto heterogeneous cloud infrastructures. Current FaaS systems, such as AWS Lambda, allow domain-specific functionality, such as AI, HPC and image processing, to be deployed in the cloud while abstracting users from infrastructure and platform concerns. Existing approaches, however, use a single type of resource configuration to execute all function requests. In this paper, we present a novel FaaS approach that allows cloud functions to be effectively executed across heterogeneous compute resources, including hardware accelerators such as GPUs and FPGAs. We implement heterogeneous scheduling to tailor resource selection to each request, taking into account performance and cost concerns. In this way, our approach makes use of different processor types and quantities (e.g. 2 CPU cores), uniquely suited to handle different types of workload, potentially providing improved performance at a reduced cost. We validate our approach in three application domains: machine learning, bio-informatics, and physics, and target a hardware platform with a combined computational capacity of 24 FPGAs and 12 CPU cores. Compared to traditional FaaS, our approach achieves a cost improvement for non-uniform traffic of up to 8.9 times, while maintaining performance objectives.

Proceedings ArticleDOI
01 Jul 2021
TL;DR: In this article, the authors propose techniques to improve the performance of networked processor templates by adding custom instructions to processors in the network, while retaining the simplicity of using processor templates.
Abstract: Processor templates are a well-established way to design for FPGA technology, easing the task of implementation by reducing it to choosing a template and writing software for it – while avoiding the need for hardware design experience and circumventing the installation and execution of FPGA design tools. Networked processor templates allow designers to achieve scalable parallelism by covering a network of processors, while retaining the simplicity of using processor templates. This paper proposes techniques to improve the performance of networked processor templates, by adding custom instructions to processors in the network. An approach has been developed to systematically choose parts of a design to implement as custom instructions. Performance and area models have also been devised to allow the prediction of performance and area usage of a design targeting a particular FPGA, enabling design trade-offs to be explored prior to implementation. The proposed approach has been evaluated on various applications including N-body simulation and dissipative particle dynamics, demonstrating its potential of delivering hardware acceleration based on custom instructions targeting state-of-the-art FPGAs.

Proceedings ArticleDOI
01 Feb 2021
TL;DR: In this article, the authors proposed a fully spectral convolutional neural network (F-CNN) approach by proposing a novel adaptive Rectified Linear Unit (ReLU) activation in spectral domain.
Abstract: Computing convolutional layers in frequency domain can largely reduce the computation overhead for training and inference of convolutional neural networks (CNNs). However, existing designs with such an idea require repeated spatial- and frequency-domain transforms due to the absence of nonlinear functions in the frequency domain, as such it makes the benefit less attractive for low-latency inference. This paper presents a fully spectral CNN approach by proposing a novel adaptive Rectified Linear Unit (ReLU) activation in spectral domain. The proposed design maintains the non-linearity in the network while taking into account the hardware efficiency in algorithm level. The spectral model size is further optimized by merging and fusing layers. Then, a customized hardware architecture is proposed to implement the designed spectral network on FPGA device with DSP optimizations for 8-bit fixed point multipliers. Our hardware accelerator is implemented on Intel's Arria 10 device and applied to the MNIST, SVHN, AT&T and CIFAR-10 datasets. Experimental results show a speed improvement of 6 × ~ 10 × and 4 × ~ 5.7 × compared to state-of-the-art spatial or FFT-based designs respectively, while achieving similar accuracy across the benchmark datasets.

Journal ArticleDOI
TL;DR: Otune is presented, a novel overlay-based approach for rapid in-circuit debugging and tuning of Deep Neural Network (DNN) designs targeting Field-Programmable Gate Array (FPGA) which enables tuning of FPGA-based DNN designs for edge systems, which would benefit developing adaptive learning systems.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper implemented the diffeomorphic log-demons algorithm on GPU and achieved a 1.3M voxel image registration in 286ms using GPU performance-aware programming techniques.
Abstract: Intensity-based image registration has been proven essential in many applications accredited to its unparalleled ability to resolve image misalignments. However, long registration time for image realignment prohibits its use in intra-operative navigation systems. There has been much work on accelerating the registration process by improving the algorithm’s robustness, but the innate computation required by the registration algorithm has been unresolved. Intensity-based registration methods involve operations with high arithmetic load and memory access demand, which supposes to be reduced by graphics processing units (GPUs). Although GPUs are widespread and affordable, there is a lack of open-source GPU implementations optimized for non-rigid image registration. This paper demonstrates performance-aware programming techniques, which involves systematic exploitation of GPU features, by implementing the diffeomorphic log-demons algorithm. By resolving the pinpointed computation bottlenecks on GPU, our implementation of diffeomorphic log-demons on Nvidia GTX Titan X GPU has achieved ~ 95 times speed-up compared to the CPU and registered a 1.3-M voxel image in 286 ms. Even for large 37-M voxel images, our implementation is able to register in 8.56 s, which attained ~ 258 times speed-up. Our solution involves effective employment of GPU computation units, memory, and data bandwidth to resolve computation bottlenecks. The computation bottlenecks in diffeomorphic log-demons are pinpointed, analyzed, and resolved using various GPU performance-aware programming techniques. The proposed fast computation on basic image operations not only enhances the computation of diffeomorphic log-demons, but is also potentially extended to speed up many other intensity-based approaches. Our implementation is open-source on GitHub at https://bit.ly/2PYZxQz .

Posted Content
TL;DR: In this article, the authors proposed a novel FPGA-based hardware architecture to accelerate BNNs inferred through Monte Carlo Dropout, which can achieve up to 4 times higher energy efficiency and 9 times better compute efficiency.
Abstract: Neural networks (NNs) have demonstrated their potential in a wide range of applications such as image recognition, decision making or recommendation systems. However, standard NNs are unable to capture their model uncertainty which is crucial for many safety-critical applications including healthcare and autonomous vehicles. In comparison, Bayesian neural networks (BNNs) are able to express uncertainty in their prediction via a mathematical grounding. Nevertheless, BNNs have not been as widely used in industrial practice, mainly because of their expensive computational cost and limited hardware performance. This work proposes a novel FPGA-based hardware architecture to accelerate BNNs inferred through Monte Carlo Dropout. Compared with other state-of-the-art BNN accelerators, the proposed accelerator can achieve up to 4 times higher energy efficiency and 9 times better compute efficiency. Considering partial Bayesian inference, an automatic framework is proposed, which explores the trade-off between hardware and algorithmic performance. Extensive experiments are conducted to demonstrate that our proposed framework can effectively find the optimal points in the design space.

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed an FPGA-based hardware design to accelerate Bayesian LSTM-based RNNs, which can achieve up to 10 times speedup with nearly 106 times higher energy efficiency.
Abstract: Neural networks have demonstrated their great performance in a wide range of tasks. Especially in time-series analysis, recurrent architectures based on long-short term memory (LSTM) cells have manifested excellent capability to model time dependencies in real-world data. However, standard recurrent architectures cannot estimate their uncertainty which is essential for safety-critical applications such as in medicine. In contrast, Bayesian recurrent neural networks (RNNs) are able to provide uncertainty estimation with improved accuracy. Nonetheless, Bayesian RNNs are computationally and memory demanding, which limits their practicality despite their advantages. To address this issue, we propose an FPGA-based hardware design to accelerate Bayesian LSTM-based RNNs. To further improve the overall algorithmic-hardware performance, a co-design framework is proposed to explore the most optimal algorithmic-hardware configurations for Bayesian RNNs. We conduct extensive experiments on health-related tasks to demonstrate the improvement of our design and the effectiveness of our framework. Compared with GPU implementation, our FPGA-based design can achieve up to 10 times speedup with nearly 106 times higher energy efficiency. To the best of our knowledge, this is the first work targeting the acceleration of Bayesian RNNs on FPGAs.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, an FPGA-based accelerator for agent-based epidemic modeling for COVID-19 is presented, where the authors propose to partition the calculation properly to decouple the on-chip resource usage from the population size.
Abstract: Agent-based models (ABMs) can provide realistic dynamics for epidemics at the individual level so that users can observe and predict the spreading pattern and the effectiveness of intervention over time and space. This paper proposes an FPGA-based accelerator for agent-based epidemic modeling for COVID-19. The optimizations enabling the effective acceleration of the simulation procedure are presented. The key idea is to partition the calculation properly to decouple the on-chip resource usage from the population size. Also, an algorithmic adaptation is proposed to reduce the latency caused by conditional branches within loops. An experimental implementation on an Intel Arria 10 GX 10AX115S2F45I1SG FPGA running at 240MHz achieves 2.2 and 1.9 times speed-up respectively over a CPU reference using 10 cores on an Intel Xeon Gold 6230 CPU and a GPU reference on an Nvidia GeForce RTX 2080 Ti GPU.

Proceedings ArticleDOI
01 May 2021
TL;DR: In this paper, the authors describe a systematic process to migrate physical parameterisations from sequential code for CPU execution into FPGA designs, and describe the steps required to automate the process.
Abstract: Efficient utilisation of computational resources is one of the critical challenges in Numerical Weather Prediction (NWP) due to tight time constraints and the complexity of the numerical models. Enabling hardware acceleration is therefore of vital interest to the weather and climate modelling community. In this paper, we describe a systematic process to migrate physical parameterisations from sequential code for CPU execution into FPGA designs; a set of conditions that the code must be able to satisfy for the process to be suitable (Single Pass Exclusive Mutability of Current Cell); and describe the steps required to automate the process. We showcase the migration of a cloud microphysics parameterisation which forms a significant portion of the runtime in operational weather models, and show that the design produces results using an order of magnitude less energy than CPU and GPU implementations while achieving significantly higher throughput. We show that when incorporated inside a larger NWP system, a further throughput increase by a factor of five is possible.

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed an FPGA-based hardware design to accelerate Bayesian LSTM-based RNNs, which can achieve up to 10 times speedup with nearly 106 times higher energy efficiency.
Abstract: Neural networks have demonstrated their outstanding performance in a wide range of tasks. Specifically recurrent architectures based on long-short term memory (LSTM) cells have manifested excellent capability to model time dependencies in real-world data. However, standard recurrent architectures cannot estimate their uncertainty which is essential for safety-critical applications such as in medicine. In contrast, Bayesian recurrent neural networks (RNNs) are able to provide uncertainty estimation with improved accuracy. Nonetheless, Bayesian RNNs are computationally and memory demanding, which limits their practicality despite their advantages. To address this issue, we propose an FPGA-based hardware design to accelerate Bayesian LSTM-based RNNs. To further improve the overall algorithmic-hardware performance, a co-design framework is proposed to explore the most fitting algorithmic-hardware configurations for Bayesian RNNs. We conduct extensive experiments on healthcare applications to demonstrate the improvement of our design and the effectiveness of our framework. Compared with GPU implementation, our FPGA-based design can achieve up to 10 times speedup with nearly 106 times higher energy efficiency. To the best of our knowledge, this is the first work targeting acceleration of Bayesian RNNs on FPGAs.

Journal ArticleDOI
TL;DR: In this paper, a promising path in achieving the next generation high-performance computing platforms is presented, which will handle extreme data-and compute-intensive problems that are intractable with today's technology.
Abstract: Next-generation high-performance computing platforms will handle extreme data- and compute-intensive problems that are intractable with today’s technology. A promising path in achieving the next le...