scispace - formally typeset
Search or ask a question

Showing papers by "Wayne Luk published in 2022"


Proceedings ArticleDOI
20 Sep 2022
TL;DR: FABNet is proposed, a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs and a novel adaptable butterfly accelerator is proposed that can be configured at runtime via dedicated hardware control to accelerate different butterfly layers using a single unified hardware engine.
Abstract: Attention-based neural networks have become pervasive in many AI tasks. Despite their excellent algorithmic performance, the use of the attention mechanism and feedforward network (FFN) demands excessive computational and memory resources, which often compromises their hardware performance. Although various sparse variants have been introduced, most approaches only focus on mitigating the quadratic scaling of attention on the algorithm level, without explicitly considering the efficiency of mapping their methods on real hardware designs. Furthermore, most efforts only focus on either the attention mechanism or the FFNs but without jointly optimizing both parts, causing most of the current designs to lack scalability when dealing with different input lengths. This paper systematically considers the sparsity patterns in different variants from a hardware perspective. On the algorithmic level, we propose FABNet, a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs. On the hardware level, a novel adaptable butterfly accelerator is proposed that can be configured at runtime via dedicated hardware control to accelerate different butterfly layers using a single unified hardware engine. On the Long-Range-Arena dataset, FABNet achieves the same accuracy as the vanilla Transformer while reducing the amount of computation by 10$\sim66\times$ and the number of parameters 2$\sim22\times$. By jointly optimizing the algorithm and hardware, our FPGA-based butterfly accelerator achieves 14.2$\sim23.2\times$ speedup over state-of-the-art accelerators normalized to the same computational budget. Compared with optimized CPU and GPU designs on Raspberry Pi 4 and Jetson Nano, our system is up to $273.8\times$ and $15.1\times$ faster under the same power budget

9 citations


Journal ArticleDOI
TL;DR: A novel latency-hiding architecture for recurrent neural network (RNN) acceleration using column-wise matrix–vector multiplication (MVM) instead of the state-of-the-art row-wise operation to eliminate data dependencies and increase HW utilization and enhance system throughput.
Abstract: This article presents a reconfigurable accelerator for REcurrent Neural networks with fine-grained cOlumn-Wise matrix–vector multiplicatioN (RENOWN). We propose a novel latency-hiding architecture for recurrent neural network (RNN) acceleration using column-wise matrix–vector multiplication (MVM) instead of the state-of-the-art row-wise operation. This hardware (HW) architecture can eliminate data dependencies to improve the throughput of RNN inference systems. Besides, we introduce a configurable checkerboard tiling strategy which allows large weight matrices, while incorporating various configurations of element-based parallelism (EP) and vector-based parallelism (VP). These optimizations improve the exploitation of parallelism to increase HW utilization and enhance system throughput. Evaluation results show that our design can achieve over 29.6 tera operations per second (TOPS) which would be among the highest for field-programmable gate array (FPGA)-based RNN designs. Compared to state-of-the-art accelerators on FPGAs, our design achieves 3.7–14.8 times better performance and has the highest HW utilization.

7 citations


Proceedings ArticleDOI
13 Jun 2022
TL;DR: A novel reconfigurable architecture to accelerate Graph Neural Networks (GNNs) for JEDI-net, a jet identification algorithm in particle physics which achieves state-of-the-art accuracy, is presented, which avoids the costly multiplication of the adjacency matrix with the input feature matrix.
Abstract: This paper presents a novel reconfigurable architecture to accelerate Graph Neural Networks (GNNs) for JEDI-net, a jet identification algorithm in particle physics which achieves state-of-the-art accuracy. The challenge is to deploy JEDI-net for online selection targeting the Large Hadron Collider (LHC) experiments with low latency. This paper proposes custom strength reduction for matrix multiplication operations customised for the GNN-based JEDI-net, which avoids the costly multiplication of the adjacency matrix with the input feature matrix. It exploits sparsity patterns and binary adjacency matrices to increase hardware efficiency while reducing latency. The throughput is further enhanced by a coarse-grained pipeline enabled by adopting column-major order data layout. Evaluation results show that our FPGA implementation is 11 times faster and consumes 12 times lower power than a GPU implementation. Moreover, the throughput of our FPGA design is sufficiently high to enable deployment of JEDI-net in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy.

4 citations


Journal ArticleDOI
TL;DR: A custom code transformation with strength reduction for the matrix multiplication operations in the interaction-network based GNNs with fully connected graphs, which avoids the costly multiplication of the adjacency matrix with the input feature matrix.
Abstract: —This work proposes a novel reconfigurable architecture for low latency Graph Neural Network (GNN) design specifically for particle detectors. Adopting FPGA-based GNNs for particle detectors is challenging since it requires sub-microsecond latency to deploy the networks for online event selection in the Level-1 triggers for the CERN Large Hadron Collider experiments. This paper proposes a custom code transformation with strength reduction for the matrix multiplication operations in the interaction-network based GNNs with fully connected graphs, which avoids the costly multiplication of the adjacency matrix with the input feature matrix. It exploits sparsity patterns as well as binary adjacency matrices, and avoids irregular memory access, leading to a reduction in latency and improvement in hardware efficiency. In addition, we introduce an outer-product based matrix multiplication approach which is enhanced by the strength reduction for low latency design. Also, a fusion step is introduced to further reduce the design latency. Furthermore, an GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under a given latency constraint. Finally, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 24 times faster and consumes up to 45 times less power than a GPU implementation. Compared to our previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit

4 citations


Proceedings ArticleDOI
01 Aug 2022
TL;DR: In this article , an outer-product based matrix multiplication approach customized for GNN-based JEDI-net is proposed to increase data spatial locality and reduce design latency, which is further enhanced by code transformation with strength reduction which exploits sparsity patterns and binary adjacency matrices.
Abstract: This work proposes a novel reconfigurable architecture for reducing the latency of JEDI-net, a Graph Neural Network (GNN) based algorithm for jet tagging in particle physics, which achieves state-of-the-art accuracy. Accelerating JEDI-net is challenging since it requires low latency to deploy the network for event selection at the CERN Large Hadron Collider. This paper proposes an outer-product based matrix multiplication approach customized for GNN-based JEDI-net, which increases data spatial locality and reduces design latency. It is further enhanced by code transformation with strength reduction which exploits sparsity patterns and binary adjacency matrices to increase hardware efficiency while reducing latency. In addition, a customizable template for this architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools. Evaluation results show that our FPGA implementation is up to 9.5 times faster and consumes up to 6.5 times less power than a GPU implementation. Moreover, the throughput of our FPGA design is sufficiently high to enable deployment of JEDI-net in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy.

3 citations


Journal ArticleDOI
TL;DR: This paper systematically exploits the extensive structured sparsity and redundant computation in BayesNNs, introduced by Monte Carlo Dropout and its associated sampling required during uncertainty estimation and prediction, to address real-world hardware performance issues.
Abstract: Bayesian neural networks (BayesNNs) have demonstrated their advantages in various safety-critical applications, such as autonomous driving or healthcare, due to their ability to capture and represent model uncertainty. However, standard BayesNNs require to be repeatedly run because of Monte Carlo sampling to quantify their uncertainty, which puts a burden on their real-world hardware performance. To address this performance issue, this paper systematically exploits the extensive structured sparsity and redundant computation in BayesNNs. Different from the unstructured or structured sparsity existing in standard convolutional NNs, the structured sparsity of BayesNNs is introduced by Monte Carlo Dropout and its associated sampling required during uncertainty estimation and prediction, which can be exploited through both algorithmic and hardware optimizations. We first classify the observed sparsity patterns into three categories: dropout sparsity, layer sparsity and sample sparsity. On the algorithmic side, a framework is proposed to automatically explore these three sparsity categories without sacrificing algorithmic performance. We demonstrated that structured sparsity can be exploited to accelerate CPU designs by up to 49 times, and GPU designs by up to 40 times. On the hardware side, a novel hardware architecture is proposed to accelerate BayesNNs, which achieves a high hardware performance using the runtime adaptable hardware engines and the intelligent skipping support. Upon implementing the proposed hardware design on an FPGA, our experiments demonstrated that the algorithm-optimized BayesNNs can achieve up to 56 times speedup when compared with unoptimized Bayesian nets. Comparing with the optimized GPU implementation, our FPGA design achieved up to 7.6 times speedup and up to 39.3 times higher energy efficiency

2 citations


Book ChapterDOI
TL;DR: In this article , a hardware-aware optimization strategy for deploying DL neural networks to FPGAs is presented, which automatically identifies hardware configurations that maximize resource utilization for a given level of computation throughput.
Abstract: AI solutions, such as Deep Learning (DL), are becoming increasingly prevalent in edge devices. Many of these applications require low latency processing of large amounts of data within a tight power budget. In this context, reconfigurable embedded devices make a compelling option. Deploying DL models to reconfigurable devices does, however, present considerable challenges. One key issue is reconciling the often large compute requirements of DL models with the limited available resources on edge devices. In this paper, we present a hardware-aware optimization strategy for deploying DL neural networks to FPGAs, which automatically identifies hardware configurations that maximize resource utilization for a given level of computation throughput. We demonstrate our optimization approach on a sample neural network containing a combination of convolutional and fully connected layers, running on a sample FPGA target device, achieving a factor of 3.5 reduction in DSP block usage without affecting throughput when using performance mode. When using the compact mode, a factor of 7.4 reduction in DSP block usage is achieved, at the cost of 1.8 times decrease in throughput. Our approach works completely automatically without the need for human intervention or domain knowledge.

2 citations



Journal ArticleDOI
TL;DR: A novel FPGA-based hardware architecture to accelerate both 2D and 3D BayesCNNs based on Monte Carlo Dropout is proposed and an automatic framework capable of supporting partial Bayesian inference is proposed to explore the trade-off between algorithm and hardware performance.
Abstract: Neural networks (NNs) have demonstrated their potential in a variety of domains ranging from computer vision (CV) to natural language processing. Among various NNs, two-dimensional (2-D) and three-dimensional (3-D) convolutional NNs (CNNs) have been widely adopted for a broad spectrum of applications, such as image classification and video recognition, due to their excellent capabilities in extracting 2-D and 3-D features. However, standard 2-D and 3-D CNNs are not able to capture their model uncertainty which is crucial for many safety-critical applications, including healthcare and autonomous driving. In contrast, Bayesian CNNs (BayesCNNs), as a variant of CNNs, have demonstrated their ability to express uncertainty in their prediction via a mathematical grounding. Nevertheless, BayesCNNs have not been widely used in industrial practice due to their compute requirements stemming from sampling and subsequent forward passes through the whole network multiple times. As a result, these requirements significantly increase the amount of computation and memory consumption in comparison to standard CNNs. This article proposes a novel field-programmable gate array (FPGA)-based hardware architecture to accelerate both 2-D and 3-D BayesCNNs based on Monte Carlo dropout (MCD). Compared with other state-of-the-art accelerators for BayesCNNs, the proposed design can achieve up to four times higher energy efficiency and nine times better compute efficiency. An automatic framework capable of supporting partial Bayesian inference is proposed to explore the tradeoff between algorithm and hardware performance. Extensive experiments are conducted to demonstrate that our framework can effectively find the optimal implementations in the design space.

2 citations


Proceedings ArticleDOI
01 Aug 2022
TL;DR: PolSCA as mentioned in this paper is a compiler framework that improves polyhedral HLS workflow by automatic code transformation, decomposing a design before polyhedral optimization to balance code complexity and parallelism, while revising memory interfaces of polyhedral-transformed code to make partitioning explicit for HLS tools.
Abstract: Polyhedral optimization can parallelize nested affine loops for high-level synthesis (HLS), but polyhedral tools are HLS-agnostic and can worsen performance. Moreover, HLS tools require user directives which can produce unreadable polyhedral-transformed code. To address these two challenges, we present POLSCA, a compiler framework that improves polyhedral HLS workflow by automatic code transformation. POLSCA decomposes a design before polyhedral optimization to balance code complexity and parallelism, while revising memory interfaces of polyhedral-transformed code to make partitioning explicit for HLS tools; it enables designs to benefit more easily from polyhedral optimization. Experiments on Polybench/C show that POLSCA designs are 1.5 times faster on average compared with baseline designs generated directly from applying HLS on C code.

2 citations


Proceedings ArticleDOI
05 Dec 2022
TL;DR: In this article , a TNN-based architecture was proposed for real-time particle trigger detection using Field-Programmable Gate Arrays (FPGAs) for high-level synthesis.
Abstract: High Energy Physics studies the fundamental forces and elementary particles of the Universe. With the unprecedented scale of experiments comes the challenge of accurate, ultra-low latency decision-making. Transformer Neural Networks (TNNs) have been proven to accomplish cutting-edge accuracy in classification for hadronic jet tagging. Nevertheless, software-centered solutions targeting CPUs and GPUs lack the inference speed required for real-time particle triggers, most notably those at the CERN Large Hadron Collider. This paper proposes a novel TNN-based architecture, efficiently mapped to Field-Programmable Gate Arrays, that outperforms GPU inference capabilities involving state-of-the-art neural network models by approximately 1000 times while preserving comparable classification accuracy. The design offers high customizability and aims to bridge the gap between hardware and software development by using High-Level Synthesis. Moreover, we propose a novel model-independent post-training quantization search algorithm that works in general hardware environments according to user-defined constraints. Experimental evaluation yields a 64% reduction in overall bit-widths with a 2% accuracy loss.

Proceedings ArticleDOI
09 Jun 2022
TL;DR: This paper presents a novel approach, Covoh, which captures families of hardware designs as parametric block descriptions, such that the behaviour of design instances can be verified by numerical and symbolic simulation.
Abstract: Verifying the correctness of optimizations is a key challenge in hardware acceleration. Incorrect optimizations can produce designs unfit for purpose. This paper presents a novel approach, Covoh, which captures families of hardware designs as parametric block descriptions, such that the behaviour of design instances can be verified by numerical and symbolic simulation. In this work, hardware optimizations are expressed as transformations of parametric descriptions, and their parametric verification based on the Coq proof assistant is guided by verification strategies. Repositories of design descriptions and verification strategies have been developed to facilitate design development in Covoh. Its use in verifying two optimizations illustrates the capability of Covoh. The first, a variation of Horner’s Rule, maps an O(n2) design to an O(n) design. The second, used in optimizing avionics monitoring, maps an O(2n) design to an O(n) design. The effectiveness of such optimizations is demonstrated with FPGA implementations: varying the value of a single parameter that controls pipelining would, for example, lead to a family of functionally-verified designs with different trade-offs, from ones with low throughput, low resource usage and low power consumption to ones with high throughput, high resource usage and high power consumption.

Proceedings ArticleDOI
10 Jul 2022
TL;DR: A novel machine learning (ML)-based framework to tackle QCP as a bilevel optimization problem, which significantly reduces the SWAP cost and achieves the same level of optimality while reducing the runtime cost by up to 40 times.
Abstract: Quantum circuit placement (QCP) is the process of mapping the synthesized logical quantum programs on physical quantum machines, which introduces additional SWAP gates and affects the performance of quantum circuits. Nevertheless, determining the minimal number of SWAP gates has been demonstrated to be an NP-complete problem. Various heuristic approaches have been proposed to address QCP, but they suffer from suboptimality due to the lack of exploration. Although exact approaches can achieve higher optimality, they are not scalable for large quantum circuits due to the massive design space and expensive runtime. By formulating QCP as a bilevel optimization problem, this paper proposes a novel machine learning (ML)-based framework to tackle this challenge. To address the lower-level combinatorial optimization problem, we adopt a policy-based deep reinforcement learning (DRL) algorithm with knowledge transfer to enable the generalization ability of our framework. An evolutionary algorithm is then deployed to solve the upper-level discrete search problem, which optimizes the initial mapping with a lower SWAP cost. The proposed ML-based approach provides a new paradigm to overcome the drawbacks in both traditional heuristic and exact approaches while enabling the exploration of optimality-runtime trade-off. Compared with the leading heuristic approaches, our ML-based method significantly reduces the SWAP cost by up to 100%. In comparison with the leading exact search, our proposed algorithm achieves the same level of optimality while reducing the runtime cost by up to 40 times.

Journal ArticleDOI
TL;DR: This research provides useful insights into the impact of curation algorithms on how information propagates and on content diversity on social media and shows how agent-based modelling can reveal specific properties ofCuration algorithms, which can be used in improving such algorithms.
Abstract: Social media networks have drastically changed how people communicate and seek information. Due to the scale of information on these platforms, newsfeed curation algorithms have been developed to sort through this information and curate what users see. However, these algorithms are opaque and it is difficult to understand their impact on human communication flows. Some papers have criticised newsfeed curation algorithms that, while promoting user engagement, heighten online polarisation, misinformation, and the formation of echo chambers. Agent-based modelling offers the opportunity to simulate the complex interactions between these algorithms, what users see, and the propagation of information on social media. This article uses agent-based modelling to compare the impact of four different newsfeed curation algorithms on the spread of misinformation and polarisation. This research has the following contributions: (1) implementing newsfeed curation algorithm logic on an agent-based model; (2) comparing the impact of different curation algorithm objectives on misinformation and polarisation; and (3) calibration and empirical validation using real Twitter data. This research provides useful insights into the impact of curation algorithms on how information propagates and on content diversity on social media. Moreover, we show how agent-based modelling can reveal specific properties of curation algorithms, which can be used in improving such algorithms.

Proceedings ArticleDOI
01 Aug 2022
TL;DR: In this article , a call-and-response approach to computation allows for different processing element implementations, connections, latencies and non-deterministic behaviour. But it does not address the problem of managing all available processing elements, which can unbalance parallel pipelines and complicate development.
Abstract: FPGA designs do not typically include all available processing elements, e.g., LUTs, DSPs and embedded cores. Additional work is required to manage their different implementations and behaviour, which can unbalance parallel pipelines and complicate development. In this paper we introduce a novel management architecture to unify heterogeneous processing elements into compute pools. A pool formed of E processing elements, each implementing the same function, serves D parallel function calls. A call-and-response approach to computation allows for different processing element implementations, connections, latencies and non-deterministic behaviour. Our rotating scheduler automatically arbitrates access to processing elements, uses greatly simplified routing, and scales linearly with D parallel accesses to the compute pool. Processing elements can easily be added to improve performance, or removed to reduce resource use and routing, facilitating higher operating frequencies. Migrating to larger or smaller FPGAs thus comes at a known performance cost. We assess our framework with a range of neural network activation functions (ReLU, LReLU, ELU, GELU, sigmoid, swish, softplus and tanh) on the Xilinx Alveo U280.

Proceedings ArticleDOI
11 Feb 2022
TL;DR: This paper proposes, analyzes, and evaluates a novel acceleration strategy for causal discovery, which has low communication costs and can effectively exploit FPGA on-chip memory and parallelism, and is the first FPGa-based acceleration approach for constraint-based causal discovery.
Abstract: Causal discovery is a technique to find the causal relationship between variables using data. This technique has many applications in data mining and knowledge discovery. However, the high data dimensionality results in a significant computational efficiency problem. A common speed bottleneck in conventional causal discovery methods is the execution of conditional independence (CI) tests. This paper proposes, analyzes, and evaluates a novel acceleration strategy for causal discovery, which has low communication costs and can effectively exploit FPGA on-chip memory and parallelism. First, we propose an algorithmic method to shift the speed bottleneck from CI test execution to CI test generation. Second, we design a hardware accelerator for CI test generation on FPGAs. Third, we evaluate the proposed approach by comparing the accuracy-speed trade-off against four state-of-the-art accelerated causal discovery tools on CPUs and GPUs. Our accelerated implementation running on an Intel Arria 10 GX FPGA shows a superior accuracy-speed trade-off in 12 causal discovery problems. The implementation achieves up to 8.8 times speedup over the cuPC software running on an NVIDIA GeForce RTX 2080 Ti GPU. It also achieves up to 155.7 times speedup over the stable.fast software running on an Intel Xeon Silver 4110 octa-core CPU. To the best of our knowledge, the proposed approach is the first FPGA-based acceleration approach for constraint-based causal discovery.

Book ChapterDOI
TL;DR: In this article , a restricted permutation network (RPN) is proposed to automatically generate a restricted subset of local permutation, preserving the features of the dataset while simplifying the generation to improve scalability.
Abstract: AbstractPermutation is a fundamental way of data augmentation. However, it is not commonly used in image based systems with hardware acceleration due to distortion of spatial correlation and generation complexity. This paper proposes Restricted Permutation Network (RPN), a scalable architecture to automatically generate a restricted subset of local permutation, preserving the features of the dataset while simplifying the generation to improve scalability. RPN reduces the spatial complexity from \(\textit{O}(Nlog(N))\) to \(\textit{O}(N)\), making it easily scalable to 64 inputs and beyond, with 21 times speed up in generation and significantly reducing data storage and transfer, while maintaining the same level of accuracy as the original dataset for deep learning training. Experiments show Convolutional Neural Networks (CNNs) trained by the augmented dataset can be as accurate as the original one. Combining three to five networks in general improves the network accuracy by 5%. Network training can be accelerated by training multiple sub-networks in parallel with a reduced training data set and epochs, resulting in up to 5 times speed up with a negligible loss in accuracy. This opens up the opportunity to easily split long iterative training process into independent parallelizable processes, facilitating the trade off between resources and run time.

Journal ArticleDOI
TL;DR: Various applications, such as N-body simulation and dissipative particle dynamics, demonstrate how hardware acceleration based on custom instructions can target state-of-the-art FPGAs.
Abstract: As Field-programmable Gate Arrays (FPGAs) continue to increase in size and capability, there is an increasing need to develop performant designs in good time. Networked processor templates support scalable on-chip parallelism while avoiding the complexity and long run time of FPGA synthesis tools. The performance of the resulting designs can be further enhanced by adding custom instructions to processors in the network. We show how to systematically choose parts of a design to implement as custom instructions. We illustrate the use of performance and area models targeting a particular networked processor template to explore design trade-offs. Various applications, such as N-body simulation and dissipative particle dynamics, demonstrate how hardware acceleration based on custom instructions can target state-of-the-art FPGAs.

29 Aug 2022
TL;DR: The proposed model can be used for testing resiliency and robustness of trading algorithms and providing advice for policymakers, and has excellent capability of reproducing realistic stylised facts in financial markets.
Abstract: This paper describes simulations and analysis of flash crash scenarios in an agent-based modelling framework. We design, implement, and assess a novel high-frequency agent-based financial market simulator that generates realistic millisecond-level financial price time series for the E-Mini S&P 500 futures market. Specifically, a microstructure model of a single security traded on a central limit order book is provided, where different types of traders follow different behavioural rules. The model is calibrated using the machine learning surrogate modelling approach. Statistical test and moment coverage ratio results show that the model has excellent capability of reproducing realistic stylised facts in financial markets. By introducing an institutional trader that mimics the real-world Sell Algorithm 1 on May 6th, 2010, the proposed high-frequency agent-based financial market simulator is used to simulate the Flash Crash that took place that day. We scrutinise the market dynamics during the simulated flash crash and show that the simulated dynamics are consistent with what happened in historical flash crash scenarios. With the help of Monte Carlo simulations, we discover functional relationships between the amplitude of the simulated 2010 Flash Crash and three conditions: the percentage of volume of the Sell Algorithm, the market maker inventory limit, and the trading frequency of fundamental traders. Similar analyses are carried out for mini flash crash events. An innovative "Spiking Trader" is introduced to the model, aiming at precipitating mini flash crash events. We analyse the market dynamics during the course of a typical simulated mini flash crash event and study the conditions affecting its characteristics. The proposed model can be used for testing resiliency and robustness of trading algorithms and providing advice for policymakers.

28 Sep 2022
TL;DR: In this paper , an FPGA-based low-latency graph neural network (LL-GNN) design for particle detectors is presented, which is enhanced by exploiting the structured adjacency matrix and column-major data layout.
Abstract: This work presents a novel reconfigurable architecture for Low Latency Graph Neural Network (LL-GNN) designs for particle detectors, delivering unprecedented low latency performance. Incorporating FPGA-based GNNs into particle detectors presents a unique challenge since it requires sub-microsecond latency to deploy the networks for online event selection with a data rate of hundreds of terabytes per second in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a novel outer-product based matrix multiplication approach, which is enhanced by exploiting the structured adjacency matrix and a column-major data layout. Moreover, a fusion step is introduced to further reduce the end-to-end design latency by eliminating unnecessary boundaries. Furthermore, a GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under given latency constraints. To facilitate this, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 9.0 times faster and consumes up to 12.4 times less power than a GPU implementation. Compared to the previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy. The proposed LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.

Proceedings ArticleDOI
09 Jun 2022
TL;DR: This paper reviews the massively micro-parallel compute system POETS (Partially Ordered Event Triggered System) and illustrates its potential for speeding up demanding applications, showing significant wallclock speedup and power consumption improvement over conventional systems.
Abstract: This paper reviews the massively micro-parallel compute system POETS (Partially Ordered Event Triggered System) and illustrates its potential for speeding up demanding applications. Application domains that benefit from POETS include simulations of physical systems that can be discretised as a mesh. The problem graph is distributed over a large compute mesh; each mesh vertex contains a processor – an FPGA-based RISC-V thread supporting custom instructions in our prototype – and a small amount of local problem state data. There is no central overseer of any sort and processors cannot see memory besides their own. A problem graph vertex interacts with a neighbour to send a state change by sending an asynchronous packet. The packets are fixed size and small – currently 64 bytes – and the hardware communications infrastructure is very fast. Applications can use an asynchronous ‘packet storm’ approach; run synchronously using a hardware idle barrier or run in a globally asynchronous, locally synchronous manner. Results show significant wallclock speedup and power consumption improvement over conventional systems: for one application we show a 40-fold speedup over a conventional CPU-based system; versus a multi-GPU system, the POETS cluster is 26% faster, 60% more power efficient, and 34% more energy efficient.

Proceedings ArticleDOI
15 May 2022
TL;DR: Alongside traditional look-up tables (LUTs), Digital Signal Processors (DSPs) and block memories (BRAMs), modern FPGAs include many specialised processing elements such as CPU cores and AI accelerators.
Abstract: Alongside traditional look-up tables (LUTs), Digital Signal Processors (DSPs) and block memories (BRAMs), modern FPGAs include many specialised processing elements such as CPU cores and AI accelerators [1] . Applications wish to maximise performance by using all available processing elements, but additional work is required to support and manage their different implementations and behaviours.

Proceedings ArticleDOI
28 May 2022
TL;DR: Wang et al. as mentioned in this paper proposed a customizable FPGA-based design to accelerate binarized GCNs (BiGCNs), which can be parameterized by different loop unrolling and memory partition factors.
Abstract: Graph convolutional networks (GCNs) have demon-strated their excellent algorithmic performance in various graph-based learning applications. Nevertheless, the massive amount of computation required by analyzing graph data structures puts a heavy burden on the hardware performance, limiting their deployment in real-life scienarios. To address this issue, we propose a customizable FPGA-based design to accelerate binarized GCNs (BiGCNs). The proposed accelerator is parameterized by different loop unrolling and memory partition factors, which can be reconfigured to fulfill different user needs. To ease the bandwidth requirement of BiGCNs, our design overlaps the data transfer with computation. We also adopt COO (Coordinate) format storage for the adjacency matrix to skip the redundant computation to improve hardware performance. Our experimental results demonstrate that the proposed FPGA-based BiGCN design achieves 202× and 10.6× speedup than CPU and GPU implementations on the Flickr dataset.

Proceedings ArticleDOI
10 Jul 2022
TL;DR: An evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload, and hardware performance improvement on optimized CPU and GPU implementations is demonstrated.
Abstract: Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a fully spectral CNN using a novel spectral-domain adaptive rectified linear unit (ReLU) layer, which completely removes the compute-intensive transformations between the spatial and frequency domains within the network.
Abstract: Computing convolutional layers in the frequency domain using fast Fourier transformation (FFT) has been demonstrated to be effective in reducing the computational complexity of convolutional neural networks (CNNs). Nevertheless, the main challenge of this approach lies in the frequent and repeated transformations between the spatial and frequency domains due to the absence of nonlinear functions in the spectral domain, as such it makes the benefit less attractive for low-latency inference, especially on embedded platforms. To overcome the drawbacks in the existing FFT-based convolution, we propose a fully spectral CNN using a novel spectral-domain adaptive rectified linear unit (ReLU) layer, which completely removes the compute-intensive transformations between the spatial and frequency domains within the network. The proposed fully spectral CNNs maintain the nonlinearity of the spatial CNNs while taking into account the hardware efficiency. We then propose a deeply customized and compute-efficient hardware architecture to accelerate the fully spectral CNN inference on field programmable gate array (FPGA). Different hardware optimizations, such as spectral-domain intralayer and interlayer pipeline techniques, are introduced to further improve the performance of throughput. To achieve a load-balanced pipeline, a design space exploration (DSE) framework is proposed to optimize the resource allocation between hardware modules according to the resource constraints. On an Intel's Arria 10 SX160 FPGA, our optimized accelerator achieves a throughput of 204 Gop/s with 80% of compute efficiency. Compared with the state-of-the-art spatial and FFT-based implementations on the same device, our accelerator is 4 × ∼ 6.6 × and 3.0TEXPRESERVE3 ∼ 4.4 × faster while maintaining a similar level of accuracy across different benchmark datasets.

Proceedings ArticleDOI
09 Jun 2022
TL;DR: This paper introduces Design-Flow Patterns, which capture modular, recurring application-agnostic elements involved in mapping and optimising application descriptions onto efficient CPU and GPU targets and is the first to codify and programmatically coordinate these elements into fully automated, customisable, and reusable end-to-end design-flows.
Abstract: Continuing advances in heterogeneous and parallel computing enable massive performance gains in domains such as AI and HPC. Such gains often involve using hardware accelerators, such as FPGAs and GPUs, to speed up specific workloads. However, to make effective use of emerging heterogeneous architectures, optimisation is typically done manually by highly-skilled developers with in-depth understanding of the target hardware. The process is tedious, error-prone, and must be repeated for each new application. This paper introduces Design-Flow Patterns, which capture modular, recurring application-agnostic elements involved in mapping and optimising application descriptions onto efficient CPU and GPU targets. Our approach is the first to codify and programmatically coordinate these elements into fully automated, customisable, and reusable end-to-end design-flows. We implement key design-flow patterns using the meta-programming tool Artisan, and evaluate automated design-flows applied to three sequential C++ applications. Compared to single-threaded implementations, our approach generates multi-threaded OpenMP CPU designs achieving up to 18 times speedup on a CPU platform with 32-threads, as well as HIP GPU designs achieving up to 1184 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU.

Proceedings ArticleDOI
28 May 2022
TL;DR: In this article , a ring-based architecture is proposed to leverage parallel accesses to these constituent block memories, benefiting low latency applications that rely on: highly-complex functions; numerical precision via iterative computation; or many parallel data-paths accessing a shared memory resource.
Abstract: Memory-based computing stores pre-computed function results in memory to be read at runtime. FPGAs group together multiple block memories (BRAMs) to form this memory, all accessed as a single monolithic device. We introduce a novel ring-based architecture to leverage parallel accesses to these constituent BRAMs, benefiting low latency applications that rely on: highly-complex functions; numerical precision via iterative computation; or many parallel data-paths accessing a shared memory resource. The implemented function’s performance is independent of its complexity, enabling significant latency reductions for compute-bound operations. We assess common functions (sqrt, power, trigonometric, hyperbolic functions) on the Xilinx Alveo U280 FPGA. Our function-agnostic memory-compute core can serve 1024 parallel function calls at 300MHz and reduce latency 4.4-29x versus traditional FPGA implementations.

Posted ContentDOI
01 May 2022-Wilmott
TL;DR: XGB-Chiarella as mentioned in this paper is a powerful new approach for deploying agent-based models to generate realistic intra-day artificial financial price data, which can be used for price prediction.
Abstract: This article presents XGB-Chiarella, a powerful new approach for deploying agent-based models to generate realistic intra-day artificial financial price data.

Journal ArticleDOI
TL;DR: This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences, and contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.
Abstract: This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving their hardware utilization and by reducing design latency, with two innovations. First, a custom coarse-grained multi-threaded RNN/Long Short-Term Memory (LSTM) hardware architecture, switching tasks among threads when RNN computational engines meet data hazards. Second, the partitioning of this hardware architecture into multiple full-fledged sub-accelerator cores, enabling spatially co-execution of multiple RNN/LSTM inferences. These innovations improve the exploitation of the available parallelism to increase runtime hardware utilization and boost design throughput. Evaluation results show that a dual-threaded quad-core Remarn NPU achieves 2.91 times higher performance while only occupying 5.0% more area than a single-threaded one on a Stratix 10 FPGA. When compared with a Tesla V100 GPU implementation, our design achieves 6.5 times better performance and 15.6 times higher power efficiency, showing that our approach contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.