scispace - formally typeset
Search or ask a question

Showing papers on "Field-programmable gate array published in 2021"


Proceedings ArticleDOI
17 Feb 2021
TL;DR: AutoSA as mentioned in this paper is an end-to-end compilation framework for generating systolic arrays on FPGA, which is based on the polyhedral framework and further incorporates a set of optimizations on different dimensions to boost performance.
Abstract: While systolic array architectures have the potential to deliver tremendous performance, it is notoriously challenging to customize an efficient systolic array processor for a target application. Designing systolic arrays requires knowledge for both high-level characteristics of the application and low-level hardware details, thus making it a demanding and inefficient process. To relieve users from the manual iterative trial-and-error process, we present AutoSA, an end-to-end compilation framework for generating systolic arrays on FPGA. AutoSA is based on the polyhedral framework, and further incorporates a set of optimizations on different dimensions to boost performance. An efficient and comprehensive design space exploration is performed to search for high-performance designs. We have demonstrated AutoSA on a wide range of applications, on which AutoSA achieves high performance within a short amount of time. As an example, for matrix multiplication, AutoSA achieves 934 GFLOPs, 3.41 TOPs, and 6.95 TOPs in floating point, 16-bit and 8-bit integer data types on Xilinx Alveo U250.

77 citations


Journal ArticleDOI
TL;DR: The ANN-MPC approach can significantly reduce the computing need and allow the use of more accurate high-order system models due to the simple mathematical expression of ANN, and retain the robustness for system parameter uncertainties by flexibly setting the input elements.
Abstract: There has been an increasing interest in using model predictive control (MPC) for power electronic applications. However, the exponential increase in computational complexity and demand of computing resources hinders the practical adoption of this highly promising control technique. In this paper, a new MPC approach using an artificial neural network (termed ANN-MPC) is proposed to overcome these barriers. The ANN-MPC approach can significantly reduce the computing need and allow the use of more accurate high-order system models due to the simple mathematical expression of ANN. This is particularly important for multi-level and multi-phase power systems as their number of switching states increases exponentially. Furthermore, the ANN-MPC approach can retain the robustness for system parameter uncertainties by flexibly setting the constraint conditions. The basic concept, ANN structure, off-line training method, and online operation of ANN-MPC are described in detail. The computing resource requirement of the ANN-MPC and conventional MPC are analyzed and compared. The ANN-MPC concept is validated by both simulation and experimental results on two kW-class flying capacitor multilevel converters. It is demonstrated that the FPGA-based ANN-MPC controller can significantly reduce the FPGA resource requirement while offering a control performance same as the conventional MPC.

70 citations


Journal ArticleDOI
15 Mar 2021
TL;DR: This paper provides an exhaustive study of the implementation by DES and AES of field programming gate arrays (FPGAs) using both AES and DES, showing how computers are superior to them.
Abstract: In recent days, increasing numbers of Internet and wireless network users have helped accelerate the need for encryption mechanisms and devices to protect user data sharing across an unsecured network. Data security, integrity, and verification may be used due to these features. In internet traffic encryption, symmetrical block chips play an essential role. Data Encryption Standard (DES) and Advanced Encryption Standard (AES) ensure privacy encryption underlying data protection standards. The DES and the AES provide information security. DES and AES have the distinction of being introduced in both hardware and applications. DES and AES hardware implementation has many advantages, such as increased performance and improved safety. This paper provides an exhaustive study of the implementation by DES and AES of field programming gate arrays (FPGAs) using both DES and AES. Since FPGAs can be defined as just one mission, computers are superior to them.

57 citations


Journal ArticleDOI
TL;DR: Low power high-speed hardware architectures for the efficient field programmable gate array (FPGA) implementation of the advanced encryption standard (AES) algorithm to provide data security and modified positive polarity reed muller (MPPRM) architecture is inserted.
Abstract: Nowadays, a huge amount of digital data is frequently changed among different embedded devices over wireless communication technologies. Data security is considered an important parameter for avoiding information loss and preventing cyber-crimes. This research article details the low power high-speed hardware architectures for the efficient field programmable gate array (FPGA) implementation of the advanced encryption standard (AES) algorithm to provide data security. This work does not depend on the Look-Up Table (LUTs) for the implementation the SubBytes and InvSubBytes stages of transformations of the AES encryption and decryption; this new architecture uses combinational logical circuits for implementing SubBytes and InvSubBytes transformation. Due to the elimination of LUTs, unwanted delays are eliminated in this architecture and a subpipelining structure is introduced for improving the speed of the AES algorithm. Here, modified positive polarity reed muller (MPPRM) architecture is inserted to reduce the total hardware requirements, and comparisons are made with different implementations. With MPPRM architecture introduced in SubBytes stages, an efficient mixcolumn and invmixcolumn architecture that is suited to subpipelined round units is added. The performances of the proposed AES-MPPRM architecture is analyzed in terms of number of slice registers, flip flops, number of slice LUTs, number of logical elements, slices, bonded IOB, operating frequency and delay. There are five different AES architectures including LAES, AES-CTR, AES-CFA, AES-BSRD, and AES-EMCBE. The LUT of the AES-MPPRM architecture designed in the Spartan 6 is reduced up to 15.45% when compared to the AES-BSRD.

53 citations


Journal ArticleDOI
TL;DR: Generic area-optimized, low-latency accurate, and approximate softcore multiplier architectures, which exploit the underlying architectural features of FPGAs, i.e., lookup table (LUT) structures and fast-carry chains to reduce the overall critical path delay (CPD) and resource utilization of multipliers
Abstract: Multiplication is one of the widely used arithmetic operations in a variety of applications, such as image/video processing and machine learning FPGA vendors provide high-performance multipliers in the form of DSP blocks These multipliers are not only limited in number and have fixed locations on FPGAs but can also create additional routing delays and may prove inefficient for smaller bit-width multiplications Therefore, FPGA vendors additionally provide optimized soft IP cores for multiplication However, in this work, we advocate that these soft multiplier IP cores for FPGAs still need better designs to provide high-performance and resource efficiency Towards this, we present generic area-optimized, low-latency accurate and approximate softcore multiplier architectures, which exploit the underlying architectural features of FPGAs, ie, look-up table (LUT) structures and fast carry chains to reduce the overall critical path delay and resource utilization of multipliers Compared to Xilinx multiplier LogiCORE IP, our proposed unsigned and signed accurate architecture provides up to 25% and 53% reduction in LUT utilization, respectively, for different sizes of multipliers Moreover, with our unsigned approximate multiplier architectures, a reduction of up to 51% in the critical path delay can be achieved with an insignificant loss in output accuracy when compared with the LogiCORE IP For illustration, we have deployed the proposed multiplier architecture in accelerators used in image and video applications, and evaluated them for area and performance gains Our library of accurate and approximate multipliers is open-source and available online at https://cfaedtu-dresdende/pd-downloads to fuel further research and development in this area, facilitate reproducible research, and thereby enabling a new research direction for the FPGA community

43 citations


Journal ArticleDOI
20 Oct 2021
TL;DR: In this article, the authors present a deterministic approach to correcting circuit errors by locally correcting hardware errors within individual optical gates, and apply their approach to simulations of large scale optical neural networks and infinite impulse response filters implemented in programmable photonics, finding that they remain resilient to component error well beyond modern day process tolerances.
Abstract: Programmable photonic circuits of reconfigurable interferometers can be used to implement arbitrary operations on optical modes, providing a flexible platform for accelerating tasks in quantum simulation, signal processing, and artificial intelligence. A major obstacle to scaling up these systems is static fabrication error, where small component errors within each device accrue to produce significant errors within the circuit computation. Mitigating this error usually requires numerical optimization dependent on real-time feedback from the circuit, which can greatly limit the scalability of the hardware. Here we present a deterministic approach to correcting circuit errors by locally correcting hardware errors within individual optical gates. We apply our approach to simulations of large scale optical neural networks and infinite impulse response filters implemented in programmable photonics, finding that they remain resilient to component error well beyond modern day process tolerances. Our results highlight a potential way to scale up programmable photonics to hundreds of modes with current fabrication processes.

40 citations


Journal ArticleDOI
TL;DR: This paper optimize the CNN-based model for hardware implementation, which establishes a foundation for efficiently mapping the network on a field-programmable gate array (FPGA), and proposes a hardware architecture for the CNN -based remote sensing object detection model.
Abstract: In recent years, convolutional neural network (CNN)-based methods have been widely used for optical remote sensing object detection and have shown excellent performance. Some aerospace systems, such as satellites or aircrafts, need to adopt these methods to observe objects on the ground. Due to the limited budget of the logical resources and power consumption in these systems, an embedded device is a good choice to implement the CNN-based methods. However, it is still a challenge to strike a balance between performance and power consumption. In this paper, we propose an efficient hardware-implementation method for optical remote sensing object detection. Firstly, we optimize the CNN-based model for hardware implementation, which establishes a foundation for efficiently mapping the network on a field-programmable gate array (FPGA). In addition, we propose a hardware architecture for the CNN-based remote sensing object detection model. In this architecture, a general processing engine (PE) is proposed to implement multiple types of convolutions in the network using the uniform module. An efficient data storage and access scheme is also proposed, and it achieves low-latency calculations and a high memory bandwidth utilization rate. Finally, we deployed the improved YOLOv2 network on a Xilinx ZYNQ xc7z035 FPGA to evaluate the performance of our design. The experimental results show that the performance of our implementation on an FPGA is only 0.18% lower than that on a graphics processing unit (GPU) in mean average precision (mAP). Under a 200 MHz working frequency, our design achieves a throughput of 111.5 giga-operations per second (GOP/s) with a 5.96 W on-chip power consumption. Comparison with the related works demonstrates that the proposed design has obvious advantages in terms of energy efficiency and that it is suitable for deployment on embedded devices.

37 citations


Book ChapterDOI
01 Jan 2021
TL;DR: In this article, a novel 3Dimensional (3D) vertically incorporated adaptive computing structure is presented, which is a mix of best in class dealing with and interconnection development, including the vertical joining of two chips of configurable array processor.
Abstract: This work shows that a novel 3-Dimensional (3D) vertically incorporated adaptive computing structure. This FPGA for 3D is a mix of best in class dealing with and interconnection development. It includes the vertical joining of two chips of Configurable Array Processor. The Configurable Array Processor is a variety of heterogeneous manufacturing components while the Intelligent Configurable Switch contains a switch controller, on-chip program, and data memory, information outline support alongside a Direct Memory Access controller. The FPGA for 3D architecture for constant communication and mixed media flag preparing as a next-generation computing system and programming execution and check philosophy including abnormal level demonstrating and design investigation of FPGA for 3D utilizing System to decide the ideal equipment detail in the early outline organize. It can deal with a few goals addresses in the meantime, recombine the packages, sent similar information to the diverse goal hubs, and maintain a strategic distance avoid from the congestion. The innovative multicast 3D NoC router performs the proposed algorithm, NoC can be designed in the Verilog HDL, and the operation is computed in the ModelSim software. All these samples and modulation taken and it has been synthesized in FPGA.

35 citations


Journal ArticleDOI
TL;DR: In this article, a pseudo-random number generator (PRNG) with a feedback controller based on a Hopfield neural network chaotic oscillator is proposed, in which a neuron is exposed to electromagnetic radiation.
Abstract: When implementing a pseudo-random number generator (PRNG) for neural network chaos-based systems on FPGAs, chaotic degradation caused by numerical accuracy constraints can have a dramatic impact on the performance of the PRNG. To suppress this degradation, a PRNG with a feedback controller based on a Hopfield neural network chaotic oscillator is proposed, in which a neuron is exposed to electromagnetic radiation. We choose the magnetic flux across the cell membrane of the neuron as a feedback condition of the feedback controller to disturb other neurons, thus avoiding periodicity. The proposed PRNG is modeled and simulated on Vivado 2018.3 software and implemented and synthesized by the FPGA device ZYNQ-XC7Z020 on Xilinx using Verilog HDL code. As the basic entropy source, the Hopfield neural network with one neuron exposed to electromagnetic radiation has been implemented on the FPGA using the high precision 32-bit Runge Kutta 4th-order method (RK4) algorithm from the IEEE 754-1985 floating point standard. The post-processing module consists of 32 registers and 15 XOR comparators. The binary data generated by the scheme was tested and analyzed using the NIST 800.22 statistical test suite. The results show that it has high security and randomness. Finally, an image encryption and decryption system based on PRNG is designed and implemented on FPGA. The feasibility of the system is proved by simulation and security analysis.

34 citations


Journal ArticleDOI
TL;DR: The experimental results show that the hyper-chaotic oscillator has higher level of security than the chaotic one, but it is slower and utilizes more FPGA resources.
Abstract: Hyper-chaotic systems can exhibit a higher level of complexity in comparison with the chaotic systems. However, they require more resources when they are realized on a modular field-programmable gate array (FPGA). In this paper, we introduce full hardware/software comparison and security analysis of three-dimensional chaotic and four-dimensional hyper-chaotic oscillator systems. The two systems (previously implemented only in analog form) are realized on a modular FPGA hardware platform to generate high-speed random bit-streams. The realization is performed using two versions of VHDL code, one is generated automatically using a MATLAB HDL-Coder, and the optimized one which is manually written. The work explores the features of each oscillator system such as throughput, FPGA resources utilization, operating clock frequency, and security of the generated bit-streams, to show a compromise solution on these features. The experimental results show that the hyper-chaotic oscillator has higher level of security than the chaotic one, but it is slower and utilizes more FPGA resources. However, when the overall comparison measure figure of merit (FOM) is used, the chaotic system shows 188% better FOM than the hyper-chaotic system (for the automatically generated version) and 183% (for the manually written one).

32 citations


Journal ArticleDOI
TL;DR: It is demonstrated that a neural network autoencoder model can be implemented in a radiation tolerant ASIC to perform lossy data compression alleviating the data transmission problem while preserving critical information of the detector energy profile.
Abstract: Despite advances in the programmable logic capabilities of modern trigger systems, a significant bottleneck remains in the amount of data to be transported from the detector to off-detector logic where trigger decisions are made. We demonstrate that a neural network (NN) autoencoder model can be implemented in a radiation-tolerant application-specific integrated circuit (ASIC) to perform lossy data compression alleviating the data transmission problem while preserving critical information of the detector energy profile. For our application, we consider the high-granularity calorimeter from the Compact Muon Solenoid (CMS) experiment at the CERN Large Hadron Collider. The advantage of the machine learning approach is in the flexibility and configurability of the algorithm. By changing the NN weights, a unique data compression algorithm can be deployed for each sensor in different detector regions and changing detector or collider conditions. To meet area, performance, and power constraints, we perform quantization-aware training to create an optimized NN hardware implementation. The design is achieved through the use of high-level synthesis tools and the hls4ml framework and was processed through synthesis and physical layout flows based on a low-power (LP)-CMOS 65-nm technology node. The flow anticipates 200 Mrad of ionizing radiation to select gates and reports a total area of 3.6 mm2 and consumes 95 mW of power. The simulated energy consumption per inference is 2.4 nJ. This is the first radiation-tolerant on-detector ASIC implementation of an NN that has been designed for particle physics applications.

Journal ArticleDOI
TL;DR: In this article, the authors investigate the implementation of a high-performance TDL-TDC addressed to 28-nm 7-Series Xilinx FPGA, taking into account the comparison between different technological nodes from 65-nm to 20-nm.
Abstract: The Field Programmable Gate Array (FPGA) structure poses several constraints that make the implementation of complex asynchronous circuits such as Time–Mode (TM) circuits almost unfeasible. In particular, in Programmable Logic (PL) devices, such as FPGAs, the operation of the logic is usually synchronous with the system clock. However, it can happen that a very high–performance specifications demands to abandon this paradigm and to follow an asynchronous implementative solution. The main driver forcing the use of programmable logic solutions instead of tailored Application Specific Integrated Circuits (ASIC), best suiting an asynchronous design, is the request coming from the research community and industrial R&D of fast–prototyping at low Non Recursive Engineering (NRE) costs. For instance in the case of a high–resolved Time–to–Digital Converter (TDC), a signal clocked at some hundreds of MHz implemented in FPGA allows implementing a TDC with resolution at ns. If a higher resolution is required, the signal frequency cannot be increased further and one of the aces up the designer’s sleeve is the propagation delay of the logic in order to quantize the time intervals by means of a so-called Tapped Delay–Line (TDL). This implementation of TDL–based TDC in FPGAs requires special attention by the designer both in making the best use of all available resources and in foreseeing how signals propagate inside these devices. In this paper, we investigate the implementation of a high–performance TDL–TDC addressed to 28–nm 7–Series Xilinx FPGA, taking into account the comparison between different technological nodes from 65–nm to 20–nm. In this context, the term high–performance means extended dynamic–range (up to 10.3 s), high–resolution and single–shot precision (up to 366 fs and 12 ps r.m.s respectively), low differential and integral non–linearity (up to 250 fs and 2.5 ps respectively), and multi–channel capability (up to 16).

Proceedings ArticleDOI
22 Jun 2021
TL;DR: In this paper, the authors proposed an acceleration framework coupling the balanced model compression at the algorithm level and FPGA-implementation optimization at the hardware level, which can significantly save storage space.
Abstract: Recently, Transformers gradually gain popularity and perform outstanding for many Natural Language Processing (NLP) tasks. However, Transformers suffer from heavy computation and memory footprint, making it difficult to deploy on embedded devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its advantages. However, the trained Transformer models are too large to accommodate to an FPGA fabric. To accommodate Transformer onto FPGA and achieve efficient execution, we propose an acceleration framework coupling the balanced model compression at the algorithm level and FPGA-implementation optimization at the hardware level. At algorithm level, we adopt a block-balanced pruning and propose an efficient sparse matrix storage format for this pruning technique, named Compressed Block Row (CBR). At the hardware level, we design an accelerator for sparse model. And we also abstract a performance analytic model to evaluate the performance of accelerator. Experiments show that our CBR format perform better than general formats and can significantly save storage space. And our accelerator can achieve $38\times$ and $1.93\times$ speedup compared to other works on CPU and GPU respectively.

Journal ArticleDOI
TL;DR: In this paper, a configurable and scalable core for real-time object detection and classification based on YOLO targeting embedded platforms has been proposed, which accelerates the execution of all the algorithm steps, including preprocessing, model inference and post-processing.
Abstract: Object detection and classification is an essential task of computer vision. A very efficient algorithm for detection and classification is YOLO (You Look Only Once). We consider hardware architectures to run YOLO in real-time on embedded platforms. Designing a new dedicated accelerator for each new version of YOLO is not feasible given the fast delivery of new versions. This work’s primary goal is to design a configurable and scalable core for creating specific object detection and classification systems based on YOLO, targeting embedded platforms. The core accelerates the execution of all the algorithm steps, including pre-processing, model inference and post-processing. It considers a fixed-point format, linearised activation functions, batch-normalisation, folding, and a hardware structure that exploits most of the available parallelism in CNN processing. The proposed core is configured for real-time execution of YOLOv3-Tiny and YOLOv4-Tiny, integrated into a RISC-V-based system-on-chip architecture and prototyped in an UltraScale XCKU040 FPGA (Field Programmable Gate Array). The solution achieves a performance of 32 and 31 frames per second for YOLOv3-Tiny and YOLOv4-Tiny, respectively, with a 16-bit fixed-point format. Compared to previous proposals, it improves the frame rate at a higher performance efficiency. The performance, area efficiency and configurability of the proposed core enable the fast development of real-time YOLO-based object detectors on embedded systems.

Journal ArticleDOI
TL;DR: A lightweight neural network architecture termed as SparkNet, capable of significantly reducing the weight parameters and computation demands, and the network model of the SparkNet and the proposed accelerator architecture are both specifically built for FPGA.

Journal ArticleDOI
TL;DR: A lightweight advanced encryption standard (AES), a high-secure symmetric cryptography algorithm, implementation on field-programmable gate array (FPGA) and 65-nm technology for resource-constrained IoT devices, and results show respective improvement in the area over the previous similar works.
Abstract: Due to the fast-growing number of connected tiny devices to the Internet of Things (IoT), providing end-to-end security is vital. Therefore, it is essential to design the cryptosystem based on the requirement of resource-constrained IoT devices. This article presents a lightweight advanced encryption standard (AES), a high-secure symmetric cryptography algorithm, implementation on field-programmable gate array (FPGA) and 65-nm technology for resource-constrained IoT devices. The proposed architecture includes 8-bit datapath and five main blocks. We design two specified register banks, Key-Register and State-Register, for storing the plain text, keys, and intermediate data. To reduce the area, Shift-Rows is embedded inside the State-Register. To adapt the Mix-Column to 8-bit datapath, we design an optimized 8-bit block for Mix-Columns with four internal registers, which accept 8-bit and send back 8-bit. Also, a shared optimized Sub-Bytes is employed for the key expansion phase and encryption phase. To optimize Sub-Bytes, we merge and simplify some parts of the Sub-Bytes. To reduce power consumption, we apply the clock gating technique to the design. Application-specific integrated circuit (ASIC) implementation results show a respective improvement in the area over the previous similar works from 35% to 2.4%. Based on the results, the proposed design is a suitable cryptosystem for tiny IoT devices.

Proceedings ArticleDOI
01 Feb 2021
TL;DR: Soft embedded FPGA redaction as discussed by the authors is a hardware obfuscation approach that allows the designer to substitute security-critical IP blocks within a design with a synthesizable eFPGA fabric.
Abstract: In recent years, IC reverse engineering and IC fabrication supply chain security have grown to become significant economic and security threats for designers, system integrators, and end customers. Many of the existing logic locking and obfuscation techniques have shown to be vulnerable to attack once the attacker has access to the design netlist either through reverse engineering or through an untrusted fabrication facility. We introduce soft embedded FPGA redaction, a hardware obfuscation approach that allows the designer substitute security-critical IP blocks within a design with a synthesizable eFPGA fabric. This method fully conceals the logic and the routing of the critical IP and is compatible with standard ASIC flows for easy integration and process portability. To demonstrate eFPGA redaction, we obfuscate a RISC-V control path and a GPS P-code generator. We also show that the modified netlists are resilient to SAT attacks with moderate VLSI overheads. The secure RISC-V design has 1.89x area and 2.36x delay overhead while the GPS design has 1.39x area and negligible delay overhead when implemented on an industrial 22nm FinFET CMOS process.

Journal ArticleDOI
TL;DR: This work designs an artificial intelligence and Internet of Things empowered edge-cloud collaborative computing (ECCC) system based on the energy-efficient field-programmable gate array (FPGA)-based CNN accelerators for the purpose of realizing a low-latency and low-power FT system.
Abstract: Convolutional neural networks (CNNs) have become the critical technology to realize face detection and face recognition in the face tracking (FT) system. However, traditional CNNs usually have nontrivial computational time and high energy consumption, making them inappropriate to be deployed in the large-scale time-sensitive FT system. To address this challenge, we design an artificial intelligence and Internet of Things (AIoT) empowered edge-cloud collaborative computing (ECCC) system based on the energy-efficient field-programmable gate array (FPGA)-based CNN accelerators for the purpose of realizing a low-latency and low-power FT. First, we present the AIoT-empowered ECCC system architecture, which consists of an intelligent computing subsystem, an Internet-of-Things (IoT) subsystem, an edge-cloud collaborative subsystem, and an application subsystem. In what follows, we investigate the enabling technologies for these subsystems. Thereafter, we develop an FPGA-based hardware accelerator dedicated to the compact MobileNet CNN by using the hardware design techniques, such as systolic array, matrix tiling, fixed-point precision, and parallelism. Furthermore, we integrate the FPGA accelerators with CPUs and GPUs to build a context-aware CPU/GPU/FPGA heterogeneous computing system. Finally, we implement a delay-aware energy-efficient scheduling algorithm dedicated to this heterogeneous system. With the above hardware and software codesign mechanism, the energy cost and execution time of CNNs can be decreased significantly. The real-world experiments on the CPU/GPU/FPGA-based ECCC system proved the effectiveness of the proposed schemes in reducing the latency and improving the power efficiency of the FT system.

Journal ArticleDOI
TL;DR: In this article, the authors proposed two different field-programmable gate array (FPGA)-based EdDSA implementations, i.e., efficient and high-performance Ed25519 architectures applicable for a security level comparable to AES-128.
Abstract: This article presents highly optimized implementations of the Ed25519 digital signature algorithm [Edwards curve digital signature algorithm (EdDSA)]. This algorithm significantly improves the execution time without sacrificing security, compared to exiting digital signature algorithms. Although EdDSA is employed in many widely used protocols, such as TLS and SSH, there appear to be extremely few hardware implementations that focus only on EdDSA. Hence, we propose two different field-programmable gate array (FPGA)-based EdDSA implementations, i.e., efficient and high-performance Ed25519 architectures applicable for a security level comparable to AES-128. Our proposed efficient Ed25519 scheme achieves an improvement of more than 84% compared to the best previous work by reducing the required area. It also incorporates more than $8\times $ speedup. Furthermore, our proposed high-performance architecture shows a $21\times $ speedup with more than 6200 digital signature algorithms per second, showing a significant improvement in terms of utilized area $\times $ time on a Xilinx Zynq-7020 FPGA. Finally, the effective side-channel countermeasures are embedded in our proposed designs, which also outperform the previous works.

Journal ArticleDOI
TL;DR: This work proposes a high performance hardware architecture for NewHope key exchange, and achieves more than 4.8 times better in terms of area-time product compared to previous results of hardware implementation of NewHope-Simple from Oder and Guneysu at Latin-crypt 2017.
Abstract: Lattice-based cryptography is a highly potential candidate that protects against the threats of quantum attack. At Usenix Security 2016, Alkim, Ducas, Popplemann, and Schwabe proposed a post-quantum key exchange scheme called NewHope, based on a variant of lattice problem, the ring-learning-with-errors (RLWE) problem. In this work, we propose a high performance hardware architecture for NewHope. Our implementation requires 6,680 slices, 9,412 FFs, 18,756 LUTs, 8 DSPs and 14 BRAMs on Xilinx Zynq-7000 equipped with 28mm Artix-7 7020 FPGA. In our hardware design of NewHope key exchange, the three phases of key exchange costs 51.9, 78.6 and 21.1 μs, respectively. It achieves more than 4.8 times better in terms of area-time product compared to previous results of hardware implementation of NewHope-Simple from Oder and Guneysu at Latin-crypt 2017.

Journal ArticleDOI
TL;DR: To the best of the knowledge, this work is the first to explore the possibility of compressing networks for radio frequency radio frequencyprinting, and can be seen as a means of characterizing the informational capacity associated with this specific learning task.
Abstract: Deep learning methods have been very successful at radio frequency fingerprinting tasks, predicting the identity of transmitting devices with high accuracy. We study radio frequency fingerprinting deployments at resource-constrained edge devices. We use structured pruning to jointly train and sparsify neural networks tailored to edge hardware implementations. We compress convolutional layers by a 27.2x factor while incurring a negligible prediction accuracy decrease (less than 1%). We demonstrate the efficacy of our approach over multiple edge hardware platforms, including a Samsung Gallaxy S10 phone and a Xilinx-ZCU104 FPGA. Our method yields significant inference speedups, 11.5x on the FPGA and 3x on the smartphone, as well as high efficiency: the FPGA processing time is 17x smaller than in a V100 GPU. To the best of our knowledge, we are the first to explore the possibility of compressing networks for radio frequency fingerprinting; as such, our experiments can be seen as a means of characterizing the informational capacity associated with this specific learning task.

Proceedings ArticleDOI
01 Feb 2021
TL;DR: In this paper, the authors investigated the hardware acceleration of BERT on FPGA for edge computing and proposed an accelerator tailored for the BERT and evaluate on Xilinx ZCU102 and ZCU11 FPGAs.
Abstract: BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT on FPGA for edge computing. To tackle the issue of huge computational complexity and memory footprint, we propose to fully quantize the BERT (FQ-BERT), including weights, activations, softmax, layer normalization, and all the intermediate results. Experiments demonstrate that the FQ-BERT can achieve 7.94× compression for weights with negligible performance loss. We then propose an accelerator tailored for the FQ-BERT and evaluate on Xilinx ZCU102 and ZCU11 FPGA. It can achieve a performance-per-watt of 3.18 fps/W, which is 28.91× and 12.72× over Intel(R) Core(TM) i7-8700 CPU and NVIDIA K80 GPU, respectively.

Proceedings ArticleDOI
17 Feb 2021
TL;DR: The Stratix 10 NX device as mentioned in this paper is a variant of FPGA specifically optimized for the AI application space, which provides the dense arrays of low precision multipliers typically used in AI implementations.
Abstract: The advent of AI has driven the adoption of high density low precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, and also the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this paper we will introduce the Stratix 10 NX device (NX), which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft logic fabric, a new type of DSP Block provides the dense arrays of low precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support for support block floating point FP16 and FP12 numerics. All additions/accumulations can be done in INT32 or IEEE754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multiplier that are more applicable to standard signal processing requirements. In terms of overall compute throughput, Stratix 10 NX achieves 143 INT8/FP16 TOPs/FLOPs, or 286 INT4/FP12 TOPS/FLOPs at 600MHz. Depending on the configuration, power efficiency is in the range of 1-4 TOPs or TFLOPs/W.

Proceedings ArticleDOI
21 May 2021
TL;DR: In this article, a reconfigurable YOLOv3 FPGA hardware accelerator based on the AXI bus ARM+FPGA architecture is proposed to detect small targets.
Abstract: The emergence of YOLOv3 makes it possible to detect small targets. Due to the characteristics of the YOLO network itself, the YOLOv3 network has exceptionally high requirements for computing power and memory bandwidth and it usually needs to be deployed on a dedicated hardware acceleration platform. FPGAs is a logically reconfigurable hardware chip with substantial advantages in terms of performance and power consumption, so it is a good choice to deploy a deep convolutional network. In the research of this paper, we proposed a reconfigurable YOLOv3 FPGA hardware accelerator based on the AXI bus ARM+FPGA architecture. The YOLOv3 network quantifies through Vitis AI, and a series of operations such as model compression and data pre-processing can save accelerator chips and the access time of external storage. Pipeline operation enables FPGAs to achieve higher throughput. Compared with the GPU implementation of the YOLOv3 model, it is found that the hardware implementation of the FPGA-based YOLOv3 accelerator has lower energy consumption and can achieve higher throughput.

Journal ArticleDOI
TL;DR: A frequency-tracker-based sliding-scale technique and a moving-average filter to improve the linearity and resolution of a multichannel field-programmable gate array (FPGA)-based time-to-digital converter (TDC) and its calibration techniques are presented.
Abstract: A multichannel field-programmable gate array (FPGA)-based time-to-digital converter (TDC) and its calibration techniques are presented. Herein, a frequency-tracker-based sliding-scale technique and a moving-average filter to improve the linearity and resolution are proposed. The error calibration technique automatically detects and corrects conversion errors caused by variations and mismatches in the propagation delays. The gain calibration extracts the average bin width of the fine TDC and resolves any linearity degradation in the coarse/fine interpolation architecture. The proposed techniques were applied to a four-channel TDC design implemented on a Xilinx Artix-7 FPGA. The measured differential and integral nonlinearities of all channels were within 0.51 least significant bit of 4.88 ps. The root-mean-squared resolution of the output code was 2.90–8.03 ps across a wide input range of 350 $\mu \text{s}$ .

Journal ArticleDOI
TL;DR: The hybrid companion circuit modeling method and the compact-electromagnetic transient program (EMTP) algorithm are proposed in this article to make sure that the simulation loop of the SST can complete in less than $1~\mu \text{s}$ .
Abstract: The switching frequency of power electronic devices has become much higher than before, which brings great challenges to real-time simulators. The small time step in high switching frequency simulations remarkably increases the difficulty to satisfy the real-time requirement. This article realizes the accurate real-time simulation of a solid-state transformer (SST) with a switching frequency of 50 kHz at the time step of 250 ns on a field-programmable gate array (FPGA)-based platform. The hybrid companion circuit modeling method and the compact-electromagnetic transient program (EMTP) algorithm are proposed in this article to make sure that the simulation loop can complete in less than $1~\mu \text{s}$ . The hybrid companion circuit modeling method can avoid too many digits used in representing simulation variables. The compact-EMTP algorithm is designed by combining the sequential computation tasks of the traditional EMTP to fully utilize the parallelized hardware structure of the FPGA. Besides, a circuit partition method is adopted to further parallelize the circuit solution of the SST circuit. In these ways, the simulation loop of the SST can be completed in 38 clock cycles (about 237.5 ns). The simulation results show that the real-time simulation waveforms are almost consistent with those of the off-line simulation software. Besides, the hardware-in-the-loop (HIL) simulation can also be performed on this platform to test the control functions of the SST.

Journal ArticleDOI
TL;DR: The proposed OMNI framework is a framework for accelerating sparse CNNs on hardware accelerators that uses hardware amenable on-chip memory partition patterns to seamlessly engage the software CNN model compression and hardware CNN acceleration.
Abstract: Convolution neural networks (CNNs) as one of today’s main flavor of deep learning techniques dominate in various image recognition tasks. As the model size of modern CNNs continues to grow, neural network compression techniques have been proposed to prune the redundant neurons and synapses. However, prior techniques disconnect the software neural networks compression and hardware acceleration, which fail to balance multiple design parameters, including sparsity, performance, hardware area cost, and efficiency. More concretely, prior unstructured pruning techniques achieve high sparsity at the expense of extra performance overhead, while prior structured pruning techniques relying on strict sparse patterns lead to low sparsity and extra hardware cost. In this article, we propose OMNI, a framework for accelerating sparse CNNs on hardware accelerators. The innovation of OMNI stems from that it uses hardware amenable on-chip memory partition patterns to seamlessly engage the software CNN model compression and hardware CNN acceleration. To accelerate the compute-intensive convolution kernel, a promising hardware optimization approach is memory partition, which divides the original weight kernels into several groups so that the different hardware processing elements can simultaneously access the weight. We exploit the memory partition patterns including block, cyclic, or hybrid as a means of CNN compression patterns. Our software CNN model compression balances the sparsity across different groups and our hardware accelerator employs hardware parallelization coordinately with the sparse patterns, leading to a desirable compromise between sparsity and performance. We further develop performance models to help the designers to quickly identify the pattern factors subject to an area constraint. Last, we evaluate our design on application specific integrated circuit (ASIC) and field-programmable gate array (FPGA) platform. Experiments demonstrate that OMNI achieves $3.4\times $ – $6.2\times $ speedup for the modern CNNs, over a comparably ideal dense CNN accelerator. OMNI shows $114.7\times $ energy efficiency improvement compared with GPU platform. OMNI is also evaluated on Xilinx ZC706 and ZCU102 FPGA platforms, achieving 41.5 GOP/s and 125.3 GOP/s, respectively.

Journal ArticleDOI
TL;DR: This paper proposes field-programmable gate array (FPGA) acceleration on a scalable multi-layer perceptron (MLP) neural network for classifying handwritten digits and results show a greater than $$\times 10$$ speedup compared with prior implementations.
Abstract: This paper proposes field-programmable gate array (FPGA) acceleration on a scalable multi-layer perceptron (MLP) neural network for classifying handwritten digits. First, an investigation to the network architectures is conducted to find the optimal FPGA design corresponding to different classification rates. As a case study, then a specific single-hidden-layer MLP network is implemented with an eight-stage pipelined structure on Xilinx Ultrascale FPGA. It mainly contains a timing controller designed by Verilog Hardware Description Language (HDL) and sigmoid neurons integrated by Xilinx IPs. Finally, experimental results show a greater than $$\times 10$$ speedup compared with prior implementations. The proposed FPGA architecture is expandable to other specifications on different accuracy (up to 95.82%) and hardware cost.

Journal ArticleDOI
TL;DR: Comprehensive benchmarks of accuracy, run-time, and energy efficiency of a wide range of vision kernels and neural networks on multiple embedded platforms: ARM57 CPU, Nvidia Jetson TX2 GPU and Xilinx ZCU102 FPGA are conducted.

Proceedings ArticleDOI
05 Dec 2021
TL;DR: Bambu as mentioned in this paper is an open-source high-level synthesis (HLS) research framework based on C/C++ specifications and compiler intermediate representation (IRs) coming from the well-known Clang/LLVM and GCC compilers.
Abstract: This paper presents the open-source high-level synthesis (HLS) research framework Bambu. Bambu provides a research environment to experiment with new ideas across HLS, high-level verification and debugging, FPGA/ASIC design, design flow space exploration, and parallel hardware accelerator design. The tool accepts as input standard C/C++ specifications and compiler intermediate representations (IRs) coming from the well-known Clang/LLVM and GCC compilers. The broad spectrum and flexibility of input formats allow the electronic design automation (EDA) research community to explore and integrate new transformations and optimizations. The easily extendable modular framework already includes many optimizations and HLS benchmarks used to evaluate the QoR of the tool against existing approaches [1]. The integration with synthesis and verification backends (commercial and open-source) allows researchers to quickly test any new finding and easily obtain performance and resource usage metrics for a given application. Different FPGA devices are supported from several different vendors: AMD/Xilinx, Intel/Altera, Lattice Semiconductor, and NanoXplore. Finally, integration with the OpenRoad open-source end-to-end silicon compiler perfectly fits with the recent push towards open-source EDA.