scispace - formally typeset
Search or ask a question

Showing papers on "Field-programmable gate array published in 2022"


Journal ArticleDOI
TL;DR: A power-efficient universal asynchronous receiver transmitter (UART) is implemented on 28 nm Artix-7 field-programmable gate array (FPGA) to reduce the power utilization of UART with the FPGA device in industries.
Abstract: In the present scheme of the world, the problem of shortage of power is seen across the world which can be a vulnerability to various communication securities. The scope of proposed research is that it is a step towards completing green communication technology concepts. In order to improve energy efficiency in communication networks, we designed UART using different nanometers of FPGA, which consumes the least amount of energy. This shortage is happening because of expanding of industries across the world and the rapid growth of the population. Therefore, to save the power for our upcoming generation, the globe is moving towards the concept and ideas of green communication and power-/energy-efficient gadget. In this work, a power-efficient universal asynchronous receiver transmitter (UART) is implemented on 28 nm Artix-7 field-programmable gate array (FPGA). The objective of this work is to reduce the power utilization of UART with the FPGA device in industries. To do this, the same authors have used voltage scaling techniques and compared the results with the existing FPGA works.

77 citations


Journal ArticleDOI
TL;DR: This research work focuses on describing the comparison of time and performance when two FPGAs are utilized for the architecture of the AES, and it has been realized that the Spartan-6 FPGA provides better throughput and less time delay to theFPGA based IoT devices.

50 citations


Journal ArticleDOI
TL;DR: In this article , a memristive Hopfield neural network (MHNN) with a special activation gradient is proposed by adding a suitable memristor to the Hopfield Neural Network (HNN).
Abstract: A memristive Hopfield neural network (MHNN) with a special activation gradient is proposed by adding a suitable memristor to the Hopfield neural network (HNN) with a special activation gradient. The MHNN is simulated and dynamically analyzed, and implemented on FPGA. Then, a new pseudo-random number generator (PRNG) based on MHNN is proposed. The post-processing unit of the PRNG is composed of nonlinear post-processor and XOR calculator, which effectively ensures the randomness of PRNG. The experiments in this paper comply with the IEEE 754-1985 high precision 32-bit floating point standard and are done on the Vivado design tool using a Xilinx XC7Z020CLG400-2 FPGA chip and the Verilog-HDL hardware programming language. The random sequence generated by the PRNG proposed in this paper has passed the NIST SP800-22 test suite and security analysis, proving its randomness and high performance. Finally, an image encryption system based on PRNG is proposed and implemented on FPGA, which proves the value of the image encryption system in the field of data encryption connected to the Internet of Things (IoT).

39 citations


Journal ArticleDOI
TL;DR: In this article , a power converter with a virtual MPC controller is first designed and operated under a circuit simulation or power hardware-in-the-loop simulation environment, and an artificial neural network (ANN) is then trained offline with the input and output data of the VMC controller.
Abstract: There has been an increasing interest in using model predictive control (MPC) for power electronic applications. However, the exponential increase in computational complexity and demand of computing resources hinders the practical adoption of this highly promising control technique. In this article, a new MPC approach using an artificial neural network (termed ANN-MPC) is proposed to overcome these barriers. A power converter with a virtual MPC controller is first designed and operated under a circuit simulation or power hardware-in-the-loop simulation environment. An artificial neural network (ANN) is then trained offline with the input and output data of the virtual MPC controller. Next, an actual FPGA-based MPC controller is designed using the trained ANN instead of relying on heavy-duty mathematical computation to control the actual operation of the power converter in real time. The ANN-MPC approach can significantly reduce the computing need and allow the use of more accurate high-order system models due to the simple mathematical expression of ANN. Furthermore, the ANN-MPC approach can retain the robustness for system parameter uncertainties by flexibly setting the input elements. The basic concept, ANN structure, offline training method, and online operation of ANN-MPC are described in detail. The computing resource requirement of the ANN-MPC and conventional MPC are analyzed and compared. The ANN-MPC concept is validated by both simulation and experimental results on two kW-class flying capacitor multilevel converters. It is demonstrated that the FPGA-based ANN-MPC controller can significantly reduce the FPGA resource requirement (e.g., 2.11 times fewer slice LUTs and 2.06 times fewer DSPs) while offering a control performance same as the conventional MPC.

38 citations


Journal ArticleDOI
TL;DR: Current architectures and discusses scalability and abstractions supported by operating systems, middleware, and virtualization are explored and the viability of these architectures for popular applications is reviewed, with a particular focus on deep learning and scientific computing.
Abstract: In this article, we survey existing academic and commercial efforts to provide Field-Programmable Gate Array (FPGA) acceleration in datacenters and the cloud. The goal is a critical review of existing systems and a discussion of their evolution from single workstations with PCI-attached FPGAs in the early days of reconfigurable computing to the integration of FPGA farms in large-scale computing infrastructures. From the lessons learned, we discuss the future of FPGAs in datacenters and the cloud and assess the challenges likely to be encountered along the way. The article explores current architectures and discusses scalability and abstractions supported by operating systems, middleware, and virtualization. Hardware and software security becomes critical when infrastructure is shared among tenants with disparate backgrounds. We review the vulnerabilities of current systems and possible attack scenarios and discuss mitigation strategies, some of which impact FPGA architecture and technology. The viability of these architectures for popular applications is reviewed, with a particular focus on deep learning and scientific computing. This work draws from workshop discussions, panel sessions including the participation of experts in the reconfigurable computing field, and private discussions among these experts. These interactions have harmonized the terminology, taxonomy, and the important topics covered in this manuscript.

35 citations


Journal ArticleDOI
TL;DR: The ring NoC design concept and its simulation in Xilinx ISE 14.7, as well as the communication of functional nodes, are discussed, including the performance of hardware and timing parameters.
Abstract: : The network-on-chip (NoC) technology is frequently referred to as a front-end solution to a back-end problem. The physical substructure that transfers data on the chip and ensures the quality of service begins to collapse when the size of semiconductor transistor dimensions shrinks and growing numbers of intellectual property (IP) blocks working together are integrated into a chip. The system on chip (SoC) architecture of today is so complex that not utilizing the crossbar and traditional hierarchical bus architecture. NoC connectivity reduces the amount of hardware required for routing and functions, allowing SoCs with NoC interconnect fabrics to operate at higher frequencies. Ring (Octagons) is a direct NoC that is specifically used to solve the scalability problem by expanding each node in the shape of an octagon. This paper discusses the ring NoC design concept and its simulation in Xilinx ISE 14.7, as well as the communication of functional nodes. For the field-programmable gate array (FPGA) synthesis, the performance of NoC is evaluated in terms of hardware and timing parameters. The design allows 64 to 256 node communication in a single chip with ‘N’ bit data transfer in the ring NoC. The performance of the NoC is evaluated with variable nodes from 2 to 256 in Digilent manufactured Virtex-5 FPGA hardware.

31 citations


Proceedings ArticleDOI
10 Jul 2022
TL;DR: This paper targets the field-programmable gate array and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration and develops a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm.
Abstract: Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 × and 2.6 × speedup compared to CPU and GPU implementation, and 4 × higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.

30 citations



Journal ArticleDOI
TL;DR: A formula of the total measurement uncertainties for a single-stage TDL-TDC to obtain its root-mean-square (RMS) resolution is derived and much more detailed precision analysis for single-TDL TDCs is presented.
Abstract: The wave union (WU) method is a well-known method in time-to-digital converters (TDCs) and can improve TDC performances without consuming extra logic resources. However, an earlier study concluded that the WU method is not suitable for UltraScale field-programmable gate array (FPGA) devices, due to more severe bubble errors. This article proves otherwise and presents new strategies to pursue high-resolution TDCs in Xilinx UltraScale 20 nm FPGAs. Combining our new subtapped delay line (sub-TDL) architecture (effective in removing bubbles and zero-width bins) and the WU method, we found that the wave union method is still powerful in UltraScale devices. We also compared the proposed TDC with the TDCs combining the dual sampling structure and the sub-TDL technique. A binning method is introduced to improve the linearity. Moreover, we derived a formula of the total measurement uncertainties for a single-stage TDL-TDC to obtain its root-mean-square resolution. Compared with the previously published FPGA-TDCs, we presented (for the first time) much more detailed precision analysis for single-TDL TDCs.

23 citations


Journal ArticleDOI
TL;DR: In this paper , a generic area-optimized, low-latency accurate, and approximate soft-core multiplier architectures, which exploit the underlying architectural features of FPGAs, i.e., lookup table (LUT) structures and fast-carry chains, are presented.
Abstract: Multiplication is one of the widely used arithmetic operations in a variety of applications, such as image/video processing and machine learning. FPGA vendors provide high-performance multipliers in the form of DSP blocks. These multipliers are not only limited in number and have fixed locations on FPGAs but can also create additional routing delays and may prove inefficient for smaller bit-width multiplications. Therefore, FPGA vendors additionally provide optimized soft IP cores for multiplication. However, in this work, we advocate that these soft multiplier IP cores for FPGAs still need better designs to provide high-performance and resource efficiency. Toward this, we present generic area-optimized, low-latency accurate, and approximate softcore multiplier architectures, which exploit the underlying architectural features of FPGAs, i.e., lookup table (LUT) structures and fast-carry chains to reduce the overall critical path delay (CPD) and resource utilization of multipliers. Compared to Xilinx multiplier LogiCORE IP, our proposed unsigned and signed accurate architecture provides up to 25% and 53% reduction in LUT utilization, respectively, for different sizes of multipliers. Moreover, with our unsigned approximate multiplier architectures, a reduction of up to 51% in the CPD can be achieved with an insignificant loss in output accuracy when compared with the LogiCORE IP. For illustration, we have deployed the proposed multiplier architecture in accelerators used in image and video applications, and evaluated them for area and performance gains. Our library of accurate and approximate multipliers is opensource and available online at https://cfaed.tu-dresden.de/pd-downloads to fuel further research and development in this area, facilitate reproducible research, and thereby enabling a new research direction for the FPGA community.

22 citations


Journal ArticleDOI
01 Mar 2022
TL;DR: A novel pooling method absolute average deviation (AAD) for CNN accelerator achieves 98% accuracy with lower computational and hardware costs compared to mixed pooling, making it an ideal pooling mechanism for an IoT CNN accelerator.
Abstract: Convolutional neural network (CNN) hardware accelerators for specialized Internet of Things (IoT) requiring high accuracy is an emerging research topic. The pooling module in a CNN pipeline impacts both the speed and accuracy of a classification task. This work proposes the design and hardware implementation of a novel pooling method absolute average deviation (AAD) for CNN accelerator. AAD utilizes the spatial locality of pixels using vertical and horizontal deviations to achieve higher accuracy, lower area, and lower power consumption than mixed pooling without increasing the computational complexity. AAD is tested on four different datasets: EEG, ImageNet, Common Objects in Context (COCO), United States Postal Service (USPS), and multiple CNN structures: CNN, VGG16, VGG19, ResNet, and DenseNet. In hardware, AAD is implemented using Very High Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL) on Altera Arria10 GX field-programmable gate array (FPGA) and 45-nm technology using Synopsys Design Compiler. The area and power consumption are found to be 244.46 nm2 and 0.31 mW, respectively. AAD achieves 98% accuracy with lower computational and hardware costs compared to mixed pooling, making it an ideal pooling mechanism for an IoT CNN accelerator.

Journal ArticleDOI
13 Jan 2022
TL;DR: In this article , an autonomously operating circuit that performs hardware-aware machine learning utilizing probabilistic neurons built with stochastic magnetic tunnel junctions is presented. But the authors do not address the variability of device properties.
Abstract: One of the big challenges of current electronics is the design and implementation of hardware neural networks that perform fast and energy-efficient machine learning. Spintronics is a promising catalyst for this field with the capabilities of nanosecond operation and compatibility with existing microelectronics. Considering large-scale, viable neuromorphic systems however, variability of device properties is a serious concern. In this paper, we show an autonomously operating circuit that performs hardware-aware machine learning utilizing probabilistic neurons built with stochastic magnetic tunnel junctions. We show that $in \ situ$ learning of weights and biases in a Boltzmann machine can counter device-to-device variations and learn the probability distribution of meaningful operations such as a full adder. This scalable autonomously operating learning circuit using spintronics-based neurons could be especially of interest for standalone artificial-intelligence devices capable of fast and efficient learning at the edge.

Journal ArticleDOI
TL;DR: In this paper , a three-dimensional chaotic system with line equilibria is presented and an image encryption algorithm based on the pixel-level scrambling, bit-level and pixel value diffusion is proposed.
Abstract: This paper announces a novel three-dimensional chaotic system with line equilibrium and discusses its dynamic properties such as Lyapunov exponents, phase portraits, equilibrium points, bifurcation diagram, multistability and coexisting attractors. New synchronization results based on integral sliding mode control (ISMC) are also derived for the new chaotic system with line equilibrium. In addition, an electronic circuit implementation of the new chaotic system with line equilibrium is reported and a good qualitative agreement is exhibited between the MATLAB simulations of the theoretical model and the MultiSim results. We also display the implementation of the Field-Programmable Gate Array (FPGA) based Pseudo-Random Number Generator (PRNG) by using the new chaotic system. The throughput of the proposed FPGA based new chaotic PRNG is 462.731 Mbps. Randomness analysis of the generated numbers has been performed with respect to the NIST-800-22 tests and they have successfully passed all of the tests. Finally, an image encryption algorithm based on the pixel-level scrambling, bit-level scrambling, and pixel value diffusion is proposed. The experimental results show that the encryption algorithm not only shuffles the pixel positions of the image, but also replaces the pixel values with different values, which can effectively resist various attacks such as brute force attack and differential attack.

Proceedings ArticleDOI
17 Feb 2022
TL;DR: MATCHA as discussed by the authors accelerates TFHE gates using approximate multiplication-less integer FFTs and IFFTs and uses a pipelined datapath to improve the energy efficiency.
Abstract: Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x.

Journal ArticleDOI
15 Sep 2022-Small
TL;DR: A touch-programmable metasurface based on touch sensing modules is proposed to realize various electromagnetic manipulations and encryptions and will have wide application prospects in imaging displays, wireless communications, and EM information encryptions.
Abstract: Previous programmable metasurfaces integrated with diodes or varactors require external instructions for field programmable gate arrays (FPGAs), which usually rely on computer-inputs or pre-loaded algorithms. But the complicated external devices make the coding regulation process of the programmable metasurfaces cumbersome and difficult to use. To simplify the process and provide a new interaction manner, a touch-programmable metasurface (TPM) based on touch sensing modules is proposed to realize various electromagnetic (EM) manipulations and encryptions. By simply touching the meta-units of the TPM, the state of the diodes can be changed. Through the touch controls, the TPM can achieve independent and direct manipulations of meta-units and efficient inputs of coding patterns without using a FPGA or other control modules. Various coding patterns are demonstrated to achieve diverse scattering-field control and flexible near-field EM information encryptions, which verifies the feasibility of the TPM design. The presented TPM will have wide application prospects in imaging displays, wireless communications, and EM information encryptions.

Proceedings ArticleDOI
28 Feb 2022
TL;DR: It is shown that a research group can design and build a more general, open, and affordable hardware platform for hybrid systems research, and Enzian is capable of duplicating the functionality of existing CPU/FPGA systems with comparable performance but in an open, flexible system.
Abstract: Hybrid computing platforms, comprising CPU cores and FPGA logic, are increasingly used for accelerating data-intensive workloads in cloud deployments, and are a growing topic of interest in systems research. However, from a research perspective, existing hardware platforms are limited: they are often optimized for concrete, narrow use-cases and, therefore lack the flexibility needed to explore other applications and configurations. We show that a research group can design and build a more general, open, and affordable hardware platform for hybrid systems research. The platform, Enzian, is capable of duplicating the functionality of existing CPU/FPGA systems with comparable performance but in an open, flexible system. It couples a large FPGA with a server-class CPU in an asymmetric cache-coherent NUMA system. Enzian also enables research not possible with existing hybrid platforms, through explicit access to coherence messages, extensive thermal and power instrumentation, and an open, programmable baseboard management processor. Enzian is already being used in multiple projects, is open source (both hardware and software), and available for remote use. We present the design principles of Enzian, the challenges in building it, and evaluate it with a range of existing research use-cases alongside other, more specialized platforms, as well as demonstrating research not possible on existing platforms.

Proceedings ArticleDOI
01 Mar 2022
TL;DR: Fast readouts enabling real-time tracking of the DSP implementation, showing that coupled-core fibers are compatible with real- time DSP implementations, are performed.
Abstract: We perform parallel continuous measurements of deployed SDM fibers using real-time coherent receivers implemented on FPGAs. Fast readouts enabling real-time tracking of the DSP implementation, showing that coupled-core fibers are compatible with real-time DSP implementations. © 2022 The Author(s)

Journal ArticleDOI
TL;DR: The progress of the deployment of HLS technology is assessed and the successes in several application domains are highlighted, including deep learning, video transcoding, graph processing, and genome sequencing.
Abstract: The year 2011 marked an important transition for FPGA high-level synthesis (HLS), as it went from prototyping to deployment. A decade later, in this article, we assess the progress of the deployment of HLS technology and highlight the successes in several application domains, including deep learning, video transcoding, graph processing, and genome sequencing. We also discuss the challenges faced by today’s HLS technology and the opportunities for further research and development, especially in the areas of achieving high clock frequency, coping with complex pragmas and system integration, legacy code transformation, building on open source HLS infrastructures, supporting domain-specific languages, and standardization. It is our hope that this article will inspire more research on FPGA HLS and bring it to a new height.

Journal ArticleDOI
TL;DR: An ensemble of weak CNNs are used to build a robust classifier with low cost to effectively improve the system reliability when suffering soft errors with an overhead much lower than TMR.
Abstract: Convolutional neural networks (CNNs) are widely used in computer vision and natural language processing. Field-programmable gate arrays (FPGAs) are popular accelerators for CNNs. However, if used in critical applications, the reliability of FPGA-based CNNs becomes a priority because FPGAs are prone to suffer soft errors. Traditional protection schemes, such as triple modular redundancy (TMR), introduce a large overhead, which is not acceptable in resource-limited platforms. This article proposes to use an ensemble of weak CNNs to build a robust classifier with low cost. To have a group of base CNNs with low complexity and balanced similarity and diversity, residual neural networks (ResNets) with different layers (20/32/44/56) are combined in the ensemble system to replace a single strong ResNet 110. In addition, a robust combiner is designed based on the reliability evaluation of a single ResNet. Single ResNets with different layers and different ensemble schemes are implemented on the FPGA accelerator based on Xilinx Zynq 7000 SoC. The reliability of the ensemble systems is evaluated based on a large-scale fault injection platform and compared with that of the TMR-protected ResNet 110 and ResNet 20. Experiment results show that the proposed ensembles could effectively improve the system reliability when suffering soft errors with an overhead much lower than TMR.

Proceedings ArticleDOI
11 Feb 2022
TL;DR: HiSparse, a high-performance SpMV accelerator on a multi-die HBM-equipped FPGA device, achieves a high frequency and delivers promising speedup with increased bandwidth efficiency when compared to prior arts on CPUs, GPUs, and FPGAs.
Abstract: Sparse linear algebra operators are memory bound due to low compute to memory access ratio and irregular data access patterns. The exceptional bandwidth improvement provided by the emerging high-bandwidth memory (HBM) technologies, coupled with the ability of FPGAs to customize the memory hierarchy and compute engines, brings the potential to significantly boost the performance of sparse linear algebra operators. In this paper we identify four challenges when developing high-performance sparse linear algebra accelerators on HBM-equipped FPGAs --- low HBM bandwidth utilization with conventional sparse storage, limited on-chip memory capacity being the bottleneck when scaling to multiple HBM channels, low compute occupancy due to bank conflicts and inter-iteration carried dependencies, and timing closure on multi-die heterogeneous fabrics. We conduct an in-depth case study on sparse matrix-vector multiplication (SpMV) to explore techniques that tackle the four challenges. These techniques include (1) a customized sparse matrix format tailored for HBMs, (2) a scalable on-chip buffer design that combines replication and banking, (3) best practices of using HLS to implement hardware modules that dynamically resolve bank conflicts and carried dependencies for achieving high compute occupancy, and (4) a split-kernel design methodology for frequency optimization. Using the techniques, we demonstrate HiSparse, a high-performance SpMV accelerator on a multi-die HBM-equipped FPGA device. We evaluated HiSparse on a variety of matrix datasets. The results show that HiSparse achieves a high frequency and delivers promising speedup with increased bandwidth efficiency when compared to prior arts on CPUs, GPUs, and FPGAs. HiSparse is available at https://github.com/cornell-zhang/HiSparse.

Journal ArticleDOI
TL;DR: Aiming at improving the LiDAR operation in challenging weather conditions, which contributes to achieving higher driving automation levels defined by the Society of Automotive Engineers (SAE), this article proposes a weather denoising method called Dynamic light-Intensity Outlier Removal (DIOR).
Abstract: The interest in developing and deploying fully autonomous vehicles on our public roads has come to a full swing. Driverless capabilities, widely spread in modern vehicles through advanced driver-assistance systems (ADAS), require highly reliable perception features to navigate the environment, being light detection and ranging (LiDAR) sensors a key instrument in detecting the distance and speed of nearby obstacles and in providing high-resolution 3D representations of the surroundings in real-time. However, and despite being assumed as a game-changer in the autonomous driving paradigm, LiDAR sensors can be very sensitive to adverse weather conditions, which can severely affect the vehicle’s perception system behavior. Aiming at improving the LiDAR operation in challenging weather conditions, which contributes to achieving higher driving automation levels defined by the Society of Automotive Engineers (SAE), this article proposes a weather denoising method called Dynamic light-Intensity Outlier Removal (DIOR). DIOR combines two approaches of the state-of-the-art, the dynamic radius outlier removal (DROR) and the low-intensity outlier removal (LIOR) algorithms, supported by an embedded reconfigurable hardware platform. By resorting to field-programmable gate array (FPGA) technology, DIOR can outperform state-of-the-art outlier removal solutions, achieving better accuracy and performance while guaranteeing the real-time requirements.

Journal ArticleDOI
TL;DR: In this paper , the challenges and solutions in co-packaging photonics modules are described through two case studies; one of a network-switch die copackaged with socketable photonic modules and another of a Field Programmable Gate Array (FPGA) co-packing with optical dies (tiles).
Abstract: Photonics die or integrated photonics modules co-packaged with compute engines have the potential to deliver significant improvements in power, bandwidth and reach needed to meet the computing and communication demands of data centers and other high-performance computing (HPC) systems. The challenges and solutions in co-packaging photonics modules are described through two case studies; one of a network-switch die co-packaged with socketable photonics modules and another of a Field Programmable Gate Array (FPGA) co-packaged with optical dies (tiles). The technical requirements to deliver the promise of co-packaged photonics in high volume are outlined.

Journal ArticleDOI
TL;DR: A high efficient layer-wise refined pruning method for deep neural networks at the software level and accelerates the inference process at the hardware level on a field-programmable gate array (FPGA).
Abstract: To accelerate the practical applications of artificial intelligence, this paper proposes a high efficient layer-wise refined pruning method for deep neural networks at the software level and accelerates the inference process at the hardware level on a field-programmable gate array (FPGA). The refined pruning operation is based on the channel-wise importance indexes of each layer and the layer-wise input sparsity of convolutional layers. The method utilizes the characteristics of the native networks without introducing any extra workloads to the training phase. In addition, the operation is easy to be extended to various state-of-the-art deep neural networks. The effectiveness of the method is verified on ResNet architecture and VGG networks in terms of dataset CIFAR10, CIFAR100, and ImageNet100. Experimental results show that in terms of ResNet50 on CIFAR10 and ResNet101 on CIFAR100, more than 85% of parameters and Floating-Point Operations are pruned with only 0.35% and 0.40% accuracy loss, respectively. As for the VGG network, 87.05% of parameters and 75.78% of Floating-Point Operations are pruned with only 0.74% accuracy loss for VGG13BN on CIFAR10. Furthermore, we accelerate the networks at the hardware level on the FPGA platform by utilizing the tool Vitis AI. For two threads mode in FPGA, the throughput/fps of the pruned VGG13BN and ResNet101 achieves 151.99 fps and 124.31 fps, respectively, and the pruned networks achieve about 4.3× and 1.8× speed up for VGG13BN and ResNet101, respectively, compared with the original networks on FPGA.

Proceedings ArticleDOI
13 Feb 2022
TL;DR: This paper proposes a split compilation approach based on the pipelining flexibility at the HLS level, which allows for partition designs for parallel placement and routing then stitch the separate partitions together to achieve a fast end-to-end compilation.
Abstract: FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing then stitch the separate partitions together. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in an HLS dataflow program in C/C++ and generates a fully placed and routed implementation. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7X reduction in compile time and up to 1.3X increase in frequency when compared to a commercial-off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in the cases with lower performance requirements. The tool is open-sourced at github.com/Licheng-Guo/RapidStream.

Journal ArticleDOI
TL;DR: This work presents the first complete implementation including encapsulation and decapsulation modules as well as key generation with seed expansion from the third-round of NIST’s Post-Quantum Cryptography standardization process, and presents three new algorithms that can be used for systemization while complying with the specification.
Abstract: . We present the first specification-compliant constant-time FPGA implementation of the Classic McEliece cryptosystem from the third-round of NIST’s Post-Quantum Cryptography standardization process. In particular, we present the first complete implementation including encapsulation and decapsulation modules as well as key generation with seed expansion. All the hardware modules are parametriz-able, at compile time, with security level and performance parameters. We show that our complete Classic McEliece design for example can perform key generation in 5 . 2 ms to 20 ms, encapsulation in 0 . 1 ms to 0 . 5 ms, and decapsulation in 0 . 7 ms to 1 . 5 ms for all security levels on an Xlilinx Artix 7 FPGA. The performance can be increased even further at the cost of resources by increasing the level of parallelization using the performance parameters of our design.

Journal ArticleDOI
24 Jan 2022-Sensors
TL;DR: A closed-loop motion control system based on BP neural network (BPNN) PID controller by using a Xilinx field programmable gate array (FPGA) solution is proposed, which can realize the self-tuning of PID control parameters, and has the characteristics of reliable performance, high real-time performance, and strong anti-interference.
Abstract: In the actual industrial production process, the method of adaptively tuning proportional–integral–derivative (PID) parameters online by neural network can adapt to different characteristics of different controlled objects better than the controller with PID. However, the commonly used microcontroller unit (MCU) cannot meet the application scenarios of real time and high reliability. Therefore, in this paper, a closed-loop motion control system based on BP neural network (BPNN) PID controller by using a Xilinx field programmable gate array (FPGA) solution is proposed. In the design of the controller, it is divided into several sub-modules according to the modular design idea. The forward propagation module is used to complete the forward propagation operation from the input layer to the output layer. The PID module implements the mapping of PID arithmetic to register transfer level (RTL) and is responsible for completing the output of control amount. The main state machine module generates enable signals that control the sequential execution of each sub-module. The error backpropagation and weight update module completes the update of the weights of each layer of the network. The peripheral modules of the control system are divided into two main parts. The speed measurement module completes the acquisition of the output pulse signal of the encoder and the measurement of the motor speed. The pulse width modulation (PWM) signal generation module generates PWM waves with different duty cycles to control the rotation speed of the motor. A co-simulation of Modelsim and Simulink is used to simulate and verify the system, and a test analysis is also performed on the development platform. The results show that the proposed system can realize the self-tuning of PID control parameters, and also has the characteristics of reliable performance, high real-time performance, and strong anti-interference. Compared with MCU, the convergence speed is far more than three orders of magnitude, which proves its superiority.

Journal ArticleDOI
TL;DR: A highly flexible and reconfigurable hardware accelerator is proposed to efficiently support various CNN-based vision tasks and shows state-of-the-art performance on both large-scale and lightweight CNNs for image segmentation or classification.
Abstract: To enable efficient deployment of convolutional neural networks (CNNs) on embedded platforms for different computer vision applications, several convolution variants have been introduced, such as depthwise convolution (DWCV), transposed convolution (TPCV), and dilated convolution (DLCV). To address the utilization degradation issue occurred in a general convolution engine for these emerging operators, a highly flexible and reconfigurable hardware accelerator is proposed to efficiently support various CNN-based vision tasks. Firstly, to avoid workload imbalance of TPCV, a zero transfer and skipping (ZTS) method is proposed to reorganize the computation process. To eliminate the redundant zero calculations of TPCV and DLCV, a sparsity-alike processing (SAP) method is proposed based on weight-oriented dataflow. Secondly, the DWCV or pooling layers are configured to be directly executed after standard convolutions without external memory accesses. Furthermore, a programmable execution schedule is introduced to gain better flexibility. Finally, the proposed accelerator is evaluated on Intel Arria 10 SoC FPGA. Experimental results show state-of-the-art performance on both large-scale and lightweight CNNs for image segmentation or classification. Specifically, the accelerator can achieve a processing speed up to 339.9 FPS and computational efficiency up to 0.58 GOPS/DSP, which is $3.3\times $ better than the prior art evaluated on the same network.

Journal ArticleDOI
TL;DR: In this paper , the proposed real-time object detection based on CNN, QNN and BNN Deep Neural Networks classifier mode activated on python, and then the dataset taken from PASCAL VOC.

Journal ArticleDOI
TL;DR: A solution has been proposed that consists of a sample prototype of an AI-based Flask-driven web application framework that predicts the six different diseases including ARDS, bacteria, COVID-19, SARS, Streptococcus, and virus and was implemented in the FPGA environment and observed that it attains a reduced number of gate counts and low power.
Abstract: Coronavirus is a large family of viruses that affects humans and damages respiratory functions ranging from cold to more serious diseases such as ARDS and SARS. But the most recently discovered virus causes COVID-19. Isolation at home or hospital depends on one’s health history and conditions. The prevailing disease that might get instigated due to the existence of the virus might lead to deterioration in health. Therefore, there is a need for early detection of the virus. Recently, many works are found to be observed with the deployment of techniques for the detection based on chest X-rays. In this work, a solution has been proposed that consists of a sample prototype of an AI-based Flask-driven web application framework that predicts the six different diseases including ARDS, bacteria, COVID-19, SARS, Streptococcus, and virus. Here, each category of X-ray images was placed under scrutiny and conducted training and testing using deep learning algorithms such as CNN, ResNet (with and without dropout), VGG16, and AlexNet to detect the status of X-rays. Recent FPGA design tools are compatible with software models in deep learning methods. FPGAs are suitable for deep learning algorithms to make the design as flexible, innovative, and hardware acceleration perspective. High-performance FPGA hardware is advantageous over GPUs. Looking forward, the device can efficiently integrate with the deep learning modules. FPGAs act as a challenging substitute podium where it bridges the gap between the architectures and power-related designs. FPGA is a better option for the implementation of algorithms. The design attains 121µW power and 89 ms delay. This was implemented in the FPGA environment and observed that it attains a reduced number of gate counts and low power.

Journal ArticleDOI
TL;DR: A review of hardware architectures for the acceleration of reinforcement learning algorithms is presented in this article , where FPGA-based implementations are the focus of this work, but GPU-based approaches are considered as well.
Abstract: Reinforcement learning algorithms have been very successful at solving sequential decision-making problems in many different problem domains. However, their training is often time-consuming, with training times ranging from multiple hours to weeks. The development of domain-specific architectures for reinforcement learning promises faster computation times, decreased experiment turn-around time, and improved energy efficiency. This paper presents a review of hardware architectures for the acceleration of reinforcement learning algorithms. FPGA-based implementations are the focus of this work, but GPU-based approaches are considered as well. Both tabular and deep reinforcement learning algorithms are included in this survey. The techniques employed in different implementations are highlighted and compared. Finally, possible areas for future work are suggested, based on the preceding discussion of existing architectures.