Showing papers on "Field-programmable gate array published in 2019"

PDF

Open Access

Journal Article•DOI•

A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection

[...]

Duy Thanh Nguyen¹, Tuan Nghia Nguyen¹, Hyun Kim², Hyuk-Jae Lee¹•Institutions (2)

Seoul National University¹, Seoul National University of Science and Technology²

01 Apr 2019-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents a Tera-OPS streaming hardware accelerator implementing a you-only-look-once (YOLO) CNN, which outperforms the “one-size-fits-all” designs in both performance and power efficiency.

...read moreread less

Abstract: Convolutional neural networks (CNNs) require numerous computations and external memory accesses. Frequent accesses to off-chip memory cause slow processing and large power dissipation. For real-time object detection with high throughput and power efficiency, this paper presents a Tera-OPS streaming hardware accelerator implementing a you-only-look-once (YOLO) CNN. The parameters of the YOLO CNN are retrained and quantized with the PASCAL VOC data set using binary weight and flexible low-bit activation. The binary weight enables storing the entire network model in block RAMs of a field-programmable gate array (FPGA) to reduce off-chip accesses aggressively and, thereby, achieve significant performance enhancement. In the proposed design, all convolutional layers are fully pipelined for enhanced hardware utilization. The input image is delivered to the accelerator line-by-line. Similarly, the output from the previous layer is transmitted to the next layer line-by-line. The intermediate data are fully reused across layers, thereby eliminating external memory accesses. The decreased dynamic random access memory (DRAM) accesses reduce DRAM power consumption. Furthermore, as the convolutional layers are fully parameterized, it is easy to scale up the network. In this streaming design, each convolution layer is mapped to a dedicated hardware block. Therefore, it outperforms the “one-size-fits-all” designs in both performance and power efficiency. This CNN implemented using VC707 FPGA achieves a throughput of 1.877 tera operations per second (TOPS) at 200 MHz with batch processing while consuming 18.29 W of on-chip power, which shows the best power efficiency compared with the previous research. As for object detection accuracy, it achieves a mean average precision (mAP) of 64.16% for the PASCAL VOC 2007 data set that is only 2.63% lower than the mAP of the same YOLO network with full precision.

...read moreread less

259 citations

Journal Article•DOI•

High-Performance FPGA-Based CNN Accelerator With Block-Floating-Point Arithmetic

[...]

Xiaocong Lian¹, Zhenyu Liu¹, Zhourui Song², Jiwu Dai³, Wei Zhou³, Xiangyang Ji¹ - Show less +2 more•Institutions (3)

Tsinghua University¹, Beijing University of Posts and Telecommunications², Northwestern Polytechnical University³

16 May 2019-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: An optimized block-floating-point (BFP) arithmetic is adopted in the accelerator for efficient inference of deep neural networks in this paper, and improves the energy and hardware efficiency by three times.

...read moreread less

Abstract: Convolutional neural networks (CNNs) are widely used and have achieved great success in computer vision and speech processing applications. However, deploying the large-scale CNN model in the embedded system is subject to the constraints of computation and memory. An optimized block-floating-point (BFP) arithmetic is adopted in our accelerator for efficient inference of deep neural networks in this paper. The feature maps and model parameters are represented in 16-bit and 8-bit formats, respectively, in the off-chip memory, which can reduce memory and off-chip bandwidth requirements by 50% and 75% compared to the 32-bit FP counterpart. The proposed 8-bit BFP arithmetic with optimized rounding and shifting-operation-based quantization schemes improves the energy and hardware efficiency by three times. One CNN model can be deployed in our accelerator without retraining at the cost of an accuracy loss of not more than 0.12%. The proposed reconfigurable accelerator with three parallelism dimensions, ping-pong off-chip DDR3 memory access, and an optimized on-chip buffer group is implemented on the Xilinx VC709 evaluation board. Our accelerator achieves a performance of 760.83 GOP/s and 82.88 GOP/s/W under a 200-MHz working frequency, significantly outperforming previous accelerators.

...read moreread less

116 citations

Journal Article•DOI•

A true random bit generator based on a memristive chaotic circuit: Analysis, design and FPGA implementation

[...]

Barış Karakaya¹, Arif Gülten¹, Mattia Frasca²•Institutions (2)

Fırat University¹, University of Catania²

01 Feb 2019-Chaos Solitons & Fractals

TL;DR: A true random bit generator (TRBG) based on a memristive chaotic circuit and its implementation on Field Programmable Gate Array (FPGA) board and validated with statistical analysis by using the NIST 800.22 statistical test suite.

...read moreread less

Abstract: The aim of this paper is to present a true random bit generator (TRBG) based on a memristive chaotic circuit and its implementation on Field Programmable Gate Array (FPGA) board. The proposed TRBG architecture makes use of a memristive canonical Chua's oscillator and a logistic map as entropy sources, while the XOR function is used for post-processing. The optimal parameter set for the chaotic systems has been chosen by carrying out numerical simulations of the system and adopting the scale index parameter to determine the degree of non-periodicity of the obtained bit streams. The proposed TRBG system has been then modeled and co-simulated on the Xilinx System Generator (XSG) platform and implemented on the Xilinx Kintex-7 KC705 FPGA Evaluation Board, obtaining experimental results in agreement with the expectations. Finally, the system has been validated with statistical analysis by using the NIST 800.22 statistical test suite.

...read moreread less

101 citations

Journal Article•DOI•

Reliable and Modeling Attack Resistant Authentication of Arbiter PUF in FPGA Implementation With Trinary Quadruple Response

[...]

Siarhei S. Zalivaka¹, Alexander A. Ivaniuk², Chip-Hong Chang¹•Institutions (2)

Nanyang Technological University¹, Belarusian State University of Informatics and Radioelectronics²

01 Apr 2019-IEEE Transactions on Information Forensics and Security

TL;DR: This paper presents a robust device authentication method based on the FPGA implementation of a reliability enhanced A-PUF with trinary digit (trit) quadruple responses and the proposed authentication protocol has been experimentally evaluated to be practically secure against various machine learning attacks.

...read moreread less

Abstract: Field programmable gate array (FPGA) is a potential hotbed for malicious and counterfeit hardware infiltration. Arbiter-based physical unclonable function (A-PUF) has been widely regarded as a suitable lightweight security primitive for FPGA bitstream encryption and device authentication. Unfortunately, the metastability of flip-flop gives rise to poor A-PUF reliability in FPGA implementation. Its linear additive path delays are also vulnerable to modeling attacks. Most reliability enhancement techniques tend to increase the response predictability and ease machine learning attacks. This paper presents a robust device authentication method based on the FPGA implementation of a reliability enhanced A-PUF with trinary digit (trit) quadruple responses. A two flip-flop arbiter is used to produce a trit for metastability detection. By considering the ordered responses to all four combinations of first and last challenge bits, each quadruple response can be compressed into a quadbit that represents one of the five classes of trit quadruple response with greater reproducibility. This challenge-response quadruple classification not only greatly reduces the burden of error correction at the device but also enables a precise A-PUF model to be built at the server without having to store the complete challenge-response pair (CRP) set for authentication. Besides, the real challenge to the A-PUF is generated internally by a lossy, nonlinear, and irreversible maximum length signature generator at both the server and device sides to prevent the naked CRP from being machine learned by the attacker. The A-PUF with short repetition code of length five has been tested to achieve a reliability of 1.0 over the full operating temperature range of the target FPGA board with lower hardware resource utilization than other modeling attack resilient strong PUFs. The proposed authentication protocol has also been experimentally evaluated to be practically secure against various machine learning attacks including evolutionary strategy covariance matrix adaptation.

...read moreread less

79 citations

Journal Article•DOI•

An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm

[...]

Sergio Spanò¹, Gian Carlo Cardarilli¹, Luca Di Nunzio¹, Rocco Fazzolari¹, Daniele Giardino¹, Marco Matta¹, Alberto Nannarelli², Marco Re¹ - Show less +4 more•Institutions (2)

University of Rome Tor Vergata¹, Technical University of Denmark²

20 Dec 2019-IEEE Access

TL;DR: An efficient hardware architecture that implements the Q-Learning algorithm, suitable for real-time applications, with low-power, high throughput and limited hardware resources, and a technique based on approximated multipliers to reduce the hardware complexity of the algorithm.

...read moreread less

Abstract: In this paper we propose an efficient hardware architecture that implements the Q-Learning algorithm, suitable for real-time applications. Its main features are low-power, high throughput and limited hardware resources. We also propose a technique based on approximated multipliers to reduce the hardware complexity of the algorithm. We implemented the design on a Xilinx Zynq Ultrascale+ MPSoC ZCU106 Evaluation Kit. The implementation results are evaluated in terms of hardware resources, throughput and power consumption. The architecture is compared to the state of the art of Q-Learning hardware accelerators presented in the literature obtaining better results in speed, power and hardware resources. Experiments using different sizes for the Q-Matrix and different wordlengths for the fixed point arithmetic are presented. With a Q-Matrix of size $8\times4$ (8 bit data) we achieved a throughput of 222 MSPS (Mega Samples Per Second) and a dynamic power consumption of 37 mW, while with a Q-Matrix of size $256\times16$ (32 bit data) we achieved a throughput of 93 MSPS and a power consumption 611 mW. Due to the small amount of hardware resources required by the accelerator, our system is suitable for multi-agent IoT applications. Moreover, the architecture can be used to implement the SARSA (State-Action-Reward-State-Action) Reinforcement Learning algorithm with minor modifications.

...read moreread less

71 citations

Proceedings Article•DOI•

GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms

[...]

Hanqing Zeng¹, Viktor K. Prasanna¹•Institutions (1)

University of Southern California¹

31 Dec 2019-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: A novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations and proposing a light-weight pre-processing step based on a graph theoretic approach to optimize the feature propagation within subgraphs.

...read moreread less

Abstract: Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to propagate information within the graph, and (2) intensive computation to propagate information along the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations. We first analyze the computation and communication characteristics of various GCN training algorithms, and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we propose a lightweight pre-processing step based on a graph theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements and the computation to be performed on the FPGA. To accelerate the weight update in GCN layers, we propose a systolic array based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline, and analyze its load-balance and resource utilization by accurate performance modeling. We evaluate our design on a Xilinx Alveo U200 board hosted by a 40-core Xeon server. On three large graphs, we achieve an order of magnitude training speedup with negligible accuracy loss, compared with state-of-the-art implementation on a multi-core platform.

...read moreread less

68 citations

Journal Article•DOI•

Field Programmable Gate Array Applications—A Scientometric Review

[...]

Juan Ruiz-Rosero, Gustavo Ramirez-Gonzalez, Rahul Khanna

11 Nov 2019

TL;DR: This paper reviews the top FPGAs’ applications by a scientometric analysis in ScientoPy, covering publications related to FPGA from 1992 to 2018, finding the top 150 applications that are divided into the following categories: digital control, communication interfaces, networking, computer security, cryptography techniques, machine learning, digital signal processing, image and video processing, big data, computer algorithms and other applications.

...read moreread less

Abstract: Field Programmable Gate Array (FPGA) is a general purpose programmable logic device that can be configured by a customer after manufacturing to perform from a simple logic gate operations to complex systems on chip or even artificial intelligence systems. Scientific publications related to FPGA started in 1992 and, up to now, we found more than 70,000 documents in the two leading scientific databases (Scopus and Clarivative Web of Science). These publications show the vast range of applications based on FPGAs, from the new mechanism that enables the magnetic suspension system for the kilogram redefinition, to the Mars rovers’ navigation systems. This paper reviews the top FPGAs’ applications by a scientometric analysis in ScientoPy, covering publications related to FPGAs from 1992 to 2018. Here we found the top 150 applications that we divided into the following categories: digital control, communication interfaces, networking, computer security, cryptography techniques, machine learning, digital signal processing, image and video processing, big data, computer algorithms and other applications. Also, we present an evolution and trend analysis of the related applications.

...read moreread less

63 citations

Proceedings Article•DOI•

Xilinx Adaptive Compute Acceleration Platform: Versal TM Architecture

[...]

Brian C. Gaide¹, Dinesh D. Gaitonde¹, Chirag Ravishankar¹, Trevor J. Bauer¹•Institutions (1)

Xilinx¹

20 Feb 2019

TL;DR: Xilinx's Versal-Adaptive Compute Acceleration Platform (ACAP) is a hybrid compute platform that tightly integrates traditional FPGA programmable fabric,Software programmable processors and software programmable accelerator engines.

...read moreread less

Abstract: In this paper we describe Xilinx's Versal-Adaptive Compute Acceleration Platform (ACAP). ACAP is a hybrid compute platform that tightly integrates traditional FPGA programmable fabric, software programmable processors and software programmable accelerator engines. ACAP improves over the programmability of traditional reconfigurable platforms by introducing newer compute models in the form of software programmable accelerators and by separating out the data movement architecture from the compute architecture. The Versal architecture includes a host of new capabilities, including a chip-pervasive programmable Network-on-Chip (NoC), Imux Registers, compute shell, more advanced SSIT, adaptive deskew of global clocks, faster configuration, and other new programmable elements as well as enhancements to the CLB and interconnect. We discuss these architectural developments and highlight their key motivations and differences in relation to traditional FPGA architectures.

...read moreread less

63 citations

Journal Article•DOI•

An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution

[...]

Bing Liu, Danyin Zou, Lei Feng, Shou Feng, Ping Fu, Junbao Li - Show less +2 more

03 Mar 2019-Electronics

TL;DR: The CNN accelerator designed in this paper can achieve 17.11GOPS for 32bit floating point when it can also accelerate depthwise separable convolution, which has obvious advantages compared with other designs.

...read moreread less

Abstract: The Convolutional Neural Network (CNN) has been used in many fields and has achieved remarkable results, such as image classification, face detection, and speech recognition. Compared to GPU (graphics processing unit) and ASIC, a FPGA (field programmable gate array)-based CNN accelerator has great advantages due to its low power consumption and reconfigurable property. However, FPGA’s extremely limited resources and CNN’s huge amount of parameters and computational complexity pose great challenges to the design. Based on the ZYNQ heterogeneous platform and the coordination of resource and bandwidth issues with the roofline model, the CNN accelerator we designed can accelerate both standard convolution and depthwise separable convolution with a high hardware resource rate. The accelerator can handle network layers of different scales through parameter configuration and maximizes bandwidth and achieves full pipelined by using a data stream interface and ping-pong on-chip cache. The experimental results show that the accelerator designed in this paper can achieve 17.11GOPS for 32bit floating point when it can also accelerate depthwise separable convolution, which has obvious advantages compared with other designs.

...read moreread less

63 citations

Journal Article•DOI•

Recent Developments and Challenges in FPGA-Based Time-to-Digital Converters

[...]

Rui Machado¹, Jorge Cabral¹, F. S. Alves•Institutions (1)

University of Minho¹

29 Aug 2019-IEEE Transactions on Instrumentation and Measurement

TL;DR: This article presents and discusses the improvements on the FPGA-based TDC research, aiming to be a starting point for new studies on this field, with some guidelines for future research.

...read moreread less

Abstract: Over the past few years, the gap between field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) performance levels has been narrowed due to the constant development of FPGA technology. The high performance, together with the lower development costs and a shorter time to market, turns FPGA-based platforms attractive for a huge range of applications, among them time-to-digital converters (TDCs). It is, therefore, important to analyze the evolution of FPGA-based TDCs to better understand where the research efforts should be focused in the near future. This article presents and discusses the improvements on the FPGA-based TDC research, aiming to be a starting point for new studies on this field, with some guidelines for future research. A state-of-the-art literature review on the FPGA-based TDC is presented, aiming to categorize and discuss the existing architectures. This discussion addresses architectures’ characteristics, limitations, and areas of application.

...read moreread less

60 citations

Journal Article•DOI•

FPGA-based implementation of different families of fractional-order chaotic oscillators applying Grünwald–Letnikov method

[...]

Ana Dalia Pano-Azucena, Brisbane Ovilla-Martinez¹, Esteban Tlelo-Cuautle, Jesús M. Muñoz-Pacheco², Luis Gerardo de la Fraga³ - Show less +1 more•Institutions (3)

Universidad Autónoma Metropolitana¹, Benemérita Universidad Autónoma de Puebla², CINVESTAV³

30 Jun 2019-Communications in Nonlinear Science and Numerical Simulation

TL;DR: This paper highlights the implementation of different families of fractional-order chaotic oscillators using field-programmable gate arrays (FPGAs), detail the hardware implementation when solving the mathematical models applying the Grunwald–Letnikov method, and highlights the short-memory principle.

...read moreread less

Journal Article•DOI•

Robust secure communication protocol for smart healthcare system with FPGA implementation

[...]

Venkatasamy Sureshkumar¹, Ruhul Amin², V. R. Vijaykumar³, S. Raja Sekar⁴•Institutions (4)

PSG College of Technology¹, International Institute of Information Technology², Anna University Chennai - Regional Office, Coimbatore³, Bannari Amman Institute of Technology, Sathy⁴

01 Nov 2019-Future Generation Computer Systems

TL;DR: A novel architecture in the MWSNs is proposed and a suitable authenticated key establishment protocol using the light weight Elliptic Curve Cryptography (ECC) for the architecture is designed, solving the security issues found in existing protocols.

...read moreread less

Journal Article•DOI•

Parallel Implementation of Reinforcement Learning Q-Learning Technique for FPGA

[...]

Lucileide M. D. Da Silva, Matheus F. Torquato¹, Marcelo A. C. Fernandes²•Institutions (2)

Swansea University¹, Federal University of Rio Grande do Norte²

01 Jan 2019-IEEE Access

TL;DR: A parallel fixed-point Q-learning algorithm architecture implemented on field programmable gate arrays (FPGA) focusing on optimizing the system processing time is proposed.

...read moreread less

Abstract: Q-learning is an off-policy reinforcement learning technique, which has the main advantage of obtaining an optimal policy interacting with an unknown model environment. This paper proposes a parallel fixed-point Q-learning algorithm architecture implemented on field programmable gate arrays (FPGA) focusing on optimizing the system processing time. The convergence results are presented, and the processing time and occupied area were analyzed for different states and actions sizes scenarios and various fixed-point formats. The studies concerning the accuracy of the Q-learning technique response and resolution error associated with a decrease in the number of bits were also carried out for hardware implementation. The architecture implementation details were featured. The entire project was developed using the system generator platform (Xilinx), with a Virtex-6 xc6vcx240t-1ff1156 as the target FPGA.

...read moreread less

Proceedings Article•DOI•

Yosys+nextpnr: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs

[...]

David Shah¹, Eddie Hung², Clifford Wolf, Serge Bazanski, Dan Gisselquist, Miodrag Milanovic - Show less +2 more•Institutions (2)

Imperial College London¹, University of British Columbia²

01 Apr 2019

TL;DR: A fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation is introduced.

...read moreread less

Abstract: This paper introduces a fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation. Currently, this flow supports two commercially available FPGA families, Lattice iCE40 (up to 8K logic elements) and Lattice ECP5 (up to 85K elements) and has been hardware-proven for custom-computing machines including a low-power neural-network accelerator and an OpenRISC system-on-chip capable of booting Linux. Both Yosys and nextpnr have been engineered in a highly flexible manner to support many of the features present in modern FPGAs by separating architecture-specific details from the common mapping algorithms.This framework is demonstrated on a longest-path case study to find an atypical single source-sink path occupying up to 45% of all on-chip wiring.

...read moreread less

Journal Article•DOI•

Hardware Optimizations of Dense Binary Hyperdimensional Computing: Rematerialization of Hypervectors, Binarized Bundling, and Combinational Associative Memory

[...]

Manuel Schmuck¹, Luca Benini², Abbas Rahimi¹•Institutions (2)

ETH Zurich¹, University of Bologna²

10 Oct 2019-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: In this article, the authors propose hardware techniques for optimizations of hyperdimensional computing, in a synthesizable open-source VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx UltraScale FPGAs.

...read moreread less

Abstract: Brain-inspired hyperdimensional (HD) computing models neural activity patterns of the very size of the brain’s circuits with points of a hyperdimensional space, that is, with hypervectors. Hypervectors are D-dimensional (pseudo)random vectors with independent and identically distributed (i.i.d.) components constituting ultra-wide holographic words: D=10,000 bits, for instance. At its very core, HD computing manipulates a set of seed hypervectors to build composite hypervectors representing objects of interest. It demands memory optimizations with simple operations for an efficient hardware realization. In this article, we propose hardware techniques for optimizations of HD computing, in a synthesizable open-source VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx UltraScale FPGAs: (1) We propose simple logical operations to rematerialize the hypervectors on the fly rather than loading them from memory. These operations massively reduce the memory footprint by directly computing the composite hypervectors whose individual seed hypervectors do not need to be stored in memory. (2) Bundling a series of hypervectors over time requires a multibit counter per every hypervector component. We instead propose a binarized back-to-back bundling without requiring any counters. This truly enables on-chip learning with minimal resources as every hypervector component remains binary over the course of training to avoid otherwise multibit components. (3) For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. This operator is proportional to hypervector dimension (D), and hence may take O(D) cycles per classification event. Accordingly, we significantly improve the throughput of classification by proposing associative memories that steadily reduce the latency of classification to the extreme of a single cycle. (4) We perform a design space exploration incorporating the proposed techniques on FPGAs for a wearable biosignal processing application as a case study. Our techniques achieve up to 2.39× area saving, or 2,337× throughput improvement. The Pareto optimal HD architecture is mapped on only 18,340 configurable logic blocks (CLBs) to learn and classify five hand gestures using four electromyography sensors.

...read moreread less

Journal Article•DOI•

Optimized Compression for Implementing Convolutional Neural Networks on FPGA

[...]

Min Zhang, Li Linpeng, Hai Wang, Yan Liu, Qin Hongbo, Wei Zhao - Show less +2 more

01 Mar 2019-Electronics

TL;DR: A reversed-pruning strategy is proposed which reduces the number of parameters of AlexNet by a factor of 13× without accuracy loss on the ImageNet dataset and an efficient storage technique, which aims for the reduction of the whole overhead cache of the convolutional layer and the fully connected layer, is presented.

...read moreread less

Abstract: Field programmable gate array (FPGA) is widely considered as a promising platform for convolutional neural network (CNN) acceleration. However, the large numbers of parameters of CNNs cause heavy computing and memory burdens for FPGA-based CNN implementation. To solve this problem, this paper proposes an optimized compression strategy, and realizes an accelerator based on FPGA for CNNs. Firstly, a reversed-pruning strategy is proposed which reduces the number of parameters of AlexNet by a factor of 13× without accuracy loss on the ImageNet dataset. Peak-pruning is further introduced to achieve better compressibility. Moreover, quantization gives another 4× with negligible loss of accuracy. Secondly, an efficient storage technique, which aims for the reduction of the whole overhead cache of the convolutional layer and the fully connected layer, is presented respectively. Finally, the effectiveness of the proposed strategy is verified by an accelerator implemented on a Xilinx ZCU104 evaluation board. By improving existing pruning techniques and the storage format of sparse data, we significantly reduce the size of AlexNet by 28×, from 243 MB to 8.7 MB. In addition, the overall performance of our accelerator achieves 9.73 fps for the compressed AlexNet. Compared with the central processing unit (CPU) and graphics processing unit (GPU) platforms, our implementation achieves 182.3× and 1.1× improvements in latency and throughput, respectively, on the convolutional (CONV) layers of AlexNet, with an 822.0× and 15.8× improvement for energy efficiency, separately. This novel compression strategy provides a reference for other neural network applications, including CNNs, long short-term memory (LSTM), and recurrent neural networks (RNNs).

...read moreread less

Journal Article•DOI•

A Review of New Time-to-Digital Conversion Techniques

[...]

Scott Tancock¹, Ekin Arabul¹, Naim Dahnoun¹•Institutions (1)

University of Bristol¹

22 Aug 2019-IEEE Transactions on Instrumentation and Measurement

TL;DR: This article completes the review literature of TDCs by describing new architectures along with their benefits and tradeoffs, as well as the terminology and performance metrics that must be considered when choosing a TDC.

...read moreread less

Abstract: Time-to-digital converters (TDCs) are vital components in time and distance measurement and frequency-locking applications. There are many architectures for implementing TDCs, from simple counter TDCs to hybrid multi-level TDCs, which use many techniques in tandem. This article completes the review literature of TDCs by describing new architectures along with their benefits and tradeoffs, as well as the terminology and performance metrics that must be considered when choosing a TDC. It describes their implementation from the gate level upward and how it is affected by the fabric of the device [field-programmable gate array (FPGA) or application-specific integrated circuit (ASIC)] and suggests suitable use cases for the various techniques. Based on the results achieved in the current literature, we make recommendations on the appropriate architecture for a given task based on the number of channels and precision required, as well as the target fabric.

...read moreread less

Journal Article•DOI•

FPGA Implementation of High-Speed Area-Efficient Processor for Elliptic Curve Point Multiplication Over Prime Field

[...]

Md. Mainul Islam¹, Md. Selim Hossain², Moh. Khalid Hasan¹, Md. Shahjalal¹, Yeong Min Jang¹ - Show less +1 more•Institutions (2)

Kookmin University¹, Khulna University of Engineering & Technology²

09 Dec 2019-IEEE Access

TL;DR: In this article, a high-speed elliptic curve cryptographic (ECC) processor that performs fast point multiplication with low hardware utilization is presented, which is a crucial demand in the fields of cryptography and network security.

...read moreread less

Abstract: Developing a high-speed elliptic curve cryptographic (ECC) processor that performs fast point multiplication with low hardware utilization is a crucial demand in the fields of cryptography and network security. This paper presents field-programmable gate array (FPGA) implementation of a high-speed, low-area, side-channel attacks (SCAs) resistant ECC processor over a prime field. The processor supports 256-bit point multiplication on recently recommended twisted Edwards curve, namely, Edwards25519, which is used for a high-security digital signature scheme called Edwards curve digital signature algorithm (EdDSA). The paper proposes novel hardware architectures for point addition and point doubling operations on the twisted Edwards curve, where the processor takes only 516 and 1029 clock cycles to perform each point addition and point doubling, respectively. For a 256-bit key, the proposed ECC processor performs single point multiplication in 1.48 ms, running at a maximum clock frequency of 177.7 MHz in a cycle count of 262 650 with a throughput of 173.2 kbps, utilizing only 8873 slices on the Xilinx Virtex-7 FPGA platform, where the points are represented in projective coordinates. The implemented design is time-area-efficient as it offers fast scalar multiplication with low hardware utilization without compromising the security level.

...read moreread less

Journal Article•DOI•

Pseudorandom number generator based on enhanced Hénon map and its implementation

[...]

M. O. Meranza-Castillón¹, M. A. Murillo-Escobar¹, R.M. López-Gutiérrez², César Cruz-Hernández¹•Institutions (2)

Ensenada Center for Scientific Research and Higher Education¹, Autonomous University of Baja California²

01 Jul 2019-Aeu-international Journal of Electronics and Communications

TL;DR: A comprehensive security analysis from a cryptographic point of view is presented for hardware implementation such as key space analysis, key sensitivity, floating frequency, histograms, autocorrelation, correlation, entropy, and performance.

...read moreread less

Abstract: This paper presents a pseudorandom number generator (PRNG) based on enhanced Henon map (EHM) and its implementation in software and hardware for chaos-based cryptosystems with high processing such as image or video encryption. The proposed EHM presents better statistical properties and higher key sensitivity in comparison with classic Henon map (CHM) by means of numerical tests such as bifurcation diagrams, largest Lyapunov exponent, Gottwald-Melbourne test, and histograms. The proposed 8-bit PRNG-EHM algorithm is implemented in MATLAB (software) and in FPGA technology (hardware) for experimental results. In hardware implementation, we use VHDL language and the Altera DE2-115 FPGA board with RS-232 serial port communication for data extraction, which are analyzed with MATLAB. In both software and hardware level, the proposed PRNG-EHM passes the randomness NIST 800-22 statistical tests. For first time in literature, a comprehensive security analysis from a cryptographic point of view is presented for hardware implementation such as key space analysis, key sensitivity, floating frequency, histograms, autocorrelation, correlation, entropy, and performance. Comparisons of proposed PRNG-EHM with recent similar schemes show main advantages in security capabilities for cryptographic applications. According with the results, the proposed scheme can be used in chaos-based cryptographic applications at software or hardware implementation.

...read moreread less

Journal Article•DOI•

A Comprehensive FPGA Reverse Engineering Tool-Chain: From Bitstream to RTL Code

[...]

Tao Zhang¹, Jian Wang¹, Guo Shize¹, Zhe Chen¹•Institutions (1)

University of Electronic Science and Technology of China¹

27 Feb 2019-IEEE Access

TL;DR: A new FPGA reverse engineering tool-chain, which is the first tool that can perform integrated, precise reverse engineering for FPGAs, paving the way for the netlist-/code-based HT detection.

...read moreread less

Abstract: As recently studied, field-programmable gate arrays (FPGAs) suffer from growing Hardware Trojan (HT) attacks, and many techniques, e.g., register-transfer level (RTL) code-based analyzing, have been presented to detect HTs on FPGAs. However, for most of the FPGA end users, they can only obtain bitstream, rather than the RTL code. Therefore, we present a new FPGA reverse engineering tool-chain. It can precisely transform the FPGA bitstream to an RTL code and therefore assists in HT detection. In detail, we first construct an integrated database involving the FPGA architecture information and the bitstream mapping information. Then, we build two tools, namely, bitstream reversal tool (BRT) and netlist reversal tool (NRT). They can be combined together to retrieve the RTL code from the FPGA bitstream in moderate time. To demonstrate the effectiveness of our tool-chain, we evaluate it qualitatively and quantitatively by using two benchmarks (ISCAS'85 and ISCAS'89) and three real applications (8051 core, 68HC08, and AES). Our tool-chain is comprehensive since it covers all the reverse engineering stages, from bitstream to netlist and from netlist to code, without any support from other tools. Moreover, it rebuilds the netlist with a 100% correct rate and retrieves RTL code, which is exactly, functionally equivalent to the original one for all our benchmarks. To the best of our knowledge, it is the first tool that can perform integrated, precise reverse engineering for FPGAs, paving the way for the netlist-/code-based HT detection.

...read moreread less

Journal Article•DOI•

Design and FPGA Implementation of a Pseudorandom Number Generator Based on a Four-Wing Memristive Hyperchaotic System and Bernoulli Map

[...]

Fei Yu¹, Qiuzhen Wan², Jie Jin³, Lixiang Li¹, Binyong He¹, Li Liu¹, Shuai Qian¹, Yuanyuan Huang¹, Shuo Cai¹, Yun Song¹, Qiang Tang¹ - Show less +7 more•Institutions (3)

Changsha University of Science and Technology¹, Hunan Normal University², Hunan University of Science and Technology³

28 Nov 2019-IEEE Access

TL;DR: In this article, a pseudorandom number generator (PRNG) based on a no-equilibrium four-wing memristive hyperchaotic system (FWMHS) and its implementation on Field Programmable Gate Array (FPGA) board is presented.

...read moreread less

Abstract: Random numbers are widely used in the fields of computer, digital signature, secure communication and information security. Especially in recent years, with the large-scale application of smart card and the demand of information security, the demand for high-quality random number generator is increasingly urgent. With the development of the theory of non-linear systems, the design of pseudorandom number generator (PRNG) for chaotic behavior of non-linear systems provides a new theoretical basis and implementation method. This paper presents a PRNG based on a no-equilibrium four-wing memristive hyperchaotic system (FWMHS) and its implementation on Field Programmable Gate Array (FPGA) board. In order to increase the output throughput and the statistical quality of the generated bit sequences, we propose the PRNG design which uses a dual entropy sources architecture with FWMHS and Bernoulli map. Simulation and experimental results verifying the feasibility of the FWMHS are also given. Then, the proposed PRNG system is modeled and simulated on the Vivado 2018.3 platform, and implemented on the Xilinx ZYNQ-XC7Z020 FPGA evaluation board. The maximum operating frequency has been achieved as 135.04 MHz with a speed of 62.5 Mbit/s. Finally, we have experimentally verified that the binary data obtained by this dual entropy sources architecture pass the tests of NIST 800.22, ENT and AIS.31 statistical test suites with XOR function post-processing for a high throughput speed. The security analysis is carried out by means of dynamical degradation, key space, key sensitivity, correlation and information entropy. Statistical tests and security analysis show that it has good pseudorandom characteristics and can be used in chaos-based cryptographic applications at hardware or software implementation.

...read moreread less

Journal Article•DOI•

A novel high speed Artificial Neural Network–based chaotic True Random Number Generator on Field Programmable Gate Array

[...]

Murat Alcin¹, Ismail Koyuncu¹, Murat Tuna², Metin Varan³, Ihsan Pehlivan³ - Show less +1 more•Institutions (3)

Afyon Kocatepe University¹, Kırklareli University², Sakarya University³

01 Mar 2019-International Journal of Circuit Theory and Applications

TL;DR: A novel type of high‐speed TRNG based on chaos and ANN implemented in a Xilinx field‐programmable gate array (FPGA) chip that can provide not only high throughput but also high quality random bit sequences for a wide variety of embedded cryptographic applications.

...read moreread less

Journal Article•DOI•

Economic LSTM Approach for Recurrent Neural Networks

[...]

Kasem Khalil¹, Omar Eldash¹, Ashok Kumar¹, Magdy Bayoumi¹•Institutions (1)

University of Louisiana at Lafayette¹

25 Jun 2019-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: A new approach to Long Short-Term Memory (LSTM) that aims to reduce the cost of the computation unit and has fewer units compared to the existing LSTM versions which makes it very attractive in processing speed and hardware design cost.

...read moreread less

Abstract: Recurrent Neural Networks (RNNs) have become a popular method for learning sequences of data. It is sometimes tough to parallelize all RNN computations on conventional hardware due to its recurrent nature. One challenge of RNN is to find its optimal structure for RNN because of computing complex hidden units that exist. This brief presents a new approach to Long Short-Term Memory (LSTM) that aims to reduce the cost of the computation unit. The proposed Economic LSTM (ELSTM) is designed using a few hardware units to perform its functionality. ELSTM has fewer units compared to the existing LSTM versions which makes it very attractive in processing speed and hardware design cost. The proposed approach is tested using three datasets and compared with other methods. The simulation results show the proposed method has comparable accuracy with other methods. At the hardware level, the proposed method is implemented on Altera FPGA.

...read moreread less

Journal Article•DOI•

Mitigating Electrical-level Attacks towards Secure Multi-Tenant FPGAs in the Cloud

[...]

Jonas Krautter¹, Dennis R. E. Gnad¹, Mehdi B. Tahoori¹•Institutions (1)

Karlsruhe Institute of Technology¹

13 Aug 2019-ACM Transactions on Reconfigurable Technology and Systems

TL;DR: This article shows the first attempt of a countermeasure against attacks on the electrical level, which is based on a bitstream checking methodology, and can provide a metric of potential risk of the FPGA bitstream being used in active fault or passive side-channel attacks against other users of theFPGA fabric or the entire SoC platform.

...read moreread less

Abstract: A rising trend is the use of multi-tenant FPGAs, particularly in cloud environments, where partial access to the hardware is given to multiple third parties. This leads to new types of attacks in FPGAs, which operate not only on the logic level, but also on the electrical level through the common power delivery network. Since FPGAs are configured from the software-side, attackers are enabled to launch hardware attacks from software, impacting the security of an entire system. In this article, we show the first attempt of a countermeasure against attacks on the electrical level, which is based on a bitstream checking methodology. Bitstreams are translated back into flat technology mapped netlists, which are then checked for properties that indicate potential malicious runtime behavior of FPGA logic. Our approach can provide a metric of potential risk of the FPGA bitstream being used in active fault or passive side-channel attacks against other users of the FPGA fabric or the entire SoC platform.

...read moreread less

Journal Article•DOI•

Efficient PUF-Based Key Generation in FPGAs Using Per-Device Configuration

[...]

Mohammad A. Usmani¹, Shahrzad Keshavarz¹, Eric Matthews², Lesley Shannon², Russel Tessier¹, Daniel Holcomb¹ - Show less +2 more•Institutions (2)

University of Massachusetts Amherst¹, Simon Fraser University²

01 Feb 2019-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper proposes a novel methodology of per-device PUF configuration and a new PUF variant derived from the popular FPGA-specific Anderson PUF, which has several advantages over existing work including theAnderson PUF on which it is based.

...read moreread less

Abstract: Reconfigurable systems often require secret keys to encrypt and decrypt data. Applications requiring high security commonly generate keys based on physical unclonable functions (PUFs), circuits that use random manufacturing variations to produce secret keys that are unique to each device. Implementing PUFs on field-programmable gate arrays (FPGAs) is usually difficult, because the designer has limited control over layout, and each PUF system requires a large area overhead to correct errors in the PUF response bits. In this paper, we extend the state of the art for FPGA-based weak PUFs using a novel methodology of per-device configuration and a new PUF variant derived from the popular FPGA-specific Anderson PUF. The PUF is evaluated using Xilinx XC7Z020 programmable system-on-chips from the Virtex-7 family on Zynq ZedBoard platforms. The design we propose has several advantages over existing work including the Anderson PUF on which it is based. Our design is tunable to minimize the response bias and can be implemented using the common SLICEL components on Xilinx FPGAs. Moreover, the proposed PUF design enables an efficient per-device configuration that reduces bit error rate by over $10\times $ at room temperature and improves response stability by over $2\times $ across all temperatures. We demonstrate that the proposed per-device PUF configuration step leads to roughly $2\times $ savings in area resources for PUFs and error correction as used in key generation.

...read moreread less

Proceedings Article•DOI•

XPPE: cross-platform performance estimation of hardware accelerators using machine learning

[...]

Hosein Mohammadi Makrani¹, Hossein Sayadi¹, Tinoosh Mohsenin¹, Setareh Rafatirad¹, Avesta Sasan¹, Houman Homayoun¹ - Show less +2 more•Institutions (1)

George Mason University¹

21 Jan 2019

TL;DR: XPPE, a neural network based cross-platform performance estimation that utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs, enables developers to explore the design space without requiring to fully implement and map the application.

...read moreread less

Abstract: The increasing heterogeneity in the applications to be processed ceased ASICs to exist as the most efficient processing platform. Hybrid processing platforms such as CPU+FPGA are emerging as powerful processing platforms to support an efficient processing for a diverse range of applications. Hardware/Software co-design enabled designers to take advantage of these new hybrid platforms such as Zynq. However, dividing an application into two parts that one part runs on CPU and the other part is converted to a hardware accelerator implemented on FPGA, is making the platform selection difficult for the developers as there is a significant variation in the application's performance achieved on different platforms. Developers are required to fully implement the design on each platform to have an estimation of the performance. This process is tedious when the number of available platforms is large. To address such challenge, in this work we propose XPPE, a neural network based cross-platform performance estimation. XPPE utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs. The proposed estimation is performed for a wide range of applications and evaluated against a vast set of platforms. Moreover, XPPE enables developers to explore the design space without requiring to fully implement and map the application. Our evaluation results show that the correlation between the estimated speed up using XPPE and actual speedup of applications on a Hybrid platform over an ARM processor is more than 0.98.

...read moreread less

Journal Article•DOI•

Shouji: a fast and efficient pre-alignment filter for sequence alignment

[...]

Mohammed Alser¹, Mohammed Alser², Mohammed Alser³, Hasan Hassan², Akash Kumar¹, Onur Mutlu³, Onur Mutlu², Can Alkan³ - Show less +4 more•Institutions (3)

Dresden University of Technology¹, ETH Zurich², Bilkent University³

01 Nov 2019-Bioinformatics

TL;DR: Shouji as mentioned in this paper is a parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms and can be adapted for any bioinformatics pipeline that performs sequence alignment for verification.

...read moreread less

Abstract: Motivation The ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm. Results Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step. Availability and implementation https://github.com/CMU-SAFARI/Shouji. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

High speed FPGA-based chaotic oscillator design

[...]

Murat Tuna¹, Murat Alcin², Ismail Koyuncu², Can Bülent Fidan³, Ihsan Pehlivan⁴ - Show less +1 more•Institutions (4)

Kırklareli University¹, Afyon Kocatepe University², Karabük University³, Sakarya University⁴

01 Apr 2019-Microprocessors and Microsystems

TL;DR: Hardware-based design of Lu-Chen chaotic system can be used in various chaos-based embedded system applications including cryptography, secure communication and random number generation and provides better results than the alternatives with respect to FPGA resource utilization.

...read moreread less

Journal Article•DOI•

An FPGA-Based Hardware Accelerator for CNNs Using On-Chip Memories Only: Design and Benchmarking with Intel Movidius Neural Compute Stick

[...]

Gianmarco Dinelli¹, Gabriele Meoni¹, Emilio Rapuano¹, Gionata Benelli, Luca Fanucci¹ - Show less +1 more•Institutions (1)

University of Pisa¹

22 Oct 2019-International Journal of Reconfigurable Computing

TL;DR: A full on-chip field-programmable gate array hardware accelerator for a separable convolutional neural network, which was designed for a keyword spotting application and shows that better inference time and energy per inference results can be obtained with comparable accuracy at expenses of a higher design effort and development time through the FPGA solution.

...read moreread less

Abstract: During the last years, convolutional neural networks have been used for different applications, thanks to their potentiality to carry out tasks by using a reduced number of parameters when compared with other deep learning approaches. However, power consumption and memory footprint constraints, typical of on the edge and portable applications, usually collide with accuracy and latency requirements. For such reasons, commercial hardware accelerators have become popular, thanks to their architecture designed for the inference of general convolutional neural network models. Nevertheless, field-programmable gate arrays represent an interesting perspective since they offer the possibility to implement a hardware architecture tailored to a specific convolutional neural network model, with promising results in terms of latency and power consumption. In this article, we propose a full on-chip field-programmable gate array hardware accelerator for a separable convolutional neural network, which was designed for a keyword spotting application. We started from the model implemented in a previous work for the Intel Movidius Neural Compute Stick. For our goals, we appropriately quantized such a model through a bit-true simulation, and we realized a dedicated architecture exclusively using on-chip memories. A benchmark comparing the results on different field-programmable gate array families by Xilinx and Intel with the implementation on the Neural Compute Stick was realized. The analysis shows that better inference time and energy per inference results can be obtained with comparable accuracy at expenses of a higher design effort and development time through the FPGA solution.

...read moreread less

Journal Article•DOI•

Recent Attacks and Defenses on FPGA-based Systems

[...]

Jiliang Zhang¹, Gang Qu²•Institutions (2)

Hunan University¹, University of Maryland, College Park²

21 Aug 2019-ACM Transactions on Reconfigurable Technology and Systems

TL;DR: Field-programmable gate array (FPGA) is a kind of programmable chip that is widely used in many areas, including automotive electronics, medical devices, military and consumer electronics, and is increasingly being used in smart grids.

...read moreread less

Abstract: Field-programmable gate array (FPGA) is a kind of programmable chip that is widely used in many areas, including automotive electronics, medical devices, military and consumer electronics, and is gaining more popularity. Unlike the application specific integrated circuits (ASIC) design, an FPGA-based system has its own supply-chain model and design flow, which brings interesting security and trust challenges. In this survey, we review the security and trust issues related to FPGA-based systems from the market perspective, where we model the market with the following parties: FPGA vendors, foundries, IP vendors, EDA tool vendors, FPGA-based system developers, and end-users. For each party, we show the security and trust problems they need to be aware of and the associated solutions that are available. We also discuss some challenges and opportunities in the security and trust of FPGA-based systems used in large-scale cloud and datacenters.

...read moreread less

Collapse