Showing papers on "Field-programmable gate array published in 2017"

PDF

Open Access

Proceedings Article•DOI•

Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs

[...]

Xuechao Wei¹, Cody Hao Yu², Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang¹, Jason Cong² - Show less +4 more•Institutions (2)

Peking University¹, University of California, Los Angeles²

18 Jun 2017

TL;DR: This paper implements CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization, and provides an analytical model for performance and resource utilization and develops an automatic design space exploration framework.

...read moreread less

Abstract: Convolutional neural networks (CNNs) have been widely applied in many deep learning applications. In recent years, the FPGA implementation for CNNs has attracted much attention because of its high performance and energy efficiency. However, existing implementations have difficulty to fully leverage the computation power of the latest FPGAs. In this paper we implement CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization. We provide an analytical model for performance and resource utilization and develop an automatic design space exploration framework, as well as source-to-source code transformation from a C program to a CNN implementation using systolic array. The experimental results show that our framework is able to generate the accelerator for real-life CNN models, achieving up to 461 GFlops for floating point data type and 1.2 Tops for 8-16 bit fixed point.

...read moreread less

363 citations

Journal Article•DOI•

Multipurpose silicon photonics signal processor core

[...]

Daniel Pérez¹, Ivana Gasulla¹, Lee Crudgington², David J. Thomson², Ali Z. Khokhar², Ke Li², Wei Cao², Goran Z. Mashanovich², Goran Z. Mashanovich³, José Capmany¹ - Show less +6 more•Institutions (3)

Polytechnic University of Valencia¹, University of Southampton², University of Belgrade³

21 Sep 2017-Nature Communications

TL;DR: A reconfigurable but simple silicon waveguide mesh with different functionalities with a simple seven hexagonal cell structure is demonstrated, which can be applied to different fields including communications, chemical and biomedical sensing, signal processing, multiprocessor networks, and quantum information systems.

...read moreread less

Abstract: Integrated photonics changes the scaling laws of information and communication systems offering architectural choices that combine photonics with electronics to optimize performance, power, footprint, and cost. Application-specific photonic integrated circuits, where particular circuits/chips are designed to optimally perform particular functionalities, require a considerable number of design and fabrication iterations leading to long development times. A different approach inspired by electronic Field Programmable Gate Arrays is the programmable photonic processor, where a common hardware implemented by a two-dimensional photonic waveguide mesh realizes different functionalities through programming. Here, we report the demonstration of such reconfigurable waveguide mesh in silicon. We demonstrate over 20 different functionalities with a simple seven hexagonal cell structure, which can be applied to different fields including communications, chemical and biomedical sensing, signal processing, multiprocessor networks, and quantum information systems. Our work is an important step toward this paradigm.Integrated optical circuits today are typically designed for a few special functionalities and require complex design and development procedures. Here, the authors demonstrate a reconfigurable but simple silicon waveguide mesh with different functionalities.

...read moreread less

358 citations

Proceedings Article•DOI•

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

[...]

Yufei Ma¹, Yu Cao¹, Sarma Vrudhula¹, Jae-sun Seo¹•Institutions (1)

Arizona State University¹

22 Feb 2017

TL;DR: This work systematically explore the trade-offs of hardware cost by searching the design variable configurations, and proposes a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance.

...read moreread less

Abstract: As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.

...read moreread less

348 citations

Journal Article•DOI•

DLAU: A Scalable Deep Learning Accelerator Unit on FPGA

[...]

Chao Wang¹, Lei Gong¹, Qi Yu¹, Xi Li¹, Yuan Xie², Xuehai Zhou¹ - Show less +2 more•Institutions (2)

University of Science and Technology of China¹, University of California, Santa Barbara²

01 Mar 2017-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: This paper designs deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype and employs three pipelined processing units to improve the throughput.

...read moreread less

Abstract: As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct a high performance implementations of deep learning neural networks. In order to improve the performance as well as to maintain the low power cost, in this paper we design deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to $36.1 {\times }$ speedup comparing to the Intel Core2 processors, with the power consumption at 234 mW.

...read moreread less

268 citations

Proceedings Article•DOI•

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

[...]

Jialiang Zhang¹, Jing Li¹•Institutions (1)

University of Wisconsin-Madison¹

22 Feb 2017

TL;DR: An analytical performance model is proposed and an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs are performed and a new kernel design is proposed to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access.

...read moreread less

Abstract: OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks. In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.

...read moreread less

199 citations

Posted Content•

A Survey of FPGA Based Neural Network Accelerator

[...]

Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, Huazhong Yang - Show less +1 more

24 Dec 2017-arXiv: Hardware Architecture

TL;DR: An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.

...read moreread less

Abstract: Recent researches on neural network have shown significant advantage in machine learning over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With specifically designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA-based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.

...read moreread less

137 citations

Proceedings Article•DOI•

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

[...]

Guohao Dai¹, Tianhao Huang¹, Yuze Chi², Ningyi Xu³, Yu Wang¹, Huazhong Yang¹ - Show less +2 more•Institutions (3)

Tsinghua University¹, University of California, Los Angeles², Microsoft³

22 Feb 2017

TL;DR: ForeGraph, a large-scale graph processing framework based on the multi-FPGA architecture, is proposed, which outperforms state-of-the-art FPGA-based large- scale graph processing systems by 4.54x when executing PageRank on the Twitter graph.

...read moreread less

Abstract: The performance of large-scale graph processing suffers from challenges including poor locality, lack of scalability, random access pattern, and heavy data conflicts. Some characteristics of FPGA make it a promising solution to accelerate various applications. For example, on-chip block RAMs can provide high throughput for random data access. However, large-scale processing on a single FPGA chip is constrained by limited on-chip memory resources and off-chip bandwidth. Using a multi-FPGA architecture may alleviate these problems to some extent, while the data partitioning and communication schemes should be considered to ensure the locality and reduce data conflicts. In this paper, we propose ForeGraph, a large-scale graph processing framework based on the multi-FPGA architecture. In ForeGraph, each FPGA board only stores a partition of the entire graph in off-chip memory. Communication over partitions is reduced. Vertices and edges are sequentially loaded onto the FPGA chip and processed. Under our scheduling scheme, each FPGA chip performs graph processing in parallel without conflicts. We also analyze the impact of system parameters on the performance of ForeGraph. Our experimental results on Xilinx Virtex UltraScale XCVU190 chip show ForeGraph outperforms state-of-the-art FPGA-based large-scale graph processing systems by 4.54x when executing PageRank on the Twitter graph (1.4 billion edges). The average throughput is over 900 MTEPS in our design and 2.03x larger than previous work.

...read moreread less

133 citations

Proceedings Article•DOI•

SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing

[...]

Ao Ren¹, Zhe Li¹, Caiwen Ding¹, Qinru Qiu¹, Yanzhi Wang¹, Ji Li², Xuehai Qian², Bo Yuan³ - Show less +4 more•Institutions (3)

Syracuse University¹, University of Southern California², City University of New York³

04 Apr 2017

TL;DR: SC-DCNN is presented, the first comprehensive design and optimization framework of SC-based DCNNs, using a bottom-up approach, and is holistically optimized to minimize area and power (energy) consumption while maintaining high network accuracy.

...read moreread less

Abstract: With the recent advance of wearable devices and Internet of Things (IoTs), it becomes attractive to implement the Deep Convolutional Neural Networks (DCNNs) in embedded and portable systems. Currently, executing the software-based DCNNs requires high-performance servers, restricting the widespread deployment on embedded and mobile IoT devices. To overcome this obstacle, considerable research efforts have been made to develop highly-parallel and specialized DCNN accelerators using GPGPUs, FPGAs or ASICs.Stochastic Computing (SC), which uses a bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has high potential for implementing DCNNs with high scalability and ultra-low hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power (energy) and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources allow immense design space for enhancing scalability and robustness for hardware DCNNs.This paper presents SC-DCNN, the first comprehensive design and optimization framework of SC-based DCNNs, using a bottom-up approach. We first present the designs of function blocks that perform the basic operations in DCNN, including inner product, pooling, and activation function. Then we propose four designs of feature extraction blocks, which are in charge of extracting features from input feature maps, by connecting different basic function blocks with joint optimization. Moreover, the efficient weight storage methods are proposed to reduce the area and power (energy) consumption. Putting all together, with feature extraction blocks carefully selected, SC-DCNN is holistically optimized to minimize area and power (energy) consumption while maintaining high network accuracy. Experimental results demonstrate that the LeNet5 implemented in SC-DCNN consumes only 17 mm2 area and 1.53 W power, achieves throughput of 781250 images/s, area efficiency of 45946 images/s/mm2, and energy efficiency of 510734 images/J.

...read moreread less

130 citations

Proceedings Article•DOI•

An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks

[...]

Yufei Ma¹, Yu Cao¹, Sarma Vrudhula¹, Jae-sun Seo¹•Institutions (1)

Arizona State University¹

01 Sep 2017

TL;DR: This work presents an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGAs and still keep the benefits of low-level hardware optimization.

...read moreread less

Abstract: Convolutional neural networks (CNNs) are rapidly evolving and being applied to a broad range of applications. Given a specific application, an increasing challenge is to search the appropriate CNN algorithm and efficiently map it to the target hardware. The FPGA-based accelerator has the advantage of reconfigurability and flexibility, and has achieved high-performance and low-power. Without a general compiler to automate the implementation, however, significant efforts and expertise are still required to customize the design for each CNN model. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The implementation of each module is optimized at the RTL level. Given a CNN algorithm, its structure is abstracted to a directed acyclic graph (DAG) and then complied with RTL modules in the library. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. ResNet, can be compiled. The proposed methodology is demonstrated with end-to-end FPGA implementations of various CNN algorithms (e.g. NiN, VGG-16, ResNet-50, and ResNet-152) on two standalone Intel FPGAs, Stratix V and Arria 10. The performance and overhead of the automated compilation are evaluated. The compiled FPGA accelerators exhibit superior performance compared to state-of-the-art automation-based works by >2× for various CNNs.

...read moreread less

116 citations

Journal Article•DOI•

Bat algorithm based maximum power point tracking for photovoltaic system under partial shading conditions

[...]

Karim Kaced¹, Cherif Larbes¹, Naeem Ramzan², Moussaab Bounabi¹, Zine elabadine Dahmane¹ - Show less +1 more•Institutions (2)

École Normale Supérieure¹, University of the West of Scotland²

01 Dec 2017-Solar Energy

TL;DR: Experimental results confirm the efficiency of the proposed MPPT method in the global peak tracking and its high accuracy to handle the partial shading and the comparison with the P & O and the PSO methods shows that the proposed method outperforms them in term of global search ability and dynamic performance.

...read moreread less

107 citations

Journal Article•DOI•

Embedded Streaming Deep Neural Networks Accelerator With Applications

[...]

Aysegul Dundar¹, Jonghoon Jin¹, Berin Martini¹, Eugenio Culurciello¹•Institutions (1)

Purdue University¹

01 Jul 2017-IEEE Transactions on Neural Networks

TL;DR: The streaming method acts as a compiler, transforming a high-level representation of DCNNs into operation codes to execute applications in a hardware accelerator, utilizing maximum computational resources available based on a novel-scheduled routing topology that combines data reuse and data concatenation.

...read moreread less

Abstract: Deep convolutional neural networks (DCNNs) have become a very powerful tool in visual perception. DCNNs have applications in autonomous robots, security systems, mobile phones, and automobiles, where high throughput of the feedforward evaluation phase and power efficiency are important. Because of this increased usage, many field-programmable gate array (FPGA)-based accelerators have been proposed. In this paper, we present an optimized streaming method for DCNNs’ hardware accelerator on an embedded platform. The streaming method acts as a compiler, transforming a high-level representation of DCNNs into operation codes to execute applications in a hardware accelerator. The proposed method utilizes maximum computational resources available based on a novel-scheduled routing topology that combines data reuse and data concatenation. It is tested with a hardware accelerator implemented on the Xilinx Kintex-7 XC7K325T FPGA. The system fully explores weight-level and node-level parallelizations of DCNNs and achieves a peak performance of 247 G-ops while consuming less than 4 W of power. We test our system with applications on object classification and object detection in real-world scenarios. Our results indicate high-performance efficiency, outperforming all other presented platforms while running these applications.

...read moreread less

Proceedings Article•DOI•

Hardware accelerators for recurrent neural networks on FPGA

[...]

Andre Xian Ming Chang¹, Eugenio Culurciello¹•Institutions (1)

Purdue University¹

28 May 2017

TL;DR: Three hardware accelerators for RNN on Xilinx's Zynq SoC FPGA are presented to present how to overcome challenges involved in developing RNN accelerators and each design uses different strategies to achieve high performance and scalability.

...read moreread less

Abstract: Recurrent Neural Networks (RNNs) have the ability to retain memory and learn from data sequences, which are fundamental for real-time applications RNN computations offer limited data reuse, which leads to high data traffic This translates into high off-chip memory bandwidth or large internal storage requirement to achieve high performance Exploiting parallelism in RNN computations are bounded by this two limiting factors, among other constraints present in embedded systems Therefore, balance between internally stored data and off-chip memory data transfer is necessary to overlap computation time with data transfer latency In this paper, we present three hardware accelerators for RNN on Xilinx's Zynq SoC FPGA to present how to overcome challenges involved in developing RNN accelerators Each design uses different strategies to achieve high performance and scalability Each co-processor was tested with a character level language model The latest design called DeepRnn, achieves up to 23 X better performance per power than Tegra X1 development board for this application

...read moreread less

Proceedings Article•DOI•

Reusability is FIRRTL ground: hardware construction languages, compiler frameworks, and transformations

[...]

Adam Izraelevitz¹, Jack Koenig¹, Patrick Li¹, Richard Lin¹, Angie Wang¹, Albert Magyar¹, Donggyu Kim¹, Colin Schmidt¹, Chick Markley¹, Jim Lawson¹, Jonathan Bachrach¹ - Show less +7 more•Institutions (1)

University of California, Berkeley¹

13 Nov 2017

TL;DR: This work hypothesizes that existing hardware construction languages and novel hardware compiler frameworks can put hardware development on a similar evolutionary path by enabling new hardware libraries to be independent of underlying process technologies including FPGA mappings.

...read moreread less

Abstract: Enabled by modern languages and retargetable compilers, software development is in a virtual "Cambrian explosion" driven by a critical mass of powerfully parameterized libraries; but hardware development practices lag far behind. We hypothesize that existing hardware construction languages (HCLs) and novel hardware compiler frameworks (HCFs) can put hardware development on a similar evolutionary path by enabling new hardware libraries to be independent of underlying process technologies including FPGA mappings. We support this claim by (1) evaluating the degree with which Chisel, an existing HCL, can support powerfully parameterized libraries, and (2) introducing the concept and implementation of an HCF that uses an open-source hardware intermediate representation, FIRRTL (Flexible Intermediate Representation for RTL), to transform target-independent RTL into technology-specific RTL. Finally, we evaluate many hardware compiler transformations, including simplifying transformations, analyses, optimizations, instrumentations, and specializations, which demonstrate the power of a combined HCL and HCF approach.

...read moreread less

Journal Article•DOI•

Programming Heterogeneous Systems from an Image Processing DSL

[...]

Jing Pu¹, Steven Bell¹, Xuan Yang¹, Jeff Setter¹, Stephen Richardson¹, Jonathan Ragan-Kelley², Mark Horowitz¹ - Show less +3 more•Institutions (2)

Stanford University¹, University of California, Berkeley²

16 Aug 2017-ACM Transactions on Architecture and Code Optimization

TL;DR: The image processing language Halide is extended so users can specify which portions of their applications should become hardware accelerators, and a compiler is provided that uses this code to automatically create the accelerator along with the “glue” code needed for the user’s application to access this hardware.

...read moreread less

Abstract: Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality But creating, “programming,” and integrating this hardware into a hardware/software system is difficult We address this problem by extending the image processing language Halide so users can specify which portions of their applications should become hardware accelerators, and then we provide a compiler that uses this code to automatically create the accelerator along with the “glue” code needed for the user’s application to access this hardware Starting with Halide not only provides a very high-level functional description of the hardware but also allows our compiler to generate a complete software application, which accesses the hardware for acceleration when appropriate Our system also provides high-level semantics to explore different mappings of applications to a heterogeneous system, including the flexibility of being able to change the throughput rate of the generated hardwareWe demonstrate our approach by mapping applications to a commercial Xilinx Zynq system Using its FPGA with two low-power ARM cores, our design achieves up to 6× higher performance and 38× lower energy compared to the quad-core ARM CPU on an NVIDIA Tegra K1, and 35× higher performance with 12× lower energy compared to the K1’s 192-core GPU

...read moreread less

Proceedings Article•DOI•

Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures

[...]

David Sidler¹, Zsolt István¹, Muhsen Owaida¹, Gustavo Alonso¹•Institutions (1)

ETH Zurich¹

09 May 2017

TL;DR: This work integrates the hardware accelerator into MonetDB, a main-memory column store, and demonstrates a significant improvement in response time and throughput, and provides a novel and efficient implementation of two commonly used SQL operators for strings.

...read moreread less

Abstract: Taking advantage of recently released hybrid multicore architectures, such as the Intel's Xeon+FPGA machine, where the FPGA has coherent access to the main memory through the QPI bus, we explore the benefits of specializing operators to hardware. We focus on two commonly used SQL operators for strings: LIKE, and REGEXP_LIKE, and provide a novel and efficient implementation of these operators in reconfigurable hardware. We integrate the hardware accelerator into MonetDB, a main-memory column store, and demonstrate a significant improvement in response time and throughput. Our Hardware User Defined Function (HUDF) can speed up complex pattern matching by an order of magnitude in comparison to the database running on a 10-core CPU. The insights gained from integrating hardware based string operators into MonetDB should also be useful for future designs combining hardware specialization and databases.

...read moreread less

Proceedings Article•DOI•

High-performance video content recognition with long-term recurrent convolutional network for FPGA

[...]

Xiaofan Zhang¹, Xinheng Liu¹, Anand Ramachandran¹, Chuanhao Zhuge¹, Shibin Tang², Peng Ouyang³, Zuofu Cheng, Kyle Rupnow, Deming Chen¹ - Show less +5 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, Tsinghua University², Beihang University³

01 Sep 2017

TL;DR: A design framework for DNNs is presented that uses highly configurable IPs for neural network layers together with a new design space exploration engine for Resource Allocation Management (REALM) to further improve the FPGA solution.

...read moreread less

Abstract: FPGA is a promising candidate for the acceleration of Deep Neural Networks (DNN) with improved latency and energy consumption compared to CPU and GPU-based implementations. DNNs use sequences of layers of regular computation that are well suited for HLS-based design for FPGA. However, optimizing large neural networks under resource constraints is still a key challenge. HLS must manage on-chip computation, buffering resources, and off-chip memory accesses to minimize the total latency. In this paper, we present a design framework for DNNs that uses highly configurable IPs for neural network layers together with a new design space exploration engine for Resource Allocation Management (REALM). We also carry out efficient memory subsystem design and fixed-point weight re-training to further improve our FPGA solution. We demonstrate our design framework on the Long-term Recurrent Convolution Network for video inputs. Our implementation on a Xilinx VC709 board achieves 3.1X speedup compared to an NVIDIA K80 and 4.75X speedup compared to an Intel Xeon with 17.5X lower energy per image.

...read moreread less

Journal Article•DOI•

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

[...]

Zhiqiang Liu¹, Yong Dou¹, Jingfei Jiang¹, Jinwei Xu¹, Shijie Li¹, Yongmei Zhou¹, Yingnan Xu¹ - Show less +3 more•Institutions (1)

National University of Defense Technology¹

19 Jul 2017-ACM Transactions on Reconfigurable Technology and Systems

TL;DR: A scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration and a systematic design space exploration methodology is put forward to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints.

...read moreread less

Abstract: Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.

...read moreread less

Journal Article•DOI•

Hardware for dynamic quantum computing.

[...]

Colm A. Ryan¹, Blake R. Johnson¹, Diego Ristè¹, Brian Donovan¹, Thomas A. Ohki¹ - Show less +1 more•Institutions (1)

BBN Technologies¹

19 Oct 2017-Review of Scientific Instruments

TL;DR: In this paper, the authors describe the hardware, gateware, and software developed at Raytheon BBN Technologies for dynamic quantum information processing experiments on superconducting qubits, where real-time qubit state information is fed back or fed forward within a fraction of the qubits' coherence time to dynamically change the implemented sequence.

...read moreread less

Abstract: We describe the hardware, gateware, and software developed at Raytheon BBN Technologies for dynamic quantum information processing experiments on superconducting qubits. In dynamic experiments, real-time qubit state information is fed back or fed forward within a fraction of the qubits' coherence time to dynamically change the implemented sequence. The hardware presented here covers both control and readout of superconducting qubits. For readout, we created a custom signal processing gateware and software stack on commercial hardware to convert pulses in a heterodyne receiver into qubit state assignments with minimal latency, alongside data taking capability. For control, we developed custom hardware with gateware and software for pulse sequencing and steering information distribution that is capable of arbitrary control flow in a fraction of superconducting qubit coherence times. Both readout and control platforms make extensive use of field programmable gate arrays to enable tailored qubit control systems in a reconfigurable fabric suitable for iterative development.

...read moreread less

Proceedings Article•DOI•

A fully connected layer elimination for a binarizec convolutional neural network on an FPGA

[...]

Hiroki Nakahara¹, Tomoya Fujii¹, Shimpei Sato¹•Institutions (1)

Tokyo Institute of Technology¹

01 Sep 2017

TL;DR: A binarized CNN which treats only binary 2-values for the inputs and the weights is used, which can realize a compact and faster CNN than the conventional ones.

...read moreread less

Abstract: A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This paper uses a binarized CNN which treats only binary 2-values for the inputs and the weights. Since the multiplier is replaced into an XNOR circuit, we can realize a high-performance MAC circuit by using many XNOR circuits. In the paper, we eliminate internal FC layers excluding the last one, then, insert a binarized average pooling layer, which can be realized by a majority circuit for binarized (1/0) values. In that case, since the weight memory is replaced into the 1's counter, we can realize a compact and faster CNN than the conventional ones. We implemented the VGG-11 benchmark CNN for the CIFAR10 image classification task on the Xilinx Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better.

...read moreread less

Journal Article•DOI•

The design and realization of a new high speed FPGA-based chaotic true random number generator

[...]

smail Koyuncu¹, Ahmet Turan zcerit²•Institutions (2)

Düzce University¹, Sakarya University²

01 Feb 2017-Computers & Electrical Engineering

TL;DR: The chaotic oscillator designed has been tested for True Random Number Generators (TRNG) and it has been proved that the proposed design can be used in embedded cryptologic applications.

...read moreread less

Proceedings Article•DOI•

Scalable high-performance architecture for convolutional ternary neural networks on FPGA

[...]

Adrien Prost-Boucle¹, Alban Bourge¹, Frédéric Pétrot¹, Hande Alemdar¹, Nicholas Caldwell¹, Vincent Leroy¹ - Show less +2 more•Institutions (1)

University of Grenoble¹

01 Sep 2017

TL;DR: This work presents a highly versatile FPGA friendly architecture for TNN in which it can vary both the number of bits of the input data and the level of parallelism at synthesis time, allowing to trade throughput for hardware resources and power consumption.

...read moreread less

Abstract: Thanks to their excellent performances on typical artificial intelligence problems, deep neural networks have drawn a lot of interest lately. However, this comes at the cost of large computational needs and high power consumption. Benefiting from high precision at acceptable hardware cost on these difficult problems is a challenge. To address it, we advocate the use of ternary neural networks (TNN) that, when properly trained, can reach results close to the state of the art using floatingpoint arithmetic. We present a highly versatile FPGA friendly architecture for TNN in which we can vary both the number of bits of the input data and the level of parallelism at synthesis time, allowing to trade throughput for hardware resources and power consumption. To demonstrate the efficiency of our proposal, we implement high-complexity convolutional neural networks on the Xilinx Virtex-7 VC709 FPGA board. While reaching a better accuracy than comparable designs, we can target either high throughput or low power. We measure a throughput up to 27 000 fps at ≈7W or up to 8.36 TMAC/s at ≈13 W.

...read moreread less

Proceedings Article•DOI•

BITMAN: A tool and API for FPGA bitstream manipulations

[...]

Khoa Dang Pham¹, Edson Lemos Horta¹, Dirk Koch¹•Institutions (1)

University of Manchester¹

27 Mar 2017

TL;DR: The capabilities, API and performance evaluation of BitMan are described, which includes high-level commands such as cutting out regions of a bitstream and placing or relocating modules on an FPGA as well as low- level commands for modifying primitives and for routing clock networks or rerouting signal connections at run-time.

...read moreread less

Abstract: To fully support the partial reconfiguration capabilities of FPGAs, this paper introduces the tool and API BitMan for generating and manipulating configuration bitstreams. Bit-Man supports recent Xilinx FPGAs that can be used by the ISE and Vivado tool suites of the FPGA vendor Xilinx, including latest Virtex-6, 7 Series, UltraScale and UltraScale− series FPGAs. The functionality includes high-level commands such as cutting out regions of a bitstream and placing or relocating modules on an FPGA as well as low-level commands for modifying primitives and for routing clock networks or rerouting signal connections at run-time. All this is possible without the vendor CAD tools for allowing BitMan to be used even with embedded CPUs. The paper describes the capabilities, API and performance evaluation of BitMan.

...read moreread less

Proceedings Article•DOI•

Analyzing Hardware Based Malware Detectors

[...]

Nisarg Patel¹, Avesta Sasan¹, Houman Homayoun¹•Institutions (1)

George Mason University¹

18 Jun 2017

TL;DR: This paper is the first attempt to thoroughly analyze various robust machine learning methods to classify benign and malware applications and shows OneR to be the most cost-effective classifier with more than 80% accuracy and fast execution time of less than 10ns, achieving highest accuracy per logic area.

...read moreread less

Abstract: Detection of malicious software at the hardware level is emerging as an effective solution to increasing security threats. Hardware based detectors rely on Machine Learning(ML) classifiers to detect malware-like execution pattern based on Hardware Performance Counters(HPC) information at run-time. The effectiveness of these learning methods mainly relies on the information provided by expensive-to-implement limited number of HPC. This paper is the first attempt to thoroughly analyze various robust machine learning methods to classify benign and malware applications. Given the limited availability of HPC the analysis results help guiding architectural decision on what hardware performance counters are needed most to effectively improve ML classification accuracy. For software implementation we fully implemented these classifier at OS Kernel to understand various software overheads. The software implementation of these classifiers are found to be relatively slow with the execution time in the range of milliseconds, order of magnitude higher than the latency needed to capture malware at run-time. This is calling for hardware accelerated implementation of these algorithms. For hardware implementation, we have synthesized the studied classifier models on FPGA to compare various design parameters including logic area, power, and latency. The results show that while complex ML classifier such as MultiLayerPerceptron and logistics are achieving close to 90% accuracy, after taking into consideration their implementation overheads, they perform worst in terms of PDP, accuracy/area and latency compared to simpler but slightly less accurate rule based and tree based classifiers. Our results further show OneR to be the most cost-effective classifier with more than 80% accuracy and fast execution time of less than 10ns, achieving highest accuracy per logic area, while mainly relying on only a single branch-instruction HPC information.

...read moreread less

Journal Article•DOI•

A 3.9-ps RMS Precision Time-to-Digital Converter Using Ones-Counter Encoding Scheme in a Kintex-7 FPGA

[...]

Yonggang Wang¹, Jie Kuang¹, Chong Liu¹, Qiang Cao¹•Institutions (1)

University of Science and Technology of China¹

30 Aug 2017-IEEE Transactions on Nuclear Science

TL;DR: The TDC design is relatively simple even when using FPGAs made with current advanced process technology, and can simultaneously achieve high time precision and high measurement throughput.

...read moreread less

Abstract: A 3.9-ps time-interval rms precision and 277-M events/second measurement throughput time-to-digital converter (TDC) is implemented in a Xilinx Kintex-7 field programmable gate array (FPGA). Unlike previous work, the TDC is achieved with a multichain tapped-delay line (TDL) followed by an ones-counter encoder. The four normal TDLs merged together make the TDC bins very small, so that the time precision can be significantly improved. The ones-counter encoder naturally applies global bubble error correction to the output of TDL, thus the TDC design is relatively simple even when using FPGAs made with current advanced process technology. The TDC implementation is a generally applicable method that can simultaneously achieve high time precision and high measurement throughput.

...read moreread less

Journal Article•DOI•

An Improved DCM-Based Tunable True Random Number Generator for Xilinx FPGA

[...]

Anju P. Johnson¹, Rajat Subhra Chakraborty¹, Debdeep Mukhopadyay¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Apr 2017-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: The main advantages of the proposed TRNG are its on-the-fly tunability through dynamic partial reconfiguration to improve randomness qualities and its low hardware footprint and built-in bias elimination capabilities.

...read moreread less

Abstract: True random number generators (TRNGs) play a very important role in modern cryptographic systems. Field-programmable gate arrays (FPGAs) form an ideal platform for hardware implementations of many of these security algorithms. In this brief, we present a highly efficient and tunable TRNG based on the principle of beat frequency detection , specifically for Xilinx -FPGA-based applications. The main advantages of the proposed TRNG are its on-the-fly tunability through dynamic partial reconfiguration to improve randomness qualities. We describe the mathematical model of the TRNG operations and experimental results for the circuit implemented on a Xilinx Virtex-V FPGA. The proposed TRNG has low hardware footprint and built-in bias elimination capabilities. The random bitstreams generated from it pass all tests in the NIST statistical testsuite.

...read moreread less

Journal Article•DOI•

Real-time fast physical random number generator with a photonic integrated circuit

[...]

Kazusa Ugajin¹, Yuta Terashima¹, Kento Iwakawa¹, Atsushi Uchida¹, Takahisa Harayama², Kazuyuki Yoshimura³, Kazuyuki Yoshimura⁴, Masanobu Inubushi³ - Show less +4 more•Institutions (4)

Saitama University¹, Waseda University², Nippon Telegraph and Telephone³, Tottori University⁴

20 Mar 2017-Optics Express

TL;DR: A real-time hardware implementation of a fast physical random number generator with a photonic integrated circuit and a field programmable gate array (FPGA) electronic board is demonstrated.

...read moreread less

Abstract: Random number generators are essential for applications in information security and numerical simulations. Most optical-chaos-based random number generators produce random bit sequences by offline post-processing with large optical components. We demonstrate a real-time hardware implementation of a fast physical random number generator with a photonic integrated circuit and a field programmable gate array (FPGA) electronic board. We generate 1-Tbit random bit sequences and evaluate their statistical randomness using NIST Special Publication 800-22 and TestU01. All of the BigCrush tests in TestU01 are passed using 410-Gbit random bit sequences. A maximum real-time generation rate of 21.1 Gb/s is achieved for random bit sequences in binary format stored in a computer, which can be directly used for applications involving secret keys in cryptography and random seeds in large-scale numerical simulations.

...read moreread less

Journal Article•DOI•

A Resource-Limited Hardware Accelerator for Convolutional Neural Networks in Embedded Vision Applications

[...]

Shayan Moini¹, Bijan Alizadeh¹, Mohammad Emad¹, Reza Ebrahimpour²•Institutions (2)

University of Tehran¹, Shahid Rajaee Teacher Training University²

04 Apr 2017-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: An architecture for accelerating convolution stages in convolutional neural networks (CNNs) implemented in embedded vision systems to reduce the required bandwidth, resource usage, and power consumption of highly computationally complex convolution operations as required by real-time embedded applications.

...read moreread less

Abstract: In this brief, we introduce an architecture for accelerating convolution stages in convolutional neural networks (CNNs) implemented in embedded vision systems. The purpose of the architecture is to exploit the inherent parallelism in CNNs to reduce the required bandwidth, resource usage, and power consumption of highly computationally complex convolution operations as required by real-time embedded applications. We also implement the proposed architecture using fixed-point arithmetic on a ZC706 evaluation board that features a Xilinx Zynq-7000 system on-chip, where the embedded ARM processor with high clocking speed is used as the main controller to increase the flexibility and speed. The proposed architecture runs under a frequency of 150 MHz, which leads to 19.2 Giga multiply accumulation operations per second while consuming less than 10 W in power. This is done using only 391 DSP48 modules, which shows significant utilization improvement compared to the state-of-the-art architectures.

...read moreread less

Journal Article•DOI•

Design and Verification of Real-Life Processes With Application of Petri Nets

[...]

Iwona Grobelna¹, Remigiusz Wisniewski¹, Michał Grobelny¹, Monika Wisniewska¹•Institutions (1)

University of Zielona Góra¹

01 Nov 2017-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A new design flow of distributed logic controllers is introduced using interpreted Petri nets as modeling formalism and the usage of formal methods and double model checking ensure the correct functionality of the designed distributed logic controller.

...read moreread less

Abstract: This paper focuses on the design and verification methods of distributed logic controllers supervising real-life processes. Such systems have to be designed very carefully and precisely in order to operate flawlessly and to meet user needs. We propose to use interpreted Petri nets as modeling formalism. A new design flow of distributed logic controllers is introduced. The methodology covers the development process from the specification stage to the final implementation of the controller in the distributed devices. In the proposed solution, the system is decomposed into separate modules that form a distributed system. Furthermore, the specification (before and after the decomposition process) is formally verified with the application of the model checking technique against predefined behavioral requirements. Finally, the system is implemented in real devices. The usage of formal methods and double model checking ensure the correct functionality of the designed distributed logic controller. The theoretical approach is supplemented by the practical experiments. Furthermore, the proposed idea is illustrated by an example of a smart home system.

...read moreread less

Journal Article•DOI•

An FPGA Platform for Real-Time Simulation of Spiking Neuronal Networks.

[...]

Danilo Pani¹, Paolo Meloni¹, Giuseppe Tuveri¹, Francesca Palumbo², Paolo Massobrio³, Luigi Raffo¹ - Show less +2 more•Institutions (3)

University of Cagliari¹, University of Sassari², University of Genoa³

28 Feb 2017-Frontiers in Neuroscience

TL;DR: A modular and efficient FPGA design of an in silico spiking neural network exploiting the Izhikevich model is presented, able to simulate a fully connected network counting up to 1,440 neurons, in real-time, at a sampling rate of 10 kHz, which is reasonable for small to medium scale extra-cellular closed-loop experiments.

...read moreread less

Abstract: In the last years, the idea to dynamically interface biological neurons with artificial ones has become more and more urgent. The reason is essentially due to the design of innovative neuroprostheses where biological cell assemblies of the brain can be substituted by artificial ones. For closed-loop experiments with biological neuronal networks interfaced with in silico modeled networks, several technological challenges need to be faced, from the low-level interfacing between the living tissue and the computational model to the implementation of the latter in a suitable form for real-time processing. Field programmable gate arrays (FPGAs) can improve flexibility when simple neuronal models are required, obtaining good accuracy, real-time performance, and the possibility to create a hybrid system without any custom hardware, just programming the hardware to achieve the required functionality. In this paper, this possibility is explored presenting a modular and efficient FPGA design of an in silico spiking neural network exploiting the Izhikevich model. The proposed system, prototypically implemented on a Xilinx Virtex 6 device, is able to simulate a fully connected network counting up to 1,440 neurons, in real-time, at a sampling rate of 10 kHz, which is reasonable for small to medium scale extra-cellular closed-loop experiments.

...read moreread less

Proceedings Article•DOI•

doppioDB: A Hardware Accelerated Database

[...]

David Sidler¹, Zsolt István¹, Muhsen Owaida¹, Kaan Kara¹, Gustavo Alonso¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

09 May 2017

TL;DR: This work presents doppioDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs), and evaluates it on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables.

...read moreread less

Abstract: Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., matching regular expressions) or the data is less structured (e.g., text or long strings). In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. We present doppioDB which consists of MonetDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs). In our demonstration the HUDFs are used to provide seamless acceleration of two string operators, LIKE and REGEXP_LIKE, and two analytics operators, SKYLINE and SGD (stochastic gradient descent). We evaluate doppioDB on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. For integration we rely on HUDFs as a unit of scheduling and management on the FPGA. In the demonstration we show the acceleration benefits of hardware operators, as well as their flexibility in accommodating changing workloads.

...read moreread less

Collapse