scispace - formally typeset
Search or ask a question

Showing papers on "Field-programmable gate array published in 2017"


Proceedings ArticleDOI
18 Jun 2017
TL;DR: This paper implements CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization, and provides an analytical model for performance and resource utilization and develops an automatic design space exploration framework.
Abstract: Convolutional neural networks (CNNs) have been widely applied in many deep learning applications. In recent years, the FPGA implementation for CNNs has attracted much attention because of its high performance and energy efficiency. However, existing implementations have difficulty to fully leverage the computation power of the latest FPGAs. In this paper we implement CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization. We provide an analytical model for performance and resource utilization and develop an automatic design space exploration framework, as well as source-to-source code transformation from a C program to a CNN implementation using systolic array. The experimental results show that our framework is able to generate the accelerator for real-life CNN models, achieving up to 461 GFlops for floating point data type and 1.2 Tops for 8-16 bit fixed point.

363 citations


Journal ArticleDOI
TL;DR: A reconfigurable but simple silicon waveguide mesh with different functionalities with a simple seven hexagonal cell structure is demonstrated, which can be applied to different fields including communications, chemical and biomedical sensing, signal processing, multiprocessor networks, and quantum information systems.
Abstract: Integrated photonics changes the scaling laws of information and communication systems offering architectural choices that combine photonics with electronics to optimize performance, power, footprint, and cost. Application-specific photonic integrated circuits, where particular circuits/chips are designed to optimally perform particular functionalities, require a considerable number of design and fabrication iterations leading to long development times. A different approach inspired by electronic Field Programmable Gate Arrays is the programmable photonic processor, where a common hardware implemented by a two-dimensional photonic waveguide mesh realizes different functionalities through programming. Here, we report the demonstration of such reconfigurable waveguide mesh in silicon. We demonstrate over 20 different functionalities with a simple seven hexagonal cell structure, which can be applied to different fields including communications, chemical and biomedical sensing, signal processing, multiprocessor networks, and quantum information systems. Our work is an important step toward this paradigm.Integrated optical circuits today are typically designed for a few special functionalities and require complex design and development procedures. Here, the authors demonstrate a reconfigurable but simple silicon waveguide mesh with different functionalities.

358 citations


Proceedings ArticleDOI
22 Feb 2017
TL;DR: This work systematically explore the trade-offs of hardware cost by searching the design variable configurations, and proposes a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance.
Abstract: As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.

348 citations


Journal ArticleDOI
TL;DR: This paper designs deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype and employs three pipelined processing units to improve the throughput.
Abstract: As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct a high performance implementations of deep learning neural networks. In order to improve the performance as well as to maintain the low power cost, in this paper we design deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to $36.1 {\times }$ speedup comparing to the Intel Core2 processors, with the power consumption at 234 mW.

268 citations


Proceedings ArticleDOI
22 Feb 2017
TL;DR: An analytical performance model is proposed and an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs are performed and a new kernel design is proposed to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access.
Abstract: OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks. In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.

199 citations


Posted Content
TL;DR: An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.
Abstract: Recent researches on neural network have shown significant advantage in machine learning over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With specifically designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA-based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.

137 citations


Proceedings ArticleDOI
22 Feb 2017
TL;DR: ForeGraph, a large-scale graph processing framework based on the multi-FPGA architecture, is proposed, which outperforms state-of-the-art FPGA-based large- scale graph processing systems by 4.54x when executing PageRank on the Twitter graph.
Abstract: The performance of large-scale graph processing suffers from challenges including poor locality, lack of scalability, random access pattern, and heavy data conflicts. Some characteristics of FPGA make it a promising solution to accelerate various applications. For example, on-chip block RAMs can provide high throughput for random data access. However, large-scale processing on a single FPGA chip is constrained by limited on-chip memory resources and off-chip bandwidth. Using a multi-FPGA architecture may alleviate these problems to some extent, while the data partitioning and communication schemes should be considered to ensure the locality and reduce data conflicts. In this paper, we propose ForeGraph, a large-scale graph processing framework based on the multi-FPGA architecture. In ForeGraph, each FPGA board only stores a partition of the entire graph in off-chip memory. Communication over partitions is reduced. Vertices and edges are sequentially loaded onto the FPGA chip and processed. Under our scheduling scheme, each FPGA chip performs graph processing in parallel without conflicts. We also analyze the impact of system parameters on the performance of ForeGraph. Our experimental results on Xilinx Virtex UltraScale XCVU190 chip show ForeGraph outperforms state-of-the-art FPGA-based large-scale graph processing systems by 4.54x when executing PageRank on the Twitter graph (1.4 billion edges). The average throughput is over 900 MTEPS in our design and 2.03x larger than previous work.

133 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: SC-DCNN is presented, the first comprehensive design and optimization framework of SC-based DCNNs, using a bottom-up approach, and is holistically optimized to minimize area and power (energy) consumption while maintaining high network accuracy.
Abstract: With the recent advance of wearable devices and Internet of Things (IoTs), it becomes attractive to implement the Deep Convolutional Neural Networks (DCNNs) in embedded and portable systems. Currently, executing the software-based DCNNs requires high-performance servers, restricting the widespread deployment on embedded and mobile IoT devices. To overcome this obstacle, considerable research efforts have been made to develop highly-parallel and specialized DCNN accelerators using GPGPUs, FPGAs or ASICs.Stochastic Computing (SC), which uses a bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has high potential for implementing DCNNs with high scalability and ultra-low hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power (energy) and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources allow immense design space for enhancing scalability and robustness for hardware DCNNs.This paper presents SC-DCNN, the first comprehensive design and optimization framework of SC-based DCNNs, using a bottom-up approach. We first present the designs of function blocks that perform the basic operations in DCNN, including inner product, pooling, and activation function. Then we propose four designs of feature extraction blocks, which are in charge of extracting features from input feature maps, by connecting different basic function blocks with joint optimization. Moreover, the efficient weight storage methods are proposed to reduce the area and power (energy) consumption. Putting all together, with feature extraction blocks carefully selected, SC-DCNN is holistically optimized to minimize area and power (energy) consumption while maintaining high network accuracy. Experimental results demonstrate that the LeNet5 implemented in SC-DCNN consumes only 17 mm2 area and 1.53 W power, achieves throughput of 781250 images/s, area efficiency of 45946 images/s/mm2, and energy efficiency of 510734 images/J.

130 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work presents an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGAs and still keep the benefits of low-level hardware optimization.
Abstract: Convolutional neural networks (CNNs) are rapidly evolving and being applied to a broad range of applications. Given a specific application, an increasing challenge is to search the appropriate CNN algorithm and efficiently map it to the target hardware. The FPGA-based accelerator has the advantage of reconfigurability and flexibility, and has achieved high-performance and low-power. Without a general compiler to automate the implementation, however, significant efforts and expertise are still required to customize the design for each CNN model. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The implementation of each module is optimized at the RTL level. Given a CNN algorithm, its structure is abstracted to a directed acyclic graph (DAG) and then complied with RTL modules in the library. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. ResNet, can be compiled. The proposed methodology is demonstrated with end-to-end FPGA implementations of various CNN algorithms (e.g. NiN, VGG-16, ResNet-50, and ResNet-152) on two standalone Intel FPGAs, Stratix V and Arria 10. The performance and overhead of the automated compilation are evaluated. The compiled FPGA accelerators exhibit superior performance compared to state-of-the-art automation-based works by >2× for various CNNs.

116 citations


Journal ArticleDOI
TL;DR: Experimental results confirm the efficiency of the proposed MPPT method in the global peak tracking and its high accuracy to handle the partial shading and the comparison with the P & O and the PSO methods shows that the proposed method outperforms them in term of global search ability and dynamic performance.

107 citations


Journal ArticleDOI
TL;DR: The streaming method acts as a compiler, transforming a high-level representation of DCNNs into operation codes to execute applications in a hardware accelerator, utilizing maximum computational resources available based on a novel-scheduled routing topology that combines data reuse and data concatenation.
Abstract: Deep convolutional neural networks (DCNNs) have become a very powerful tool in visual perception. DCNNs have applications in autonomous robots, security systems, mobile phones, and automobiles, where high throughput of the feedforward evaluation phase and power efficiency are important. Because of this increased usage, many field-programmable gate array (FPGA)-based accelerators have been proposed. In this paper, we present an optimized streaming method for DCNNs’ hardware accelerator on an embedded platform. The streaming method acts as a compiler, transforming a high-level representation of DCNNs into operation codes to execute applications in a hardware accelerator. The proposed method utilizes maximum computational resources available based on a novel-scheduled routing topology that combines data reuse and data concatenation. It is tested with a hardware accelerator implemented on the Xilinx Kintex-7 XC7K325T FPGA. The system fully explores weight-level and node-level parallelizations of DCNNs and achieves a peak performance of 247 G-ops while consuming less than 4 W of power. We test our system with applications on object classification and object detection in real-world scenarios. Our results indicate high-performance efficiency, outperforming all other presented platforms while running these applications.

Proceedings ArticleDOI
28 May 2017
TL;DR: Three hardware accelerators for RNN on Xilinx's Zynq SoC FPGA are presented to present how to overcome challenges involved in developing RNN accelerators and each design uses different strategies to achieve high performance and scalability.
Abstract: Recurrent Neural Networks (RNNs) have the ability to retain memory and learn from data sequences, which are fundamental for real-time applications RNN computations offer limited data reuse, which leads to high data traffic This translates into high off-chip memory bandwidth or large internal storage requirement to achieve high performance Exploiting parallelism in RNN computations are bounded by this two limiting factors, among other constraints present in embedded systems Therefore, balance between internally stored data and off-chip memory data transfer is necessary to overlap computation time with data transfer latency In this paper, we present three hardware accelerators for RNN on Xilinx's Zynq SoC FPGA to present how to overcome challenges involved in developing RNN accelerators Each design uses different strategies to achieve high performance and scalability Each co-processor was tested with a character level language model The latest design called DeepRnn, achieves up to 23 X better performance per power than Tegra X1 development board for this application

Proceedings ArticleDOI
13 Nov 2017
TL;DR: This work hypothesizes that existing hardware construction languages and novel hardware compiler frameworks can put hardware development on a similar evolutionary path by enabling new hardware libraries to be independent of underlying process technologies including FPGA mappings.
Abstract: Enabled by modern languages and retargetable compilers, software development is in a virtual "Cambrian explosion" driven by a critical mass of powerfully parameterized libraries; but hardware development practices lag far behind. We hypothesize that existing hardware construction languages (HCLs) and novel hardware compiler frameworks (HCFs) can put hardware development on a similar evolutionary path by enabling new hardware libraries to be independent of underlying process technologies including FPGA mappings. We support this claim by (1) evaluating the degree with which Chisel, an existing HCL, can support powerfully parameterized libraries, and (2) introducing the concept and implementation of an HCF that uses an open-source hardware intermediate representation, FIRRTL (Flexible Intermediate Representation for RTL), to transform target-independent RTL into technology-specific RTL. Finally, we evaluate many hardware compiler transformations, including simplifying transformations, analyses, optimizations, instrumentations, and specializations, which demonstrate the power of a combined HCL and HCF approach.

Journal ArticleDOI
TL;DR: The image processing language Halide is extended so users can specify which portions of their applications should become hardware accelerators, and a compiler is provided that uses this code to automatically create the accelerator along with the “glue” code needed for the user’s application to access this hardware.
Abstract: Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality But creating, “programming,” and integrating this hardware into a hardware/software system is difficult We address this problem by extending the image processing language Halide so users can specify which portions of their applications should become hardware accelerators, and then we provide a compiler that uses this code to automatically create the accelerator along with the “glue” code needed for the user’s application to access this hardware Starting with Halide not only provides a very high-level functional description of the hardware but also allows our compiler to generate a complete software application, which accesses the hardware for acceleration when appropriate Our system also provides high-level semantics to explore different mappings of applications to a heterogeneous system, including the flexibility of being able to change the throughput rate of the generated hardwareWe demonstrate our approach by mapping applications to a commercial Xilinx Zynq system Using its FPGA with two low-power ARM cores, our design achieves up to 6× higher performance and 38× lower energy compared to the quad-core ARM CPU on an NVIDIA Tegra K1, and 35× higher performance with 12× lower energy compared to the K1’s 192-core GPU

Proceedings ArticleDOI
09 May 2017
TL;DR: This work integrates the hardware accelerator into MonetDB, a main-memory column store, and demonstrates a significant improvement in response time and throughput, and provides a novel and efficient implementation of two commonly used SQL operators for strings.
Abstract: Taking advantage of recently released hybrid multicore architectures, such as the Intel's Xeon+FPGA machine, where the FPGA has coherent access to the main memory through the QPI bus, we explore the benefits of specializing operators to hardware. We focus on two commonly used SQL operators for strings: LIKE, and REGEXP_LIKE, and provide a novel and efficient implementation of these operators in reconfigurable hardware. We integrate the hardware accelerator into MonetDB, a main-memory column store, and demonstrate a significant improvement in response time and throughput. Our Hardware User Defined Function (HUDF) can speed up complex pattern matching by an order of magnitude in comparison to the database running on a 10-core CPU. The insights gained from integrating hardware based string operators into MonetDB should also be useful for future designs combining hardware specialization and databases.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: A design framework for DNNs is presented that uses highly configurable IPs for neural network layers together with a new design space exploration engine for Resource Allocation Management (REALM) to further improve the FPGA solution.
Abstract: FPGA is a promising candidate for the acceleration of Deep Neural Networks (DNN) with improved latency and energy consumption compared to CPU and GPU-based implementations. DNNs use sequences of layers of regular computation that are well suited for HLS-based design for FPGA. However, optimizing large neural networks under resource constraints is still a key challenge. HLS must manage on-chip computation, buffering resources, and off-chip memory accesses to minimize the total latency. In this paper, we present a design framework for DNNs that uses highly configurable IPs for neural network layers together with a new design space exploration engine for Resource Allocation Management (REALM). We also carry out efficient memory subsystem design and fixed-point weight re-training to further improve our FPGA solution. We demonstrate our design framework on the Long-term Recurrent Convolution Network for video inputs. Our implementation on a Xilinx VC709 board achieves 3.1X speedup compared to an NVIDIA K80 and 4.75X speedup compared to an Intel Xeon with 17.5X lower energy per image.

Journal ArticleDOI
TL;DR: A scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration and a systematic design space exploration methodology is put forward to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints.
Abstract: Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.

Journal ArticleDOI
TL;DR: In this paper, the authors describe the hardware, gateware, and software developed at Raytheon BBN Technologies for dynamic quantum information processing experiments on superconducting qubits, where real-time qubit state information is fed back or fed forward within a fraction of the qubits' coherence time to dynamically change the implemented sequence.
Abstract: We describe the hardware, gateware, and software developed at Raytheon BBN Technologies for dynamic quantum information processing experiments on superconducting qubits. In dynamic experiments, real-time qubit state information is fed back or fed forward within a fraction of the qubits' coherence time to dynamically change the implemented sequence. The hardware presented here covers both control and readout of superconducting qubits. For readout, we created a custom signal processing gateware and software stack on commercial hardware to convert pulses in a heterodyne receiver into qubit state assignments with minimal latency, alongside data taking capability. For control, we developed custom hardware with gateware and software for pulse sequencing and steering information distribution that is capable of arbitrary control flow in a fraction of superconducting qubit coherence times. Both readout and control platforms make extensive use of field programmable gate arrays to enable tailored qubit control systems in a reconfigurable fabric suitable for iterative development.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: A binarized CNN which treats only binary 2-values for the inputs and the weights is used, which can realize a compact and faster CNN than the conventional ones.
Abstract: A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This paper uses a binarized CNN which treats only binary 2-values for the inputs and the weights. Since the multiplier is replaced into an XNOR circuit, we can realize a high-performance MAC circuit by using many XNOR circuits. In the paper, we eliminate internal FC layers excluding the last one, then, insert a binarized average pooling layer, which can be realized by a majority circuit for binarized (1/0) values. In that case, since the weight memory is replaced into the 1's counter, we can realize a compact and faster CNN than the conventional ones. We implemented the VGG-11 benchmark CNN for the CIFAR10 image classification task on the Xilinx Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better.

Journal ArticleDOI
TL;DR: The chaotic oscillator designed has been tested for True Random Number Generators (TRNG) and it has been proved that the proposed design can be used in embedded cryptologic applications.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work presents a highly versatile FPGA friendly architecture for TNN in which it can vary both the number of bits of the input data and the level of parallelism at synthesis time, allowing to trade throughput for hardware resources and power consumption.
Abstract: Thanks to their excellent performances on typical artificial intelligence problems, deep neural networks have drawn a lot of interest lately. However, this comes at the cost of large computational needs and high power consumption. Benefiting from high precision at acceptable hardware cost on these difficult problems is a challenge. To address it, we advocate the use of ternary neural networks (TNN) that, when properly trained, can reach results close to the state of the art using floatingpoint arithmetic. We present a highly versatile FPGA friendly architecture for TNN in which we can vary both the number of bits of the input data and the level of parallelism at synthesis time, allowing to trade throughput for hardware resources and power consumption. To demonstrate the efficiency of our proposal, we implement high-complexity convolutional neural networks on the Xilinx Virtex-7 VC709 FPGA board. While reaching a better accuracy than comparable designs, we can target either high throughput or low power. We measure a throughput up to 27 000 fps at ≈7W or up to 8.36 TMAC/s at ≈13 W.

Proceedings ArticleDOI
27 Mar 2017
TL;DR: The capabilities, API and performance evaluation of BitMan are described, which includes high-level commands such as cutting out regions of a bitstream and placing or relocating modules on an FPGA as well as low- level commands for modifying primitives and for routing clock networks or rerouting signal connections at run-time.
Abstract: To fully support the partial reconfiguration capabilities of FPGAs, this paper introduces the tool and API BitMan for generating and manipulating configuration bitstreams. Bit-Man supports recent Xilinx FPGAs that can be used by the ISE and Vivado tool suites of the FPGA vendor Xilinx, including latest Virtex-6, 7 Series, UltraScale and UltraScale− series FPGAs. The functionality includes high-level commands such as cutting out regions of a bitstream and placing or relocating modules on an FPGA as well as low-level commands for modifying primitives and for routing clock networks or rerouting signal connections at run-time. All this is possible without the vendor CAD tools for allowing BitMan to be used even with embedded CPUs. The paper describes the capabilities, API and performance evaluation of BitMan.

Proceedings ArticleDOI
18 Jun 2017
TL;DR: This paper is the first attempt to thoroughly analyze various robust machine learning methods to classify benign and malware applications and shows OneR to be the most cost-effective classifier with more than 80% accuracy and fast execution time of less than 10ns, achieving highest accuracy per logic area.
Abstract: Detection of malicious software at the hardware level is emerging as an effective solution to increasing security threats. Hardware based detectors rely on Machine Learning(ML) classifiers to detect malware-like execution pattern based on Hardware Performance Counters(HPC) information at run-time. The effectiveness of these learning methods mainly relies on the information provided by expensive-to-implement limited number of HPC. This paper is the first attempt to thoroughly analyze various robust machine learning methods to classify benign and malware applications. Given the limited availability of HPC the analysis results help guiding architectural decision on what hardware performance counters are needed most to effectively improve ML classification accuracy. For software implementation we fully implemented these classifier at OS Kernel to understand various software overheads. The software implementation of these classifiers are found to be relatively slow with the execution time in the range of milliseconds, order of magnitude higher than the latency needed to capture malware at run-time. This is calling for hardware accelerated implementation of these algorithms. For hardware implementation, we have synthesized the studied classifier models on FPGA to compare various design parameters including logic area, power, and latency. The results show that while complex ML classifier such as MultiLayerPerceptron and logistics are achieving close to 90% accuracy, after taking into consideration their implementation overheads, they perform worst in terms of PDP, accuracy/area and latency compared to simpler but slightly less accurate rule based and tree based classifiers. Our results further show OneR to be the most cost-effective classifier with more than 80% accuracy and fast execution time of less than 10ns, achieving highest accuracy per logic area, while mainly relying on only a single branch-instruction HPC information.

Journal ArticleDOI
TL;DR: The TDC design is relatively simple even when using FPGAs made with current advanced process technology, and can simultaneously achieve high time precision and high measurement throughput.
Abstract: A 3.9-ps time-interval rms precision and 277-M events/second measurement throughput time-to-digital converter (TDC) is implemented in a Xilinx Kintex-7 field programmable gate array (FPGA). Unlike previous work, the TDC is achieved with a multichain tapped-delay line (TDL) followed by an ones-counter encoder. The four normal TDLs merged together make the TDC bins very small, so that the time precision can be significantly improved. The ones-counter encoder naturally applies global bubble error correction to the output of TDL, thus the TDC design is relatively simple even when using FPGAs made with current advanced process technology. The TDC implementation is a generally applicable method that can simultaneously achieve high time precision and high measurement throughput.

Journal ArticleDOI
TL;DR: The main advantages of the proposed TRNG are its on-the-fly tunability through dynamic partial reconfiguration to improve randomness qualities and its low hardware footprint and built-in bias elimination capabilities.
Abstract: True random number generators (TRNGs) play a very important role in modern cryptographic systems. Field-programmable gate arrays (FPGAs) form an ideal platform for hardware implementations of many of these security algorithms. In this brief, we present a highly efficient and tunable TRNG based on the principle of beat frequency detection , specifically for Xilinx -FPGA-based applications. The main advantages of the proposed TRNG are its on-the-fly tunability through dynamic partial reconfiguration to improve randomness qualities. We describe the mathematical model of the TRNG operations and experimental results for the circuit implemented on a Xilinx Virtex-V FPGA. The proposed TRNG has low hardware footprint and built-in bias elimination capabilities. The random bitstreams generated from it pass all tests in the NIST statistical testsuite.

Journal ArticleDOI
TL;DR: A real-time hardware implementation of a fast physical random number generator with a photonic integrated circuit and a field programmable gate array (FPGA) electronic board is demonstrated.
Abstract: Random number generators are essential for applications in information security and numerical simulations. Most optical-chaos-based random number generators produce random bit sequences by offline post-processing with large optical components. We demonstrate a real-time hardware implementation of a fast physical random number generator with a photonic integrated circuit and a field programmable gate array (FPGA) electronic board. We generate 1-Tbit random bit sequences and evaluate their statistical randomness using NIST Special Publication 800-22 and TestU01. All of the BigCrush tests in TestU01 are passed using 410-Gbit random bit sequences. A maximum real-time generation rate of 21.1 Gb/s is achieved for random bit sequences in binary format stored in a computer, which can be directly used for applications involving secret keys in cryptography and random seeds in large-scale numerical simulations.

Journal ArticleDOI
TL;DR: An architecture for accelerating convolution stages in convolutional neural networks (CNNs) implemented in embedded vision systems to reduce the required bandwidth, resource usage, and power consumption of highly computationally complex convolution operations as required by real-time embedded applications.
Abstract: In this brief, we introduce an architecture for accelerating convolution stages in convolutional neural networks (CNNs) implemented in embedded vision systems. The purpose of the architecture is to exploit the inherent parallelism in CNNs to reduce the required bandwidth, resource usage, and power consumption of highly computationally complex convolution operations as required by real-time embedded applications. We also implement the proposed architecture using fixed-point arithmetic on a ZC706 evaluation board that features a Xilinx Zynq-7000 system on-chip, where the embedded ARM processor with high clocking speed is used as the main controller to increase the flexibility and speed. The proposed architecture runs under a frequency of 150 MHz, which leads to 19.2 Giga multiply accumulation operations per second while consuming less than 10 W in power. This is done using only 391 DSP48 modules, which shows significant utilization improvement compared to the state-of-the-art architectures.

Journal ArticleDOI
TL;DR: A new design flow of distributed logic controllers is introduced using interpreted Petri nets as modeling formalism and the usage of formal methods and double model checking ensure the correct functionality of the designed distributed logic controller.
Abstract: This paper focuses on the design and verification methods of distributed logic controllers supervising real-life processes. Such systems have to be designed very carefully and precisely in order to operate flawlessly and to meet user needs. We propose to use interpreted Petri nets as modeling formalism. A new design flow of distributed logic controllers is introduced. The methodology covers the development process from the specification stage to the final implementation of the controller in the distributed devices. In the proposed solution, the system is decomposed into separate modules that form a distributed system. Furthermore, the specification (before and after the decomposition process) is formally verified with the application of the model checking technique against predefined behavioral requirements. Finally, the system is implemented in real devices. The usage of formal methods and double model checking ensure the correct functionality of the designed distributed logic controller. The theoretical approach is supplemented by the practical experiments. Furthermore, the proposed idea is illustrated by an example of a smart home system.

Journal ArticleDOI
TL;DR: A modular and efficient FPGA design of an in silico spiking neural network exploiting the Izhikevich model is presented, able to simulate a fully connected network counting up to 1,440 neurons, in real-time, at a sampling rate of 10 kHz, which is reasonable for small to medium scale extra-cellular closed-loop experiments.
Abstract: In the last years, the idea to dynamically interface biological neurons with artificial ones has become more and more urgent. The reason is essentially due to the design of innovative neuroprostheses where biological cell assemblies of the brain can be substituted by artificial ones. For closed-loop experiments with biological neuronal networks interfaced with in silico modeled networks, several technological challenges need to be faced, from the low-level interfacing between the living tissue and the computational model to the implementation of the latter in a suitable form for real-time processing. Field programmable gate arrays (FPGAs) can improve flexibility when simple neuronal models are required, obtaining good accuracy, real-time performance, and the possibility to create a hybrid system without any custom hardware, just programming the hardware to achieve the required functionality. In this paper, this possibility is explored presenting a modular and efficient FPGA design of an in silico spiking neural network exploiting the Izhikevich model. The proposed system, prototypically implemented on a Xilinx Virtex 6 device, is able to simulate a fully connected network counting up to 1,440 neurons, in real-time, at a sampling rate of 10 kHz, which is reasonable for small to medium scale extra-cellular closed-loop experiments.

Proceedings ArticleDOI
David Sidler1, Zsolt István1, Muhsen Owaida1, Kaan Kara1, Gustavo Alonso1 
09 May 2017
TL;DR: This work presents doppioDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs), and evaluates it on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables.
Abstract: Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., matching regular expressions) or the data is less structured (e.g., text or long strings). In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. We present doppioDB which consists of MonetDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs). In our demonstration the HUDFs are used to provide seamless acceleration of two string operators, LIKE and REGEXP_LIKE, and two analytics operators, SKYLINE and SGD (stochastic gradient descent). We evaluate doppioDB on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. For integration we rely on HUDFs as a unit of scheduling and management on the FPGA. In the demonstration we show the acceleration benefits of hardware operators, as well as their flexibility in accommodating changing workloads.