scispace - formally typeset
Search or ask a question

Showing papers on "Field-programmable gate array published in 2016"


Proceedings ArticleDOI
21 Feb 2016
TL;DR: This paper presents an in-depth analysis of state-of-the-art CNN models and shows that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric, and proposes a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification.
Abstract: In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN.In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric.Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.

1,172 citations


Proceedings ArticleDOI
21 Feb 2016
TL;DR: This work presents a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGAs resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth.
Abstract: Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is difficult to perform real-time classification with low power consumption on today?s computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

515 citations


Proceedings ArticleDOI
15 Oct 2016
TL;DR: DnnWeaver is devised, a framework that automatically generates a synthesizable accelerator for a given DNN, FPGA pair from a high-level specification in Caffe that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGAs.
Abstract: Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains. FPGAs are an attractive choice for DNNs since they offer a programmable substrate for acceleration and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware designers. Furthermore, the large memory footprint of DNNs, coupled with the FPGAs' limited on-chip storage makes DNN acceleration using FPGAs more challenging. This work tackles these challenges by devising DnnWeaver, a framework that automatically generates a synthesizable accelerator for a given (DNN, FPGA) pair from a high-level specification in Caffe [1]. To achieve large benefits while preserving automation, DNNWEAVER generates accelerators using hand-optimized design templates. First, DnnWeaver translates a given high-level DNN specification to its novel ISA that represents a macro dataflow graph of the DNN. The DnnWeaver compiler is equipped with our optimization algorithm that tiles, schedules, and batches DNN operations to maximize data reuse and best utilize target FPGA's memory and other resources. The final result is a custom synthesizable accelerator that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGA. We use DnnWeaver to generate accelerators for a set of eight different DNN models and three different FPGAs, Xilinx Zynq, Altera Stratix V, and Altera Arria 10. We use hardware measurements to compare the generated accelerators to both multicore CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650Ti, and Tesla K40). In comparison, the generated accelerators deliver superior performance and efficiency without requiring the programmers to participate in the arduous task of hardware design.

435 citations


Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper proposed a BNN hardware accelerator design, implemented the proposed accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU.
Abstract: Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized variant of DNNs. BNNs constraint network weight and/or neuron value to either +1 or −1, which is representable in 1 bit. This leads to dramatic algorithm efficiency improvement, due to reduction in the memory and computational demands. This paper evaluates the opportunity to further improve the execution efficiency of BNNs through hardware acceleration. We first proposed a BNN hardware accelerator design. Then, we implemented the proposed accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU. Our evaluation shows that FPGA provides superior efficiency over CPU and GPU. Even though CPU and GPU offer high peak theoretical performance, they are not as efficiently utilized since BNNs rely on binarized bit-level operations that are better suited for custom hardware. Finally, even though ASIC is still more efficient, FPGA can provide orders of magnitudes in efficiency improvements over software, without having to lock into a fixed ASIC solution.

286 citations


Proceedings ArticleDOI
Huimin Li1, Xitian Fan1, Li Jiao1, Wei Cao1, Xuegong Zhou1, Lingli Wang1 
01 Aug 2016
TL;DR: This work proposes an end-to-end FPGA-based CNN accelerator with all the layers mapped on one chip so that different layers can work concurrently in a pipelined structure to increase the throughput.
Abstract: In recent years, convolutional neural networks (CNNs) based machine learning algorithms have been widely applied in computer vision applications. However, for large-scale CNNs, the computation-intensive, memory-intensive and resource-consuming features have brought many challenges to CNN implementations. This work proposes an end-to-end FPGA-based CNN accelerator with all the layers mapped on one chip so that different layers can work concurrently in a pipelined structure to increase the throughput. A methodology which can find the optimized parallelism strategy for each layer is proposed to achieve high throughput and high resource utilization. In addition, a batch-based computing method is implemented and applied on fully connected layers (FC layers) to increase the memory bandwidth utilization due to the memory-intensive feature. Further, by applying two different computing patterns on FC layers, the required on-chip buffers can be reduced significantly. As a case study, a state-of-the-art large-scale CNN, AlexNet, is implemented on Xilinx VC709. It can achieve a peak performance of 565.94 GOP/s and 391 FPS under 156MHz clock frequency which outperforms previous approaches.

269 citations


Proceedings ArticleDOI
22 Aug 2016
TL;DR: To the best of the knowledge, ClickNP is the first FPGA-accelerated platform for NFs, written completely in high-level language and achieving 40 Gbps line rate at any packet size and reducing latency by 10x.
Abstract: Highly flexible software network functions (NFs) are crucial components to enable multi-tenancy in the clouds. However, software packet processing on a commodity server has limited capacity and induces high latency. While software NFs could scale out using more servers, doing so adds significant cost. This paper focuses on accelerating NFs with programmable hardware, i.e., FPGA, which is now a mature technology and inexpensive for datacenters. However, FPGA is predominately programmed using low-level hardware description languages (HDLs), which are hard to code and difficult to debug. More importantly, HDLs are almost inaccessible for most software programmers. This paper presents ClickNP, a FPGA-accelerated platform for highly flexible and high-performance NFs with commodity servers. ClickNP is highly flexible as it is completely programmable using high-level C-like languages, and exposes a modular programming abstraction that resembles Click Modular Router. ClickNP is also high performance. Our prototype NFs show that they can process traffic at up to 200 million packets per second with ultra-low latency ($

228 citations


Proceedings ArticleDOI
12 Mar 2016
TL;DR: Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays, and achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.
Abstract: The energy constraints due to the end of Dennard scaling, the popularity of in-memory analytics, and the advances in 3D integration technology have led to renewed interest in near-data processing (NDP) architectures that move processing closer to main memory. Due to the limited power and area budgets of the logic layer, the NDP compute units should be area and energy efficient while providing sufficient compute capability to match the high bandwidth of vertical memory channels. They should also be flexible to accommodate a wide range of applications. Towards this goal, NDP units based on fine-grained (FPGA) and coarse-grained (CGRA) reconfigurable logic have been proposed as a compromise between the efficiency of custom engines and the flexibility of programmable cores. Unfortunately, FPGAs incur significant area overheads for bit-level reconfiguration, while CGRAs consume significant power in the interconnect and are inefficient for irregular data layouts and control flows. This paper presents Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays. HRL combines both coarse-grained and fine-grained logic blocks, separates routing networks for data and control signals, and uses specialized units to effectively support branch operations and irregular data layouts in analytics workloads. HRL has the power efficiency of FPGA and the area efficiency of CGRA. It improves performance per Watt by 2.2x over FPGA and 1.7x over CGRA. For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.

184 citations


Proceedings ArticleDOI
01 Aug 2016
TL;DR: This paper proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU, and studied the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU.
Abstract: Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU. Then, we study the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU. Results show that FPGA provides superior performance/Watt over CPU and GPU because FPGA's on-chip BRAMs, hard DSPs, and reconfigurable fabric allow for efficiently extracting fine-grained parallelisms from small/medium size matrices used by GRU. Moreover, newer FPGAs with more DSPs, on-chip BRAMs, and higher frequency have the potential to narrow the FPGA-ASIC efficiency gap.

161 citations


Proceedings ArticleDOI
12 Mar 2016
TL;DR: TABLA provides a template-based framework that generates accelerators for a class of machine learning algorithms and rigorously compares the benefits of FPGA acceleration to multi-core CPUs and many-core GPUs using real hardware measurements.
Abstract: A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.

158 citations


Posted Content
TL;DR: In this article, the authors present the optimal designs of function blocks that perform the basic operations, i.e., inner product, pooling, and activation function, and propose the optimal design of four types of combinations of basic function blocks, named feature extraction blocks.
Abstract: With recent advancing of Internet of Things (IoTs), it becomes very attractive to implement the deep convolutional neural networks (DCNNs) onto embedded/portable systems. Presently, executing the software-based DCNNs requires high-performance server clusters in practice, restricting their widespread deployment on the mobile devices. To overcome this issue, considerable research efforts have been conducted in the context of developing highly-parallel and specific DCNN hardware, utilizing GPGPUs, FPGAs, and ASICs. Stochastic Computing (SC), which uses bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has a high potential for implementing DCNNs with high scalability and ultra-low hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power/energy and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources bring about immense design space for enhancing scalability and robustness for hardware DCNNs. This paper presents the first comprehensive design and optimization framework of SC-based DCNNs (SC-DCNNs). We first present the optimal designs of function blocks that perform the basic operations, i.e., inner product, pooling, and activation function. Then we propose the optimal design of four types of combinations of basic function blocks, named feature extraction blocks, which are in charge of extracting features from input feature maps. Besides, weight storage methods are investigated to reduce the area and power/energy consumption for storing weights. Finally, the whole SC-DCNN implementation is optimized, with feature extraction blocks carefully selected, to minimize area and power/energy consumption while maintaining a high network accuracy level.

138 citations


Proceedings ArticleDOI
Yufei Ma1, Naveen Suda1, Yu Cao1, Jae-sun Seo1, Sarma Vrudhula1 
01 Aug 2016
TL;DR: This work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints, and demonstrates the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
Abstract: Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

Journal ArticleDOI
TL;DR: A new chaos-based RNG design was achieved and internationally accepted FIPS-140-1 and NIST-800-22 randomness tests were run and video encryption application and security analyses were carried out with the RNG designed here.
Abstract: There has recently been an increase in the number of new chaotic system designs and chaos-based engineering applications. In this study, since homoclinic and heteroclinic orbits did not exist and analyses like Shilnikov method could not be used, a 3D chaotic system without equilibrium points was included and thus different engineering applications especially for encryption studies were realized. The 3D chaotic system without equilibrium points represents a new different phenomenon and an almost unexplored field of research. First of all, chaotic system without equilibrium points was examined as the basis and electronic circuit application of the chaotic system was realized and oscilloscope outputs of phase portraits were obtained. Later, chaotic system without equilibrium points was modelled on Labview Field Programmable Gate Array (FPGA) and then FPGA chip statistics, phase portraits and oscilloscope outputs were derived. With another study, VHDL and RK-4 algorithm were used and a new FPGA-based chaotic oscillators design was achieved. Results of Labview-based design on FPGA- and VHDL-based design were compared. Results of chaotic oscillator units designed here were gained via Xilinx ISE Simulator. Finally, a new chaos-based RNG design was achieved and internationally accepted FIPS-140-1 and NIST-800-22 randomness tests were run. Furthermore, video encryption application and security analyses were carried out with the RNG designed here.

Proceedings ArticleDOI
01 Aug 2016
TL;DR: This work presents a working implementation of a 4×4-port universal linear circuit implemented in silicon photonics, and demonstrates the circuit performing two distinct linear operations by changing the software algorithms that controls the circuit.
Abstract: We present our work on reconfigurable, general-purpose linear photonic circuits. Photonic integrated circuits today resemble the very specialized application specific integrated circuits (ASICs) rather than general-purpose CPUs or highly flexible field-programmable gate arrays (FPGA). Reprogrammable optical circuits could dramatically reduce the time to application or prototype a new optical chip. We will discuss the possibilities of such circuits, and present a working implementation of a 4×4-port universal linear circuit implemented in silicon photonics. We demonstrate the circuit performing two distinct linear operations by changing the software algorithms that controls the circuit.

Journal ArticleDOI
18 Jun 2016
TL;DR: A hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing is described.
Abstract: Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration.We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.

Journal ArticleDOI
TL;DR: In this paper, a device-level electrothermal model for an MMC on the field programmable gate array (FPGA) is presented for real-time hardware emulation.
Abstract: Real-time simulation of modular multilevel converters (MMCs) is challenging due to their complex structure consisting of a large number of submodules (SMs). In the literature, the computational speed is emphasized for MMC modeling in real-time simulation, while accurate and detailed information of insulated-gate bipolar transistor (IGBT) modules in SMs is sacrificed. A novel datasheet-based device-level electrothermal model for an MMC on the field programmable gate array (FPGA) is presented in this paper for real-time hardware emulation. Conduction and switching power losses, junction temperatures, temperature-dependent electrical parameters, and linearized switching transient waveforms of IGBT modules are adequately captured in the proposed model. Simultaneously the system-level behavior of the MMC is accurately modeled. Five-level and nine-level MMC systems are emulated in the hardware with the time step of 10 $\bm {\mu }\text{s}$ and 10 $\text{ns}$ for system-level and device-level computations, respectively. The paralleled and pipelined hardware design using IEEE 32-bit floating point number precision runs on Xilinx Virtex-7 XC7VX485T device. The emulated real-time results by an oscilloscope have been validated by offline simulation on SaberRD software.

Journal ArticleDOI
TL;DR: With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation and supports the spike-timing-dependent plasticity (STDP) rule for learning.
Abstract: NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.

Journal ArticleDOI
TL;DR: A high-throughput energy-efficient Successive Cancellation decoder architecture for polar codes based on combinational logic that combines the advantageous aspects of the combinational decoder with the low-complexity nature of sequential-logic decoders is proposed.
Abstract: This paper proposes a high-throughput energy-efficient Successive Cancellation (SC) decoder architecture for polar codes based on combinational logic. The proposed combinational architecture operates at relatively low clock frequencies compared to sequential circuits, but takes advantage of the high degree of parallelism inherent in such architectures to provide a favorable tradeoff between throughput and energy efficiency at short to medium block lengths. At longer block lengths, the paper proposes a hybrid-logic SC decoder that combines the advantageous aspects of the combinational decoder with the low-complexity nature of sequential-logic decoders. Performance characteristics on ASIC and FPGA are presented with a detailed power consumption analysis for combinational decoders. Finally, the paper presents an analysis of the complexity and delay of combinational decoders, and of the throughput gains obtained by hybrid-logic decoders with respect to purely synchronous architectures.

Journal ArticleDOI
TL;DR: A fully field-programmable gate array (FPGA)-based real-time implementation of MPC is proposed for direct matrix converter (DMC) and the need for another digital control platform, such as digital signal processors (DSP) or dSPACE, is eliminated.
Abstract: Recently, there has been an increase in the use of finite control set model predictive control (FCS-MPC) for power converters. Model predictive control (MPC) uses the discrete-time model of the system to predict future values of control variables for all possible control actions and computes a cost function related to control objectives. This control technique can provide fast dynamic response. However, MPC method implementation imposes a very high computational burden and causes significant hardware requirements for real-time implementation. In this paper, a fully field-programmable gate array (FPGA)-based real-time implementation of MPC is proposed for direct matrix converter (DMC). In the proposed method, all control calculations and the safe commutation scheme for DMC are fully implemented in the FPGA and the need for another digital control platform, such as digital signal processors (DSP) or dSPACE, is eliminated. The proposed scheme takes full advantages of the parallel computation capability of FPGAs.

Journal ArticleDOI
TL;DR: A low latency real-time gas classification service system, which uses a multi-layer perceptron (MLP) artificial neural network to detect and classify the gas sensor data.
Abstract: Systems based on wireless gas sensor networks offer a powerful tool to observe and analyze data in complex environments over long monitoring periods. Since the reliability of sensors is very important in those systems, gas classification is a critical process within the gas safety precautions. A gas classification system has to react fast in order to take essential actions in the case of fault detection. This paper proposes a low latency real-time gas classification service system, which uses a multi-layer perceptron (MLP) artificial neural network to detect and classify the gas sensor data. An accurate MLP is developed to work with the data set obtained from an array of tin oxide (SnO2) gas sensor, based on convex micro hotplates. The overall system acquires the gas sensor data through radio-frequency identification (RFID), and processes the sensor data with the proposed MLP classifier implemented on a system on chip (SoC) platform from Xilinx. Hardware implementation of the classifier is optimized to achieve very low latency for real-time application. The proposed architecture has been implemented on a ZYNQ SoC using fixed-point format and the achieved results have shown that an accuracy of 97.4% has been obtained.

Journal ArticleDOI
TL;DR: In this article, a quasi-centralized direct model predictive control (QC-DMPC) scheme for back-to-back converter control without a dc-link outer-loop controller is presented.
Abstract: Voltage source back-to-back power converters are widely used in grid-tied applications. This paper presents a quasi-centralized direct model predictive control (QC-DMPC) scheme for back-to-back converter control without a dc-link outer-loop controller. Furthermore, the QC-DMPC is experimentally compared with a conventional proportional-integration (PI) dc-link controller-based DMPC (PI-DMPC) scheme. For the QC-DMPC scheme, the dc-link voltage is directly controlled by a grid-side predictive controller using a dynamic reference generation concept and load-side power estimation. For the PI-DMPC scheme, the dc-link voltage is controlled by an external PI controller. Both schemes are implemented on a field programmable gate array (FPGA)-based platform. Effectiveness of the proposed QC-DMPC is verified by both simulation and experimental data. Moreover, FPGA implementation issues (resource usage and timing information), dc-link control performance, and robustness to parameter variation of the two DMPC schemes are compared in detail. The results emphasize that the QC-DMPC may outperform the PI-DMPC scheme in normal operation but with a slightly higher usage of FPGA resources. However, PI-DMPC scheme is more robust when parameter variations are considered.

Proceedings ArticleDOI
21 Feb 2016
TL;DR: Results show that the GraphOps-based accelerators are able to operate close to the bandwidth limit of the hardware platform, the limiting constraint in graph analytics computation.
Abstract: Analytics and knowledge extraction on graph data structures have become areas of great interest. For frequently executed algorithms, dedicated hardware accelerators are an energy-efficient avenue to high performance. Unfortunately, they are notoriously labor-intensive to design and verify while meeting stringent time-to-market goals.In this paper, we present GraphOps, a modular hardware library for quickly and easily constructing energy-efficient accelerators for graph analytics algorithms. GraphOps provide a hardware designer with a set of composable graph-specific building blocks, broad enough to target a wide array of graph analytics algorithms. The system is built upon a dataflow execution platform and targets FPGAs, allowing a vendor to use the same hardware to accelerate different types of analytics computation. Low-level hardware implementation details such as flow control, input buffering, rate throttling, and host/interrupt interaction are automatically handled and built into the design of the GraphOps, greatly reducing design time. As an enabling contribution, we also present a novel locality-optimized graph data structure that improves spatial locality and memory efficiency when accessing the graph in main memory.Using the GraphOps system, we construct six different hardware accelerators. Results show that the GraphOps-based accelerators are able to operate close to the bandwidth limit of the hardware platform, the limiting constraint in graph analytics computation.

Journal ArticleDOI
01 Jul 2016-Optik
TL;DR: The proposed work have showed that chaotic systems can be successfully modeled using ANNs on FPGA, and in future, chaos-based engineering applications can be performed using ANN-based chaotic oscillators onFPGA.

Proceedings ArticleDOI
21 Feb 2016
TL;DR: This paper describes architectural enhancements in the Altera Stratix?
Abstract: This paper describes architectural enhancements in the Altera Stratix? 10 HyperFlex? FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-flops in the routing to enable a high degree of pipelining. In contrast to the earlier architectural exploration of pipelining in pass-transistor based architectures, the direct drive routing fabric in Stratix-style FPGAs enables an extremely low-cost pipeline register. The presence of ubiquitous flip-flops simplifies circuit retiming and improves performance. The availability of predictable retiming affects all stages of the cluster, place and route flow. Ubiquitous flip-flops require a low-cost clock network with sufficient flexibility to enable pipelining of dozens of clock domains. Different cost/performance tradeoffs in a pipelined fabric and use of a 14nm process, lead to other modifications to the routing fabric and the logic element. User modification of the design enables even higher performance, averaging 2.3X faster in a small set of designs.

Proceedings ArticleDOI
11 Jul 2016
TL;DR: This paper proposes Angel-Eye, a programmable and flexible CNN processor architecture, together with compilation tool and runtime environment, which is 8× faster and 7× better in power efficiency than peer FPGA implementation on the same platform.
Abstract: Convolutional Neural Network (CNN) has become a successful algorithm in the region of artificial intelligence and a strong candidate for many applications. However, for embedded platforms, CNN-based solutions are still too complex to be applied if only CPU is utilized for computation. Various dedicated hardware designs on FPGA and ASIC have been carried out to accelerate CNN, while few of them explore the whole design flow for both fast deployment and high power efficiency. In this paper, we propose Angel-Eye, a programmable and flexible CNN processor architecture, together with compilation tool and runtime environment. Evaluated on Zynq XC7Z045 platform, Angel-Eye is 8x faster and 7x better in power efficiency than peer FPGA implementation on the same platform. A demo of face detection on XC7Z020 is also 20x and 15x more energy efficient than counterparts on mobile CPU and mobile GPU respectively.

Journal ArticleDOI
TL;DR: The proposed SVM cascade architecture is implemented on a Spartan-6 field-programmable gate array (FPGA) platform and evaluated for object detection on 800 × 600 (Super Video Graphics Array) resolution images, and achieves a real-time processing rate of 40 frames/s for the benchmark face detection application.
Abstract: Cascade support vector machines (SVMs) are optimized to efficiently handle problems, where the majority of the data belong to one of the two classes, such as image object classification, and hence can provide speedups over monolithic (single) SVM classifiers. However, SVM classification is a computationally demanding task and existing hardware architectures for SVMs only consider monolithic classifiers. This paper proposes the acceleration of cascade SVMs through a hybrid processing hardware architecture optimized for the cascade SVM classification flow, accompanied by a method to reduce the required hardware resources for its implementation, and a method to improve the classification speed utilizing cascade information to further discard data samples. The proposed SVM cascade architecture is implemented on a Spartan-6 field-programmable gate array (FPGA) platform and evaluated for object detection on $800\times 600$ (Super Video Graphics Array) resolution images. The proposed architecture, boosted by a neural network that processes cascade information, achieves a real-time processing rate of 40 frames/s for the benchmark face detection application. Furthermore, the hardware-reduction method results in the utilization of 25% less FPGA custom-logic resources and 20% peak power reduction compared with a baseline implementation.

Journal ArticleDOI
TL;DR: The results show that the modifications proposed produced a clear increase in computation speed in comparison with the standard PC-based implementation, demonstrating the usefulness of the intrinsic parallelism of FPGAs in neurocomputational tasks and the suitability of both implementations of the algorithm for its application to the real world problems.
Abstract: The well-known backpropagation learning algorithm is implemented in a field-programmable gate array (FPGA) board and a microcontroller, focusing in obtaining efficient implementations in terms of a resource usage and computational speed. The algorithm was implemented in both cases using a training/validation/testing scheme in order to avoid overfitting problems. For the case of the FPGA implementation, a new neuron representation that reduces drastically the resource usage was introduced by combining the input and first hidden layer units in a single module. Further, a time-division multiplexing scheme was implemented for carrying out product computations taking advantage of the built-in digital signal processor cores. In both implementations, the floating-point data type representation normally used in a personal computer (PC) has been changed to a more efficient one based on a fixed-point scheme, reducing system memory variable usage and leading to an increase in computation speed. The results show that the modifications proposed produced a clear increase in computation speed in comparison with the standard PC-based implementation, demonstrating the usefulness of the intrinsic parallelism of FPGAs in neurocomputational tasks and the suitability of both implementations of the algorithm for its application to the real world problems.

Journal ArticleDOI
11 Jul 2016
TL;DR: This paper presents Rigel, which takes pipelines specified in the new multi-rate architecture and lowers them to FPGA implementations, and demonstrates depth from stereo, Lucas-Kanade, the SIFT descriptor, and a Gaussian pyramid running on two FPGAs.
Abstract: Image processing algorithms implemented using custom hardware or FPGAs of can be orders-of-magnitude more energy efficient and performant than software. Unfortunately, converting an algorithm by hand to a hardware description language suitable for compilation on these platforms is frequently too time consuming to be practical. Recent work on hardware synthesis of high-level image processing languages demonstrated that a single-rate pipeline of stencil kernels can be synthesized into hardware with provably minimal buffering. Unfortunately, few advanced image processing or vision algorithms fit into this highly-restricted programming model. In this paper, we present Rigel, which takes pipelines specified in our new multi-rate architecture and lowers them to FPGA implementations. Our flexible multi-rate architecture supports pyramid image processing, sparse computations, and space-time implementation tradeoffs. We demonstrate depth from stereo, Lucas-Kanade, the SIFT descriptor, and a Gaussian pyramid running on two FPGA boards. Our system can synthesize hardware for FPGAs with up to 436 Megapixels/second throughput, and up to 297x faster runtime than a tablet-class ARM CPU.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC) and are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency.
Abstract: FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. However, this approach limits the number of FPGAs per node and hinders the acceleration of large-scale distributed applications.

Journal ArticleDOI
TL;DR: A Mixed Integer Linear Programming (MILP) formulation, a divide-and-conquer heuristic algorithm, and a Simulated Annealing algorithm are presented to meet security and deadline requirements while minimizing the number of hardware coprocessors needed in the system.
Abstract: Automotive in-vehicle systems are distributed systems consisting of multiple ECUs (Electronic Control Units) interconnected with a broadcast network such as FlexRay. Message authentication is an effective mechanism to prevent attackers from injecting malicious messages into the network. In order to reduce timing interference of message authentication operations on application tasks, hardware coprocessors in the form of either FPGA or ASIC are adopted to offload computation-intensive cryptographic algorithms from the ECU. However, it may not be feasible or desirable to equip every ECU with a hardware coprocessor, as modern vehicles can contain more than one hundred ECUs, and the automotive industry is cost-sensitive. In this paper, we consider the problem of mapping an application task graph onto a FlexRay-based distributed hardware platform, to meet security and deadline requirements while minimizing the number of hardware coprocessors needed in the system. We present a Mixed Integer Linear Programming (MILP) formulation, a divide-and-conquer heuristic algorithm, and a Simulated Annealing algorithm. We evaluate the algorithms with industrial case studies.

Journal ArticleDOI
TL;DR: A Field Programmable Gate Array (FPGA) platform for spike time dependent encoder and dynamic reservoir in neuromorphic computing processors and utilizes the Echo State Network (ESN) architecture which includes a reservoir and its consequent training process.