scispace - formally typeset
Search or ask a question

Showing papers on "Field-programmable gate array published in 2013"


Journal ArticleDOI
TL;DR: Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool, and results demonstrate the ability of the tool to explore the hardware/software codesign space by varying the amount of a program that runs in software versus hardware.
Abstract: It is generally accepted that a custom hardware implementation of a set of computations will provide superior speed and energy efficiency relative to a software implementation. However, the cost and difficulty of hardware design is often prohibitive, and consequently, a software approach is used for most applications. In this article, we introduce a new high-level synthesis tool called LegUp that allows software techniques to be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware accelerators that communicate through a standard bus interface. In the hybrid processor/accelerator architecture, program segments that are unsuitable for hardware implementation can execute in software on the processor. LegUp can synthesize most of the C language to hardware, including fixed-sized multidimensional arrays, structs, global variables, and pointer arithmetic. Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool. We also give results demonstrating the ability of the tool to explore the hardware/software codesign space by varying the amount of a program that runs in software versus hardware. LegUp, along with a set of benchmark C programs, is open source and freely downloadable, providing a powerful platform that can be leveraged for new research on a wide range of high-level synthesis topics.

302 citations


Journal ArticleDOI
TL;DR: This study presents a fast yet robust method for fault diagnosis in nonisolated dc-dc converters based on time and current criteria which observe the slope of the inductor current over the time.
Abstract: Fault detection (FD) in power electronic converters is necessary in embedded and safety critical applications to prevent further damage. Fast FD is a mandatory step in order to make a suitable response to a fault in one of the semiconductor devices. The aim of this study is to present a fast yet robust method for fault diagnosis in nonisolated dc–dc converters. FD is based on time and current criteria which observe the slope of the inductor current over the time. It is realized by using a hybrid structure via coordinated operation of two FD subsystems that work in parallel. No additional sensors, which increase system cost and reduce reliability, are required for this detection method. For validation, computer simulations are first carried out. The proposed detection scheme is validated on a boost converter. Effects of input disturbances and the closed-loop control are also considered. In the experimental setup, a field programmable gate array digital target is used for the implementation of the proposed method, to perform a very fast switch FD. Results show that, with the presented method, FD is robust and can be done in a few microseconds.

163 citations


Proceedings ArticleDOI
11 Feb 2013
TL;DR: In this article, a C-to-FPGA framework is presented to implement data reuse through aggressive loop transformation-based program restructuring, which can satisfy hardware resource constraints (scratchpad size) while still aggressively exploiting data reuse.
Abstract: Many applications, such as medical imaging, generate intensive data traffic between the FPGA and off-chip memory. Significant improvements in the execution time can be achieved with effective utilization of on-chip (scratchpad) memories, associated with careful software-based data reuse and communication scheduling techniques. We present a fully automated C-to-FPGA framework to address this problem. Our framework effectively implements data reuse through aggressive loop transformation-based program restructuring. In addition, our proposed framework automatically implements critical optimizations for performance such as task-level parallelization, loop pipelining, and data prefetching.We leverage the power and expressiveness of the polyhedral compilation model to develop a multi-objective optimization system for off-chip communications management. Our technique can satisfy hardware resource constraints (scratchpad size) while still aggressively exploiting data reuse. Our approach can also be used to reduce the on-chip buffer size subject to bandwidth constraint. We also implement a fast design space exploration technique for effective optimization of program performance using the Xilinx high-level synthesis tool.

152 citations


Journal ArticleDOI
TL;DR: The design of a new high-speed point multiplier for elliptic curve cryptography using either field-programmable gate array or application-specified integrated circuit technology is detailed.
Abstract: This paper details the design of a new high-speed point multiplier for elliptic curve cryptography using either field-programmable gate array or application-specified integrated circuit technology. Different levels of digit-serial computation were applied to the data path of Galois field (GF) multiplication and division to explore the resulting performances and find out an optimal digit size. We provide results for the five National Institute of Standards and Technology recommended curves, outperforming the previous published results. In GF(2163), we achieve a point multiplication in 19.38 μs in Xilinx Virtex-E. Using the modern Xilinx Virtex-5, the point multiplication times in GF(2m) for m = 163, 233, 409, and 571 are 5.5, 17.8, 33.6, 102.6, 384μs, respectively, which are the fastest figures reported to date.

151 citations


Journal ArticleDOI
TL;DR: The programmability of FPGAs must improve if they are to be part of mainstream computing, and this paper presents a meta-modelling architecture suitable for this purpose.
Abstract: When looking at how hardware influences computing performance, we have GPPs (general-purpose processors) on one end of the spectrum and ASICs (application-specific integrated circuits) on the other...

142 citations


Journal ArticleDOI
TL;DR: Implementation of the fault detection and of the fully digital control schemes on a single FPGA is realized, based on a suited methodology for rapid prototyping, and the results confirm the capability of the proposed reconfigurable control and fault-tolerant structure.
Abstract: In this paper, an FPGA-based fault-tolerant back-to-back converter without redundancy is studied. Before fault occurrence, the fault-tolerant converter operates like a conventional back-to-back six-leg converter, and after the fault, it becomes a five-leg converter. Design, implementation, and experimental verification of an FPGA-based reconfigurable control strategy for this converter are discussed. This reconfigurable control strategy allows the continuous operation of the converter with minimum affection from a fault in one of the semiconductor switches. A very fast detection scheme is used to detect and locate the fault. Implementation of the fault detection and of the fully digital control schemes on a single FPGA is realized, based on a suited methodology for rapid prototyping. FPGA in loop and also experimental tests are carried out, and the results are presented. These results confirm the capability of the proposed reconfigurable control and fault-tolerant structure.

125 citations


Journal ArticleDOI
Hongyan Guo1, Hong Chen1, Fang Xu1, Fei Wang1, Geyu Lu1 
TL;DR: Simulation results of standard double lane change, slalom test, and hard accelerating and braking test show that the proposed EKF implementation scheme has acceptable precision and computational efficiency.
Abstract: In order to improve the computational performance of the extended Kalman filter (EKF) for longitudinal and lateral vehicle velocities estimation, a novel scheme for the EKF implementation is proposed based on field programmable gate array (FPGA) and System on Programmable Chip (SoPC). A Nios II processor clocked at 100 MHz is embedded into the FPGA chip. The EKF is created by C/C++ program and runs in the Nios II processor. The main procedure for the EKF implementation using FPGA/SoPC technique is decomposed into three parts: system requirements analysis, hardware design, and software design. The proposed architecture offers favorable flexibility since it supports the reconfigurable hardware and reprogramming software. For the sake of increasing the computational efficiency, the single precision floating-point customized instructions and algorithm optimization are adopted. A testing platform is introduced to evaluate the functionality and the computational performance of the EKF, which includes an FPGA prototyping board and an xPC-Target system. Simulation results of standard double lane change, slalom test, and hard accelerating and braking test show that the proposed EKF implementation scheme has acceptable precision and computational efficiency.

107 citations


Proceedings ArticleDOI
11 Feb 2013
TL;DR: This paper reverse-engineered the details of the proprietary and unpublished Stratix II bitstream encryption scheme from the Quartus II software and demonstrates that the full 128-bit AES key of a Stratx II can be recovered by means of side-channel analysis with 30,000 measurements, which can be acquired in less than three hours.
Abstract: In order to protect FPGA designs against IP theft and related issues such as product cloning, all major FPGA manufacturers offer a mechanism to encrypt the bitstream used to configure the FPGA. From a mathematical point of view, the employed encryption algorithms, e.g., AES or 3DES, are highly secure. However, recently it has been shown that the bitstream encryption feature of several FPGA product lines is susceptible to side-channel attacks that monitor the power consumption of the cryptographic module. In this paper, we present the first successful attack on the bitstream encryption of the Altera Stratix II FPGA. To this end, we reverse-engineered the details of the proprietary and unpublished Stratix II bitstream encryption scheme from the Quartus II software. Using this knowledge, we demonstrate that the full 128-bit AES key of a Stratix II can be recovered by means of side-channel analysis with 30,000 measurements, which can be acquired in less than three hours. The complete bitstream of a Stratix II that is (seemingly) protected by the bitstream encryption feature can hence fall into the hands of a competitor or criminal - possibly implying system-wide damage if confidential information such as proprietary encryption schemes or keys programmed into the FPGA are extracted. In addition to lost IP, reprogramming the attacked FPGA with modified code, for instance, to secretly plant a hardware trojan, is a particularly dangerous scenario for many security-critical applications.

105 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: LINQits is a flexible hardware template that can be mapped onto programmable logic or ASICs in a heterogeneous system-on-chip for a mobile device or server and accelerates a domain-specific query language called LINQ, which benefits extensively from hardware acceleration.
Abstract: We present LINQits, a flexible hardware template that can be mapped onto programmable logic or ASICs in a heterogeneous system-on-chip for a mobile device or server. Unlike fixed-function accelerators, LINQits accelerates a domain-specific query language called LINQ. LINQits does not provide coverage for all possible applications---however, existing applications (re-)written with LINQ in mind benefit extensively from hardware acceleration. Furthermore, the LINQits framework offers a graceful and transparent migration path from software to hardware.LINQits is prototyped on a 2W heterogeneous SoC called the ZYNQ processor, which combines dual ARM A9 processors with an FPGA on a single die in 28nm silicon technology. Our physical measurements show that LINQits improves energy efficiency by 8.9 to 30.6 times and performance by 10.7 to 38.1 times compared to optimized, multithreaded C programs running on conventional ARM A9 processors.

103 citations


Journal ArticleDOI
TL;DR: An improved cost function for the voltage control of the flying capacitors is proposed in this paper, which offers a capacitor voltage control that corresponds more closely with the desired behavior and adds a limitation on the capacitor voltage deviation.
Abstract: Recently, there has been an increase in the use of finite-set model-based predictive control (FS-MBPC) for power-electronic converters. However, the computational burden for this control scheme is very high and often restrictive for a good implementation. This means that a suitable technology and design approach should be used. In this paper, the implementation of FS-MBPC for flying-capacitor converters in field-programmable gate arrays (FPGAs) is discussed. The control is fully implemented in programmable digital logic by using a high-level design tool. This allows us to obtain very good performances (both in control quality, speed, and hardware utilization) and have a flexible, modular control configuration. The good performance is obtained by exploiting the FPGA's strong points: parallelism and pipelining. Furthermore, an improved cost function for the voltage control of the flying capacitors is proposed in this paper. Typical cost functions result in tracking control for the flying-capacitor voltages, although this does not correspond with the desired system behavior. The improved cost function offers a capacitor voltage control that corresponds more closely with the desired behavior and adds a limitation on the capacitor voltage deviation. Furthermore, the selection of the weight factor in the cost function becomes less critical.

97 citations


Journal ArticleDOI
TL;DR: A functional decomposition method is proposed to map FPGA hardware resources to system modelling, which lends itself to fully pipelined and parallel hardware emulation of individual component models and numerical solvers, while preserving original system characteristics without the need for extraneous components to partition the system.
Abstract: Large-scale electromagnetic transient simulation of power systems in real-time using detailed modelling is computationally very demanding. This study introduces a multi-field programmable gate array (FPGA) hardware design for this purpose. A functional decomposition method is proposed to map FPGA hardware resources to system modelling. This systematic method lends itself to fully pipelined and parallel hardware emulation of individual component models and numerical solvers, while preserving original system characteristics without the need for extraneous components to partition the system. Proof-of-concept is provided in terms of a 3-FPGA and 10-FPGA real-time hardware emulation of a three-phase 42-bus and 420-bus power systems using detailed modelling of various system components and iterative non-linear solution on a 100 MHz FPGA clock. Real-time results are compared with offline simulation results, and conclusions are derived on the performance and scalability of this multi-FPGA hardware design.

Journal ArticleDOI
TL;DR: It is demonstrated that a hardware solution for systems such as automated optical inspection systems or systems dealing with projective geometry estimation and motion compensation systems in robotic vision systems is possible in real time.
Abstract: The Levenberg-Marquardt (LM) algorithm is a nonlinear parameter learning algorithm that converges accurately and quickly. This paper demonstrates for the first time to our knowledge, a real-time implementation of the LM algorithm on field programmable gate arrays (FPGAs). It was used to train neural networks to solve the eXclusive Or function (XOR), and for 3D-to-2D camera calibration parameter estimation. A Xilinx Virtex-5 ML506 was used to implement the LMA as a hardware-in-the-loop system. The XOR function was approximated in only 13 iterations from zero initial conditions, usually the same function is approximated in thousands of iterations using the error backpropagation algorithm. Also, this type of training not only reduced the number of iterations but also achieved a speed up in excess of 3 ×106 when compared to the software implementation. A real-time camera calibration and parameter estimation was performed successfully on FPGAs. Compared to the software implementation the FPGA implementation led to an increase in the mean squared error and standard deviation by only 17.94% and 8.04% respectively. The FPGA increased the calibration speed by a factor of 1.41 × 106. There are a wide range of systems problems solved via nonlinear parameter optimization, this study demonstrated that a hardware solution for systems such as automated optical inspection systems or systems dealing with projective geometry estimation and motion compensation systems in robotic vision systems is possible in real time.

Journal ArticleDOI
Wenqiang Wang1, Jing Yan1, Ningyi Xu1, Yu Wang2, Feng-Hsiung Hsu1 
01 Dec 2013
TL;DR: This is the first complete real-time hardware system that supports both cost aggregation on cross-based regions and semi-global optimization on FPGA, and can adjust image resolution, parallelism degree, and support region size to achieve maximum efficiency flexibly during the implementation.
Abstract: Stereo vision is a well-known technique for acquiring depth information. In this paper, we propose a real-time high-quality stereo vision system in field-programmable gate array (FPGA). Using absolute difference-census cost initialization, cross-based cost aggregation, and semiglobal optimization, the system provides high-quality depth results for high-definition images. This is the first complete real-time hardware system that supports both cost aggregation on variable support regions and semiglobal optimization in FPGAs. Furthermore, the system is designed to be scaled with image resolution, disparity range, and parallelism degree for maximum parallel efficiency. We present the depth map quality on the Middlebury benchmark and some real-world scenarios with different image resolutions. The results show that our system performs the best among FPGA-based stereo vision systems and its accuracy is comparable with those of current top-performing software implementations. The first version of the system was demonstrated on an Altera Stratix-IV FPGA board, processing 1024 $\times $ 768 pixel images with 96 disparity levels at 67 frames/s. The system is then scaled up on a new Altera Stratix-V FPGA and the processing ability is enhanced to $1600 \times 1200$ pixel images with 128 disparity levels at 42 frames/s.

Journal ArticleDOI
TL;DR: This paper classifies and presents current and novel design methodologies and architectures for SRAM-based FPGAs, and in particular for Xilinx Virtex-4QV/5QV, configuration memory scrubbers.
Abstract: SRAM-based FPGAs are in-field reconfigurable an unlimited number of times. This characteristic, together with their high performance and high logic density, proves to be very convenient for a number of ground and space level applications. One drawback of this technology is that it is susceptible to ionizing radiation, and this sensitivity increases with technology scaling. This is a first order concern for applications in harsh radiation environments, and starts to be a concern for high reliability ground applications. Several techniques exist for coping with radiation effects at user application. In order to be effective they need to be complemented with configuration memory scrubbing, which allows error mitigation and prevents failures due to error accumulation. Depending on the radiation environment and on the system dependability requirements, the configuration scrubber design can become more or less complex. This paper classifies and presents current and novel design methodologies and architectures for SRAM-based FPGAs, and in particular for Xilinx Virtex-4QV/5QV, configuration memory scrubbers.

Proceedings ArticleDOI
11 Feb 2013
TL;DR: An integrated framework to model and enable both intra- and inter-block HLS optimizations that implement parallelism, pipelining, and fine-grained communication is presented.
Abstract: High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry in design expertise, and shorter design time. State-of-the-art high level synthesis now includes a wide variety of powerful optimizations that implement efficient hardware. These optimizations can implement some of the most important features generally performed in manual designs including parallel hardware units, pipelining of execution both within a hardware unit and between units, and fine-grained data communication. We may generally classify the optimizations as those that optimize hardware implementation within a code block (intra-block) and those that optimize communication and pipelining between code blocks (inter-block). However, both optimizations are in practice difficult to apply. Real-world applications contain data-dependent blocks of code and communicate through complex data access patterns. Existing high level synthesis tools cannot apply these powerful optimizations unless the code is inherently compatible, severely limiting the optimization opportunity. In this paper we present an integrated framework to model and enable both intra- and inter-block optimizations. This integrated technique substantially improves the opportunity to use the powerful HLS optimizations that implement parallelism, pipelining, and fine-grained communication. Our polyhedral model-based technique systematically defines a set of data access patterns, identifies effective data access patterns, and performs the loop transformations to enable the intra- and inter-block optimizations. Our framework automatically explores transformation options, performs code transformations, and inserts the appropriate HLS directives to implement the HLS optimizations. Furthermore, our framework can automatically generate the optimized communication blocks for fine-grained communication between hardware blocks. Experimental evaluation demonstrates that we can achieve an average of 6.04X speedup over the high level synthesis solution without our transformations to enable intra- and inter-block optimizations.

Proceedings ArticleDOI
02 Jun 2013
TL;DR: This work presents area optimizations for the most critical and computationally-intensive operation in lattice-based cryptography: polynomial multiplication with the Number Theoretic Transform (NTT).
Abstract: The interest in lattice-based cryptography is increasing due to its quantum resistance and its provable security under some worst-case hardness assumptions. As this is a relatively new topic, the search for efficient hardware architectures for lattice-based cryptographic building blocks is still an active area of research. We present area optimizations for the most critical and computationally-intensive operation in lattice-based cryptography: polynomial multiplication with the Number Theoretic Transform (NTT). The proposed methods are implemented on an FPGA for polynomial multiplication over the ideal ℤp[x]〈xn + 1〉. The proposed hardware architectures reduce slice usage, number of utilized memory blocks and total memory accesses by using a simplified address generation, improved memory organization and on-the-fly operand generations. Compared to prior work, with similar performance the proposed hardware architectures can save up to 67% of occupied slices, 80% of used memory blocks and 60% of memory accesses, and can fit into smallest Xilinx Spartan-6 FPGA.

Journal ArticleDOI
Denis Navarro1, Oscar Lucia1, Luis A. Barragan1, I. Urriza1, O. Jimenez1 
TL;DR: The Xilinx Vivado HLS tool is evaluated for the design of a computationally demanding application, the real-time load estimation for resonant power converters using parametric identification methods, and shows a significant design complexity reduction.
Abstract: Recent advances in power electronic converters highly rely on the development of new control algorithms. These implementations often require complex control architectures featuring microprocessors, digital signal processors, and field-programmable gate arrays (FPGAs). Whereas software implementations are feasible for most power electronics practitioners, FPGA implementations with ad-hoc digital hardware are often a challenging design task. This paper deals with the design and development of control systems for power converters using high-level synthesis tools. In particular, the Xilinx Vivado HLS tool is evaluated for the design of a computationally demanding application, the real-time load estimation for resonant power converters using parametric identification methods. The proposed methodology allows the designer to use a high-level description language, e.g., C, to describe the identification algorithm functionality, and the tool automatically generates the hardware floating-point data-path and the control unit. Besides, it allows a fast design-space exploration through synthesis directives, and pipelining and parallelization are automatically performed to meet timing constraints. The evaluation performed in the study-case control architecture shows a significant design complexity reduction. As a consequence, high-level synthesis tools should be considered as a new paradigm in accelerating digital design for power conversion systems.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: The implementation of FlexGrip is described, a soft GPGPU architecture which has been optimized for FPGA implementation which supports direct CUDA compilation to a binary which is executable on the F PGPU without hardware recompilation.
Abstract: Over the past decade, soft microprocessors and vector processors have been extensively used in FPGAs for a wide variety of applications. However, it is difficult to straightforwardly extend their functionality to support conditional and thread-based execution characteristic of general-purpose graphics processing units (GPGPUs) without recompiling FPGA hardware for each application. In this paper, we describe the implementation of FlexGrip, a soft GPGPU architecture which has been optimized for FPGA implementation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU without hardware recompilation. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. The benefits of our architecture are evaluated for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of up to 30× versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip.

Proceedings ArticleDOI
Weirong Jiang1
21 Oct 2013
TL;DR: This paper presents a scalable random access memory (RAM)-based TCAM architecture aiming for efficient implementation on state-of-the-art FPGAs, and is the first FPGA design that implements a TCAM larger than 1 Mbits.
Abstract: Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions. There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA). Most of existing FPGA-based TCAM designs are based on brute-force implementations, which result in inefficient on-chip resource usage. As a result, existing designs support only a small TCAM size even with large FPGA devices. They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding. This paper presents a scalable random access memory (RAM)-based TCAM architecture aiming for efficient implementation on state-of-the-art FPGAs. We give a formal study on RAM-based TCAM to unveil the ideas and the algorithms behind it. To conquer the timing challenge, we propose a modular architecture consisting of arrays of small-size RAM-based TCAM units. After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units. This leads to resource saving. The capability of explicit range matching is also offered to avoid range-to-ternary conversion for search functions that require range matching. Implementation on a Xilinx Virtex 7 FPGA shows that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 150 million packets per second. The resource usage scales linearly with the TCAM size. The architecture is configurable, allowing various performance trade-offs to be exploited. To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than 1 Mbits.

Journal ArticleDOI
TL;DR: Using the virtual decomposition control (VDC) approach with embedded field programmable gate array (FPGA) logic devices, the proposed solution solves a long-standing problem of lacking control precision fundamentally associated with the modular robot manipulators.
Abstract: A systematic solution to precision control of modular robot manipulators without using joint torque sensing is presented in this paper for the first time. Using the virtual decomposition control (VDC) approach with embedded field programmable gate array (FPGA) logic devices, the proposed solution solves a long-standing problem of lacking control precision fundamentally associated with the modular robot manipulators. As a result, this solution allows modular robot manipulators to possess not only their traditional advantages (such as reconfigurability, flexibility, versatility, and ease of use) but precision control capability as well. A hierarchical master-slave control structure is used, which is supported by a high-speed communication system modified from SpaceWire (IEEE 1355), transferring a limited amount of data between the master and slave nodes at a rate of 1000 Hz. In each module, the FPGA logic implementation uses multiple sampling periods of 163.8 μs, 1.28 μs, and 20 ns. A gravity counterbalance spring provides a design option for the purpose of energy saving. Experimental results demonstrate unprecedented control precision, which is attributed to the use of both the VDC approach and embedded FPGA implementation. The ratio of the maximum position tracking error to the maximum velocity reaches 0.00012 s-more than an order of magnitude better than available technologies in control of robots with harmonic drives. The solution presented in this paper is also applicable to integrated robot manipulators using embedded FPGA controllers.

Journal ArticleDOI
TL;DR: This paper proposes a distributed digital control architecture of a modular-based solid-state transformer (SST) using a digital signal processor (DSP) and a field-programmable gate array (FPGA) that operate cooperatively.
Abstract: This paper proposes a distributed digital control architecture of a modular-based solid-state transformer (SST) using a digital signal processor (DSP) and a field-programmable gate array (FPGA). In particular, the three-stage SST based on a modular structure is the topology of most interest because of its superior controllability. In order to make the modular-based SST, the digital implementation is inevitable to achieve higher performances, improved reliability, and an easy development. In addition, the modular-based SST requires enough capacity for implementing complex control algorithms, multiple interfaces, and a large number of internal variables. In this paper, a digital control platform for the modular-based SST is built using a floating-point DSP and an FPGA that operate cooperatively. As a result, the main control algorithms are performed by the DSP, and the simple logical processes are implemented in the FPGA to synthesize the suitable gating signals and control external devices, respectively. The proposed implementation method enables high-switching-frequency operation, multitasking, and flexible design for the modular-based SST. Experimental results are presented to verify the practical feasibility of the proposed technique for the modular-based SST.

Journal ArticleDOI
01 Apr 2013
TL;DR: Results obtained from case studies for a small UAV helicopter with environment derived from light-detection and ranging data verify the effectiveness of the proposed FPGA-based pathplanner, and demonstrate convergence at rates above the typical 10 Hz update frequency of an autopilot system.
Abstract: In this paper, a hardware-based path planning architecture for unmanned aerial vehicle (UAV) adaptation is proposed. The architecture aims to provide UAVs with higher autonomy using an application-specific evolutionary algorithm (EA) implemented entirely on a field-programmable gate array (FPGA) chip. The physical attributes of an FPGA chip, being compact in size and low in power consumption, makes it an ideal platform for UAV applications. The design, which is implemented entirely in hardware, consists of EA modules, population storage resources, and 3-D terrain information necessary to the path planning process, subject to constraints accounted for separately via UAV, environment, and mission profiles. The architecture has been successfully synthesized for a target Xilinx Virtex-4 FPGA platform with 32% logic slice utilization. Results obtained from case studies for a small UAV helicopter with environment derived from light-detection and ranging data verify the effectiveness of the proposed FPGA-based pathplanner, and demonstrate convergence at rates above the typical 10 Hz update frequency of an autopilot system.

Journal ArticleDOI
TL;DR: Simulation and experimental results are given to verify the implemented SVPWM control for PV system in terms of THD.

Journal ArticleDOI
TL;DR: When looking at how hardware influences computing performance, the authors have GPPs (general-purpose processors) on one end of the spectrum and ASICs (application-specific integrated circuits) on the other.
Abstract: When looking at how hardware influences computing performance, we have GPPs (general-purpose processors) on one end of the spectrum and ASICs (application-specific integrated circuits) on the other. Processors are highly programmable but often inefficient in terms of power and performance. ASICs implement a dedicated and fixed function and provide the best power and performance characteristics, but any functional change requires a complete (and extremely expensive) re-spinning of the circuits.

Proceedings ArticleDOI
28 Mar 2013
TL;DR: This paper presents an MTJ/MOS-hybrid video coding hardware that uses a cycle-based power-gating technique for a practical-scale MTJ-based NV-LIM LSI, which is fully designed using the established semi-automatedMTJ-oriented design flow.
Abstract: Nonvolatile logic-in-memory (NV-LIM) architecture [1], where magnetic tunnel junction (MTJ) devices [2] are distributed over a CMOS logic-circuit plane, has the potential of overcoming the serious power-consumption problem that has rapidly become a dominant constraint on the performance improvement of today's VLSI processors. Normally-off and instant-on capabilities with a small area penalty due to non-volatility and three-dimensional-stackability of MTJ devices in the above structure allow us to apply a power-gating technique in a fine temporal granularity, which can perfectly eliminate wasted power dissipation due to leakage current. The impact of embedding nonvolatile memory devices into a logic circuit was, however, demonstrated by using only small fabricated primitive logic-circuit elements [3], memory-like structures such as FPGA [4], or circuit simulation because of the lack of an established MTJ-oriented design flow reflecting the chip-fabrication environment, while larger-capacity and/or high-speed-access MRAM has been increasingly developed. In this paper, we present an MTJ/MOS-hybrid video coding hardware that uses a cycle-based power-gating technique for a practical-scale MTJ-based NV-LIM LSI, which is fully designed using the established semi-automated MTJ-oriented design flow.

Journal ArticleDOI
TL;DR: The results of synthesis show that, in the first implementation, 17 929 slices or 20% of the chip area is occupied, which makes it suitable for speed-critical cryptographic applications, while in the second implementation, 14203 slices or 16% ofThe resulting architecture is suitable for applications that may require speed-area tradeoff.
Abstract: A new and highly efficient architecture for elliptic curve scalar point multiplication is presented. To achieve the maximum architectural and timing improvements, we reorganize and reorder the critical path of the Lopez-Dahab scalar point multiplication architecture such that logic structures are implemented in parallel and operations in the critical path are diverted to noncritical paths. The results we obtained show that with G=55 our proposed design is able to compute scalar multiplication over GF(2163) in 9.6 μs with the maximum achievable frequency of 250 MHz on Xilinx Virtex-4 (XC4VLX200), where G is the digit size of the underlying digit-serial finite-field multiplier. Another implementation variant for less resource consumption is also proposed; with G=33, the design performs the same operation in 11.6 μs at 263 MHz on the same platform. The results of synthesis show that, in the first implementation, 17 929 slices or 20% of the chip area is occupied, which makes it suitable for speed-critical cryptographic applications, while in the second implementation 14203 slices or 16% of the chip area is utilized, which makes it suitable for applications that may require speed-area tradeoff.

Journal ArticleDOI
TL;DR: The presented system is composed of a custom CMOS image sensor, a dedicated image compressor, a forward error correction encoder protecting radio transmitted data against random and burst errors, a radio data transmitter, and a controller supervising all operations of the system.
Abstract: This paper presents the design of a hardware-efficient, low-power image processing system for next-generation wireless endoscopy. The presented system is composed of a custom CMOS image sensor, a dedicated image compressor, a forward error correction (FEC) encoder protecting radio transmitted data against random and burst errors, a radio data transmitter, and a controller supervising all operations of the system. The most significant part of the system is the image compressor. It is based on an integer version of a discrete cosine transform and a novel, low complexity yet efficient, entropy encoder making use of an adaptive Golomb-Rice algorithm instead of Huffman tables. The novel hardware-efficient architecture designed for the presented system enables on-the-fly compression of the acquired image. Instant compression, together with elimination of the necessity of retransmitting erroneously received data by their prior FEC encoding, significantly reduces the size of the required memory in comparison to previous systems. The presented system was prototyped in a single, low-power, 65-nm field programmable gate arrays (FPGA) chip. Its power consumption is low and comparable to other application-specific-integrated-circuits-based systems, despite FPGA-based implementation.

Journal ArticleDOI
TL;DR: The Merlin system as mentioned in this paper is based on a National Instruments PXI/FlexRIO system running a Xilinx Virtex5 FPGA and is capable of recording Medipix3 256 by 256 by 12 bit data frames at over 1 kHz in bursts of 1200 frames and running at over 100 Hz continuously to disk or over a TCP/IP link.
Abstract: This contribution reports on the development of a new high rate readout system for the Medipix3 hybrid pixel ASIC developed by the Detector Group at Diamond Light Source. It details the current functionality of the system and initial results from tests on Diamond's B16 beamline. The Merlin system is based on a National Instruments PXI/FlexRIO system running a Xilinx Virtex5 FPGA. It is capable of recording Medipix3 256 by 256 by 12 bit data frames at over 1 kHz in bursts of 1200 frames and running at over 100 Hz continuously to disk or over a TCP/IP link. It is compatible with the standard Medipix3 single chipboards developed at CERN and is capable of driving them over cable lengths of up to 10 m depending on the data rate required. In addition to a standalone graphical interface, a system of remote TCP/IP control and data transfer has been developed to allow easy integration with third party control systems and scripting languages. Two Merlin systems are being deployed on the B16 and I16 beamlines at Diamond and the system has been integrated with the EPICS/GDA control systems used. Results from trigger synchronisation, fast burst and high rate tests made on B16 in March are reported and demonstrate an encouraging reliability and timing accuracy. In addition to normal high resolution imaging applications of Medipix3, the results indicate the system could profitably be used in `pump and probe' style experiments, where a very accurate, high frame rate is especially beneficial. In addition to these two systems, Merlin is being used by the Detector Group to test the Excalibur 16 chip hybrid modules, and by the LHCb VELO Pixel Upgrade group in their forthcoming testbeams. Additionally the contribution looks forward to further developments and improvements in the system, including full rate quad chip readout capability, multi-FPGA support, long distance optical communication and further functionality enhancements built on the capabilities of the Medipix3 chips.

Journal ArticleDOI
TL;DR: Field programmable gate array's (FPGA's) capacity of exploring the parallelism of operations present in the GDSC-PLL is demonstrated through the mapping of this technique directly in hardware, allowing for a much shorter execution time than in DSP.
Abstract: Fundamental-frequency and harmonic positive- and negative-sequence components detection is an important task for implementing power converters for renewable energy systems, uninterruptible power supplies, active power filters, dynamic voltage restorers, and also for power systems protection relays. Detection techniques of this kind are generally implemented in digital signal processor (DSP) with the execution time limited by the sampling period. The computational effort of the control algorithm can considerably increase the execution time, due to the sequential nature of processing in DSP. A promising technique for sequence components separation of three-phase signals is the so called the generalized delayed signal cancelation-phase locked loop (GDSC-PLL). Field programmable gate array's (FPGA's) capacity of exploring the parallelism of operations present in the GDSC-PLL is demonstrated in this paper through the mapping of this technique directly in hardware, allowing for a much shorter execution time than in DSP. The proposed architecture is presented, and the efficient detection of the fundamental-frequency positive-sequence with FPGA is demonstrated, with the obtained results compared with a traditional DSP implementation. In particular, the advantages and possibilities of the use of FPGA are demonstrated in comparison with the DSP. For this comparison, a metric for evaluating the capacity of complexity increase in application algorithms is proposed.

Proceedings ArticleDOI
29 Sep 2013
TL;DR: The VectorBlox MXP Matrix Processor is an FPGA-based soft processor capable of executing data-parallel software algorithms at hardware-like speeds and seamlessly ties into existing Altera and Xilinx development flows, simplifying system creation and deployment.
Abstract: Embedded systems frequently use FPGAs to perform highly parallel data processing tasks. However, building such a system usually requires specialized hardware design skills with VHDL or Verilog. Instead, this paper presents the VectorBlox MXP Matrix Processor, an FPGA-based soft processor capable of highly parallel execution. Programmed entirely in C, the MXP is capable of executing data-parallel software algorithms at hardware-like speeds. For example, the MXP running at 200MHz or higher can implement a multi-tap FIR filter and output 1 element per clock cycle. MXP's parameterized design lets the user specify the amount of parallelism required, ranging from 1 to 128 or more parallel ALUs. Key features of the MXP include a parallel-access scratchpad memory to hold vector data and high-throughput DMA and scatter/gather engines. To provide extreme performance, the processor is expandable with custom vector instructions and custom DMA filters. Finally, the MXP seamlessly ties into existing Altera and Xilinx development flows, simplifying system creation and deployment.