scispace - formally typeset
Search or ask a question

Showing papers on "Field-programmable gate array published in 2012"


Proceedings ArticleDOI
03 Jun 2012
TL;DR: Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages, is introduced by embedding Chisel in the Scala programming language, raising the level of hardware design abstraction.
Abstract: In this paper we introduce Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages. By embedding Chisel in the Scala programming language, we raise the level of hardware design abstraction by providing concepts including object orientation, functional programming, parameterized types, and type inference. Chisel can generate a high-speed C++-based cycle-accurate software simulator, or low-level Verilog designed to map to either FPGAs or to a standard ASIC flow for synthesis. This paper presents Chisel, its embedding in Scala, hardware examples, and results for C++ simulation, Verilog emulation and ASIC synthesis.

697 citations


Proceedings ArticleDOI
22 Feb 2012
TL;DR: The current status and new release of an ongoing effort to create a downstream full-implementation flow of Verilog to Routing is described, and the use of the new flow is illustrated by using it to help architect a floating-point unit in an FPGA, and compared with a prior, much longer effort.
Abstract: To facilitate the development of future FPGA architectures and CAD tools -- both embedded programmable fabrics and pure-play FPGAs -- there is a need for a large scale, publicly available software suite that can synthesize circuits into easily-described hypothetical FPGA architectures. These circuits should be captured at the HDL level, or higher, and pass through logical and physical synthesis. Such a tool must provide detailed modelling of area, performance and energy to enable architecture exploration. As software flows themselves evolve to permit design capture at ever higher levels of abstraction, this downstream full-implementation flow will always be required. This paper describes the current status and new release of an ongoing effort to create such a flow - the 'Verilog to Routing' (VTR) project, which is a broad collaboration of researchers. There are three core tools: ODIN II for Verilog Elaboration and front-end hard-block synthesis, ABC for logic synthesis, and VPR for physical synthesis and analysis. ODIN II now has a simulation capability to help verify that its output is correct, as well as specialized synthesis at the elaboration step for multipliers and memories. ABC is used to optimize the 'soft' logic of the FPGA. The VPR-based packing, placement and routing is now fully timing-driven (the previous release was not) and includes new capability to target complex logic blocks. In addition we have added a set of four large benchmark circuits to a suite of previously-released Verilog HDL circuits. Finally, we illustrate the use of the new flow by using it to help architect a floating-point unit in an FPGA, and contrast it with a prior, much longer effort that was required to do the same thing.

271 citations


Proceedings ArticleDOI
25 Oct 2012
TL;DR: It is shown that the OpenCL computing paradigm is a viable design entry method for high-performance computing applications on FPGAs and that it can achieve a clock frequency in excess of 160MHz on benchmarks.
Abstract: We present an OpenCL compilation framework to generate high-performance hardware for FPGAs. For an OpenCL application comprising a host program and a set of kernels, it compiles the host program, generates Verilog HDL for each kernel, compiles the circuit using Altera Complete Design Suite 12.0, and downloads the compiled design onto an FPGA.We can then run the application by executing the host program on a Windows(tm)-based machine, which communicates with kernels on an FPGA using a PCIe interface. We implement four applications on an Altera Stratix IV and present the throughput and area results for each application. We show that we can achieve a clock frequency in excess of 160MHz on our benchmarks, and that OpenCL computing paradigm is a viable design entry method for high-performance computing applications on FPGAs.

252 citations


Proceedings ArticleDOI
22 Feb 2012
TL;DR: This paper developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control.
Abstract: An FPGA is a peculiar hardware realization substrate in terms of the relative speed and cost of logic vs. wires vs. memory. In this paper, we present a Network-on-Chip (NoC) design study from the mindset of NoC as a synthesizable infrastructural element to support emerging System-on-Chip (SoC) applications on FPGAs. To support our study, we developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology. The CONNECT NoC architecture embodies a set of FPGA-motivated design principles that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control. We evaluate CONNECT against a high-quality publicly available synthesizable RTL-level NoC design intended for ASICs. Our evaluation shows a significant gain in specializing NoC design decisions to FPGAs' unique mapping and operating characteristics. For example, in the case of a 4x4 mesh configuration evaluated using a set of synthetic traffic patterns, we obtain comparable or better performance than the state-of-the-art NoC while reducing logic resource cost by 58%, or alternatively, achieve 3-4x better performance for approximately the same logic resource usage. Finally, to demonstrate CONNECT's flexibility and extensive design space coverage, we also report synthesis and network performance results for several router configurations and for entire CONNECT networks.

201 citations


Patent
12 Mar 2012
TL;DR: In this article, the tamper-resistant hardware may be used in a transaction system that provides the off-line transaction protocol, such as trusted bootstrapping by means of secure software entity modules, a new use of hardware providing a Physical Unclonable Function (PUF), and the use of a configuration fingerprint of a FPGA used within a tamper resistant hardware.
Abstract: One of the various aspects of the invention is related to suggesting various techniques for improving the tamper-resistibility of hardware. The tamper-resistant hardware may be advantageously used in a transaction system that provides the off-line transaction protocol. Amongst these techniques for improving the tamper-resistibility are trusted bootstrapping by means of secure software entity modules, a new use of hardware providing a Physical Unclonable Function, and the use of a configuration fingerprint of a FPGA used within the tamper-resistant hardware.

196 citations


Proceedings ArticleDOI
29 Apr 2012
TL;DR: The tool Go Ahead is introduced that is able to implement run-time reconfigurable systems for all recent Xilinx FPGAs and provides a scripting interface and all features can be accessed remotely.
Abstract: Exploiting the benefits of partial run-time reconfiguration requires efficient tools. In this paper, we introduce the tool Go Ahead that is able to implement run-time reconfigurable systems for all recent Xilinx FPGAs. This includes in particular support for low cost and low power Spartan-6 FPGAs. Go Ahead assists during floor planning and automates the constraint generation. It interacts with the Xilinx vendor tools and triggers the physical implementation phases all the way down to the final configuration bit streams. Go Ahead enables the building of flexible systems for integrating many reconfigurable modules very efficiently into a system. The tool targets (re)usability, portability to future devices, and migration paths among reconfigurable systems featuring different FPGAs or even FPGA families. Moreover, it provides a scripting interface and all features can be accessed remotely.

138 citations


Proceedings ArticleDOI
29 Apr 2012
TL;DR: It is demonstrated that a spiking neuron algorithm can be efficiently mapped to Bluehive using Bluespec System Verilog by taking a communication-centric approach, which contrasts with many FPGA-based neural systems which are very focused on parallel computation, resulting in inefficient use of FPGAs.
Abstract: Bluehive is a custom 64-FPGA machine targeted at scientific simulations with demanding communication requirements. Bluehive is designed to be extensible with a reconfigurable communication topology suited to algorithms with demanding high-bandwidth and low-latency communication, something which is unattainable with commodity GPGPUs and CPUs. We demonstrate that a spiking neuron algorithm can be efficiently mapped to Bluehive using Bluespec System Verilog by taking a communication-centric approach. This contrasts with many FPGA-based neural systems which are very focused on parallel computation, resulting in inefficient use of FPGA resources. Our design allows 64k neurons with 64M synapses per FPGA and is scalable to a large number of FPGAs.

105 citations


Book ChapterDOI
27 Feb 2012
TL;DR: A side-channel analysis of the bitstream encryption mechanism provided by Xilinx Virtex FPGAs shows that the encryption mechanism can be completely broken with moderate effort, and demonstrates sophisticated attacks on off-the-shelf FPGA that go far beyond schoolbook attacks on 8-bit AES S-boxes.
Abstract: This paper presents a side-channel analysis of the bitstream encryption mechanism provided by Xilinx Virtex FPGAs. This work covers our results analyzing the Virtex-4 and Virtex-5 family showing that the encryption mechanism can be completely broken with moderate effort. The presented results provide an overview of a practical real-world analysis and should help practitioners to judge the necessity to implement side-channel countermeasures. We demonstrate sophisticated attacks on off-the-shelf FPGAs that go far beyond schoolbook attacks on 8-bit AES S-boxes. We were able to perform the key extraction by using only the measurements of a single power-up. Access to the key enables cloning and manipulating a design, which has been encrypted to protect the intellectual property and to prevent fraud. As a consequence, the target product faces serious threats like IP theft and more advanced attacks such as reverse engineering or the introduction of hardware Trojans. To the best of our knowledge, this is the first successful attack against the bitstream encryption of Xilinx Virtex-4 and Virtex-5 reported in open literature.

104 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate that AMUSE can emulate soft error effects for complex circuits including microprocessors and memories, considering the real delays of an ASIC technology, and support massive fault injection campaigns, in the order of tens of millions of faults within acceptable time.
Abstract: Estimation of soft error sensitivity is crucial in order to devise optimal mitigation solutions that can satisfy reliability requirements with reduced impact on area, performance, and power consumption. In particular, the estimation of Single Event Transient (SET) effects for complex systems that include a microprocessor is challenging, due to the huge potential number of different faults and effects that must be considered, and the delay-dependent nature of SET effects. In this paper, we propose a multilevel FPGA emulation-based fault injection approach for evaluation of SET effects called AMUSE (Autonomous MUltilevel emulation system for Soft Error evaluation). This approach integrates Gate level and Register-Transfer level models of the circuit under test in a FPGA and is able to switch to the appropriate model as needed during emulation. Fault injection is performed at the Gate level, which provides delay accuracy, while fault propagation across clock cycles is performed at the Register-Transfer level for higher performance. Experimental results demonstrate that AMUSE can emulate soft error effects for complex circuits including microprocessors and memories, considering the real delays of an ASIC technology, and support massive fault injection campaigns, in the order of tens of millions of faults within acceptable time.

102 citations


Journal ArticleDOI
TL;DR: This work utilizes memristors as weights in the realization of low-power Field Programmable Gate Arrays (FPGAs) using threshold logic which is necessary not only for low power embedded systems, but also realizing biological applications using threshold Logic.
Abstract: Researchers have claimed that the memristor, the fourth fundamental circuit element, can be used for computing. In this work, we utilize memristors as weights in the realization of low-power Field Programmable Gate Arrays (FPGAs) using threshold logic which is necessary not only for low power embedded systems, but also realizing biological applications using threshold logic. Boolean functions, which are subsets of threshold functions, can be implemented using the proposed Memristive Threshold Logic (MTL) gate, whose functionality can be configured by changing the weights (memristance). A CAD framework is also developed to map the weights of a threshold gate to corresponding memristance values and synthesize logic circuits using MTL gates. Performance of the MTL gates at the circuit and logic levels is also evaluated using this CAD framework using ISCAS-85 combinational benchmarking circuits. This work also provides solutions based on device options and refreshing memristance, against drift in memristance, which can be a potential problem during operation. Comparisons with the existing CMOS look-up-table (LUT) and capacitor threshold logic (CTL) gates show that MTL gates exhibit less energy-delay product by at least 90 percent.

95 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: The implemented ASIC relies on a semi-parallel architecture where processing resources are reused to achieve good hardware efficiency and a speculative decoding technique is employed to increase the throughput by 25% at the cost of very limited added complexity.
Abstract: This paper presents the first ASIC implementation of a successive cancellation (SC) decoder for polar codes. The implemented ASIC relies on a semi-parallel architecture where processing resources are reused to achieve good hardware efficiency. A speculative decoding technique is employed to increase the throughput by 25% at the cost of very limited added complexity. The resulting architecture is implemented in a 180nm technology. The fabricated chip can be clocked at 150 MHz and uses 183k gates. It was verified using an FPGA testing setup and provides reference for the true silicon complexity of SC decoders for polar codes.

Journal ArticleDOI
TL;DR: The RS latch in this TRNG is implemented as a hard-macro to guarantee the quality of randomness by minimizing the signal skew and load imbalance of internal nodes.
Abstract: SUMMARY True random number generators (TRNGs) are important as a basis for computer security. Though there are some TRNGs composed of analog circuit, the use of digital circuits is desired for the application of TRNGs to logic LSIs. Some of the digital TRNGs utilize jitter in freerunning ring oscillators as a source of entropy, which consume large power. Another type of TRNG exploits the metastability of a latch to generate entropy. Although this kind of TRNG has been mostly implemented with fullcustom LSI technology, this study presents an implementation based on common FPGA technology. Our TRNG is comprised of logic gates only, and can be integrated in any kind of logic LSI. The RS latch in our TRNG is implemented as a hard-macro to guarantee the quality of randomness by minimizing the signal skew and load imbalance of internal nodes. To improve the quality and throughput, the output of 64–256 latches are XOR’ed. The derived design was verified on a Xilinx Virtex-4 FPGA (XC4VFX20), and passed NIST statistical test suite without post-processing. Our TRNG with 256 latches occupies 580 slices, while achieving 12.5Mbps through

Journal ArticleDOI
TL;DR: This brief addresses the implementation of the powerful extreme learning machine (ELM) model on reconfigurable digital hardware (HW) and describes and analyzes two implementation approaches: one involving field-programmable gate array devices and one embedding low-cost low-performance devices such as complex programmable logic devices.
Abstract: The availability of compact fast circuitry for the support of artificial neural systems is a long-standing and critical requirement for many important applications. This brief addresses the implementation of the powerful extreme learning machine (ELM) model on reconfigurable digital hardware (HW). The design strategy first provides a training procedure for ELMs, which effectively trades off prediction accuracy and network complexity. This, in turn, facilitates the optimization of HW resources. Finally, this brief describes and analyzes two implementation approaches: one involving field-programmable gate array devices and one embedding low-cost low-performance devices such as complex programmable logic devices. Experimental results show that, in both cases, the design approach yields efficient digital architectures with satisfactory performances and limited costs.

Journal ArticleDOI
TL;DR: This paper addresses the implementation of linear model predictive control at millisecond range, or faster, sampling rates by designing a custom integrated circuit architecture that is specifically targeted to the MPC problem.
Abstract: This paper addresses the implementation of linear model predictive control (MPC) at millisecond range, or faster, sampling rates. This is achieved by designing a custom integrated circuit architecture that is specifically targeted to the MPC problem. As opposed to the more usual approach using a generic serial architecture processor, the design here is implemented using a field-programmable gate array and employs parallelism, pipelining, and specialized numerical formats. The performance of this approach is profiled via the control of a 14th-order resonant structure with 12 sample prediction horizon at 200-μs sampling rate. The results indicate that no more than 30 μs are required to compute the control action. A feasibility study indicates that the design can also be implemented in 130 nm CMOS technology, with a core area of 2.5 mm2. These results illustrate the feasibility of MPC for reasonably complex systems, using relatively cheap, small, and low-power computing hardware.

Proceedings ArticleDOI
25 Mar 2012
TL;DR: A novel AOP language, LARA, is described, which allows the specification of compi-lation strategies to enable efficient generation of software code and hardware cores for alternative target architectures and for guiding the application of compiler and hardware synthesis optimizations.
Abstract: The development of applications for high-performance embedded systems is typically a long and error-prone process. In addition to the required functions, developers must consider various and often conflicting non-functional application requirements such as performance and energy efficiency. The complexity of this process is exacerbated by the multitude of target architectures and the associated retargetable mapping tools. This paper introduces an As-pect-Oriented Programming (AOP) approach that conveys domain knowledge and non-functional requirements to optimizers and mapping tools. We describe a novel AOP language, LARA, which allows the specification of compi-lation strategies to enable efficient generation of software code and hardware cores for alternative target architectures. We illustrate the use of LARA for code instrumentation and analysis, and for guiding the application of compiler and hardware synthesis optimizations. An important LARA feature is its capability to deal with different join points, action models, and attributes, and to generate an aspect intermediate representation. We present examples of our aspect-oriented hardware/software design flow for mapping real-life application codes to embedded platforms based on Field Programmable Gate Array (FPGA) technology.

Journal ArticleDOI
01 Mar 2012
TL;DR: The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing and shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software.
Abstract: The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing. This is an important issue since the use of IT2 FIS still being controversial for several reasons, one of the most important is related to the resulting shocking increase in computational complexity that type reducers, like the Karnik-Mendel (KM) iterative method, can cause even for small systems. Hence, comparing our results against a typical implementation of a IT2 FIS using a high level language implemented into a computer, we show that using a hardware implementation the the whole IT2 FIS (fuzzification, inference engine, type reducer and defuzzification) last only four clock cycles; a speed up of nearly 225,000 and 450,000 can be obtained for the Spartan 3 and Virtex 5 Field Programmable Gate Arrays (FPGAs), respectively. This proposal is suitable to be implemented in pipeline, so the complete IT2 process can be obtained in just one clock cycle with the consequently gain in speed of 900,000 and 2,400,000 for the aforementioned FPGAs. This paper also shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software. Comparative experiments of control surfaces, and time response in the control of a real plant using the IT2 FIS implemented into a computer against the IT2 FIS into an FPGA are shown.

Proceedings ArticleDOI
29 Apr 2012
TL;DR: This paper introduces a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration and shows that it is able to achieve a substantially higher throughput compared to a software-only solution.
Abstract: In this paper, we introduce a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration. Query acceleration is of utmost importance in large database systems to achieve a very high throughput. Although common FPGA-based accelerators are suitable to achieve such a high throughput, their design is hard to extend for new operations. Using partial dynamic reconfiguration, we are able to build more flexible architectures which can be extended to new operations or SQL constructs with a very low area overhead on the FPGA. Furthermore, the reconfiguration of a few FPGA frames can be used to switch very fast from one query to the next. In our approach, an SQL query is transformed into a hardware pipeline consisting of partially reconfigurable modules. The assembly of the (FPGA) data path is done at run-time using a static system providing the stream-based communication interfaces to the partial modules and the database management system. More specifically, each incoming SQL query is analyzed and divided into single operations which are subsequently mapped onto library modules and the composed data path loaded on the FPGA. We show that our approach is able to achieve a substantially higher throughput compared to a software-only solution.

Proceedings ArticleDOI
22 Feb 2012
TL;DR: A cycle-accurate and cycle-reproducible large-scale FPGA platform designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBM's 45 nm SOI CMOS technology.
Abstract: Software based tools for simulation are not keeping up with the demands for increased chip and system design complexity. In this paper, we describe a cycle-accurate and cycle-reproducible large-scale FPGA platform that is designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBM's 45 nm SOI CMOS technology. This paper discusses the challenges for constructing such large-scale FPGA platforms, including design partitioning, clocking & synchronization, and debugging support, as well as our approach for addressing these challenges without sacrificing cycle accuracy and cycle reproducibility. The resulting fullchip simulation of the Bluegene/Q compute node ASIC runs at a simulated processor clock speed of 4 MHz, over 100,000 times faster than the logic level software simulation of the same design. The vast increase in simulation speed provides a new capability in the design cycle that proved to be instrumental in logic verification as well as early software development and performance validation for Bluegene/Q.

Journal ArticleDOI
TL;DR: This paper presents a parallel array architecture for SVM-based object detection, in an attempt to show the advantages, and performance benefits that stem from a dedicated hardware solution.
Abstract: Object detection applications are often associated with real-time performance constraints that stem from the embedded environment that they are often deployed in. Consequently, researchers have proposed dedicated hardware architectures, utilizing a variety of classification algorithms targeting object detection. Support Vector Machines (SVMs) is among the most popular classification algorithms used in object detection yielding high accuracy rates. However, existing SVM hardware implementations attempting to speed up SVM classification, have either targeted only simple applications, or SVM training. As such, there are limited proposed hardware architectures that are generic enough to be used in a variety of object detection applications. Hence, this paper presents a parallel array architecture for SVM-based object detection, in an attempt to show the advantages, and performance benefits that stem from a dedicated hardware solution. The proposed hardware architecture provides parallel processing, resource sharing among the processing units, and efficient memory management. Furthermore, the size of the array is scalable to the hardware demands, and can also handle a variety of applications such as multiclass classification problems. A prototype of the proposed architecture was implemented on an FPGA platform and evaluated using three popular detection applications, demonstrating real-time performance (40-122 fps for a variety of applications).

Proceedings ArticleDOI
15 Mar 2012
TL;DR: A proposed FPGA-based implementation of the Advanced Encryption Standard (AES) algorithm that uses an iterative looping approach with block and key size of 128 bits, lookup table implementation of S-box is presented.
Abstract: A proposed FPGA-based implementation of the Advanced Encryption Standard (AES) algorithm is presented in this paper. This implementation is compared with other works to show the efficiency. The design uses an iterative looping approach with block and key size of 128 bits, lookup table implementation of S-box. This gives low complexity architecture and easily achieves low latency as well as high throughput. Simulation results, performance results are presented and compared with previous reported designs.

Journal ArticleDOI
TL;DR: A comprehensive approach to the real-time simulation of power converters using a state-space representation using a new switch model that exhibits a natural switching behavior is covered in this paper.
Abstract: A comprehensive approach to the real-time simulation of power converters using a state-space representation is covered in this paper. Systematic formulations of state-space equations as well as a new switch model are presented. The proposed switch model exhibits a natural switching behavior, which is a valuable characteristic for the real-time simulation of power converters, thereby allowing individual treatment of switching devices irrespective of the converter topology. Successful implementations of the proposed switch model on a field programmable gate array (FPGA) device are reported, with two alternative approaches: 1) precomputing network equations for all switch state combinations and 2) solving network equations on-chip using the Gauss-Seidel iterative method. A two-level three-phase voltage source converter is implemented using the first approach, with a time step of 80 ns and a switching frequency of 200 kHz. Ideal and nonideal boost converters are also implemented on FPGA using the second approach, with a time step of 75 ns and a switching frequency of 20 kHz. Comparison with SPICE models shows that the proposed switch model offers very satisfactory accuracy and precision.

Proceedings ArticleDOI
11 May 2012
TL;DR: This paper implemented 8, 16 and 32-bit LFSR on FPGA by using VHDL to study the performance and analysis the behavior of randomness, and the simulation problem for long bit L FSR onFPGA is presented.
Abstract: LFSR based PN Sequence Generator technique is used for various cryptography applications and for designing encoder, decoder in different communication channel. It is more important to test and verify by implementing on any hardware for getting better efficient result. As FPGAs is used to implement any logical function for faster prototype development, it is necessary to implement the existing design of LFSR on FPGA to test and verify the simulated & synthesis result between different lengths. The total number of random state generated on LFSR depends on the feedback polynomial. As it is simple counter so it can count maximum of 2n-1 by using maximum feedback polynomial. Here in this paper we implemented 8, 16 and 32-bit LFSR on FPGA by using VHDL to study the performance and analysis the behavior of randomness. The analysis is conceded out to find number of gates, memory and speed requirement in FPGA as the number of bits is increased. The comparative study of 8, 16 and 32 bit LFSR on FPGA is shown here to understand the on chip verification. Also the simulation problem for long bit LFSR on FPGA is presented.

Journal ArticleDOI
TL;DR: The first FPGA design for N-FINDR, a widely used endmember extraction algorithm in the literature, is presented, which includes a direct memory access module and implements a prefetching technique to hide the latency of the input/output communications.
Abstract: Hyperspectral remote sensing attempts to identify features in the surface of the Earth using sensors that generally provide large amounts of data. The data are usually collected by a satellite or an airborne instrument and sent to a ground station that processes it. The main bottleneck of this approach is the (often reduced) bandwidth connection between the satellite and the station, which drastically limits the information that can be sent and processed in real time. A possible way to overcome this problem is to include onboard computing resources able to preprocess the data, reducing its size by orders of magnitude. Reconfigurable field-programmable gate arrays (FPGAs) are a promising platform that allows hardware/software codesign and the potential to provide powerful onboard computing capability and flexibility at the same time. Since FPGAs can implement custom hardware solutions, they can reach very high performance levels. Moreover, using run-time reconfiguration, the functionality of the FPGA can be updated at run time as many times as needed to perform different computations. Hence, the FPGA can be reused for several applications reducing the number of computing resources needed. One of the most popular and widely used techniques for analyzing hyperspectral data is linear spectral unmixing, which relies on the identification of pure spectral signatures via a so-called endmember extraction algorithm. In this paper, we present the first FPGA design for N-FINDR, a widely used endmember extraction algorithm in the literature. Our system includes a direct memory access module and implements a prefetching technique to hide the latency of the input/output communications. The proposed method has been implemented on a Virtex-4 XC4VFX60 FPGA (a model that is similar to radiation-hardened FPGAs certified for space operation) and tested using real hyperspectral data collected by NASA's Earth Observing-1 Hyperion (a satellite instrument) and the Airborne Visible Infra-Red Imaging Spectrometer over the Cuprite mining district in Nevada and the Jasper Ridge Biological Preserve in California. Experimental results demonstrate that our hardware version of the N-FINDR algorithm can significantly outperform an equivalent software version and is able to provide accurate results in near real time, which makes our reconfigurable system appealing for onboard hyperspectral data processing.

Proceedings ArticleDOI
25 Oct 2012
TL;DR: HeAP is presented, an analytical placement algorithm for heterogeneous FPGAs comprised of LUT-based logic blocks, multiplier/DSP blocks and block RAMs and a state-of-the-art ASIC-based analytical placer to target FPGA with heterogeneous blocks located at discrete locations throughout the fabric.
Abstract: We present HeAP, an analytical placement algorithm for heterogeneous FPGAs comprised of LUT-based logic blocks, multiplier/DSP blocks and block RAMs. Specifically, we adapt a state-of-the-art ASIC-based analytical placer to target FPGAs with heterogeneous blocks located at discrete locations throughout the fabric. Our placer also handles macros of LUT-based blocks with specific layout requirements, such as carry chains. Results show that our placer delivers a 4× speedup, on average, compared to Altera's non-timing driven flow, at the cost of a 5% increase in postrouted wirelength, and an 11× speedup compared to Altera's timing-driven flow, at the cost of a 4% increase in post-routed wirelength and a 9% reduction in maximum operating frequency. We also compare with an academic simulated annealing-based placer and demonstrate a 7.4× runtime advantage with 6% better placement quality.

Journal ArticleDOI
TL;DR: A digital hardware emulation of the universal machine (UM) and the ULM for real-time electromagnetic transient simulation that features accurate floating-point data representation, paralleled implementation, and fully pipelined arithmetic processing is proposed.
Abstract: Real-time electromagnetic transient simulation plays an important role in the planning, design, and operation of power systems. Inclusion of accurate and complicated models, such as the universal machine (UM) model and the universal line model (ULM), requires significant computational resources. This paper proposes a digital hardware emulation of the UM and the ULM for real-time electromagnetic transient simulation. It features accurate floating-point data representation, paralleled implementation, and fully pipelined arithmetic processing. The hardware is based on advanced field-programmable gate array (FPGA) using VHDL. A power system transient case study is simulated in real time to validate the design. On a 130-MHz input clock frequency to the FPGA, the achieved execution times for UM and ULM models are 2.5 μs and 1.42 μs, respectively. The captured real-time oscilloscope results demonstrate high accuracy of the emulator in comparison to the offline simulation of the original system in the EMTP-RV software.

Journal ArticleDOI
TL;DR: The proposed REM architecture achieved up to 11 Gbps concurrent throughput for various regex sets and up to 2.67× the throughput efficiency of other state-of-the-art designs.
Abstract: We present the design, implementation and evaluation of a high-performance architecture for regular expression matching (REM) on field-programmable gate array (FPGA). Each regular expression (regex) is first parsed into a concise token list representation, then compiled to a modular nondeterministic finite automaton (RE-NFA) using a modified version of the McNaughton-Yamada algorithm. The RE-NFA can be mapped directly onto a compact register-transistor level (RTL) circuit. A number of optimizations are applied to improve the circuit performance: 1) spatial stacking is used to construct an REM circuit processing m ≥ 1 input characters per clock cycle; 2) single-character constrained repetitions are matched efficiently by parallel shift-register lookup tables; 3) complex character classes are matched by a BRAM-based classifier shared across regexes; 4) a multipipeline architecture is used to organize a large number of RE-NFAs into priority groups to limit the I/O size of the circuit. We implemented 2,630 unique PCRE regexes from Snort rules (February 2010) in the proposed REM architecture. Based on the place-and-route results from Xilinx ISE 11.1 targeting Virtex5 LX-220 FPGAs, the proposed REM architecture achieved up to 11 Gbps concurrent throughput for various regex sets and up to 2.67× the throughput efficiency of other state-of-the-art designs.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: It is shown that the Split-Merge switch architecture is more amenable to pipelining on FPGAs, achieving 300MHz operation-up to three times the frequency and throughput of the CONNECT switches-with only 13-37% more area.
Abstract: Due to their different cost structures, the architecture of switches for an FPGA packet-switched Network-on-a-Chip (NoC) should differ from their ASIC counterparts. The CONNECT network recently demonstrated several ways in which packet-switched FPGA NoCs should differ from ASIC NoCs. However, they also concluded that pipelining was not appropriate for the FPGA switches.We show that the Split-Merge switch architecture is more amenable to pipelining on FPGAs, achieving 300MHz operation—up to three times the frequency and throughput of the CONNECT switches—with only 13–37% more area. Furthermore, we show that the Split-Merge switches are at least as efficient at routing traffic as the CONNECT switches, meaning the 2–3× frequency translates directly into two to three times the application performance.

Journal ArticleDOI
TL;DR: A holistic approach to modeling and field programmable gate array (FPGA) implementation of a permanent magnet synchronous motor (PMSM) speed controller that fits into a low-cost FPGA, without significantly increasing the execution time.
Abstract: The aim of this paper is to present a holistic approach to modeling and field programmable gate array (FPGA) implementation of a permanent magnet synchronous motor (PMSM) speed controller. The whole system is modeled in the Matlab Simulink environment. The controller is then translated to discrete time and remodeled using System Generator blocks, directly synthesizable into FPGA hardware. The algorithm is further refined and factorized to take into account hardware constraints, so as to fit into a low-cost FPGA, without significantly increasing the execution time. The resulting controller is then integrated together with sensor interfaces and analysis tools and implemented into an FPGA device. Experimental results validate the controller and verify the design.

Journal ArticleDOI
TL;DR: Two synchronous designs to increase the resolution of the DPWM implemented on field programmable gate arrays (FPGA) based on the on-chip digital clock manager block present in the low-cost Spartan-3 FPGA series and on the I/O delay element available in the high-end Virtex-6 FPGa series are presented.
Abstract: Advantages of digital control in power electronics have led to an increasing use of digital pulse-width modulators (DPWM). However, the clock frequency requirements may exceed the operational limits when the power converter switching frequency is increased, while using classical DPWM architectures. In this paper, we present two synchronous designs to increase the resolution of the DPWM implemented on field programmable gate arrays (FPGA). The proposed circuits are based on the on-chip digital clock manager block present in the low-cost Spartan-3 FPGA series and on the I/O delay element (IODELAYE1) available in the high-end Virtex-6 FPGA series. These solutions have been implemented, tested, and compared to verify the performance of these architectures.

Journal ArticleDOI
TL;DR: The architecture and implementation of a field-programmable gate array (FPGA) accelerator for doubleprecision floating-point matrix multiplication employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed.
Abstract: This study treats architecture and implementation of a field-programmable gate array (FPGA) accelerator for doubleprecision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses eight Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 Giga FLOPS (GFLOPS)), by comparing it to double-precision matrix multiplication function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.