Showing papers on "Field-programmable gate array published in 2012"

PDF

Open Access

Proceedings Article•DOI•

Chisel: constructing hardware in a Scala embedded language

[...]

Jonathan Bachrach¹, Huy Vo¹, Brian Richards¹, Yunsup Lee¹, Andrew Waterman¹, Rimas Avizienis¹, John Wawrzynek¹, Krste Asanovic¹ - Show less +4 more•Institutions (1)

University of California, Berkeley¹

03 Jun 2012

TL;DR: Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages, is introduced by embedding Chisel in the Scala programming language, raising the level of hardware design abstraction.

...read moreread less

Abstract: In this paper we introduce Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages. By embedding Chisel in the Scala programming language, we raise the level of hardware design abstraction by providing concepts including object orientation, functional programming, parameterized types, and type inference. Chisel can generate a high-speed C++-based cycle-accurate software simulator, or low-level Verilog designed to map to either FPGAs or to a standard ASIC flow for synthesis. This paper presents Chisel, its embedding in Scala, hardware examples, and results for C++ simulation, Verilog emulation and ASIC synthesis.

...read moreread less

697 citations

Proceedings Article•DOI•

The VTR project: architecture and CAD for FPGAs from verilog to routing

[...]

Jonathan Rose¹, Jason Luu¹, Chi Wai Yu², Opal Densmore¹, Jeffrey Goeders³, Andrew Somerville⁴, Kenneth B. Kent⁴, Peter Jamieson⁵, Jason H. Anderson¹ - Show less +5 more•Institutions (5)

University of Toronto¹, City University of Hong Kong², University of British Columbia³, University of New Brunswick⁴, University of Miami⁵

22 Feb 2012

TL;DR: The current status and new release of an ongoing effort to create a downstream full-implementation flow of Verilog to Routing is described, and the use of the new flow is illustrated by using it to help architect a floating-point unit in an FPGA, and compared with a prior, much longer effort.

...read moreread less

Abstract: To facilitate the development of future FPGA architectures and CAD tools -- both embedded programmable fabrics and pure-play FPGAs -- there is a need for a large scale, publicly available software suite that can synthesize circuits into easily-described hypothetical FPGA architectures. These circuits should be captured at the HDL level, or higher, and pass through logical and physical synthesis. Such a tool must provide detailed modelling of area, performance and energy to enable architecture exploration. As software flows themselves evolve to permit design capture at ever higher levels of abstraction, this downstream full-implementation flow will always be required. This paper describes the current status and new release of an ongoing effort to create such a flow - the 'Verilog to Routing' (VTR) project, which is a broad collaboration of researchers. There are three core tools: ODIN II for Verilog Elaboration and front-end hard-block synthesis, ABC for logic synthesis, and VPR for physical synthesis and analysis. ODIN II now has a simulation capability to help verify that its output is correct, as well as specialized synthesis at the elaboration step for multipliers and memories. ABC is used to optimize the 'soft' logic of the FPGA. The VPR-based packing, placement and routing is now fully timing-driven (the previous release was not) and includes new capability to target complex logic blocks. In addition we have added a set of four large benchmark circuits to a suite of previously-released Verilog HDL circuits. Finally, we illustrate the use of the new flow by using it to help architect a floating-point unit in an FPGA, and contrast it with a prior, much longer effort that was required to do the same thing.

...read moreread less

271 citations

Proceedings Article•DOI•

From opencl to high-performance hardware on FPGAS

[...]

Tomasz Czajkowski¹, Utku Aydonat¹, Dmitry N. Denisenko¹, John Freeman¹, Michael Kinsner¹, David Neto¹, Jason Wong¹, Peter Yiannacouras¹, Deshanand Singh¹ - Show less +5 more•Institutions (1)

Altera¹

25 Oct 2012

TL;DR: It is shown that the OpenCL computing paradigm is a viable design entry method for high-performance computing applications on FPGAs and that it can achieve a clock frequency in excess of 160MHz on benchmarks.

...read moreread less

Abstract: We present an OpenCL compilation framework to generate high-performance hardware for FPGAs. For an OpenCL application comprising a host program and a set of kernels, it compiles the host program, generates Verilog HDL for each kernel, compiles the circuit using Altera Complete Design Suite 12.0, and downloads the compiled design onto an FPGA.We can then run the application by executing the host program on a Windows(tm)-based machine, which communicates with kernels on an FPGA using a PCIe interface. We implement four applications on an Altera Stratix IV and present the throughput and area results for each application. We show that we can achieve a clock frequency in excess of 160MHz on our benchmarks, and that OpenCL computing paradigm is a viable design entry method for high-performance computing applications on FPGAs.

...read moreread less

252 citations

Proceedings Article•DOI•

CONNECT: re-examining conventional wisdom for designing nocs in the context of FPGAs

[...]

Michael K. Papamichael¹, James C. Hoe¹•Institutions (1)

Carnegie Mellon University¹

22 Feb 2012

TL;DR: This paper developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control.

...read moreread less

Abstract: An FPGA is a peculiar hardware realization substrate in terms of the relative speed and cost of logic vs. wires vs. memory. In this paper, we present a Network-on-Chip (NoC) design study from the mindset of NoC as a synthesizable infrastructural element to support emerging System-on-Chip (SoC) applications on FPGAs. To support our study, we developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology. The CONNECT NoC architecture embodies a set of FPGA-motivated design principles that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control. We evaluate CONNECT against a high-quality publicly available synthesizable RTL-level NoC design intended for ASICs. Our evaluation shows a significant gain in specializing NoC design decisions to FPGAs' unique mapping and operating characteristics. For example, in the case of a 4x4 mesh configuration evaluated using a set of synthetic traffic patterns, we obtain comparable or better performance than the state-of-the-art NoC while reducing logic resource cost by 58%, or alternatively, achieve 3-4x better performance for approximately the same logic resource usage. Finally, to demonstrate CONNECT's flexibility and extensive design space coverage, we also report synthesis and network performance results for several router configurations and for entire CONNECT networks.

...read moreread less

201 citations

Patent•

Tamper-protected hardware and method for using same

[...]

Kreft Heinz

12 Mar 2012

TL;DR: In this article, the tamper-resistant hardware may be used in a transaction system that provides the off-line transaction protocol, such as trusted bootstrapping by means of secure software entity modules, a new use of hardware providing a Physical Unclonable Function (PUF), and the use of a configuration fingerprint of a FPGA used within a tamper resistant hardware.

...read moreread less

Abstract: One of the various aspects of the invention is related to suggesting various techniques for improving the tamper-resistibility of hardware. The tamper-resistant hardware may be advantageously used in a transaction system that provides the off-line transaction protocol. Amongst these techniques for improving the tamper-resistibility are trusted bootstrapping by means of secure software entity modules, a new use of hardware providing a Physical Unclonable Function, and the use of a configuration fingerprint of a FPGA used within the tamper-resistant hardware.

...read moreread less

196 citations

Proceedings Article•DOI•

Go Ahead: A Partial Reconfiguration Framework

[...]

Christian Beckhoff¹, Dirk Koch¹, Jim Torresen¹•Institutions (1)

University of Oslo¹

29 Apr 2012

TL;DR: The tool Go Ahead is introduced that is able to implement run-time reconfigurable systems for all recent Xilinx FPGAs and provides a scripting interface and all features can be accessed remotely.

...read moreread less

Abstract: Exploiting the benefits of partial run-time reconfiguration requires efficient tools. In this paper, we introduce the tool Go Ahead that is able to implement run-time reconfigurable systems for all recent Xilinx FPGAs. This includes in particular support for low cost and low power Spartan-6 FPGAs. Go Ahead assists during floor planning and automates the constraint generation. It interacts with the Xilinx vendor tools and triggers the physical implementation phases all the way down to the final configuration bit streams. Go Ahead enables the building of flexible systems for integrating many reconfigurable modules very efficiently into a system. The tool targets (re)usability, portability to future devices, and migration paths among reconfigurable systems featuring different FPGAs or even FPGA families. Moreover, it provides a scripting interface and all features can be accessed remotely.

...read moreread less

138 citations

Proceedings Article•DOI•

Bluehive - A field-programable custom computing machine for extreme-scale real-time neural network simulation

[...]

Simon W. Moore¹, Paul J. Fox¹, S. J. T. Marsh¹, A. T. Markettos¹, Alan Mujumdar¹ - Show less +1 more•Institutions (1)

University of Cambridge¹

29 Apr 2012

TL;DR: It is demonstrated that a spiking neuron algorithm can be efficiently mapped to Bluehive using Bluespec System Verilog by taking a communication-centric approach, which contrasts with many FPGA-based neural systems which are very focused on parallel computation, resulting in inefficient use of FPGAs.

...read moreread less

Abstract: Bluehive is a custom 64-FPGA machine targeted at scientific simulations with demanding communication requirements. Bluehive is designed to be extensible with a reconfigurable communication topology suited to algorithms with demanding high-bandwidth and low-latency communication, something which is unattainable with commodity GPGPUs and CPUs. We demonstrate that a spiking neuron algorithm can be efficiently mapped to Bluehive using Bluespec System Verilog by taking a communication-centric approach. This contrasts with many FPGA-based neural systems which are very focused on parallel computation, resulting in inefficient use of FPGA resources. Our design allows 64k neurons with 64M synapses per FPGA and is scalable to a large number of FPGAs.

...read moreread less

105 citations

Book Chapter•DOI•

Black-Box side-channel attacks highlight the importance of countermeasures: an analysis of the xilinx virtex-4 and virtex-5 bitstream encryption mechanism

[...]

Amir Moradi¹, Markus Kasper¹, Christof Paar¹•Institutions (1)

Ruhr University Bochum¹

27 Feb 2012

TL;DR: A side-channel analysis of the bitstream encryption mechanism provided by Xilinx Virtex FPGAs shows that the encryption mechanism can be completely broken with moderate effort, and demonstrates sophisticated attacks on off-the-shelf FPGA that go far beyond schoolbook attacks on 8-bit AES S-boxes.

...read moreread less

Abstract: This paper presents a side-channel analysis of the bitstream encryption mechanism provided by Xilinx Virtex FPGAs. This work covers our results analyzing the Virtex-4 and Virtex-5 family showing that the encryption mechanism can be completely broken with moderate effort. The presented results provide an overview of a practical real-world analysis and should help practitioners to judge the necessity to implement side-channel countermeasures. We demonstrate sophisticated attacks on off-the-shelf FPGAs that go far beyond schoolbook attacks on 8-bit AES S-boxes. We were able to perform the key extraction by using only the measurements of a single power-up. Access to the key enables cloning and manipulating a design, which has been encrypted to protect the intellectual property and to prevent fraud. As a consequence, the target product faces serious threats like IP theft and more advanced attacks such as reverse engineering or the introduction of hardware Trojans. To the best of our knowledge, this is the first successful attack against the bitstream encryption of Xilinx Virtex-4 and Virtex-5 reported in open literature.

...read moreread less

104 citations

Journal Article•DOI•

Soft Error Sensitivity Evaluation of Microprocessors by Multilevel Emulation-Based Fault Injection

[...]

Luis Entrena¹, Mario Garcia-Valderas¹, R. Fernandez-Cardenal¹, Almudena Lindoso¹, M. Portela¹, Celia Lopez-Ongil¹ - Show less +2 more•Institutions (1)

Carlos III Health Institute¹

01 Mar 2012-IEEE Transactions on Computers

TL;DR: Experimental results demonstrate that AMUSE can emulate soft error effects for complex circuits including microprocessors and memories, considering the real delays of an ASIC technology, and support massive fault injection campaigns, in the order of tens of millions of faults within acceptable time.

...read moreread less

Abstract: Estimation of soft error sensitivity is crucial in order to devise optimal mitigation solutions that can satisfy reliability requirements with reduced impact on area, performance, and power consumption. In particular, the estimation of Single Event Transient (SET) effects for complex systems that include a microprocessor is challenging, due to the huge potential number of different faults and effects that must be considered, and the delay-dependent nature of SET effects. In this paper, we propose a multilevel FPGA emulation-based fault injection approach for evaluation of SET effects called AMUSE (Autonomous MUltilevel emulation system for Soft Error evaluation). This approach integrates Gate level and Register-Transfer level models of the circuit under test in a FPGA and is able to switch to the appropriate model as needed during emulation. Fault injection is performed at the Gate level, which provides delay accuracy, while fault propagation across clock cycles is performed at the Register-Transfer level for higher performance. Experimental results demonstrate that AMUSE can emulate soft error effects for complex circuits including microprocessors and memories, considering the real delays of an ASIC technology, and support massive fault injection campaigns, in the order of tens of millions of faults within acceptable time.

...read moreread less

102 citations

Journal Article•DOI•

An Energy-Efficient Memristive Threshold Logic Circuit

[...]

Jeyavijayan Rajendran¹, Harika Manem¹, Ramesh Karri¹, Garrett S. Rose¹•Institutions (1)

New York University¹

01 Apr 2012-IEEE Transactions on Computers

TL;DR: This work utilizes memristors as weights in the realization of low-power Field Programmable Gate Arrays (FPGAs) using threshold logic which is necessary not only for low power embedded systems, but also realizing biological applications using threshold Logic.

...read moreread less

Abstract: Researchers have claimed that the memristor, the fourth fundamental circuit element, can be used for computing. In this work, we utilize memristors as weights in the realization of low-power Field Programmable Gate Arrays (FPGAs) using threshold logic which is necessary not only for low power embedded systems, but also realizing biological applications using threshold logic. Boolean functions, which are subsets of threshold functions, can be implemented using the proposed Memristive Threshold Logic (MTL) gate, whose functionality can be configured by changing the weights (memristance). A CAD framework is also developed to map the weights of a threshold gate to corresponding memristance values and synthesize logic circuits using MTL gates. Performance of the MTL gates at the circuit and logic levels is also evaluated using this CAD framework using ISCAS-85 combinational benchmarking circuits. This work also provides solutions based on device options and refreshing memristance, against drift in memristance, which can be a potential problem during operation. Comparisons with the existing CMOS look-up-table (LUT) and capacitor threshold logic (CTL) gates show that MTL gates exhibit less energy-delay product by at least 90 percent.

...read moreread less

95 citations

Proceedings Article•DOI•

A successive cancellation decoder ASIC for a 1024-bit polar code in 180nm CMOS

[...]

A. Mishra¹, Alexandre J. Raymond², Luca Amaru¹, Sarkis Gabi², Camille Leroux, Pascal Meinerzhagen¹, Andreas Burg¹, Warren J. Gross² - Show less +4 more•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, McGill University²

01 Dec 2012

TL;DR: The implemented ASIC relies on a semi-parallel architecture where processing resources are reused to achieve good hardware efficiency and a speculative decoding technique is employed to increase the throughput by 25% at the cost of very limited added complexity.

...read moreread less

Abstract: This paper presents the first ASIC implementation of a successive cancellation (SC) decoder for polar codes. The implemented ASIC relies on a semi-parallel architecture where processing resources are reused to achieve good hardware efficiency. A speculative decoding technique is employed to increase the throughput by 25% at the cost of very limited added complexity. The resulting architecture is implemented in a 180nm technology. The fabricated chip can be clocked at 150 MHz and uses 183k gates. It was verified using an FPGA testing setup and provides reference for the true silicon complexity of SC decoders for polar codes.

...read moreread less

Journal Article•DOI•

FPGA Implementation of Metastability-Based True Random Number Generator

[...]

Hisashi Hata¹, Shuichi Ichikawa¹•Institutions (1)

Toyohashi University of Technology¹

01 Feb 2012-IEICE Transactions on Information and Systems

TL;DR: The RS latch in this TRNG is implemented as a hard-macro to guarantee the quality of randomness by minimizing the signal skew and load imbalance of internal nodes.

...read moreread less

Abstract: SUMMARY True random number generators (TRNGs) are important as a basis for computer security. Though there are some TRNGs composed of analog circuit, the use of digital circuits is desired for the application of TRNGs to logic LSIs. Some of the digital TRNGs utilize jitter in freerunning ring oscillators as a source of entropy, which consume large power. Another type of TRNG exploits the metastability of a latch to generate entropy. Although this kind of TRNG has been mostly implemented with fullcustom LSI technology, this study presents an implementation based on common FPGA technology. Our TRNG is comprised of logic gates only, and can be integrated in any kind of logic LSI. The RS latch in our TRNG is implemented as a hard-macro to guarantee the quality of randomness by minimizing the signal skew and load imbalance of internal nodes. To improve the quality and throughput, the output of 64–256 latches are XOR’ed. The derived design was verified on a Xilinx Virtex-4 FPGA (XC4VFX20), and passed NIST statistical test suite without post-processing. Our TRNG with 256 latches occupies 580 slices, while achieving 12.5Mbps through

...read moreread less

Journal Article•DOI•

Efficient Digital Implementation of Extreme Learning Machines for Classification

[...]

Sergio Decherchi¹, Paolo Gastaldo², Alessio Leoncini², Rodolfo Zunino²•Institutions (2)

Istituto Italiano di Tecnologia¹, University of Genoa²

10 Jul 2012-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: This brief addresses the implementation of the powerful extreme learning machine (ELM) model on reconfigurable digital hardware (HW) and describes and analyzes two implementation approaches: one involving field-programmable gate array devices and one embedding low-cost low-performance devices such as complex programmable logic devices.

...read moreread less

Abstract: The availability of compact fast circuitry for the support of artificial neural systems is a long-standing and critical requirement for many important applications. This brief addresses the implementation of the powerful extreme learning machine (ELM) model on reconfigurable digital hardware (HW). The design strategy first provides a training procedure for ELMs, which effectively trades off prediction accuracy and network complexity. This, in turn, facilitates the optimization of HW resources. Finally, this brief describes and analyzes two implementation approaches: one involving field-programmable gate array devices and one embedding low-cost low-performance devices such as complex programmable logic devices. Experimental results show that, in both cases, the design approach yields efficient digital architectures with satisfactory performances and limited costs.

...read moreread less

Journal Article•DOI•

Fast Linear Model Predictive Control Via Custom Integrated Circuit Architecture

[...]

Adrian Wills¹, Geoff Knagge¹, Brett Ninness¹•Institutions (1)

University of Newcastle¹

01 Jan 2012-IEEE Transactions on Control Systems and Technology

TL;DR: This paper addresses the implementation of linear model predictive control at millisecond range, or faster, sampling rates by designing a custom integrated circuit architecture that is specifically targeted to the MPC problem.

...read moreread less

Abstract: This paper addresses the implementation of linear model predictive control (MPC) at millisecond range, or faster, sampling rates. This is achieved by designing a custom integrated circuit architecture that is specifically targeted to the MPC problem. As opposed to the more usual approach using a generic serial architecture processor, the design here is implemented using a field-programmable gate array and employs parallelism, pipelining, and specialized numerical formats. The performance of this approach is profiled via the control of a 14th-order resonant structure with 12 sample prediction horizon at 200-μs sampling rate. The results indicate that no more than 30 μs are required to compute the control action. A feasibility study indicates that the design can also be implemented in 130 nm CMOS technology, with a core area of 2.5 mm2. These results illustrate the feasibility of MPC for reasonably complex systems, using relatively cheap, small, and low-power computing hardware.

...read moreread less

Proceedings Article•DOI•

LARA: an aspect-oriented programming language for embedded systems

[...]

João M. P. Cardoso¹, Tiago Carvalho¹, Jose G. F. Coutinho², Wayne Luk², Ricardo Nobre³, Pedro C. Diniz³, Zlatko Petrov⁴ - Show less +3 more•Institutions (4)

University of Porto¹, Imperial College London², INESC-ID³, Honeywell⁴

25 Mar 2012

TL;DR: A novel AOP language, LARA, is described, which allows the specification of compi-lation strategies to enable efficient generation of software code and hardware cores for alternative target architectures and for guiding the application of compiler and hardware synthesis optimizations.

...read moreread less

Abstract: The development of applications for high-performance embedded systems is typically a long and error-prone process. In addition to the required functions, developers must consider various and often conflicting non-functional application requirements such as performance and energy efficiency. The complexity of this process is exacerbated by the multitude of target architectures and the associated retargetable mapping tools. This paper introduces an As-pect-Oriented Programming (AOP) approach that conveys domain knowledge and non-functional requirements to optimizers and mapping tools. We describe a novel AOP language, LARA, which allows the specification of compi-lation strategies to enable efficient generation of software code and hardware cores for alternative target architectures. We illustrate the use of LARA for code instrumentation and analysis, and for guiding the application of compiler and hardware synthesis optimizations. An important LARA feature is its capability to deal with different join points, action models, and attributes, and to generate an aspect intermediate representation. We present examples of our aspect-oriented hardware/software design flow for mapping real-life application codes to embedded platforms based on Field Programmable Gate Array (FPGA) technology.

...read moreread less

Journal Article•DOI•

Embedding a high speed interval type-2 fuzzy controller for a real plant into an FPGA

[...]

Roberto Sepúlveda¹, Oscar Montiel¹, Oscar Castillo², Patricia Melin²•Institutions (2)

Instituto Politécnico Nacional¹, AmeriCorps VISTA²

01 Mar 2012

TL;DR: The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing and shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software.

...read moreread less

Abstract: The main goal of this paper is to show that interval type-2 fuzzy inference systems (IT2 FIS) can be used in applications that require high speed processing. This is an important issue since the use of IT2 FIS still being controversial for several reasons, one of the most important is related to the resulting shocking increase in computational complexity that type reducers, like the Karnik-Mendel (KM) iterative method, can cause even for small systems. Hence, comparing our results against a typical implementation of a IT2 FIS using a high level language implemented into a computer, we show that using a hardware implementation the the whole IT2 FIS (fuzzification, inference engine, type reducer and defuzzification) last only four clock cycles; a speed up of nearly 225,000 and 450,000 can be obtained for the Spartan 3 and Virtex 5 Field Programmable Gate Arrays (FPGAs), respectively. This proposal is suitable to be implemented in pipeline, so the complete IT2 process can be obtained in just one clock cycle with the consequently gain in speed of 900,000 and 2,400,000 for the aforementioned FPGAs. This paper also shows that the iterative KM method can be efficient if it is adequately implemented using the appropriate combination of hardware and software. Comparative experiments of control surfaces, and time response in the control of a real plant using the IT2 FIS implemented into a computer against the IT2 FIS into an FPGA are shown.

...read moreread less

Proceedings Article•DOI•

On-the-fly Composition of FPGA-Based SQL Query Accelerators Using a Partially Reconfigurable Module Library

[...]

Christopher Dennl¹, Daniel Ziener¹, Jürgen Teich¹•Institutions (1)

University of Erlangen-Nuremberg¹

29 Apr 2012

TL;DR: This paper introduces a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration and shows that it is able to achieve a substantially higher throughput compared to a software-only solution.

...read moreread less

Abstract: In this paper, we introduce a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration. Query acceleration is of utmost importance in large database systems to achieve a very high throughput. Although common FPGA-based accelerators are suitable to achieve such a high throughput, their design is hard to extend for new operations. Using partial dynamic reconfiguration, we are able to build more flexible architectures which can be extended to new operations or SQL constructs with a very low area overhead on the FPGA. Furthermore, the reconfiguration of a few FPGA frames can be used to switch very fast from one query to the next. In our approach, an SQL query is transformed into a hardware pipeline consisting of partially reconfigurable modules. The assembly of the (FPGA) data path is done at run-time using a static system providing the stream-based communication interfaces to the partial modules and the database management system. More specifically, each incoming SQL query is analyzed and divided into single operations which are subsequently mapped onto library modules and the composed data path loaded on the FPGA. We show that our approach is able to achieve a substantially higher throughput compared to a software-only solution.

...read moreread less

Proceedings Article•DOI•

A cycle-accurate, cycle-reproducible multi-FPGA system for accelerating multi-core processor simulation

[...]

Sameh W. Asaad¹, Ralph Bellofatto¹, Bernard Brezzo¹, C. Haymes¹, Mohit Kapur¹, Benjamin D. Parker¹, Thomas Roewer¹, Proshanta Saha¹, Todd E. Takken¹, Jose A. Tierno¹ - Show less +6 more•Institutions (1)

IBM¹

22 Feb 2012

TL;DR: A cycle-accurate and cycle-reproducible large-scale FPGA platform designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBM's 45 nm SOI CMOS technology.

...read moreread less

Abstract: Software based tools for simulation are not keeping up with the demands for increased chip and system design complexity. In this paper, we describe a cycle-accurate and cycle-reproducible large-scale FPGA platform that is designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBM's 45 nm SOI CMOS technology. This paper discusses the challenges for constructing such large-scale FPGA platforms, including design partitioning, clocking & synchronization, and debugging support, as well as our approach for addressing these challenges without sacrificing cycle accuracy and cycle reproducibility. The resulting fullchip simulation of the Bluegene/Q compute node ASIC runs at a simulated processor clock speed of 4 MHz, over 100,000 times faster than the logic level software simulation of the same design. The vast increase in simulation speed provides a new capability in the design cycle that proved to be instrumental in logic verification as well as early software development and performance validation for Bluegene/Q.

...read moreread less

Journal Article•DOI•

A Parallel Hardware Architecture for Real-Time Object Detection with Support Vector Machines

[...]

Christos Kyrkou¹, Theocharis Theocharides¹•Institutions (1)

University of Cyprus¹

01 Jun 2012-IEEE Transactions on Computers

TL;DR: This paper presents a parallel array architecture for SVM-based object detection, in an attempt to show the advantages, and performance benefits that stem from a dedicated hardware solution.

...read moreread less

Abstract: Object detection applications are often associated with real-time performance constraints that stem from the embedded environment that they are often deployed in. Consequently, researchers have proposed dedicated hardware architectures, utilizing a variety of classification algorithms targeting object detection. Support Vector Machines (SVMs) is among the most popular classification algorithms used in object detection yielding high accuracy rates. However, existing SVM hardware implementations attempting to speed up SVM classification, have either targeted only simple applications, or SVM training. As such, there are limited proposed hardware architectures that are generic enough to be used in a variety of object detection applications. Hence, this paper presents a parallel array architecture for SVM-based object detection, in an attempt to show the advantages, and performance benefits that stem from a dedicated hardware solution. The proposed hardware architecture provides parallel processing, resource sharing among the processing units, and efficient memory management. Furthermore, the size of the array is scalable to the hardware demands, and can also handle a variety of applications such as multiclass classification problems. A prototype of the proposed architecture was implemented on an FPGA platform and evaluated using three popular detection applications, demonstrating real-time performance (40-122 fps for a variety of applications).

...read moreread less

Proceedings Article•DOI•

An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm

[...]

Trang Hoang¹, Van Loi Nguyen²•Institutions (2)

Ho Chi Minh City University of Technology¹, Vietnam National University, Ho Chi Minh City²

15 Mar 2012

TL;DR: A proposed FPGA-based implementation of the Advanced Encryption Standard (AES) algorithm that uses an iterative looping approach with block and key size of 128 bits, lookup table implementation of S-box is presented.

...read moreread less

Abstract: A proposed FPGA-based implementation of the Advanced Encryption Standard (AES) algorithm is presented in this paper. This implementation is compared with other works to show the efficiency. The design uses an iterative looping approach with block and key size of 128 bits, lookup table implementation of S-box. This gives low complexity architecture and easily achieves low latency as well as high throughput. Simulation results, performance results are presented and compared with previous reported designs.

...read moreread less

Journal Article•DOI•

A State-Space Modeling Approach for the FPGA-Based Real-Time Simulation of High Switching Frequency Power Converters

[...]

Handy Fortin Blanchette¹, Tarek Ould-Bachir¹, Jean-Pierre David¹•Institutions (1)

École Normale Supérieure¹

01 Dec 2012-IEEE Transactions on Industrial Electronics

TL;DR: A comprehensive approach to the real-time simulation of power converters using a state-space representation using a new switch model that exhibits a natural switching behavior is covered in this paper.

...read moreread less

Abstract: A comprehensive approach to the real-time simulation of power converters using a state-space representation is covered in this paper. Systematic formulations of state-space equations as well as a new switch model are presented. The proposed switch model exhibits a natural switching behavior, which is a valuable characteristic for the real-time simulation of power converters, thereby allowing individual treatment of switching devices irrespective of the converter topology. Successful implementations of the proposed switch model on a field programmable gate array (FPGA) device are reported, with two alternative approaches: 1) precomputing network equations for all switch state combinations and 2) solving network equations on-chip using the Gauss-Seidel iterative method. A two-level three-phase voltage source converter is implemented using the first approach, with a time step of 80 ns and a switching frequency of 200 kHz. Ideal and nonideal boost converters are also implemented on FPGA using the second approach, with a time step of 75 ns and a switching frequency of 20 kHz. Comparison with SPICE models shows that the proposed switch model offers very satisfactory accuracy and precision.

...read moreread less

Proceedings Article•DOI•

FPGA Implementation of 8, 16 and 32 Bit LFSR with Maximum Length Feedback Polynomial Using VHDL

[...]

Amit Kumar Panda¹, Praveena Rajput¹, Bhawna Shukla¹•Institutions (1)

Guru Ghasidas University¹

11 May 2012

TL;DR: This paper implemented 8, 16 and 32-bit LFSR on FPGA by using VHDL to study the performance and analysis the behavior of randomness, and the simulation problem for long bit L FSR onFPGA is presented.

...read moreread less

Abstract: LFSR based PN Sequence Generator technique is used for various cryptography applications and for designing encoder, decoder in different communication channel. It is more important to test and verify by implementing on any hardware for getting better efficient result. As FPGAs is used to implement any logical function for faster prototype development, it is necessary to implement the existing design of LFSR on FPGA to test and verify the simulated & synthesis result between different lengths. The total number of random state generated on LFSR depends on the feedback polynomial. As it is simple counter so it can count maximum of 2n-1 by using maximum feedback polynomial. Here in this paper we implemented 8, 16 and 32-bit LFSR on FPGA by using VHDL to study the performance and analysis the behavior of randomness. The analysis is conceded out to find number of gates, memory and speed requirement in FPGA as the number of bits is increased. The comparative study of 8, 16 and 32 bit LFSR on FPGA is shown here to understand the on chip verification. Also the simulation problem for long bit LFSR on FPGA is presented.

...read moreread less

Journal Article•DOI•

FPGA Implementation of the N-FINDR Algorithm for Remotely Sensed Hyperspectral Image Analysis

[...]

Carlos Villaseca González¹, Daniel Mozos¹, Javier Resano², Antonio Plaza³•Institutions (3)

Complutense University of Madrid¹, University of Zaragoza², University of Extremadura³

01 Feb 2012-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: The first FPGA design for N-FINDR, a widely used endmember extraction algorithm in the literature, is presented, which includes a direct memory access module and implements a prefetching technique to hide the latency of the input/output communications.

...read moreread less

Abstract: Hyperspectral remote sensing attempts to identify features in the surface of the Earth using sensors that generally provide large amounts of data. The data are usually collected by a satellite or an airborne instrument and sent to a ground station that processes it. The main bottleneck of this approach is the (often reduced) bandwidth connection between the satellite and the station, which drastically limits the information that can be sent and processed in real time. A possible way to overcome this problem is to include onboard computing resources able to preprocess the data, reducing its size by orders of magnitude. Reconfigurable field-programmable gate arrays (FPGAs) are a promising platform that allows hardware/software codesign and the potential to provide powerful onboard computing capability and flexibility at the same time. Since FPGAs can implement custom hardware solutions, they can reach very high performance levels. Moreover, using run-time reconfiguration, the functionality of the FPGA can be updated at run time as many times as needed to perform different computations. Hence, the FPGA can be reused for several applications reducing the number of computing resources needed. One of the most popular and widely used techniques for analyzing hyperspectral data is linear spectral unmixing, which relies on the identification of pure spectral signatures via a so-called endmember extraction algorithm. In this paper, we present the first FPGA design for N-FINDR, a widely used endmember extraction algorithm in the literature. Our system includes a direct memory access module and implements a prefetching technique to hide the latency of the input/output communications. The proposed method has been implemented on a Virtex-4 XC4VFX60 FPGA (a model that is similar to radiation-hardened FPGAs certified for space operation) and tested using real hyperspectral data collected by NASA's Earth Observing-1 Hyperion (a satellite instrument) and the Airborne Visible Infra-Red Imaging Spectrometer over the Cuprite mining district in Nevada and the Jasper Ridge Biological Preserve in California. Experimental results demonstrate that our hardware version of the N-FINDR algorithm can significantly outperform an equivalent software version and is able to provide accurate results in near real time, which makes our reconfigurable system appealing for onboard hyperspectral data processing.

...read moreread less

Proceedings Article•DOI•

Analytical placement for heterogeneous FPGAs

[...]

Marcel Gort¹, Jason H. Anderson¹•Institutions (1)

University of Toronto¹

25 Oct 2012

TL;DR: HeAP is presented, an analytical placement algorithm for heterogeneous FPGAs comprised of LUT-based logic blocks, multiplier/DSP blocks and block RAMs and a state-of-the-art ASIC-based analytical placer to target FPGA with heterogeneous blocks located at discrete locations throughout the fabric.

...read moreread less

Abstract: We present HeAP, an analytical placement algorithm for heterogeneous FPGAs comprised of LUT-based logic blocks, multiplier/DSP blocks and block RAMs. Specifically, we adapt a state-of-the-art ASIC-based analytical placer to target FPGAs with heterogeneous blocks located at discrete locations throughout the fabric. Our placer also handles macros of LUT-based blocks with specific layout requirements, such as carry chains. Results show that our placer delivers a 4× speedup, on average, compared to Altera's non-timing driven flow, at the cost of a 5% increase in postrouted wirelength, and an 11× speedup compared to Altera's timing-driven flow, at the cost of a 4% increase in post-routed wirelength and a 9% reduction in maximum operating frequency. We also compare with an academic simulated annealing-based placer and demonstrate a 7.4× runtime advantage with 6% better placement quality.

...read moreread less

Journal Article•DOI•

Digital Hardware Emulation of Universal Machine and Universal Line Models for Real-Time Electromagnetic Transient Simulation

[...]

Yuan Chen¹, Venkata Dinavahi¹•Institutions (1)

University of Alberta¹

01 Feb 2012-IEEE Transactions on Industrial Electronics

TL;DR: A digital hardware emulation of the universal machine (UM) and the ULM for real-time electromagnetic transient simulation that features accurate floating-point data representation, paralleled implementation, and fully pipelined arithmetic processing is proposed.

...read moreread less

Abstract: Real-time electromagnetic transient simulation plays an important role in the planning, design, and operation of power systems. Inclusion of accurate and complicated models, such as the universal machine (UM) model and the universal line model (ULM), requires significant computational resources. This paper proposes a digital hardware emulation of the UM and the ULM for real-time electromagnetic transient simulation. It features accurate floating-point data representation, paralleled implementation, and fully pipelined arithmetic processing. The hardware is based on advanced field-programmable gate array (FPGA) using VHDL. A power system transient case study is simulated in real time to validate the design. On a 130-MHz input clock frequency to the FPGA, the achieved execution times for UM and ULM models are 2.5 μs and 1.42 μs, respectively. The captured real-time oscilloscope results demonstrate high accuracy of the emulator in comparison to the offline simulation of the original system in the EMTP-RV software.

...read moreread less

Journal Article•DOI•

High-Performance and Compact Architecture for Regular Expression Matching on FPGA

[...]

Yi-Hua E. Yang¹, Viktor K. Prasanna¹•Institutions (1)

University of Southern California¹

01 Jul 2012-IEEE Transactions on Computers

TL;DR: The proposed REM architecture achieved up to 11 Gbps concurrent throughput for various regex sets and up to 2.67× the throughput efficiency of other state-of-the-art designs.

...read moreread less

Abstract: We present the design, implementation and evaluation of a high-performance architecture for regular expression matching (REM) on field-programmable gate array (FPGA). Each regular expression (regex) is first parsed into a concise token list representation, then compiled to a modular nondeterministic finite automaton (RE-NFA) using a modified version of the McNaughton-Yamada algorithm. The RE-NFA can be mapped directly onto a compact register-transistor level (RTL) circuit. A number of optimizations are applied to improve the circuit performance: 1) spatial stacking is used to construct an REM circuit processing m ≥ 1 input characters per clock cycle; 2) single-character constrained repetitions are matched efficiently by parallel shift-register lookup tables; 3) complex character classes are matched by a BRAM-based classifier shared across regexes; 4) a multipipeline architecture is used to organize a large number of RE-NFAs into priority groups to limit the I/O size of the circuit. We implemented 2,630 unique PCRE regexes from Snort rules (February 2010) in the proposed REM architecture. Based on the place-and-route results from Xilinx ISE 11.1 targeting Virtex5 LX-220 FPGAs, the proposed REM architecture achieved up to 11 Gbps concurrent throughput for various regex sets and up to 2.67× the throughput efficiency of other state-of-the-art designs.

...read moreread less

Proceedings Article•DOI•

FPGA optimized packet-switched NoC using split and merge primitives

[...]

Yutian Huan¹, André DeHon¹•Institutions (1)

University of Pennsylvania¹

01 Dec 2012

TL;DR: It is shown that the Split-Merge switch architecture is more amenable to pipelining on FPGAs, achieving 300MHz operation-up to three times the frequency and throughput of the CONNECT switches-with only 13-37% more area.

...read moreread less

Abstract: Due to their different cost structures, the architecture of switches for an FPGA packet-switched Network-on-a-Chip (NoC) should differ from their ASIC counterparts. The CONNECT network recently demonstrated several ways in which packet-switched FPGA NoCs should differ from ASIC NoCs. However, they also concluded that pipelining was not appropriate for the FPGA switches.We show that the Split-Merge switch architecture is more amenable to pipelining on FPGAs, achieving 300MHz operation—up to three times the frequency and throughput of the CONNECT switches—with only 13–37% more area. Furthermore, we show that the Split-Merge switches are at least as efficient at routing traffic as the CONNECT switches, meaning the 2–3× frequency translates directly into two to three times the application performance.

...read moreread less

Journal Article•DOI•

Simulink Modeling and Design of an Efficient Hardware-Constrained FPGA-Based PMSM Speed Controller

[...]

Bogdan Alecsa, Marcian Cirstea¹, Alexandru Onea•Institutions (1)

Anglia Ruskin University¹

06 Apr 2012-IEEE Transactions on Industrial Informatics

TL;DR: A holistic approach to modeling and field programmable gate array (FPGA) implementation of a permanent magnet synchronous motor (PMSM) speed controller that fits into a low-cost FPGA, without significantly increasing the execution time.

...read moreread less

Abstract: The aim of this paper is to present a holistic approach to modeling and field programmable gate array (FPGA) implementation of a permanent magnet synchronous motor (PMSM) speed controller. The whole system is modeled in the Matlab Simulink environment. The controller is then translated to discrete time and remodeled using System Generator blocks, directly synthesizable into FPGA hardware. The algorithm is further refined and factorized to take into account hardware constraints, so as to fit into a low-cost FPGA, without significantly increasing the execution time. The resulting controller is then integrated together with sensor interfaces and analysis tools and implemented into an FPGA device. Experimental results validate the controller and verify the design.

...read moreread less

Journal Article•DOI•

Synchronous FPGA-Based High-Resolution Implementations of Digital Pulse-Width Modulators

[...]

Denis Navarro¹, Oscar Lucia¹, Luis A. Barragan¹, Jose I. Artigas¹, I. Urriza¹, O. Jimenez¹ - Show less +2 more•Institutions (1)

University of Zaragoza¹

01 May 2012-IEEE Transactions on Power Electronics

TL;DR: Two synchronous designs to increase the resolution of the DPWM implemented on field programmable gate arrays (FPGA) based on the on-chip digital clock manager block present in the low-cost Spartan-3 FPGA series and on the I/O delay element available in the high-end Virtex-6 FPGa series are presented.

...read moreread less

Abstract: Advantages of digital control in power electronics have led to an increasing use of digital pulse-width modulators (DPWM). However, the clock frequency requirements may exceed the operational limits when the power converter switching frequency is increased, while using classical DPWM architectures. In this paper, we present two synchronous designs to increase the resolution of the DPWM implemented on field programmable gate arrays (FPGA). The proposed circuits are based on the on-chip digital clock manager block present in the low-cost Spartan-3 FPGA series and on the I/O delay element (IODELAYE1) available in the high-end Virtex-6 FPGA series. These solutions have been implemented, tested, and compared to verify the performance of these architectures.

...read moreread less

Journal Article•DOI•

FPGA accelerator for floating-point matrix multiplication

[...]

Zeljko Jovanovic¹, Veljko Milutinovic¹•Institutions (1)

University of Belgrade¹

25 Oct 2012-Iet Computers and Digital Techniques

TL;DR: The architecture and implementation of a field-programmable gate array (FPGA) accelerator for doubleprecision floating-point matrix multiplication employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed.

...read moreread less

Abstract: This study treats architecture and implementation of a field-programmable gate array (FPGA) accelerator for doubleprecision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses eight Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 Giga FLOPS (GFLOPS)), by comparing it to double-precision matrix multiplication function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.

...read moreread less

Collapse