scispace - formally typeset
Search or ask a question

Showing papers on "Field-programmable gate array published in 2005"


Journal ArticleDOI
TL;DR: In this article, the authors describe a digital logic architecture for CMOL hybrid circuits which combine a semiconductor-transistor (CMOS) stack and two levels of parallel nanowires, with molecular-scale nanodevices formed between the Nanowires at every crosspoint.
Abstract: This paper describes a digital logic architecture for ‘CMOL’ hybrid circuits which combine a semiconductor–transistor (CMOS) stack and two levels of parallel nanowires, with molecular-scale nanodevices formed between the nanowires at every crosspoint. This cell-based, field-programmable gate array (FPGA)-like architecture is based on a uniform, reconfigurable CMOL fabric, with four-transistor CMOS cells and two-terminal nanodevices (‘latching switches’). The switches play two roles: they provide diode-like I –V curves for logic circuit operation, and allow circuit mapping on CMOL fabric and its reconfiguration around defective nanodevices. Monte Carlo simulations of two simple circuits (a 32-bit integer adder and a 64-bit full crossbar switch) have shown that the reconfiguration allows one to increase the circuit yield above 99% at the fraction of bad nanodevices above 20%. Estimates have shown that at the same time the circuits may have extremely high density (approximately 500 times higher than that of the usual CMOS FPGAs with the same design rules), while operating at higher speed at acceptable power consumption. (Some figures in this article are in colour only in the electronic version)

539 citations


Journal ArticleDOI
25 Jul 2005
TL;DR: It is shown that reconfigurable computing designs are capable of achieving up to 500 times speedup and 70% energy savings over microprocessor implementations for specific applications.
Abstract: Reconfigurable computing is becoming increasingly attractive for many applications. This survey covers two aspects of reconfigurable computing: architectures and design methods. The paper includes recent advances in reconfigurable architectures, such as the Alters Stratix II and Xilinx Virtex 4 FPGA devices. The authors identify major trends in general-purpose and special-purpose design methods. It is shown that reconfigurable computing designs are capable of achieving up to 500 times speedup and 70% energy savings over microprocessor implementations for specific applications.

414 citations


Book
01 Jan 2005
TL;DR: A Stochastic Model for Differential Side Channel Cryptanalysis and some Applications to Cryptanalysis, and a New Baby-Step Giant-Step Algorithm and Some Applications to cryptanalysis are presented.
Abstract: Side Channels I.- Resistance of Randomized Projective Coordinates Against Power Analysis.- Templates as Master Keys.- A Stochastic Model for Differential Side Channel Cryptanalysis.- Arithmetic for Cryptanalysis.- A New Baby-Step Giant-Step Algorithm and Some Applications to Cryptanalysis.- Further Hidden Markov Model Cryptanalysis.- Low Resources.- Energy-Efficient Software Implementation of Long Integer Modular Arithmetic.- Short Memory Scalar Multiplication on Koblitz Curves.- Hardware/Software Co-design for Hyperelliptic Curve Cryptography (HECC) on the 8051 ?P.- Special Purpose Hardware.- SHARK: A Realizable Special Hardware Sieving Device for Factoring 1024-Bit Integers.- Scalable Hardware for Sparse Systems of Linear Equations, with Applications to Integer Factorization.- Design of Testable Random Bit Generators.- Hardware Attacks and Countermeasures I.- Successfully Attacking Masked AES Hardware Implementations.- Masked Dual-Rail Pre-charge Logic: DPA-Resistance Without Routing Constraints.- Masking at Gate Level in the Presence of Glitches.- Arithmetic for Cryptography.- Bipartite Modular Multiplication.- Fast Truncated Multiplication for Cryptographic Applications.- Using an RSA Accelerator for Modular Inversion.- Comparison of Bit and Word Level Algorithms for Evaluating Unstructured Functions over Finite Rings.- Side Channel II (EM).- EM Analysis of Rijndael and ECC on a Wireless Java-Based PDA.- Security Limits for Compromising Emanations.- Security Evaluation Against Electromagnetic Analysis at Design Time.- Side Channel III.- On Second-Order Differential Power Analysis.- Improved Higher-Order Side-Channel Attacks with FPGA Experiments.- Trusted Computing.- Secure Data Management in Trusted Computing.- Hardware Attacks and Countermeasures II.- Data Remanence in Flash Memory Devices.- Prototype IC with WDDL and Differential Routing - DPA Resistance Assessment.- Hardware Attacks and Countermeasures III.- DPA Leakage Models for CMOS Logic Circuits.- The "Backend Duplication" Method.- Efficient Hardware I.- Hardware Acceleration of the Tate Pairing in Characteristic Three.- Efficient Hardware for the Tate Pairing Calculation in Characteristic Three.- Efficient Hardware II.- AES on FPGA from the Fastest to the Smallest.- A Very Compact S-Box for AES.

297 citations


BookDOI
01 Jan 2005
TL;DR: In this paper, the authors describe a new Baby-Step giant-step algorithm and some applications to Cryptanalysis, such as low resources, energy-efficient software implementation of Long Integer Modular Arithmetic, Short Memory Scalar Multiplication on Koblitz Curves, and Scalable Hardware for Sparse Systems of Linear Equations with Applications to Integer Factorization.
Abstract: Side Channels I -- Resistance of Randomized Projective Coordinates Against Power Analysis -- Templates as Master Keys -- A Stochastic Model for Differential Side Channel Cryptanalysis -- Arithmetic for Cryptanalysis -- A New Baby-Step Giant-Step Algorithm and Some Applications to Cryptanalysis -- Further Hidden Markov Model Cryptanalysis -- Low Resources -- Energy-Efficient Software Implementation of Long Integer Modular Arithmetic -- Short Memory Scalar Multiplication on Koblitz Curves -- Hardware/Software Co-design for Hyperelliptic Curve Cryptography (HECC) on the 8051 ?P -- Special Purpose Hardware -- SHARK: A Realizable Special Hardware Sieving Device for Factoring 1024-Bit Integers -- Scalable Hardware for Sparse Systems of Linear Equations, with Applications to Integer Factorization -- Design of Testable Random Bit Generators -- Hardware Attacks and Countermeasures I -- Successfully Attacking Masked AES Hardware Implementations -- Masked Dual-Rail Pre-charge Logic: DPA-Resistance Without Routing Constraints -- Masking at Gate Level in the Presence of Glitches -- Arithmetic for Cryptography -- Bipartite Modular Multiplication -- Fast Truncated Multiplication for Cryptographic Applications -- Using an RSA Accelerator for Modular Inversion -- Comparison of Bit and Word Level Algorithms for Evaluating Unstructured Functions over Finite Rings -- Side Channel II (EM) -- EM Analysis of Rijndael and ECC on a Wireless Java-Based PDA -- Security Limits for Compromising Emanations -- Security Evaluation Against Electromagnetic Analysis at Design Time -- Side Channel III -- On Second-Order Differential Power Analysis -- Improved Higher-Order Side-Channel Attacks with FPGA Experiments -- Trusted Computing -- Secure Data Management in Trusted Computing -- Hardware Attacks and Countermeasures II -- Data Remanence in Flash Memory Devices -- Prototype IC with WDDL and Differential Routing – DPA Resistance Assessment -- Hardware Attacks and Countermeasures III -- DPA Leakage Models for CMOS Logic Circuits -- The “Backend Duplication” Method -- Efficient Hardware I -- Hardware Acceleration of the Tate Pairing in Characteristic Three -- Efficient Hardware for the Tate Pairing Calculation in Characteristic Three -- Efficient Hardware II -- AES on FPGA from the Fastest to the Smallest -- A Very Compact S-Box for AES.

264 citations


Proceedings ArticleDOI
07 Mar 2005
TL;DR: The experimental results presented in this paper demonstrate that the number and placement of voters in the TMR design can directly affect the fault tolerance, ranging from 4.03% to 0.98% the number of upsets in the routing able to cause an error in theTMR circuit.
Abstract: Triple modular redundancy (TMR) is a suitable fault tolerant technique for SRAM-based FPGA However, one of the main challenges in achieving 100% robustness in designs protected by TMR running on programmable platforms is to prevent upsets in the routing from provoking undesirable connections between signals from distinct redundant logic parts, which can generate an error in the output This paper investigates the optimal design of the TMR logic (eg, by cleverly inserting voters) to ensure robustness Four different versions of a TMR digital filter were analyzed by fault injection Faults were randomly inserted straight into the bitstream of the FPGA The experimental results presented in this paper demonstrate that the number and placement of voters in the TMR design can directly affect the fault tolerance, ranging from 403% to 098% the number of upsets in the routing able to cause an error in the TMR circuit

243 citations


Proceedings ArticleDOI
20 Feb 2005
TL;DR: A novel packet classification architecture called BV-TCAM is presented, which is implemented for an FPGA-based Network Intrusion Detection System (NIDS), which can report multiple matches at gigabit per second network link rates.
Abstract: Using FPGA technology for real-time network intrusion detection has gained many research efforts recently. In this paper, a novel packet classification architecture called BV-TCAM is presented, which is implemented for an FPGA-based Network Intrusion Detection System (NIDS). The classifier can report multiple matches at gigabit per second network link rates. The BV-TCAM architecture combines the Ternary Content Addressable Memory (TCAM) and the Bit Vector (BV) algorithm to effectively compress the data representations and boost throughput. A tree-bitmap implementation of the BV algorithm is used for source and destination port lookup while a TCAM performs the lookup of the other header fields, which can be represented as a prefix or exact value. The architecture eliminates the requirement for prefix expansion of port ranges. With the aid of a small embedded TCAM, packet classification can be implemented in a relatively small part of the available logic of an FPGA. The design is prototyped and evaluated in a Xilinx FPGA XCV2000E on the FPX platform. Even with the most difficult set of rules and packet inputs, the circuit is fast enough to sustain OC48 traffic throughput. Using larger and faster FPGAs, the system can work at speeds greater than OC192.

234 citations


Proceedings ArticleDOI
20 Feb 2005
TL;DR: This architecture features a novel adaptive logic module (ALM) that is based on a 6-LUT, but can be partitioned into two smaller LUTs to efficiently implement circuits containing a range of LUT sizes that arises in conventional synthesis flows.
Abstract: This paper describes the Altera Stratix II™ logic and routing architecture. This architecture features a novel adaptive logic module (ALM) that is based on a 6-LUT, but can be partitioned into two smaller LUTs to efficiently implement circuits containing a range of LUT sizes that arises in conventional synthesis flows. This provides a performance increase of 15% in the Stratix II architecture while reducing area by 2%. The ALM also includes a more powerful arithmetic structure that can perform two bits of arithmetic per ALM, and perform a sum of up to three inputs. The routing fabric adds a new set of fast inputs to the routing multiplexers for another 3% improvement in performance, while other improvements in routing efficiency cause another 6% reduction in area. These changes in combination with other circuit and architecture changes in Stratix II contribute 27% of an overall 51% performance improvement (including architecture and process improvement). The architecture changes reduce area by 10% in the same process, and by 50% after including process migration.

226 citations


Proceedings ArticleDOI
20 Feb 2005
TL;DR: A 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations and implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology.
Abstract: We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.

224 citations


Book ChapterDOI
29 Aug 2005
TL;DR: Two new FPGA designs for the Advanced Encryption Standard (AES) are presented, believed to be the fastest and the smallest, and includes support for continued throughput during key changes for both encryption and decryption which previous pipelined designs have omitted.
Abstract: Two new FPGA designs for the Advanced Encryption Standard (AES) are presented. The first is believed to be the fastest, achieving 25 Gbps throughput using a Xilinx Spartan-III (XC3S2000) device. The second is believed to be the smallest and fits into a Xilinx Spartan-II (XC2S15) device, only requiring two block memories and 124 slices to achieve a throughput of 2.2 Mbps. These designs show the extremes of what is possible and have radically different applications from high performance e-commerce IPsec servers to low power mobile and home applications. The high speed design presented here includes support for continued throughput during key changes for both encryption and decryption which previous pipelined designs have omitted.

211 citations


Journal Article
TL;DR: In this paper, two new FPGA designs for the Advanced Encryption Standard (AES) are presented, the first achieving 25 Gbps throughput using a Xilinx Spartan-Ill (XC3S2000) device and the second achieving 22 Mbps.
Abstract: Two new FPGA designs for the Advanced Encryption Standard (AES) are presented The first is believed to be the fastest, achieving 25 Gbps throughput using a Xilinx Spartan-Ill (XC3S2000) device The second is believed to be the smallest and fits into a Xilinx Spartan-II (XC2S15) device, only requiring two block memories and 124 slices to achieve a throughput of 22 Mbps These designs show the extremes of what is possible and have radically different applications from high performance e-commerce IPsec servers to low power mobile and home applications The high speed design presented here includes support for continued throughput during key changes for both encryption and decryption which previous pipelined designs have omitted

206 citations


Journal ArticleDOI
TL;DR: A detailed and flexible power model which has been integrated in the widely used Versatile Place and Route (VPR) CAD tool is described, which estimates the dynamic, short-circuit, and leakage power consumed by FPGAs.
Abstract: Power has become a critical issue for field-programmable gate array (FPGA) vendors. Understanding the power dissipation within FPGAs is the first step in developing power-efficient architectures and computer-aided design (CAD) tools for FPGAs. This article describes a detailed and flexible power model which has been integrated in the widely used Versatile Place and Route (VPR) CAD tool. This power model estimates the dynamic, short-circuit, and leakage power consumed by FPGAs. It is the first flexible power model developed to evaluate architectural tradeoffs and the efficiency of power-aware CAD tools for a variety of FPGA architectures, and is freely available for noncommercial use. The model is flexible, in that it can estimate the power for a wide variety of FPGA architectures, and it is fast, in that it does not require extensive simulation, meaning it can be used to explore a large architectural space. We show how the model can be used to investigate the impact of various architectural parameters on the energy consumed by the FPGA, focusing on the segment length, switch block topology, lookuptable size, and cluster size.

Book
14 Dec 2005
TL;DR: A one-of-a-kind survey of the field of Reconfigurable Computing gives a comprehensive introduction to a discipline that offers a 10X-100X acceleration of algorithms over microprocessors.
Abstract: A one-of-a-kind survey of the field of Reconfigurable Computing Gives a comprehensive introduction to a discipline that offers a 10X-100X acceleration of algorithms over microprocessors Discusses the impact of reconfigurable hardware on a wide range of applications: signal and image processing, network security, bioinformatics, and supercomputing Includes the history of the field as well as recent advances Includes an extensive bibliography of primary sources

Proceedings ArticleDOI
01 Nov 2005
TL;DR: This paper presents a novel architecture for matrix inversion by generalizing the QR decomposition-based recursive least square (RLS) algorithm, and using Squared Givens rotations and a folded systolic array for FPGA implementation.
Abstract: This paper presents a novel architecture for matrix inversion by generalizing the QR decomposition-based recursive least square (RLS) algorithm. The use of Squared Givens rotations and a folded systolic array makes this architecture very suitable for FPGA implementation. Input is a 4 × 4 matrix of complex, floating point values. The matrix inversion design can achieve throughput of 0.13M updates per second on a state of the art Xilinx Virtex4 FPGA running at 115 MHz. Due to the modular partitioning and interfacing between multiple Boundary and Internal processing units, this architecture is easily extendable for other matrix sizes.

Proceedings ArticleDOI
10 Oct 2005
TL;DR: This paper discusses ways to save and restore the state information of a hardware task, and significantly reduces the amount of readback data by reading only those configuration frames that contain state information.
Abstract: Today's Field Programmable Gate Arrays (FPGAs) can be reconfigured partially, which makes it possible to share resources between various functional modules (hardware tasks) over time. This concept is well known in the area of conventional operating systems. However, in order to transfer resource sharing concepts to operating systems on FPGAs, several underlying mechanisms have to be developed. One of these mechanisms is to suspend hardware tasks and restart them at another time and/or another area of the FPGA. Addressing this problem, this paper discusses ways to save and restore the state information of a hardware task. Afterwards, an implementation of a state relocation mechanisms is presented that uses the standard configuration port. In contrast to similar approaches, we significantly reduce the amount of readback data by reading only those configuration frames that contain state information. We finally determine the time overhead for task relocation, which is essential for most multitasking concepts, like defragmentation.

Journal ArticleDOI
Fei Li1, Yan Lin1, Lei He1, Deming Chen1, Jason Cong1 
TL;DR: It is shown that interconnect power is dominant and leakage power is significant in nanometer technologies, and FPGA area and power are reduced at the same time by tuning the cluster and LUT sizes.
Abstract: This paper studies power modeling for field programmable gate arrays (FPGAs) and investigates FPGA power characteristics in nanometer technologies. Considering both dynamic and leakage power, a mixed-level power model that combines switch-level models for interconnects and macromodels for look-up tables (LUTs) is developed. Gate-level netlists back-annotated with postlayout capacitances and delays are generated and cycle-accurate power simulation is performed using the mixed-level power model. The resulting power analysis framework is named as fpgaEVA-LP2. Experiments show that fpgaEVA-LP2 achieves high fidelity compared to SPICE simulation, and the absolute error is merely 8% on average. fpgaEVA-LP2 can be used to examine the power impact of FPGA circuits, architectures, and CAD algorithms, and it is used to study the power characteristics of existing FPGA architectures in this paper. It is shown that interconnect power is dominant and leakage power is significant in nanometer technologies. In addition, tuning cluster and LUT sizes lead to 1.7/spl times/ energy difference and 0.8/spl times/ delay difference between the resulting min-energy and min-delay FPGA architectures, and FPGA area and power are reduced at the same time by tuning the cluster and LUT sizes. The existing commercial architectures are similar to the min-energy (and min-area at the same time) architecture according to this study. Therefore, innovative FPGA circuits, architectures, and CAD algorithms, for example, considering programmable power supply voltage, are needed to further reduce FPGA power.

Patent
18 Jan 2005
TL;DR: In this article, the authors present a system and method for online configuration of a measurement system, where the user can access a server over a network and specify a desired task, and receive programs and/or configuration information which are usable to configure the user's measurement system hardware (and/or software) to perform the desired task.
Abstract: A system and method for online configuration of a measurement system. The user may access a server over a network and specify a desired task, e.g., a measurement task, and receive programs and/or configuration information which are usable to configure the user's measurement system hardware (and/or software) to perform the desired task. Additionally, if the user does not have the hardware required to perform the task, the required hardware may be sent to the user, along with programs and/or configuration information. The hardware may be reconfigurable hardware, such as an FPGA or a processor/memory based device. In one embodiment, the required hardware may be pre-configured to perform the task before being sent to the user. In another embodiment, the system and method may provide a graphical program in response to receiving the user's task specification, where the graphical program may be usable by the measurement system to perform the task.

Proceedings ArticleDOI
20 Jun 2005
TL;DR: This paper looks at the advantages and disadvantages of FPGA technology, its suitability for image processing and computer vision tasks, and attempts to suggest some directions for the future.
Abstract: Reconfigurable hardware, in the form of Field Programmable Gate Arrays (FPGAs), is becoming increasingly attractive for digital signal processing problems, including image processing and computer vision tasks. The ability to exploit the parallelism often found in these problems, as well as the ability to support different modes of operation on a single hardware substrate, gives these devices a particular advantage over fixed architecture devices such as serial CPUs and DSPs. Further, development times are substantially shorter than dedicated hardware in the form of Application Specific ICs (ASICs), and small changes to a design can be prototyped in a matter of hours. On the other hand, designing with FPGAs still requires expertise beyond that found in many vision labs today. This paper looks at the advantages and disadvantages of FPGA technology, its suitability for image processing and computer vision tasks, and attempts to suggest some directions for the future.

Proceedings ArticleDOI
04 Apr 2005
TL;DR: The REPLICA (relocation per online configuration alteration) filter is developed, which is capable of performing the necessary bitstream manipulations during the regular download process and enables the integration of dynamic systems that can be adapted to changing demands during runtime.
Abstract: The feature of partial reconfiguration provided by currently available field programmable gate arrays (FPGAs) makes it possible to change hardware modules while others keep working. The combination of this feature and the high gate capacity enables the integration of dynamic systems that can be adapted to changing demands during runtime. Placing the dynamically changing modules along a horizontal communication infrastructure does not only provide communication facilities it also enables the relocation of pre-synthesized modules by bitstream manipulations. The exact placement of an incoming module is determined according to the current resource allocation, which results in an online placement problem. In order to prevent any extra configuration overhead for the relocation process, we developed the REPLICA (relocation per online configuration alteration) filter, which is capable of performing the necessary bitstream manipulations during the regular download process. The filter architecture, a configuration manager and an evaluation example are presented in this paper.

Journal ArticleDOI
TL;DR: This work presents a new approach to compute multiple sequence alignments in far shorter time using reconfigurable hardware, which results in an implementation of ClustalW with significant runtime savings on a standard off-the-shelf FPGA.
Abstract: Summary: Aligning hundreds of sequences using progressive alignment tools such as ClustalW requires several hours on state-of-the-art workstations. We present a new approach to compute multiple sequence alignments in far shorter time using reconfigurable hardware. This results in an implementation of ClustalW with significant runtime savings on a standard off-the-shelf FPGA. Availability: An online server for ClustalW running on a Pentium IV 3 GHz with a Xilinx XC2V6000 FPGA PCI-board is available at http://beta.projectproteus.org. The PE hardware design in Verilog HDL is available on request from the first author. Contact: tim.oliver@pmail.ntu.edu.sg

Proceedings ArticleDOI
07 Mar 2005
TL;DR: W warp processing is proposed, a technique capable of optimizing a software application by dynamically and transparently re-implementing critical software kernels as custom circuits in on-chip configurable logic, and it is demonstrated that the soft-core based warp processor achieves average speedups of 5.8 and energy reductions of 57% compared to the soft core alone.
Abstract: Field programmable gate arrays (FPGAs) provide designers with the ability to quickly create hardware circuits. Increases in FPGA configurable logic capacity and decreasing FPGA costs have enabled designers to more readily incorporate FPGAs in their designs. FPGA vendors have begun providing configurable soft processor cores that can be synthesized onto their FPGA products. While FPGAs with soft processor cores provide designers with increased flexibility, such processors typically have degraded performance and energy consumption compared to hard-core processors. Previously, we proposed warp processing, a technique capable of optimizing a software application by dynamically and transparently re-implementing critical software kernels as custom circuits in on-chip configurable logic. In this paper, we study the potential of a MicroBlaze soft-core based warp processing system to eliminate the performance and energy overhead of a soft-core processor compared to a hard-core processor. We demonstrate that the soft-core based warp processor achieves average speedups of 5.8 and energy reductions of 57% compared to the soft core alone. Our data shows that a soft-core based warp processor yields performance and energy consumption competitive with existing hard-core processors, thus expanding the usefulness of soft processor cores on FPGAs to a broader range of applications.

Proceedings ArticleDOI
20 Feb 2005
TL;DR: This paper presents an architecture that combines VLIW (Very Large Instruction Word) processing with the capability to introduce application specific customized instructions and complex hardware functions that allows for an overall speedup of 30X and 12X on average for signal processing benchmarks from the MediaBench.
Abstract: The capability and heterogeneity of new FPGA (Field Programmable Gate Array) devices continues to increase with each new line of devices. Efficiently programming these devices is increasing in difficulty. However, FPGAs continue to be utilized for algorithms traditionally targeted to embedded DSP microprocessors such as signal and image processing applications.This paper presents an architecture that combines VLIW (Very Large Instruction Word) processing with the capability to introduce application specific customized instructions and complex hardware functions. To support this architecture, a compilation and design automation flow are described for programs written in C.Several design tradeoffs for the architecture were examined including number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply accumulate operations.We show that our combined VLIW with hardware functions exhibit as much as 230X speedup and 63X on average for computational kernels for a set of benchmarks. This allows for an overall speedup of 30X and 12X on average for signal processing benchmarks from the MediaBench.

Proceedings ArticleDOI
11 Dec 2005
TL;DR: A novel hardware accelerator for Monte Carlo (MC) simulation, based on a generic architecture, which combines speed and flexibility by integrating a pipelined MC core with an on-chip instruction processor is described.
Abstract: This paper describes a novel hardware accelerator for Monte Carlo (MC) simulation, and illustrates its implementation in field programmable gate array (FPGA) technology for speeding up financial applications. Our accelerator is based on a generic architecture, which combines speed and flexibility by integrating a pipelined MC core with an on-chip instruction processor. We develop a generic number system representation for determining the choice of number representation that meets numerical precision requirements. Our approach is then used in a complex financial engineering application, involving the Brace, Gatarek and Musiela (BGM) interest rate model for pricing derivatives. We address, in our BGM model, several challenges including the generation of Gaussian distributed random numbers and pipelining of the MC simulation. Our BGM application, based on an off-the-shelf system with a Xilinx XC2VP30 device at 50 MHz, is over 25 times faster than software running on a 1.5 GHz, Intel Pentium machine

Proceedings ArticleDOI
24 Sep 2005
TL;DR: An infrastructure for rapidly generating RTL models of soft processors, as well as a methodology for measuring their area, performance, and power, are presented.
Abstract: As more embedded systems are built using FPGA platforms, there is an increasing need to support processors in FPGAs. One option is the soft processor, a programmable instruction processor implemented in the reconfigurable logic of the FPGA. Commercial soft processors have been widely deployed, and hence we are motivated to understand their microarchitecture. We must re-evaluate microarchiteture in the soft processor context because an FPGA platform is significantly different than an ASIC platform---for example, the relative speed of memory and logic is quite different in the two platforms, as is the area cost. In this paper we present an infrastructure for rapidly generating RTL models of soft processors, as well as a methodology for measuring their area, performance, and power. Using our automatically-generated soft processors we explore the microarchitecture trade-off space including: (i) hardware vs software multiplication support; (ii) shifter implementations; and (iii) pipeline depth, organization, and forwarding. For example, we find that a 3-stage pipeline has better wall-clock-time performance than deeper pipelines, despite lower clock frequency. We also compare our designs to Altera's NiosII commercial soft processor variations and find that our automatically generated designs span the design space while remaining very competitive.

Proceedings ArticleDOI
18 Apr 2005
TL;DR: This work introduces an efficient "systolic injection" method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling in the Apriori algorithm.
Abstract: The Apriori algorithm is a popular correlation-based data mining kernel. However, it is a computationally expensive algorithm and the running times can stretch up to days for large databases, as database sizes can extend to Gigabytes. Through the use of a new extension to the systolic array architecture, time required for processing can be significantly reduced. Our array architecture implementation on a Xilinx Virtex-II Pro 100 provides a performance improvement that can be orders of magnitude faster than the state-of-the-art software implementations. The system is easily scalable and introduces an efficient "systolic injection" method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: FPGA implementation results confirm that the proposed DA architecture can implement a 1024-tap FIR filter with significantly smaller area usage than the original LUT-based DA and the Lut-less DA-OBC.
Abstract: The paper presents a new memory-efficient distributed arithmetic (DA) architecture for high-order FIR filters. The proposed architecture is based on a memory reduction technique for DA look-up-tables (LUTs); it requires fewer transistors for high-order filters than original LUT-based DA, DA-offset binary coding (DA-OBC), and the LUT-less DA-OBC. Recursive iteration of the memory reduction technique significantly increases the maximum number of filter order implementable on an FPGA platform by not only saving transistor counts, but also balancing hardware usage between logic element (LE) and memory. FPGA implementation results confirm that the proposed DA architecture can implement a 1024-tap FIR filter with significantly smaller area usage (<50%) than the original LUT-based DA and the LUT-less DA-OBC.

Journal ArticleDOI
TL;DR: A new approach to bio-sequence database scanning using re-configurable field-programmable gate array (FPGA)-based hardware platforms to gain high performance at low cost and shows how run-time reconfiguration can be used to further improve performance.
Abstract: Protein sequences with unknown functionality are often compared to a set of known sequences to detect functional similarities. Efficient dynamic-programming algorithms exist for solving this problem, however current solutions still require significant scan times. These scan time requirements are likely to become even more severe due to the rapid growth in the size of these databases. In this paper, we present a new approach to bio-sequence database scanning using re-configurable field-programmable gate array (FPGA)-based hardware platforms to gain high performance at low cost. Efficient mappings of the Smith-Waterman algorithm using fine-grained parallel processing elements (PEs) that are tailored toward the parameters of a query have been designed. We use customization opportunities available at run time to dynamically reconfigure the PEs to make better use of available resources. Our FPGA implementation achieves a speedup of approximately 170 for linear gap penalties and 125 for affine gap penalties compared to a standard desktop computing platform. We show how run-time reconfiguration can be used to further improve performance.

Patent
05 May 2005
TL;DR: In this article, a reconfigurable hardware architecture (RHA) is configured to include a communications infrastructure that uses a high-bandwidth packet router to establish standard communications protocols between multiple interfaces and/or multiple devices that may be present on a single circuit card.
Abstract: Application Specific Integrated Circuit ('ASIC') devices, such as Field Programmable Gate Arrays ('FPGAs'), may be interconnected using serial I/O connections, such as high speed multi-gigabit serial transceiver ('MGT') connections. For example, serial I/O connections may be employed to interconnect a pair of ASICs to create a high bandwidth, low signal count connection, and in a manner so that any given pair of multiple ASIC devices on a single circuit card may communicate with each other through no more than one serial data communication link connection step. A reconfigurable hardware architecture ('RHA') may be configured to include a communications infrastructure that uses a high-bandwidth packet router to establish standard communications protocols between multiple interfaces and/or multiple devices that may be present on a single circuit card. Additionally, a communications infrastructure may be established across multiple circuit cards.

Proceedings ArticleDOI
07 Mar 2005
TL;DR: ROCCC as discussed by the authors is a compiler designed to generate circuits from C source code to execute on FPGAs, more specifically on CSoCs, and it generates RTL level HDLs from frequently executing kernels in an application.
Abstract: FPGAs, as computing devices, offer significant speedup over microprocessors. Furthermore, their configurability offers an advantage over traditional ASICs. However, they do not yet enjoy high-level language programmability, as microprocessors do. This has become the main obstacle for their wider acceptance by application designers. ROCCC is a compiler designed to generate circuits from C source code to execute on FPGAs, more specifically on CSoCs. It generates RTL level HDLs from frequently executing kernels in an application. In this paper, we describe the ROCCC's system overview and focus on its data path generation. We compare the performance of ROCCC-generated VHDL code with that of Xilinx IPs. The synthesis result shows that the ROCCC-generated circuit takes around 2/spl times//spl sim/3/spl times/ the area and runs at a comparable clock rate.

Journal ArticleDOI
TL;DR: The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.
Abstract: This paper presents a method for producing hardware designs for elliptic curve cryptography (ECC) systems over the finite field GF(2/sup m/), using the optimal normal basis for the representation of numbers. Our field multiplier design is based on a parallel architecture containing multiple m-bit serial multipliers; by changing the number of such serial multipliers, designers can obtain implementations with different tradeoffs in speed, size and level of security. A design generator has been developed which can automatically produce a customised ECC hardware design that meets user-defined requirements. To facilitate performance characterization, we have developed a parametric model for estimating the number of cycles for our generic ECC architecture. The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.

Book ChapterDOI
B. Glackin1, TM McGinnity1, Liam Maguire1, Qingxiang Wu1, Ammar Belatreche1 
08 Jun 2005
TL;DR: FPGA implementation results demonstrate a performance increase over a PC based simulation and an alternative approach where a trade off in terms of speed/area is made and time multiplexing of the neuron model implemented on the FPGA is used to generate large network topologies.
Abstract: This paper presents a strategy for the implementation of large scale spiking neural network topologies on FPGA devices based on the I&F conductance model. Analysis of the logic requirements demonstrate that large scale implementations are not viable if a fully parallel implementation strategy is utilised. Thus the paper presents an alternative approach where a trade off in terms of speed/area is made and time multiplexing of the neuron model implemented on the FPGA is used to generate large network topologies. FPGA implementation results demonstrate a performance increase over a PC based simulation.