Showing papers presented at "Field-Programmable Custom Computing Machines in 2003"

PDF

Open Access

Proceedings Article•DOI•

Implementation of a content-scanning module for an Internet firewall

[...]

James Moscola¹, John W. Lockwood¹, Ronald P. Loui¹, Michael Pachos¹•Institutions (1)

09 Apr 2003

TL;DR: A module has been implemented in Field Programmable Gate Array (FPGA) hardware that scans the content of Internet packets at Gigabits/second rates and automatically generates the Finite State Machines (FSMs) to search for regular expressions.

...read moreread less

Abstract: A module has been implemented in Field Programmable Gate Array (FPGA) hardware that scans the content of Internet packets at Gigabits/second rates. All of the packet processing operations are performed using reconfigurable hardware within a single Xilinx Virtex XCV2000E FPGA. A set of layered protocol wrappers is used to parse the headers and payloads of packets for Internet protocol data. A content matching server automatically generates the Finite State Machines (FSMs) to search for regular expressions. The complete system is operated on the Field-programmable Port Extender (FPX) platform.

...read moreread less

286 citations

Proceedings Article•DOI•

Compact FPGA-based true and pseudo random number generators

[...]

K.H. Tsoi¹, K.H. Leung¹, Philip H. W. Leong¹•Institutions (1)

The Chinese University of Hong Kong¹

09 Apr 2003

TL;DR: Two FPGA-based implementations of random number generators intended for embedded cryptographic applications are presented, one a true random number generator which employs oscillator phase noise, and the second a bit serial implementation of a Blum Blum Shub pseudorandom number generator.

...read moreread less

Abstract: Two FPGA-based (field programmable gate array) implementations of random number generators intended for embedded cryptographic applications are presented. The first is a true random number generator (TRNG) which employs oscillator phase noise, and the second is a bit serial implementation of a Blum Blum Shub (BBS) pseudorandom number generator (PRNG). Both designs are extremely compact and can be implemented on any FPGA of PLD device. They were designed specifically for use as FPGA-based cryptographic hardware cores. The TRNG and PRNG were tested using the NIST and Diehard random number test suites.

...read moreread less

131 citations

Proceedings Article•DOI•

Floating point unit generation and evaluation for FPGAs

[...]

Jian Liang¹, Russell Tessier¹, Oskar Mencer²•Institutions (2)

University of Massachusetts Amherst¹, Imperial College London²

09 Apr 2003

TL;DR: The floating point unit generation approach outlined in this paper allows for the creation of a vast collection of floating point units with differing throughput, latency, and area characteristics.

...read moreread less

Abstract: Most commercial and academic floating point libraries for FPGAs (field programmable gate arrays) provide only a small fraction of all possible floating point units. In contrast, the floating point unit generation approach outlined in this paper allows for the creation of a vast collection of floating point units with differing throughput, latency, and area characteristics. Given performance requirements, our generation tool automatically chooses the proper implementation algorithm and architecture to create a compliant floating point unit. Our approach is fully integrated into standard C++ using ASC, a stream compiler for FPGAs, and the PAM-Blox II module generation environment. The floating point units created by our approach exhibit a factor of two latency improvement versus commercial FPGA floating point units, while consuming only half of the FPGA logic area.

...read moreread less

112 citations

Proceedings Article•DOI•

The reliability of FPGA circuit designs in the presence of radiation induced configuration upsets

[...]

Michael Wirthlin¹, E. Johnson¹, Nathaniel Rollins¹, M. Caffrey², Paul Graham² - Show less +1 more•Institutions (2)

Brigham Young University¹, Los Alamos National Laboratory²

09 Apr 2003

TL;DR: An SEU simulator based on the SLAAC-1V computing board has been developed and is being used to characterize the reliability of SEU mitigation techniques for FPGAs.

...read moreread less

Abstract: FPGAs are an appealing solution for space-based remote sensing applications. However, in a low-Earth orbit, FPGAs (field programmable gate arrays) are susceptible to Single-Event Upsets (SEUs). In an effort to understand the effects of SEUs, an SEU simulator based on the SLAAC-1V computing board has been developed. This simulator artificially upsets the configuration memory of an FPGA and measures its impact on FPGA designs. The accuracy of this simulation environment has been verified using ground-based radiation testing. This simulation tool is being used to characterize the reliability of SEU mitigation techniques for FPGAs.

...read moreread less

108 citations

Proceedings Article•DOI•

Perturbation analysis for word-length optimization

[...]

George A. Constantinides¹•Institutions (1)

Imperial College London¹

09 Apr 2003

TL;DR: Application of the proposed procedure to adaptive filters realized in a Xilinx Virtex FPGA (field programmable gate array) has resulted in area reductions and power reductions and speed-up of up to 36% over common alternative design strategies.

...read moreread less

Abstract: This paper introduces a design tool and its associated procedures for determining the sensitivity of outputs in a digital signal processing design to small errors introduced by rounding or truncation of internal variables. The proposed approach can be applied to both linear and nonlinear designs. By analyzing the resulting sensitivity values, the proposed procedure is able to determine an appropriate distinct word-length for each internal variable. Also in this paper, the power optimizing capabilities of word-length optimization are studied for the first time. Application of the proposed procedure to adaptive filters realized in a Xilinx Virtex FPGA (field programmable gate array) has resulted in area reductions of up to 80% combined with power reductions of up to 98% and speed-up of up to 36% over common alternative design strategies.

...read moreread less

90 citations

Proceedings Article•DOI•

Reconfigurable computing application frameworks

[...]

A.L. Slade¹, Brent Nelson¹, Brad Hutchings¹•Institutions (1)

Brigham Young University¹

09 Apr 2003

TL;DR: An application framework is discussed for developing CCM-based applications beyond just the hardware configuration that allows dynamic circuit configurations that include data folding optimizations based on user input and the resulting system aids in creating applications that are potentially more intuitive, easier to develop, and better performing.

...read moreread less

Abstract: FPGA-based (field programmable gate array) configurable computing machines (CCMs) offer powerful and flexible general-purpose computing platforms. However, development for FPGA-based designs using modern CAD (computer aided design) tools is geared mainly toward an ASIC-like process. This is inadequate for the needs of CCM application development. This paper discusses an application framework for developing CCM-based applications beyond just the hardware configuration. This framework leverages the advantages of CCMs (availability, programmability, visibility, and controllability) to help create CCM-based applications throughout the entire development process (i.e. design, debug, and deploy). The framework itself is deployed with the final application, thus permitting dynamic circuit configurations that include data folding optimizations based on user input. The resulting system aids in creating applications that are potentially more intuitive, easier to develop, and better performing. An example application demonstrates the use of the application framework and the potential benefits.

...read moreread less

76 citations

Proceedings Article•DOI•

Tradeoffs of designing floating-point division and square root on Virtex FPGAs

[...]

Xiaojun Wang, Brent Nelson

09 Apr 2003

TL;DR: This paper presents a high radix SRT division algorithm and a binary restoring square root algorithm and describes three implementations of floating-point division operations with a variable width and precision based on Virtex-2 FPGAs.

...read moreread less

Abstract: Low latency, high throughput and small area are three major design considerations of an FPGA (field programmable gate array) design. In this paper, we present a high radix SRT division algorithm and a binary restoring square root algorithm. We describe three implementations of floating-point division operations with a variable width and precision based on Virtex-2 FPGAs. One is a low cost iterative implementation; another is a low latency array implementation; and the third is a high throughput pipelined implementation. The implementations of floating-point square root operations are presented as well. In addition to the design of modules, we also analyze the tradeoffs among the cost, latency and throughput with strategies on how to reduce the cost or improve the performance.

...read moreread less

62 citations

Proceedings Article•DOI•

Source level debugger for the Sea Cucumber synthesizing compiler

[...]

K.S. Hemmert¹, Justin L. Tripp¹, Brad Hutchings¹, Preston Jackson¹•Institutions (1)

Brigham Young University¹

09 Apr 2003

TL;DR: The creation of a debugger for the Sea Cucumber synthesizing compiler is discussed, used to explore the issues associated with providing information about a circuit in the context of the original source code, thus making the debugging process more intuitive.

...read moreread less

Abstract: With the growing popularity of using high-level synthesis tools to map programs written in general-purpose programming languages to FPGA (field programmable gate array) hardware, it has become necessary to provide comprehensive, intuitive debugging tools in order to verify the correctness of the synthesized hardware. The difficulty in creating these tools lies in the fact that typical synthesizing compilers provide no information about how the source code is mapped to hardware. This paper discusses the creation of a debugger for the Sea Cucumber synthesizing compiler used to explore the issues associated with providing information about a circuit in the context of the original source code, thus making the debugging process more intuitive.

...read moreread less

56 citations

Proceedings Article•DOI•

Automatic conversion of floating point MATLAB programs into fixed point FPGA based hardware design

[...]

Prithviraj Banerjee¹, Debabrata Bagchi, Malay Haldar, Anshuman Nayak, V. Kim, R. Uribe - Show less +2 more•Institutions (1)

Northwestern University¹

09 Apr 2003

TL;DR: This paper describes how the floating point computations in MATLAB can be automatically converted to a fixed point MATLAB version of specific precision for hardware design.

...read moreread less

Abstract: This paper describes how the floating point computations in MATLAB can be automatically converted to a fixed point MATLAB version of specific precision for hardware design. The techniques have been incorporated in the AcelFPGA behavioral synthesis tool (Banerjee et al., 2003) that reads in high-level descriptions of DSP applications written in MATLAB, and automatically generate synthesizable RTL models in VHDL or Verilog. Experimental results are reported with the AccelFPGA version 1.5 compiler on a set of five MATLAB benchmarks that are mapped onto the Xilinx Virtex II FPGAs (field programmable gate arrays).

...read moreread less

54 citations

Proceedings Article•DOI•

A high I/O reconfigurable crossbar switch

[...]

Steven P. Young¹, P. Alfke, C. Fewer¹, S. McMillan, B. Blodget, D. Levi - Show less +2 more•Institutions (1)

Xilinx¹

09 Apr 2003

TL;DR: A crossbar switch with 928 inputs and 928 outputs is presented, which yields a 16/spl times/ improvement in logic density compared with using conventional logic and uses partial configuration to modify routing resources during operation.

...read moreread less

Abstract: A crossbar switch with 928 inputs and 928 outputs is presented. Switching elements are constructed using logic in the routing fabric. This approach yields a 16/spl times/ improvement in logic density compared with using conventional logic. Normally, the routing is fixed. However, in FPGAs (field programmable gate arrays), the interconnection is defined by the state of SRAM configuration cells, which are dynamically modifiable. Therefore, the switch is implemented on an FPGA using partial configuration to modify routing resources during operation. All paths are synchronously clocked at 155.5 MHz, creating a total throughput of 144.3 Gbits/s. to maintain constant clock latency across all paths, partially configurable delay registers are used. Finally, the partial reconfiguration controller is implemented in hardware to enable fast switch updates.

...read moreread less

48 citations

Proceedings Article•DOI•

Runtime assignment of reconfigurable hardware components for image processing pipelines

[...]

Heather Quinn¹, Laurie Smith King¹, Miriam Leeser, Waleed Meleis•Institutions (1)

Northeastern University¹

09 Apr 2003

TL;DR: Optimal and heuristic methods for fast (fixed time limit) runtime pipeline assignment are investigated and experimental finding for pipelines of twenty or fewer components is presented, which shows that in this environment, optimal runtime solutions are possible for smaller pipelines and nearly optimal heuristic solutions are Possible for larger pipelines.

...read moreread less

Abstract: The combination of hardware acceleration and flexibility make FPGAs (field programmable gate arrays) important to image processing applications. There is also a need for efficient, flexible hardware/software codesign environments that can balance the benefits and costs of using FPGAs. Image processing applications often consist of pipeline of components where each component applies a different processing algorithm. Components can be implemented for FPGAs or software. Such systems enable an image analyst to work with either FPGA or software implementations of image processing algorithms for a given problem. The pipeline assignment problem chooses from alternative implementations of pipeline components to yield the fastest pipeline. Our codesign system solves the pipeline assignment problem to provide the most effective implementation automatically, so the image analyst can focus solely on choosing components, which make up the pipeline. However, the pipeline assignment problem is NP complete. An efficient, dynamic solution to the pipeline assignment problem is a desirable enabler of codesign systems which use both FPGA and software implementations. This paper is concerned with solving pipeline assignment in this context. Consequently, we focus on optimal and heuristic methods for fast (fixed time limit) runtime pipeline assignment are investigated. We present experimental finding for pipelines of twenty or fewer components, which show that in our environment, optimal runtime solutions are possible for smaller pipelines and nearly optimal heuristic solutions are possible for larger pipelines.

...read moreread less

Proceedings Article•DOI•

Efficient application representation for HASTE: Hybrid Architectures with a Single, Transformable Executable

[...]

B. Levine¹, Herman Schmit¹•Institutions (1)

Carnegie Mellon University¹

09 Apr 2003

TL;DR: An ISA using a modified form of register addressing has been shown to have the best overall characteristics and should allow for the practical implementation of HASTE, an architecture that allows a single executable to represent an entire application.

...read moreread less

Abstract: Hybrid architectures, which are composed of a conventional processor closely coupled with reconfigurable logic, seem to combine the advantages of both types of hardware. They present some practical difficulties however. The interface between the processor and the reconfigurable logic is crucial to performance and is often difficult to implement well. Partitioning the application between the processor and logic is a difficult task, typically complicated by entirely different programming models, heterogeneous interfaces to external resources, and incompatible representations of applications. A separate executable must be produced and maintained for each type of hardware. An architecture called HASTE (Hybrid Architecture with a Single Transformable Executable) solves many of these difficulties. HASTE allows a single executable to represent an entire application, including portions that run on a reconfigurable fabric and portions that run on a sequential processor. This executable can execute in its entirety on the processor, but for best performance portions of the application that are mapped onto the fabric at run-time. The application representation is the key to making this concept viable, and several different ones were examined. Some used a relatively conventional register instruction set architecture (ISA) while others used a new queue-based ISA. AN ISA using a modified form of register addressing has been shown to have the best overall characteristics and should allow for the practical implementation of HASTE.

...read moreread less

Proceedings Article•DOI•

A hardware Gaussian noise generator for channel code evaluation

[...]

Dong-U Lee¹, Wayne Luk¹, John D. Villasenor, Peter Y. K. Cheung•Institutions (1)

Imperial College London¹

09 Apr 2003

TL;DR: A hardware-based Gaussian noise generator used as a key component in a hardware simulation system, for exploring channel code behavior at very low bit error rates (BERs) in the range of 10/sup -9/ to 10/Sup -10/.

...read moreread less

Abstract: Hardware simulation of channel codes offers the potential of improving code evaluation speed by orders of magnitude over workstation of PC-based simulation. We describe a hardware-based Gaussian noise generator used as a key component in a hardware simulation system, for exploring channel code behavior at very low bit error rates (BERs) in the range of 10/sup -9/ to 10/sup -10/. The main novelty is the design and use of nonuniform piecewise linear approximations in computing trigonometric and logarithmic functions. The parameters of the approximation are chosen carefully to enable rapid computation of coefficients from the inputs, while still retaining extremely high fidelity to the modeled functions. The output of the noise generator accurately models a true Gaussian PDF even at very high /spl sigma/ values. Its properties are explored using: (a) several different statistical tests, including the chi-square test and the Kolmogorov-Smirnov test, and (b) an application for decoding of low density parity check (LDPC) codes. An implementation at 133MHz on a Xilinx Virtex-II XC2V4000-6 FPGA produces 133 million samples per second, which is 40 times faster than a 2.13GHz PC; another implementation on a Xilinx Spartan-IIE XC2S300E-7 FPGA at 62MHz is capable of a 20 times speedup. The performance can be improved by exploiting parallelism: an XC2V4000-6 FPGA with three parallel instances of the noise generator at 126 MHz can run 100 times faster than a 2.13GHz PC. We illustrate the deterioration of clock speed with the increase in the number of instances.

...read moreread less

Proceedings Article•DOI•

Improved small multiplier based multiplication, squaring and division

[...]

B.R. Lee¹, N. Burgess¹•Institutions (1)

Cardiff University¹

09 Apr 2003

TL;DR: This paper presents the design of parameterized fixed-point integer multiplication, squaring and fractional division units targeted at the Virtex-II family of FPGAs from Xilinx and are based on the small 18X18-bit multiplier blocks.

...read moreread less

Abstract: This paper presents the design of parameterized fixed-point integer multiplication, squaring and fractional division units. The units are targeted at the Virtex-II family of FPGAs (field programmable gate arrays) from Xilinx and are based on the small 18X18-bit multiplier blocks. New partial product creation and summation techniques that exploit the low level primitives are used that achieve a 20% area and a 30% delay reduction for multiplication. A dedicated squaring component is presented that offers substantial area savings of up to 50%. The division component uses the multipliers for pre-scaling to reduce the delay and complexity of each minimally redundant radix-8 stage.

...read moreread less

Proceedings Article•DOI•

Exploiting reconfigurable hardware for network security

[...]

Shaomeng Li¹, Jim Torresen¹, O. Soraasen¹•Institutions (1)

University of Oslo¹

09 Apr 2003

TL;DR: This study provides novel hardware architectures for an IDS system which should be able to monitor networks with a speed up to 2.68 Gbps to achieve higher speed and more efficient performance of network security.

...read moreread less

Abstract: One type of network security strategy is using an intrusion detection system (IDS). We are implementing an IDS in FPGA-based (Field Programmable Gate Array) reconfigurable hardware. This is to achieve higher speed and more efficient performance of network security, as networks develop very fast with consequently more demanding constraints. This study provides novel hardware architectures for an IDS system which should be able to monitor networks with a speed up to 2.68 Gbps.

...read moreread less

Proceedings Article•DOI•

Compiling policy descriptions into reconfigurable firewall processors

[...]

T.K. Lee¹, S. Yusuf¹, Wayne Luk¹, Morris Sloman¹, Emil Lupu¹, Naranker Dulay¹ - Show less +2 more•Institutions (1)

Imperial College London¹

09 Apr 2003

TL;DR: A framework for capturing firewall requirements as high-level descriptions based on the policy specification language Ponder is described, which provides abstraction from hardware implementation while allowing performance control through constraints.

...read moreread less

Abstract: We describe a framework for capturing firewall requirements as high-level descriptions based on the policy specification language Ponder. The framework provides abstraction from hardware implementation while allowing performance control through constraints. Our hardware compilation strategy for such descriptions involves a rule reduction step to produce a hardware firewall rule representation. Three main methods have also been developed for resource optimization: partitioning; elimination; and sharing. A case study involving five sets of filter rules indicates that it is possible to reduce 67-80% of hardware resources over techniques based on regular content-addressable memory, and 24-63% over methods based on irregular content-addressable memory.

...read moreread less

Proceedings Article•DOI•

Issues and approaches to coarse-grain reconfigurable architecture development

[...]

Ken Eguro¹, Scott Hauck¹•Institutions (1)

University of Washington¹

09 Apr 2003

TL;DR: Three algorithms are presented that solve the problem of balancing the hardware needs of the domain while considering performance and area requirements during the development of an encryption-specialized FPGA architecture.

...read moreread less

Abstract: Although domain-specialized FPGAs (field programmable gate arrays) can offer significant area, speed and power improvements over conventional reconfigurable devices, there are several unique and unexplored design problems that complicate their development. One source of these problems is that the designers often opt to replace more universal, fine-grain logic elements with a specialized set of coarse-grain functional units to improve computation speed and reduce routing complexity. One issue this introduces is that it is not obvious how to simultaneously consider all applications in a domain and determine the most appropriate overall number and ration of the different functional units. In this paper, we illustrate how this problem manifests itself during the development of an encryption-specialized FPGA architecture. We present three algorithms that solve this problem by balancing the hardware needs of the domain while considering performance and area requirements. We believe these concerns need to be addressed by future CAD tools in order to develop more sophisticated application-specialized reconfigurable devices.

...read moreread less

Proceedings Article•DOI•

Design and implementation of a generic 2D orthogonal discrete wavelet transform on FPGA

[...]

A. Benkrid¹, Khaled Benkrid¹, Danny Crookes¹•Institutions (1)

Queen's University Belfast¹

09 Apr 2003

TL;DR: An FPGA architecture for the separable 2-D Biorthogonal Discrete Wavelet Transform (DWT) decomposition based on the Pyramid Algorithm Analysis, which handles computation along the border efficiently by using the method of symmetric extension.

...read moreread less

Abstract: This paper gives a design framework for the implementation of the 2D (two-dimensional) orthogonal discrete wavelet transform (DWT) on FPGA (field programmable gate array). The architecture is based on the pyramid algorithm analysis. It maps spatially the multistage filter banks of the DWT on Xilinx Virtex-e FPGA family using on chip buffering. The architecture takes advantage from the low rate of the high transform stages to reuse the logic. In this paper, we propose an FIR structure to handle the computation along the borders using symmetry extension, a new BlockRam configuration for multi ports shift register, and a mathematical approach to predict and reduce the error dynamic range due to wordlength rounding. For an MxM image size input, our architecture has a period of M/sup 2/ clock cycles, and requires the minimum storage size. The architecture is highly scalable for different filter lengths and number of octaves. The implementation results for a specific 2D Doubechies-4 wavelet transform are included.

...read moreread less

Proceedings Article•DOI•

Synthesis and estimation of memory interfaces for FPGA-based reconfigurable computing engines

[...]

Joonseok Park¹, Pedro C. Diniz¹•Institutions (1)

University of Southern California¹

09 Apr 2003

TL;DR: This paper describes a set of synthesizable and programmable memory interfaces a compiler can use to automatically generate the appropriate designs for mapping computations to FPGA-based architectures and reveals that it is possible to accurately model the area and timing requirements using a linear estimation function.

...read moreread less

Abstract: As the densities of current FPGA continue to grow it is now possible to generate System-On-a-Chip (SoC) designs where multiple computing cores are connected to various memory modules with customized topology with application specific memory access patterns. For example, Xilinx has recently introduced devices to which a paired down version of a PowerPC core can be mapped and connected to a set of internal memories. In this paper we address the problem of synthesizing and estimating the area and speed of memory interfacing for Static RAM (SRAM) and Synchronous Dynamic RAM (SDRAM) with various latency parameters and access modes. We describe a set of synthesizable and programmable memory interfaces a compiler can use to automatically generate the appropriate designs for mapping computations to FPGA-based architectures. Our preliminary results reveal that it is possible to accurately model the area and timing requirements using a linear estimation function. We have successfully integrated the proposed memory interface designs with simple image processing kernels generated using commercially available behavioral synthesis tools.

...read moreread less

Proceedings Article•DOI•

Adaptive fault recovery for networked reconfigurable systems

[...]

Weifeng Xu¹, R. Ramanarayanan¹, Russell Tessier¹•Institutions (1)

University of Massachusetts Amherst¹

09 Apr 2003

TL;DR: This paper introduces a fully automated fault recovery system for networked systems, which contain FPGAs (field programmable gate arrays), which requires no manual intervention.

...read moreread less

Abstract: The device-level size and complexity of reconfigurable architectures makes fault tolerance an important concern in system design. In this paper, we introduce a fully automated fault recovery system for networked systems, which contain FPGAs (field programmable gate arrays). If a fault is detected hat cannot be addressed locally, fault information is transferred to a reconfiguration server. Following design recompilation to avoid the fault, a new FPGA configuration is returned to the remote system and computation is reinitiated. To illustrate the benefit of this approach, we have implemented a complete fault recovery system, which requires no manual intervention. An important part of the system is a timing-driven incremental router for Xilinx Virtex devices. This router is directly interfaced to Xilinx JBits and uses no CAD tools from the standard Xilinx Alliance tool flow. Our completed system has been applied to three benchmark designs and exhibits complete fault recovery in up to 12x less time than the standard incremental Xilinx PAR flow.

...read moreread less

Proceedings Article•DOI•

Data search and reorganization using FPGAs: application to spatial pointer-based data structures

[...]

Pedro C. Diniz¹, Joonseok Park¹•Institutions (1)

University of Southern California¹

09 Apr 2003

TL;DR: Reconfigurable logic, when combined with the data reorganization, can lead to dramatic performance improvements of up to 20x over traditional computer architectures for pointer-based computations, traditionally not viewed as a good match for reconfigurable technologies.

...read moreread less

Abstract: FPGAs (field programmable gate arrays) have appealing features such as customizable internal and external bandwidth and the ability to exploit vast amounts of fine-grain parallelism. In this paper, we explore the applicability of these features in using FPGAs as smart memory engines for search and reorganization computations over spatial pointer-based data structures. The experimental results in this paper suggests that reconfigurable logic, when combined with the data reorganization, can lead to dramatic performance improvements of up to 20x over traditional computer architectures for pointer-based computations, traditionally not viewed as a good match for reconfigurable technologies.

...read moreread less

Proceedings Article•DOI•

A novel FIR filter architecture for efficient signal boundary handling on Xilinx VIRTEX FPGAs

[...]

A. Benkrid¹, Khaled Benkrid¹, Danny Crookes¹•Institutions (1)

Queen's University Belfast¹

09 Apr 2003

TL;DR: The architecture for FIR filters on Xilinx Virtex FPGAs (field programmable gate arrays) is presented, which is particularly useful for handling the problem of signal boundaries filtering, which occurs in finite length signal processing.

...read moreread less

Abstract: FIR (Finite Impulse Response) filters are often used in digital signal processing. This paper presents architecture for FIR filters on Xilinx Virtex FPGAs (field programmable gate arrays). The architecture is particularly useful for handling the problem of signal boundaries filtering, which occurs in finite length signal processing (e.g. image processing). Based on a bit parallel arithmetic, our architecture is fully scalable and parameterized. It cleverly exploits the Shift Register Logic (SRL16) component of the Virtex family. The implementation leads to considerable area savings compared to the conventional implementation (based on a hard router) with no speed penalty. A case study based on the implementation of the standard low filter of the Daubechies-8 wavelet on Xilinx Virtex-E FPGAs is presented.

...read moreread less

Proceedings Article•DOI•

Implementation of three-dimensional FPGA-based FDTD solvers: an architectural overview

[...]

James P. Durbano, Fernando E. Ortiz¹, John R. Humphrey¹, Dennis W. Prather¹, Mark S. Mirotznik² - Show less +1 more•Institutions (2)

University of Delaware¹, The Catholic University of America²

09 Apr 2003

TL;DR: This extended abstract presents an architecture that overcomes the previous limitations of the Finite-Difference Time-Domain method, and begins with a high-level description of the computational flow of this architecture.

...read moreread less

Abstract: Maxwell's equations, which govern electromagnetic propagation, are a system of coupled, differential equations. As such, they can be represented in difference form, thus allowing their numerical solution. By implementing both the temporal and spatial derivatives of Maxwell's equations in difference form, we arrive at one of the most common computational electromagnetic algorithms, the Finite-Difference Time-Domain (FDTD) method (Yee, 1966). In this technique, the region of interest is sampled to generate a grid of points, hereafter referred to as a mesh. The discretized form of Maxwell's equations is then solved at each point in the mesh to determine the associated electromagnetic fields. In this extended abstract, we present an architecture that overcomes the previous limitations. We begin with a high-level description of the computational flow of this architecture.

...read moreread less

Proceedings Article•DOI•

Accelerating bit error rate testing using a system level design tool

[...]

Vinay Singh¹, A. Root¹, E. Hemphill¹, Nabeel Shirazi¹, James Hwang¹ - Show less +1 more•Institutions (1)

Xilinx¹

09 Apr 2003

TL;DR: This work demonstrates how system level design tools can be used to build a bit error rate (BER) tester, and how hardware co-simulation of the entire system provided a 10,000x speed-up over a pure software simulation FPGA tools.

...read moreread less

Abstract: System level design tools for creating DSP designs reduce the amount of time needed to create a DSP design, in part by eliminating the need for verification between system model and hardware implementation. The design is developed within a high-level modeling environment. This description is compiled into a hardware description language, and synthesized by traditional FPGA (field programmable gate array) tools. The use of system level tools can eliminate the need for extensive hardware knowledge. We demonstrate how such tools can be used to build a bit error rate (BER) tester, and how hardware co-simulation of the entire system provided a 10,000x speed-up over a pure software simulation FPGA tools. The use of system level tools can eliminate the need for extensive hardware knowledge. We demonstrate how such tools can be used to build a bit error rate (BER) tester, and how hardware co-simulation of the entire system provided a 10,000x speed-up over a pure software simulation.

...read moreread less

Proceedings Article•DOI•

FPGA-based SIMD processor

[...]

S.Y.C. Li¹, G.C.K. Cheuk¹, Kin-Hong Lee¹, Philip H. W. Leong¹•Institutions (1)

The Chinese University of Hong Kong¹

09 Apr 2003

TL;DR: A massively parallel single instruction multiple data stream (SIMD) processor designed specifically for cryptographic key search applications is presented and performance is compared with a previously reported hardwired design on a RC4 key search application.

...read moreread less

Abstract: A massively parallel single instruction multiple data stream (SIMD) processor designed specifically for cryptographic key search applications is presented. This design exploits fine grain parallelism and the high memory bandwidth available in an FPGA (field programmable gate array) by integrating 95 simple processors and memory on a single FPGA chip. Performance is compared with a previously reported hardwired design on a RC4 key search application.

...read moreread less

Proceedings Article•DOI•

Performance analysis of fixed, reconfigurable, and custom architectures for the SCAN image and video encryption algorithm

[...]

Apostolos Dollas¹, C. Kachris¹, Nikolaos G. Bourbakis²•Institutions (2)

University of Crete¹, Wright State University²

09 Apr 2003

TL;DR: This paper briefly presents a block cipher encryption architecture and a reconfigurable logic-based hardware design for the SCAN encryption algorithm and detailed performance results are presented for still images as well as video.

...read moreread less

Abstract: This paper briefly presents a block cipher encryption architecture and a reconfigurable logic-based hardware design for the SCAN encryption algorithm. Detailed performance results are presented for still images as well as video, and the reconfigurable architecture is compared to software-only implementations of the same algorithm as well as a preliminary ASIC design.

...read moreread less

Proceedings Article•DOI•

Asynchronous PipeRench: architecture and performance evaluations

[...]

H. Kagotani¹, Herman Schmit²•Institutions (2)

Okayama University¹, Carnegie Mellon University²

09 Apr 2003

TL;DR: This paper investigates the potential benefits and costs of implementing this architecture using an asynchronous methodology, and focuses on the benefit due to decreased timing pessimism in an asynchronous implementation.

...read moreread less

Abstract: PipeRench is a configurable architecture that has the unique ability to virtualize an application using dynamic reconfiguration. This paper investigates the potential benefits and costs of implementing this architecture using an asynchronous methodology. Since clock distribution and gating are relatively easy in the synchronous PipeRench, we focus on the benefit due to decreased timing pessimism in an asynchronous implementation. Two architectures for fully asynchronous implementation are considered. PE-based asynchronous implementation yields approximately 80% improvement in performance per stripe. This implementation, however, requires significant increases in configuration storage and wire count. A few particular features of the architecture, such as the crossbar interconnect structure within the stripe, are primarily responsible for this growth in configuration bits and wires. These features, however, are the primary aspects of the PipeRench architecture that make it a good compilation target.

...read moreread less

Book Chapter•DOI•

Performance and area modeling of complete FPGA designs in the presence of loop transformations

[...]

K.R.S. Shayee¹, Joonseok Park¹, Pedro C. Diniz¹•Institutions (1)

University of Southern California¹

09 Apr 2003

TL;DR: In this article, the authors focus on the development of fast, yet accurate performance and area modeling of complete FPGA designs that combine analytical, empirical and behavioral estimation techniques, and model the application of a set of important program transformations for image processing algorithms, namely loop unrolling, tiling, loop interchanging, loop fission and array privatization.

...read moreread less

Abstract: Digital image processing algorithms are a good match for direct implementation on FPGAs as current FPGA architectures can naturally match the fine grain parallelism in these applications. Typically, these algorithms are structured as a sequence of operations, expressed in high-level programming languages as tight loop nests. The loops usually define a shifting-window region over which the algorithm applies a simple localized operator (e.g., a differential gradient, or a min/max). In this research we focus on the development of fast, yet accurate performance and area modeling of complete FPGA designs that combine analytical, empirical and behavioral estimation techniques. We model the application of a set of important program transformations for image processing algorithms, namely loop unrolling, tiling, loop interchanging, loop fission and array privatization, and explore pipelined and non-pipelined execution modes. We take into consideration the impact of various transformations, in the presence of limited I/O resources like address generators and external memory data channels, on the performance of a complete design implemented in a FPGA based architecture.

...read moreread less

Proceedings Article•DOI•

Gamma-ray pulsar detection using reconfigurable computing hardware

[...]

Janette R. Frigo¹, D. Palmer¹, Maya Gokhale¹, M. Popkin-Paine¹•Institutions (1)

Los Alamos National Laboratory¹

09 Apr 2003

TL;DR: This astrophysics application poses a "good example" of the use of a highlevel reconfigurable computing tool such as sc2 to accelerate an algorithm because it uses real satellite data, the algorithm can be parallelized, and was originally validated using a high level scientific language, IDL.

...read moreread less

Abstract: This paper presents a method to detect gamma-ray pulsars using a fast folding algorithm (Staelin, 1969) mapped onto reconfigurable hardware. In contrast, existing techniques require gigapoint complex FFTs. the algorithm has been written in Streams-C and compiled with the sc2 compiler to the target Annapolis Micro Systems (AMS) Firebird board (Xilinx Virtex E processor). To accelerate detection of new gamma-ray pulsars, the sc2 compiler generates a hardware implementation of the algorithm for finding periodicities in data sets. The data to be analyzed comes from a high-energy gamma-ray telescope onboard a spacecraft. This astrophysics application poses a "good example" of the use of a high level reconfigurable computing tool such as sc2 to accelerate an algorithm because it uses real satellite data, the algorithm can be parallelized, and was originally validated using a high level scientific language, IDL. By recasting the algorithm into Streams-C, the scientific software developer can create a hardware implementation on a reconfigurable computing platform. We describe the fast folding algorithm, the Streams-C implementation, and discuss techniques to optimize performance within the Streams-C framework. The compiler-generated hardware delivers approximately 3X to 6X speed up over a comparable 800MHz general-purpose processor doing the software-only algorithm.

...read moreread less

Proceedings Article•

Design and Implementation of a Generic 2-D Orthogonal Discrete Wavelet Transform on FPGA

[...]

A. Benkrid, Khaled Benkrid, Danny Crookes

09 Apr 2003

TL;DR: An FIR structure to handle the computation along the borders using symmetry extension, a new BlockRam configuration for multi ports shift register, and a mathematical approach to predict and reduce the error dynamic range due to wordlength rounding are proposed.

...read moreread less

Abstract: This paper gives a design framework for the implementation ofthe 2-D Orthogonal Discrete Wavelet Transform (DWT) onFPGA. The architecture is based on the Pyramid AlgorithmAnalysis. It maps spatially the multistage filter banks of theDWT on Xilinx Virtex-e FPGA family using on chip buffering.The architecture takes advantage from the low rate of the hightransform stages to reuse the logic. In this paper, we proposea novel FIR structure to handle the computation along theborders using symmetry extension, a new BlockRamconfiguration for multi ports shift register, and a newmathematical approach to predict and reduce the errordynamic range due to wordlength rounding. For an MxMimage size input, our architecture has a period of M2 clockcycles, and requires the minimum storage size. Thearchitecture is highly scalable for different filter lengths andnumber of octaves. The implementation results for a specific2-D Daubechies-4 Wavelet transform are included.

...read moreread less