scispace - formally typeset
Search or ask a question

Showing papers presented at "Field-Programmable Custom Computing Machines in 2003"


Proceedings Article•DOI•
09 Apr 2003
TL;DR: A module has been implemented in Field Programmable Gate Array (FPGA) hardware that scans the content of Internet packets at Gigabits/second rates and automatically generates the Finite State Machines (FSMs) to search for regular expressions.
Abstract: A module has been implemented in Field Programmable Gate Array (FPGA) hardware that scans the content of Internet packets at Gigabits/second rates. All of the packet processing operations are performed using reconfigurable hardware within a single Xilinx Virtex XCV2000E FPGA. A set of layered protocol wrappers is used to parse the headers and payloads of packets for Internet protocol data. A content matching server automatically generates the Finite State Machines (FSMs) to search for regular expressions. The complete system is operated on the Field-programmable Port Extender (FPX) platform.

286 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: Two FPGA-based implementations of random number generators intended for embedded cryptographic applications are presented, one a true random number generator which employs oscillator phase noise, and the second a bit serial implementation of a Blum Blum Shub pseudorandom number generator.
Abstract: Two FPGA-based (field programmable gate array) implementations of random number generators intended for embedded cryptographic applications are presented. The first is a true random number generator (TRNG) which employs oscillator phase noise, and the second is a bit serial implementation of a Blum Blum Shub (BBS) pseudorandom number generator (PRNG). Both designs are extremely compact and can be implemented on any FPGA of PLD device. They were designed specifically for use as FPGA-based cryptographic hardware cores. The TRNG and PRNG were tested using the NIST and Diehard random number test suites.

131 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: The floating point unit generation approach outlined in this paper allows for the creation of a vast collection of floating point units with differing throughput, latency, and area characteristics.
Abstract: Most commercial and academic floating point libraries for FPGAs (field programmable gate arrays) provide only a small fraction of all possible floating point units. In contrast, the floating point unit generation approach outlined in this paper allows for the creation of a vast collection of floating point units with differing throughput, latency, and area characteristics. Given performance requirements, our generation tool automatically chooses the proper implementation algorithm and architecture to create a compliant floating point unit. Our approach is fully integrated into standard C++ using ASC, a stream compiler for FPGAs, and the PAM-Blox II module generation environment. The floating point units created by our approach exhibit a factor of two latency improvement versus commercial FPGA floating point units, while consuming only half of the FPGA logic area.

112 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: An SEU simulator based on the SLAAC-1V computing board has been developed and is being used to characterize the reliability of SEU mitigation techniques for FPGAs.
Abstract: FPGAs are an appealing solution for space-based remote sensing applications. However, in a low-Earth orbit, FPGAs (field programmable gate arrays) are susceptible to Single-Event Upsets (SEUs). In an effort to understand the effects of SEUs, an SEU simulator based on the SLAAC-1V computing board has been developed. This simulator artificially upsets the configuration memory of an FPGA and measures its impact on FPGA designs. The accuracy of this simulation environment has been verified using ground-based radiation testing. This simulation tool is being used to characterize the reliability of SEU mitigation techniques for FPGAs.

108 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: Application of the proposed procedure to adaptive filters realized in a Xilinx Virtex FPGA (field programmable gate array) has resulted in area reductions and power reductions and speed-up of up to 36% over common alternative design strategies.
Abstract: This paper introduces a design tool and its associated procedures for determining the sensitivity of outputs in a digital signal processing design to small errors introduced by rounding or truncation of internal variables. The proposed approach can be applied to both linear and nonlinear designs. By analyzing the resulting sensitivity values, the proposed procedure is able to determine an appropriate distinct word-length for each internal variable. Also in this paper, the power optimizing capabilities of word-length optimization are studied for the first time. Application of the proposed procedure to adaptive filters realized in a Xilinx Virtex FPGA (field programmable gate array) has resulted in area reductions of up to 80% combined with power reductions of up to 98% and speed-up of up to 36% over common alternative design strategies.

90 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: An application framework is discussed for developing CCM-based applications beyond just the hardware configuration that allows dynamic circuit configurations that include data folding optimizations based on user input and the resulting system aids in creating applications that are potentially more intuitive, easier to develop, and better performing.
Abstract: FPGA-based (field programmable gate array) configurable computing machines (CCMs) offer powerful and flexible general-purpose computing platforms. However, development for FPGA-based designs using modern CAD (computer aided design) tools is geared mainly toward an ASIC-like process. This is inadequate for the needs of CCM application development. This paper discusses an application framework for developing CCM-based applications beyond just the hardware configuration. This framework leverages the advantages of CCMs (availability, programmability, visibility, and controllability) to help create CCM-based applications throughout the entire development process (i.e. design, debug, and deploy). The framework itself is deployed with the final application, thus permitting dynamic circuit configurations that include data folding optimizations based on user input. The resulting system aids in creating applications that are potentially more intuitive, easier to develop, and better performing. An example application demonstrates the use of the application framework and the potential benefits.

76 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: This paper presents a high radix SRT division algorithm and a binary restoring square root algorithm and describes three implementations of floating-point division operations with a variable width and precision based on Virtex-2 FPGAs.
Abstract: Low latency, high throughput and small area are three major design considerations of an FPGA (field programmable gate array) design. In this paper, we present a high radix SRT division algorithm and a binary restoring square root algorithm. We describe three implementations of floating-point division operations with a variable width and precision based on Virtex-2 FPGAs. One is a low cost iterative implementation; another is a low latency array implementation; and the third is a high throughput pipelined implementation. The implementations of floating-point square root operations are presented as well. In addition to the design of modules, we also analyze the tradeoffs among the cost, latency and throughput with strategies on how to reduce the cost or improve the performance.

62 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: The creation of a debugger for the Sea Cucumber synthesizing compiler is discussed, used to explore the issues associated with providing information about a circuit in the context of the original source code, thus making the debugging process more intuitive.
Abstract: With the growing popularity of using high-level synthesis tools to map programs written in general-purpose programming languages to FPGA (field programmable gate array) hardware, it has become necessary to provide comprehensive, intuitive debugging tools in order to verify the correctness of the synthesized hardware. The difficulty in creating these tools lies in the fact that typical synthesizing compilers provide no information about how the source code is mapped to hardware. This paper discusses the creation of a debugger for the Sea Cucumber synthesizing compiler used to explore the issues associated with providing information about a circuit in the context of the original source code, thus making the debugging process more intuitive.

56 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: This paper describes how the floating point computations in MATLAB can be automatically converted to a fixed point MATLAB version of specific precision for hardware design.
Abstract: This paper describes how the floating point computations in MATLAB can be automatically converted to a fixed point MATLAB version of specific precision for hardware design. The techniques have been incorporated in the AcelFPGA behavioral synthesis tool (Banerjee et al., 2003) that reads in high-level descriptions of DSP applications written in MATLAB, and automatically generate synthesizable RTL models in VHDL or Verilog. Experimental results are reported with the AccelFPGA version 1.5 compiler on a set of five MATLAB benchmarks that are mapped onto the Xilinx Virtex II FPGAs (field programmable gate arrays).

54 citations


Proceedings Article•DOI•
Steven P. Young1, P. Alfke, C. Fewer1, S. McMillan, B. Blodget, D. Levi •
09 Apr 2003
TL;DR: A crossbar switch with 928 inputs and 928 outputs is presented, which yields a 16/spl times/ improvement in logic density compared with using conventional logic and uses partial configuration to modify routing resources during operation.
Abstract: A crossbar switch with 928 inputs and 928 outputs is presented. Switching elements are constructed using logic in the routing fabric. This approach yields a 16/spl times/ improvement in logic density compared with using conventional logic. Normally, the routing is fixed. However, in FPGAs (field programmable gate arrays), the interconnection is defined by the state of SRAM configuration cells, which are dynamically modifiable. Therefore, the switch is implemented on an FPGA using partial configuration to modify routing resources during operation. All paths are synchronously clocked at 155.5 MHz, creating a total throughput of 144.3 Gbits/s. to maintain constant clock latency across all paths, partially configurable delay registers are used. Finally, the partial reconfiguration controller is implemented in hardware to enable fast switch updates.

48 citations


Proceedings Article•DOI•
09 Apr 2003
TL;DR: Optimal and heuristic methods for fast (fixed time limit) runtime pipeline assignment are investigated and experimental finding for pipelines of twenty or fewer components is presented, which shows that in this environment, optimal runtime solutions are possible for smaller pipelines and nearly optimal heuristic solutions are Possible for larger pipelines.
Abstract: The combination of hardware acceleration and flexibility make FPGAs (field programmable gate arrays) important to image processing applications. There is also a need for efficient, flexible hardware/software codesign environments that can balance the benefits and costs of using FPGAs. Image processing applications often consist of pipeline of components where each component applies a different processing algorithm. Components can be implemented for FPGAs or software. Such systems enable an image analyst to work with either FPGA or software implementations of image processing algorithms for a given problem. The pipeline assignment problem chooses from alternative implementations of pipeline components to yield the fastest pipeline. Our codesign system solves the pipeline assignment problem to provide the most effective implementation automatically, so the image analyst can focus solely on choosing components, which make up the pipeline. However, the pipeline assignment problem is NP complete. An efficient, dynamic solution to the pipeline assignment problem is a desirable enabler of codesign systems which use both FPGA and software implementations. This paper is concerned with solving pipeline assignment in this context. Consequently, we focus on optimal and heuristic methods for fast (fixed time limit) runtime pipeline assignment are investigated. We present experimental finding for pipelines of twenty or fewer components, which show that in our environment, optimal runtime solutions are possible for smaller pipelines and nearly optimal heuristic solutions are possible for larger pipelines.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: An ISA using a modified form of register addressing has been shown to have the best overall characteristics and should allow for the practical implementation of HASTE, an architecture that allows a single executable to represent an entire application.
Abstract: Hybrid architectures, which are composed of a conventional processor closely coupled with reconfigurable logic, seem to combine the advantages of both types of hardware. They present some practical difficulties however. The interface between the processor and the reconfigurable logic is crucial to performance and is often difficult to implement well. Partitioning the application between the processor and logic is a difficult task, typically complicated by entirely different programming models, heterogeneous interfaces to external resources, and incompatible representations of applications. A separate executable must be produced and maintained for each type of hardware. An architecture called HASTE (Hybrid Architecture with a Single Transformable Executable) solves many of these difficulties. HASTE allows a single executable to represent an entire application, including portions that run on a reconfigurable fabric and portions that run on a sequential processor. This executable can execute in its entirety on the processor, but for best performance portions of the application that are mapped onto the fabric at run-time. The application representation is the key to making this concept viable, and several different ones were examined. Some used a relatively conventional register instruction set architecture (ISA) while others used a new queue-based ISA. AN ISA using a modified form of register addressing has been shown to have the best overall characteristics and should allow for the practical implementation of HASTE.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: A hardware-based Gaussian noise generator used as a key component in a hardware simulation system, for exploring channel code behavior at very low bit error rates (BERs) in the range of 10/sup -9/ to 10/Sup -10/.
Abstract: Hardware simulation of channel codes offers the potential of improving code evaluation speed by orders of magnitude over workstation of PC-based simulation. We describe a hardware-based Gaussian noise generator used as a key component in a hardware simulation system, for exploring channel code behavior at very low bit error rates (BERs) in the range of 10/sup -9/ to 10/sup -10/. The main novelty is the design and use of nonuniform piecewise linear approximations in computing trigonometric and logarithmic functions. The parameters of the approximation are chosen carefully to enable rapid computation of coefficients from the inputs, while still retaining extremely high fidelity to the modeled functions. The output of the noise generator accurately models a true Gaussian PDF even at very high /spl sigma/ values. Its properties are explored using: (a) several different statistical tests, including the chi-square test and the Kolmogorov-Smirnov test, and (b) an application for decoding of low density parity check (LDPC) codes. An implementation at 133MHz on a Xilinx Virtex-II XC2V4000-6 FPGA produces 133 million samples per second, which is 40 times faster than a 2.13GHz PC; another implementation on a Xilinx Spartan-IIE XC2S300E-7 FPGA at 62MHz is capable of a 20 times speedup. The performance can be improved by exploiting parallelism: an XC2V4000-6 FPGA with three parallel instances of the noise generator at 126 MHz can run 100 times faster than a 2.13GHz PC. We illustrate the deterioration of clock speed with the increase in the number of instances.

Proceedings Article•DOI•
B.R. Lee1, N. Burgess1•
09 Apr 2003
TL;DR: This paper presents the design of parameterized fixed-point integer multiplication, squaring and fractional division units targeted at the Virtex-II family of FPGAs from Xilinx and are based on the small 18X18-bit multiplier blocks.
Abstract: This paper presents the design of parameterized fixed-point integer multiplication, squaring and fractional division units. The units are targeted at the Virtex-II family of FPGAs (field programmable gate arrays) from Xilinx and are based on the small 18X18-bit multiplier blocks. New partial product creation and summation techniques that exploit the low level primitives are used that achieve a 20% area and a 30% delay reduction for multiplication. A dedicated squaring component is presented that offers substantial area savings of up to 50%. The division component uses the multipliers for pre-scaling to reduce the delay and complexity of each minimally redundant radix-8 stage.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: This study provides novel hardware architectures for an IDS system which should be able to monitor networks with a speed up to 2.68 Gbps to achieve higher speed and more efficient performance of network security.
Abstract: One type of network security strategy is using an intrusion detection system (IDS). We are implementing an IDS in FPGA-based (Field Programmable Gate Array) reconfigurable hardware. This is to achieve higher speed and more efficient performance of network security, as networks develop very fast with consequently more demanding constraints. This study provides novel hardware architectures for an IDS system which should be able to monitor networks with a speed up to 2.68 Gbps.

Proceedings Article•DOI•
T.K. Lee1, S. Yusuf1, Wayne Luk1, Morris Sloman1, Emil Lupu1, Naranker Dulay1 •
09 Apr 2003
TL;DR: A framework for capturing firewall requirements as high-level descriptions based on the policy specification language Ponder is described, which provides abstraction from hardware implementation while allowing performance control through constraints.
Abstract: We describe a framework for capturing firewall requirements as high-level descriptions based on the policy specification language Ponder. The framework provides abstraction from hardware implementation while allowing performance control through constraints. Our hardware compilation strategy for such descriptions involves a rule reduction step to produce a hardware firewall rule representation. Three main methods have also been developed for resource optimization: partitioning; elimination; and sharing. A case study involving five sets of filter rules indicates that it is possible to reduce 67-80% of hardware resources over techniques based on regular content-addressable memory, and 24-63% over methods based on irregular content-addressable memory.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: Three algorithms are presented that solve the problem of balancing the hardware needs of the domain while considering performance and area requirements during the development of an encryption-specialized FPGA architecture.
Abstract: Although domain-specialized FPGAs (field programmable gate arrays) can offer significant area, speed and power improvements over conventional reconfigurable devices, there are several unique and unexplored design problems that complicate their development. One source of these problems is that the designers often opt to replace more universal, fine-grain logic elements with a specialized set of coarse-grain functional units to improve computation speed and reduce routing complexity. One issue this introduces is that it is not obvious how to simultaneously consider all applications in a domain and determine the most appropriate overall number and ration of the different functional units. In this paper, we illustrate how this problem manifests itself during the development of an encryption-specialized FPGA architecture. We present three algorithms that solve this problem by balancing the hardware needs of the domain while considering performance and area requirements. We believe these concerns need to be addressed by future CAD tools in order to develop more sophisticated application-specialized reconfigurable devices.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: An FPGA architecture for the separable 2-D Biorthogonal Discrete Wavelet Transform (DWT) decomposition based on the Pyramid Algorithm Analysis, which handles computation along the border efficiently by using the method of symmetric extension.
Abstract: This paper gives a design framework for the implementation of the 2D (two-dimensional) orthogonal discrete wavelet transform (DWT) on FPGA (field programmable gate array). The architecture is based on the pyramid algorithm analysis. It maps spatially the multistage filter banks of the DWT on Xilinx Virtex-e FPGA family using on chip buffering. The architecture takes advantage from the low rate of the high transform stages to reuse the logic. In this paper, we propose an FIR structure to handle the computation along the borders using symmetry extension, a new BlockRam configuration for multi ports shift register, and a mathematical approach to predict and reduce the error dynamic range due to wordlength rounding. For an MxM image size input, our architecture has a period of M/sup 2/ clock cycles, and requires the minimum storage size. The architecture is highly scalable for different filter lengths and number of octaves. The implementation results for a specific 2D Doubechies-4 wavelet transform are included.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: This paper describes a set of synthesizable and programmable memory interfaces a compiler can use to automatically generate the appropriate designs for mapping computations to FPGA-based architectures and reveals that it is possible to accurately model the area and timing requirements using a linear estimation function.
Abstract: As the densities of current FPGA continue to grow it is now possible to generate System-On-a-Chip (SoC) designs where multiple computing cores are connected to various memory modules with customized topology with application specific memory access patterns. For example, Xilinx has recently introduced devices to which a paired down version of a PowerPC core can be mapped and connected to a set of internal memories. In this paper we address the problem of synthesizing and estimating the area and speed of memory interfacing for Static RAM (SRAM) and Synchronous Dynamic RAM (SDRAM) with various latency parameters and access modes. We describe a set of synthesizable and programmable memory interfaces a compiler can use to automatically generate the appropriate designs for mapping computations to FPGA-based architectures. Our preliminary results reveal that it is possible to accurately model the area and timing requirements using a linear estimation function. We have successfully integrated the proposed memory interface designs with simple image processing kernels generated using commercially available behavioral synthesis tools.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: This paper introduces a fully automated fault recovery system for networked systems, which contain FPGAs (field programmable gate arrays), which requires no manual intervention.
Abstract: The device-level size and complexity of reconfigurable architectures makes fault tolerance an important concern in system design. In this paper, we introduce a fully automated fault recovery system for networked systems, which contain FPGAs (field programmable gate arrays). If a fault is detected hat cannot be addressed locally, fault information is transferred to a reconfiguration server. Following design recompilation to avoid the fault, a new FPGA configuration is returned to the remote system and computation is reinitiated. To illustrate the benefit of this approach, we have implemented a complete fault recovery system, which requires no manual intervention. An important part of the system is a timing-driven incremental router for Xilinx Virtex devices. This router is directly interfaced to Xilinx JBits and uses no CAD tools from the standard Xilinx Alliance tool flow. Our completed system has been applied to three benchmark designs and exhibits complete fault recovery in up to 12x less time than the standard incremental Xilinx PAR flow.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: Reconfigurable logic, when combined with the data reorganization, can lead to dramatic performance improvements of up to 20x over traditional computer architectures for pointer-based computations, traditionally not viewed as a good match for reconfigurable technologies.
Abstract: FPGAs (field programmable gate arrays) have appealing features such as customizable internal and external bandwidth and the ability to exploit vast amounts of fine-grain parallelism. In this paper, we explore the applicability of these features in using FPGAs as smart memory engines for search and reorganization computations over spatial pointer-based data structures. The experimental results in this paper suggests that reconfigurable logic, when combined with the data reorganization, can lead to dramatic performance improvements of up to 20x over traditional computer architectures for pointer-based computations, traditionally not viewed as a good match for reconfigurable technologies.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: The architecture for FIR filters on Xilinx Virtex FPGAs (field programmable gate arrays) is presented, which is particularly useful for handling the problem of signal boundaries filtering, which occurs in finite length signal processing.
Abstract: FIR (Finite Impulse Response) filters are often used in digital signal processing. This paper presents architecture for FIR filters on Xilinx Virtex FPGAs (field programmable gate arrays). The architecture is particularly useful for handling the problem of signal boundaries filtering, which occurs in finite length signal processing (e.g. image processing). Based on a bit parallel arithmetic, our architecture is fully scalable and parameterized. It cleverly exploits the Shift Register Logic (SRL16) component of the Virtex family. The implementation leads to considerable area savings compared to the conventional implementation (based on a hard router) with no speed penalty. A case study based on the implementation of the standard low filter of the Daubechies-8 wavelet on Xilinx Virtex-E FPGAs is presented.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: This extended abstract presents an architecture that overcomes the previous limitations of the Finite-Difference Time-Domain method, and begins with a high-level description of the computational flow of this architecture.
Abstract: Maxwell's equations, which govern electromagnetic propagation, are a system of coupled, differential equations. As such, they can be represented in difference form, thus allowing their numerical solution. By implementing both the temporal and spatial derivatives of Maxwell's equations in difference form, we arrive at one of the most common computational electromagnetic algorithms, the Finite-Difference Time-Domain (FDTD) method (Yee, 1966). In this technique, the region of interest is sampled to generate a grid of points, hereafter referred to as a mesh. The discretized form of Maxwell's equations is then solved at each point in the mesh to determine the associated electromagnetic fields. In this extended abstract, we present an architecture that overcomes the previous limitations. We begin with a high-level description of the computational flow of this architecture.

Proceedings Article•DOI•
Vinay Singh1, A. Root1, E. Hemphill1, Nabeel Shirazi1, James Hwang1 •
09 Apr 2003
TL;DR: This work demonstrates how system level design tools can be used to build a bit error rate (BER) tester, and how hardware co-simulation of the entire system provided a 10,000x speed-up over a pure software simulation FPGA tools.
Abstract: System level design tools for creating DSP designs reduce the amount of time needed to create a DSP design, in part by eliminating the need for verification between system model and hardware implementation. The design is developed within a high-level modeling environment. This description is compiled into a hardware description language, and synthesized by traditional FPGA (field programmable gate array) tools. The use of system level tools can eliminate the need for extensive hardware knowledge. We demonstrate how such tools can be used to build a bit error rate (BER) tester, and how hardware co-simulation of the entire system provided a 10,000x speed-up over a pure software simulation FPGA tools. The use of system level tools can eliminate the need for extensive hardware knowledge. We demonstrate how such tools can be used to build a bit error rate (BER) tester, and how hardware co-simulation of the entire system provided a 10,000x speed-up over a pure software simulation.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: A massively parallel single instruction multiple data stream (SIMD) processor designed specifically for cryptographic key search applications is presented and performance is compared with a previously reported hardwired design on a RC4 key search application.
Abstract: A massively parallel single instruction multiple data stream (SIMD) processor designed specifically for cryptographic key search applications is presented. This design exploits fine grain parallelism and the high memory bandwidth available in an FPGA (field programmable gate array) by integrating 95 simple processors and memory on a single FPGA chip. Performance is compared with a previously reported hardwired design on a RC4 key search application.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: This paper briefly presents a block cipher encryption architecture and a reconfigurable logic-based hardware design for the SCAN encryption algorithm and detailed performance results are presented for still images as well as video.
Abstract: This paper briefly presents a block cipher encryption architecture and a reconfigurable logic-based hardware design for the SCAN encryption algorithm. Detailed performance results are presented for still images as well as video, and the reconfigurable architecture is compared to software-only implementations of the same algorithm as well as a preliminary ASIC design.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: This paper investigates the potential benefits and costs of implementing this architecture using an asynchronous methodology, and focuses on the benefit due to decreased timing pessimism in an asynchronous implementation.
Abstract: PipeRench is a configurable architecture that has the unique ability to virtualize an application using dynamic reconfiguration. This paper investigates the potential benefits and costs of implementing this architecture using an asynchronous methodology. Since clock distribution and gating are relatively easy in the synchronous PipeRench, we focus on the benefit due to decreased timing pessimism in an asynchronous implementation. Two architectures for fully asynchronous implementation are considered. PE-based asynchronous implementation yields approximately 80% improvement in performance per stripe. This implementation, however, requires significant increases in configuration storage and wire count. A few particular features of the architecture, such as the crossbar interconnect structure within the stripe, are primarily responsible for this growth in configuration bits and wires. These features, however, are the primary aspects of the PipeRench architecture that make it a good compilation target.

Book Chapter•DOI•
09 Apr 2003
TL;DR: In this article, the authors focus on the development of fast, yet accurate performance and area modeling of complete FPGA designs that combine analytical, empirical and behavioral estimation techniques, and model the application of a set of important program transformations for image processing algorithms, namely loop unrolling, tiling, loop interchanging, loop fission and array privatization.
Abstract: Digital image processing algorithms are a good match for direct implementation on FPGAs as current FPGA architectures can naturally match the fine grain parallelism in these applications. Typically, these algorithms are structured as a sequence of operations, expressed in high-level programming languages as tight loop nests. The loops usually define a shifting-window region over which the algorithm applies a simple localized operator (e.g., a differential gradient, or a min/max). In this research we focus on the development of fast, yet accurate performance and area modeling of complete FPGA designs that combine analytical, empirical and behavioral estimation techniques. We model the application of a set of important program transformations for image processing algorithms, namely loop unrolling, tiling, loop interchanging, loop fission and array privatization, and explore pipelined and non-pipelined execution modes. We take into consideration the impact of various transformations, in the presence of limited I/O resources like address generators and external memory data channels, on the performance of a complete design implemented in a FPGA based architecture.

Proceedings Article•DOI•
09 Apr 2003
TL;DR: This astrophysics application poses a "good example" of the use of a highlevel reconfigurable computing tool such as sc2 to accelerate an algorithm because it uses real satellite data, the algorithm can be parallelized, and was originally validated using a high level scientific language, IDL.
Abstract: This paper presents a method to detect gamma-ray pulsars using a fast folding algorithm (Staelin, 1969) mapped onto reconfigurable hardware. In contrast, existing techniques require gigapoint complex FFTs. the algorithm has been written in Streams-C and compiled with the sc2 compiler to the target Annapolis Micro Systems (AMS) Firebird board (Xilinx Virtex E processor). To accelerate detection of new gamma-ray pulsars, the sc2 compiler generates a hardware implementation of the algorithm for finding periodicities in data sets. The data to be analyzed comes from a high-energy gamma-ray telescope onboard a spacecraft. This astrophysics application poses a "good example" of the use of a high level reconfigurable computing tool such as sc2 to accelerate an algorithm because it uses real satellite data, the algorithm can be parallelized, and was originally validated using a high level scientific language, IDL. By recasting the algorithm into Streams-C, the scientific software developer can create a hardware implementation on a reconfigurable computing platform. We describe the fast folding algorithm, the Streams-C implementation, and discuss techniques to optimize performance within the Streams-C framework. The compiler-generated hardware delivers approximately 3X to 6X speed up over a comparable 800MHz general-purpose processor doing the software-only algorithm.

Proceedings Article•
09 Apr 2003
TL;DR: An FIR structure to handle the computation along the borders using symmetry extension, a new BlockRam configuration for multi ports shift register, and a mathematical approach to predict and reduce the error dynamic range due to wordlength rounding are proposed.
Abstract: This paper gives a design framework for the implementation ofthe 2-D Orthogonal Discrete Wavelet Transform (DWT) onFPGA. The architecture is based on the Pyramid AlgorithmAnalysis. It maps spatially the multistage filter banks of theDWT on Xilinx Virtex-e FPGA family using on chip buffering.The architecture takes advantage from the low rate of the hightransform stages to reuse the logic. In this paper, we proposea novel FIR structure to handle the computation along theborders using symmetry extension, a new BlockRamconfiguration for multi ports shift register, and a newmathematical approach to predict and reduce the errordynamic range due to wordlength rounding. For an MxMimage size input, our architecture has a period of M2 clockcycles, and requires the minimum storage size. Thearchitecture is highly scalable for different filter lengths andnumber of octaves. The implementation results for a specific2-D Daubechies-4 Wavelet transform are included.