scispace - formally typeset
Search or ask a question

Showing papers by "Wayne Luk published in 2004"


Book Chapter•DOI•
30 Aug 2004
TL;DR: In this article, the authors investigated the impact of pipelining on energy consumption of two representative FPGA devices: a 0.13μm CMOS high density/high speed FPGAs (Altera Stratix EP1S40) and an 0.18μm MCOS low-cost FPGa (Xilinx XC2S200).
Abstract: This paper investigates experimentally the quantitative impact of pipelining on energy per operation for two representative FPGA devices: a 0.13μm CMOS high density/high speed FPGA (Altera Stratix EP1S40), and a 0.18μm CMOS low-cost FPGA (Xilinx XC2S200). The results are obtained by both measurements and execution of vendor-supplied tools for power estimation. It is found that pipelining can reduce the amount of energy per operation by between 40% and 90%. Further reduction in energy consumption can be achieved by power-aware clustering, although the effect becomes less pronounced for circuits with a large number of pipeline stages.

118 citations


Proceedings Article•DOI•
20 Apr 2004
TL;DR: A method that offers a uniform treatment for bit-width optimisation of both fixed-point and floating-point designs and is implemented in the BitSize tool targeting reconfigurable architectures, which takes user-defined constraints to direct the optimisation procedure.
Abstract: This paper presents a method that offers a uniform treatment for bit-width optimisation of both fixed-point and floating-point designs. Our work utilises automatic differentiation to compute the sensitivities of outputs to the bit-width of the various operands in the design. This sensitivity analysis enables us to explore and compare fixed-point and floating-point implementation for a particular design. As a result, we can automate the selection of the optimal number representation for each variable in a design to optimize area and performance. We implement our method in the BitSize tool targeting reconfigurable architectures, which takes user-defined constraints to direct the optimisation procedure. We illustrate our approach using applications such as ray-tracing and function approximation.

115 citations


Journal Article•DOI•
TL;DR: A hardware-based Gaussian noise generator used as a key component in a hardware simulation system, for exploring channel code behavior at very low bit error rates (BERs) in the range of 10/sup -9/ to 10/Sup -10/.
Abstract: Hardware simulation offers the potential of improving code evaluation speed by orders of magnitude over workstation or PC-based simulation. We describe a hardware-based Gaussian noise generator used as a key component in a hardware simulation system, for exploring channel code behavior at very low bit error rates (BERs) in the range of 10/sup -9/ to 10/sup -10/. The main novelty is the design and use of nonuniform piecewise linear approximations in computing trigonometric and logarithmic functions. The parameters of the approximation are chosen carefully to enable rapid computation of coefficients from the inputs while still retaining high fidelity to the modeled functions. The output of the noise generator accurately models a true Gaussian Probability Density Function (PDF) even at very high /spl sigma/ values. Its properties are explored using: 1) several different statistical tests, including the chi-square test and the Anderson-Darling test, and 2) an application for decoding of low-density parity-check (LDPC) codes. An implementation at 133 MHz on a Xilinx Virtex-II XC2V4000-6 FPGA produces 133 million samples per second, which is seven times faster than a 2.6 GHz Pentium-IV PC; another implementation on a Xilinx Spartan-IIE XC2S300E-7 FPGA at 62 MHz is capable of a three times speedup. The performance can be improved by exploiting parallelism: an XC2V4000-6 FPGA with nine parallel instances of the noise generator at 105 MHz can run 50 times faster than a 2.6 GHz Pentium-IV PC. We illustrate the deterioration of clock speed with the increase in the number of instances.

101 citations


Book•
01 Jan 2004
TL;DR: Synthesis and Optimization of DSP Algorithms describes approaches taken to synthesising structural hardware descriptions of digital circuits from high-level descriptions of Digital Signal Processing (DSP) algorithms.
Abstract: Synthesis and Optimization of DSP Algorithms describes approaches taken to synthesising structural hardware descriptions of digital circuits from high-level descriptions of Digital Signal Processing (DSP) algorithms. The book contains: -A tutorial on the subjects of digital design and architectural synthesis, intended for DSP engineers, -A tutorial on the subject of DSP, intended for digital designers, -A discussion of techniques for estimating the peak values likely to occur in a DSP system, thus enabling an appropriate signal scaling. Analytic techniques, simulation techniques, and hybrids are discussed. The applicability of different analytic approaches to different types of DSP design is covered, -The development of techniques to optimise the precision requirements of a DSP algorithm, aiming for efficient implementation in a custom parallel processor. The idea is to trade-off numerical accuracy for area or power-consumption advantages. Again, both analytic and simulation techniques for estimating numerical accuracy are described and contrasted. Optimum and heuristic approaches to precision optimisation are discussed, -A discussion of the importance of the scheduling, allocation, and binding problems, and development of techniques to automate these processes with reference to a precision-optimized algorithm, -Future perspectives for synthesis and optimization of DSP algorithms.

76 citations


Proceedings Article•DOI•
20 Apr 2004
TL;DR: A flexible hardware encoder for regular and irregular low-density parity-check (LDPC) codes that is flexible, supporting arbitrary H matrices, rates and block lengths and can be improved by exploiting parallelism.
Abstract: We describe a flexible hardware encoder for regular and irregular low-density parity-check (LDPC) codes. Although LDPC codes achieve better performance and lower decoding complexity than turbo codes, a major drawback of LDPC codes is their apparently high encoding complexity. Using an efficient encoding method proposed by Richardson and Urbanke, we present a hardware LDPC encoder with linear encoding complexity. The encoder is flexible, supporting arbitrary H matrices, rates and block lengths. An implementation for a rate 1/2 irregular length 2000 LDPC code encoder on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 4% of the device. It runs at 143 MHz and has a throughput of 45 million codeword bits per second (or 22 million information bits per second) with a latency of 0.18 ms. The performance can be improved by exploiting parallelism: several instances of the encoder can be mapped onto the same chip to encode multiple message blocks concurrently. An implementation of 16 instances of the encoder on the same device at 82 MHz is capable of 410 million codeword bits per second, 80 times faster than an Intel Pentium-lV 2.4 GHz PC.

53 citations


Book Chapter•DOI•
21 Jul 2004
TL;DR: A design generator has been developed which can automatically produce a customised ECC hardware design that meets user-defined requirements and enables designers to rapidly explore and implement a design with the best trade-offs in speed, size and level of security.
Abstract: This paper presents a method for producing hardware designs for Elliptic Curve Cryptography (ECC) systems over the finite field GF(2 m ), using the optimal normal basis for the representation of numbers A design generator has been developed which can automatically produce a customised ECC hardware design that meets user-defined requirements This method enables designers to rapidly explore and implement a design with the best trade-offs in speed, size and level of security To facilitate performance characterisation, we have developed formulaefor estimating the number of cycles for our generic ECC architecture The resulting hardware implementations are among the fastest reported, and can often run several orders of magnitude faster than software implementations

25 citations


Journal Article•DOI•
01 Jan 2004
TL;DR: This paper presents parameterized module-generators for pipelined function evaluation using lookup tables, adders, shifters, multipliers, multiplier, and dividers, and discusses trade-offs involved between full-lookup tables, bipartite units, lookup-multiply units, shift-and-add based CORDIC units, and rational approximation.
Abstract: This paper presents parameterized module-generators for pipelined function evaluation using lookup tables, adders, shifters, multipliers, and dividers. We discuss trade-offs involved between (1) full-lookup tables, (2) bipartite (lookup-add) units, (3) lookup-multiply units, (4) shift-and-add based CORDIC units, and (5) rational approximation. Our treatment mainly focuses on explaining method (3), and briefly covers the background of the other methods. For lookup-multiply units, we provide equations for estimating approximation errors and rounding errors which are used to parameterize the hardware units. The resources and performance of the resulting design can be estimated given the input parameters. A selection of the compared methods are implemented as part of the current PAM-Blox module generation environment. An example shows that the lookup-multiply unit produces competitive designs with data widths up to 20 bits when compared with shift-and-add based CORDIC units. Additionally, the lookup-multiply method or rational approximation can produce efficient designs for larger data widths when evaluating functions not supported by CORDIC.

23 citations


Proceedings Article•DOI•
06 Dec 2004
TL;DR: The Rabin-Miller Strong Pseudoprime Test has been mapped into hardware, which makes use of a circuit for computing Montgomery modular exponentiation to speed up the validation and to reduce the hardware cost.
Abstract: This work presents a scalable architecture for prime number validation which targets reconfigurable hardware. The primality test is crucial for security systems, especially for most public-key schemes. The Rabin-Miller Strong Pseudoprime Test has been mapped into hardware, which makes use of a circuit for computing Montgomery modular exponentiation to further speed up the validation and to reduce the hardware cost. A design generator has been developed to generate a variety of scalable and non-scalable Montgomery multipliers based on user-defined parameters. The performance and resource usage of our designs, implemented in Xilinx reconfigurable devices, have been explored using very large prime numbers. Our work demonstrates the flexibility and trade-offs in using reconfigurable platform for prototyping cryptographic hardware in embedded systems. It is shown that, for instance, a 1024-bit primality test can be completed in less than a second, and a low cost XC3S2000 FPGA chip can accommodate a 32k-bit scalable primality test with 64 parallel processing elements.

21 citations


Proceedings Article•DOI•
W. W. S. Chu1, R. Dimond1, S. Perrott1, S. P. Seng1, Wayne Luk1 •
16 Feb 2004
TL;DR: A customisable architecture and the associated tools for a prototype EPIC (explicitly parallel instruction computing) processor, which include a compiler and an assembler based on the trimaran framework are described.
Abstract: This paper describes a customisable architecture and the associated tools for a prototype EPIC (explicitly parallel instruction computing) processor. Possible customisations include varying the number of registers and functional units, which are specified at compile-time. This facilitates the exploration of performance/area trade-off for different EPIC implementations. We describe the tools for this EPIC processor, which include a compiler and an assembler based on the trimaran framework. Various pipelined EPIC designs have been implemented using field programmable gate arrays (FPGAs); the one with 4 ALUs at 41.8 MHz can run a DCT application 5 times faster than the strongARM SA-110 processor at 100 MHz.

18 citations


Journal Article•DOI•
15 Nov 2004
TL;DR: It is argued that shape-adaptive video processing algorithms with a relatively small number of different configuration contexts can often be more efficiently implemented as a static or multiconfiguration design, while a design employing dynamic or partial reconfiguration will be more suitable or even necessary if the number ofDifferent computation possibilities is relatively large.
Abstract: Various reconfigurable computing strategies are examined regarding their suitability for implementing shape-adaptive video processing algorithms of typical object-oriented multimedia applications. The utilisation of reconfigurability at different levels is investigated and the implications of designing reconfigurable shape-adaptive video processing circuits are addressed. Simple models for representing arbitrarily shaped objects and for mapping them into object-specific hardware designs are developed. Based on these models, several design and reconfiguration strategies, targeting an efficient mapping of shape-adaptive video processing tasks to a given reconfigurable computing architecture, are investigated. A number of real applications are analysed to study the trade-offs between these strategies. These include a shape-adaptive discrete cosine transform characterised by a limited number of different data-dependent computations and a shape-adaptive template matching method consisting of a virtually unlimited number of different computation possibilities. It is argued that shape-adaptive video processing algorithms with a relatively small number of different configuration contexts can often be more efficiently implemented as a static or multiconfiguration design, while a design employing dynamic or partial reconfiguration will be more suitable or even necessary if the number of different computation possibilities is relatively large.

16 citations


Journal Article•DOI•
H. Styles1, Wayne Luk1•
TL;DR: An analytical queuing network performance model is proposed to determine the optimal settings for basic block computation rates given a set of observed branch probabilities and is shown to be highly accurate with relative error between 0.12 and 1.1.
Abstract: This paper explores using information about program branch probabilities to optimize the results of hardware compilation. The basic premise is to promote utilization by dedicating more resources to branches which execute more frequently. A new hardware compilation and flow control scheme are presented which enable the computation rate of different branches to be matched to the observed branch probabilities. We propose an analytical queuing network performance model to determine the optimal settings for basic block computation rates given a set of observed branch probabilities. An experimental hardware compilation system has been developed to evaluate this approach. The branch optimization design space is characterized in an experimental study for Xilinx Virtex FPGAs of two complex applications: video feature extraction and progressive refinement radiosity. For designs of equal performance, branch-optimized designs require 24 percent and 27.5 percent less area. For designs of equal area, branch optimized designs run up to three times faster. Our analytical performance model is shown to be highly accurate with relative error between 0.12 and 1.1 /spl times/ 10/sup -4/.

Journal Article•
TL;DR: It is found that pipelining can reduce the amount of energy per operation by between 40% and 90%.
Abstract: This paper investigates experimentally the quantitative impact of pipelining on energy per operation for two representative FPGA devices: a 013μm CMOS high density/high speed FPGA (Altera Stratix EP1S40), and a 018μm CMOS low-cost FPGA (Xilinx XC2S200) The results are obtained by both measurements and execution of vendor-supplied tools for power estimation It is found that pipelining can reduce the amount of energy per operation by between 40% and 90% Further reduction in energy consumption can be achieved by power-aware clustering, although the effect becomes less pronounced for circuits with a large number of pipeline stages

Book Chapter•DOI•
30 Aug 2004
TL;DR: Preliminary experimental investigations reveal that while the proposed methodology is able to achieve the desired aims, its success would be enhanced if changes were made to existing FPGA fabrics in order to make them better suited to modular design.
Abstract: Increasing logic resources coupled with a proliferation of integrated performance enhancing primitives in high-end FPGAs results in an increased design complexity which requires new methodologies to overcome. This paper presents a structured system based design methodology, centred around the concept of architecture reuse, which aims to increase productivity and exploit the reconfigurability of high-end FPGAs. The methodology is exemplified by the Sonic-on-a-Chip architecture. Preliminary experimental investigations reveal that while the proposed methodology is able to achieve the desired aims, its success would be enhanced if changes were made to existing FPGA fabrics in order to make them better suited to modular design.

Proceedings Article•DOI•
06 Dec 2004
TL;DR: This work explores the reconfigurable dataflow approach in producing efficient hardware pipelines for programs with loop-carry dependencies in nested loops, and employs tagged tokens to enable reassembling of results which can retire out of order.
Abstract: This work explores the reconfigurable dataflow approach in producing efficient hardware pipelines for programs with loop-carry dependencies in nested loops. Reconfigurable dataflow combines static and dynamic scheduling, and employs tagged tokens to enable reassembling of results which can retire out of order. The effectiveness of this approach is illustrated using a fractal set generator and a Newton-Raphson root polisher: implementations targeting Xilinx Virtex and Virtex-II FPGAs can run up to 55 times faster than hardware pipelines developed using other methods, at the expense of a 50% increase in area.

Journal Article•
TL;DR: In this paper, a structured system based design methodology, centred around the concept of architecture reuse, which aims to increase productivity and exploit the reconfigurability of high-end FPGAs is presented.
Abstract: Increasing logic resources coupled with a proliferation of integrated performance enhancing primitives in high-end FPGAs results in an increased design complexity which requires new methodologies to overcome This paper presents a structured system based design methodology, centred around the concept of architecture reuse, which aims to increase productivity and exploit the reconfigurability of high-end FPGAs The methodology is exemplified by the Sonic-on-a-Chip architecture Preliminary experimental investigations reveal that while the proposed methodology is able to achieve the desired aims, its success would be enhanced if changes were made to existing FPGA fabrics in order to make them better suited to modular design

01 Jan 2004
TL;DR: This paper describes a framework, based on a parallel imperative language, which supports multiple levels of design abstraction, transformational development, optimisation by compiler passes, and metalanguage facilities, and has been used in producing designs for applications such as signal and image processing.
Abstract: Hardware compilers for high-level languages are increasingly recognised to be the key to reducing the productivity gap for advanced circuit development in general, and for reconfigurable designs in particular This paper explains how customisable frameworks for hardware compilation can enable rapid design exploration, and reusable and extensible hardware optimisation It describes such a framework, based on a parallel imperative language, which supports multiple levels of design abstraction, transformational development, optimisation by compiler passes, and metalanguage facilities Our approach has been used in producing designs for applications such as signal and image processing, with different trade-offs in performance and resource usage

Proceedings Article•DOI•
06 Dec 2004
TL;DR: This work explores this vast design space of adaptive range reduction for fixed-point sin(x), log(x) and /spl radic/(x) accurate to one unit in the last place using MATLAB and ASC, A Stream Compiler.
Abstract: Function evaluation f(x) typically consists of range reduction and the actual function evaluation on a small interval. We investigate optimization of range reduction given the range and precision of x and f(x). For every function evaluation there exists a convenient interval such as [0, /spl pi//2) for sin(x). The adaptive range reduction method, which we propose in this work, involves deciding whether range reduction can be used effectively for a particular design. The decision depends on the function being evaluated, precision, and optimization metrics such as area, latency and throughput. In addition, the input and output range has an impact on the preferable function evaluation method such as polynomial, table-based, or combinations of the two. We explore this vast design space of adaptive range reduction for fixed-point sin(x), log(x) and /spl radic/(x) accurate to one unit in the last place using MATLAB and ASC, A Stream Compiler. These tools enable us to study over 1000 designs resulting in over 40 million Xilinx equivalent circuit gates, in a few hours' time. The final objective is to progress towards a fully automated library that provides optimal function evaluation hardware units given input/output range and precision.

Proceedings Article•DOI•
23 May 2004
TL;DR: Quantitative comparison between using AMB and conventional FPGA block memory architectures demonstrates that this approach is promising.
Abstract: Current FPGAs include large blocks of memory that require separate address generation circuits. This not only uses logic resources surrounding the memory blocks, but also results in unnecessary routing congestions. This paper proposes the integration of the address generation circuit into the block memory to form an Autonomous Memory Block (AMB). Quantitative comparison between using AMB and conventional FPGA block memory architectures demonstrates that this approach is promising.

Book Chapter•DOI•
30 Aug 2004
TL;DR: The Customisable Modular Platform approach promotes modularity and design reuse by having multiple interoperable layers of design abstraction, while supporting advanced development and verification methods such as mixed-abstraction execution and efficient system-level simulation based on Transaction Level Modelling.
Abstract: This paper presents the Customisable Modular Platform (CMP) approach. The aim is to accelerate FPGA application development by raising the level of abstraction and facilitating design reuse. The solution is based on network of Nodes, communicating using packet-based protocol. The approach is illustrated using SoftSONIC, a CMP for video applications. Our approach promotes modularity and design reuse by having multiple interoperable layers of design abstraction, while supporting advanced development and verification methods such as mixed-abstraction execution and efficient system-level simulation based on Transaction Level Modelling. The platform provides domain-specific abstractions and customisations of various elements such as communication protocols and topology, enabling exploitation of data locality and fine- and coarse-grain parallelism. The benefits of our approach is demonstrated using SoftSONIC for development of several real-time HDTV video processing applications.

Journal Article•
TL;DR: In this paper, the authors investigate the consequences of working with high-resolution images on FPGAs and derive a performance model to establish bounds on performance and to predict which optimisations may be fruitful.
Abstract: Film and video sequences are increasingly being digitised, allowing image processing operations to be applied to them in the digital domain For film in particular, images are digitised at the limit of available scanners: each frame may contain 3000 by 2000 pixels, with 16 bits per colour channel We investigate the consequences of working with these high-resolution images on FPGAs We consider template matching and related algorithms, and derive a performance model to establish bounds on performance and to predict which optimisations may be fruitful An architecture generator has been developed which can generate optimised implementations given image resolution, the FPGA platform architecture, and a description of the image processing algorithm

Journal Article•
TL;DR: This paper presents a methodology and a partially automated implementation to select the best function evaluation hardware for a given function, accuracy requirement, technology mapping and optimization metrics, such as area, throughput and latency.
Abstract: Function evaluation is at the core of many compute-intensive applications which perform well on reconfigurable platforms Yet, in order to implement function evaluation efficiently, the FPGA programmer has to choose between a multitude of function evaluation methods such as table lookup, polynomial approximation, or table lookup combined with polynomial approximation In this paper, we present a methodology and a partially automated implementation to select the best function evaluation hardware for a given function, accuracy requirement, technology mapping and optimization metrics, such as area, throughput and latency The automation of function evaluation unit design is combined with ASC, A Stream Compiler, for FPGAs On the algorithmic side, MATLAB designs approximation algorithms with polynomial coefficients and minimizes bitwidths On the hardware implementation side, ASC provides partially automated design space exploration We illustrate our approach for sin(x), log( 1 + x) and 2 x with a selection of graphs that characterize the design space with various dimensions, including accuracy, precision and function evaluation method We also demonstrate design space exploration by implementing more than 400 distinct designs

Book Chapter•DOI•
Tim Todman1, Wayne Luk1•
30 Aug 2004
TL;DR: An architecture generator has been developed which can generate optimised implementations given image resolution, the FPGA platform architecture, and a description of the image processing algorithm.
Abstract: Film and video sequences are increasingly being digitised, allowing image processing operations to be applied to them in the digital domain. For film in particular, images are digitised at the limit of available scanners: each frame may contain 3000 by 2000 pixels, with 16 bits per colour channel. We investigate the consequences of working with these high-resolution images on FPGAs. We consider template matching and related algorithms, and derive a performance model to establish bounds on performance and to predict which optimisations may be fruitful. An architecture generator has been developed which can generate optimised implementations given image resolution, the FPGA platform architecture, and a description of the image processing algorithm.

Book Chapter•DOI•
Wayne Luk1•
21 Jul 2004
TL;DR: Techniques and tools for customising processors at design time and at run time are reviewed, and the use of declarative and imperative languages for describing and customising data processors is explored.
Abstract: This paper reviews techniques and tools for customising processors at design time and at run time. We use several examples to illustrate customisation for particular application domains, and explore the use of declarative and imperative languages for describing and customising data processors. We then consider run-time customisation, which necessitates additional work at compile time such as production of multiple configurations for downloading at run time. The customisation of instruction processors and design tools is also discussed.

Book Chapter•DOI•
30 Aug 2004
TL;DR: This goal is to see how a finegrained FPGA can be used to implement the pixel and vertex shader stages of a graphics pipeline, and how well the two methods compare in terms of speed, power, flexibility, and usability.
Abstract: There are many similarities between modern 3D graphics chips and FPGAs: a shader based graphic chip can be viewed as a highly domain specific and very coarse-grained reconfigurable logic device. Our goal is to see how a finegrained FPGA can be used to implement the pixel and vertex shader stages of a graphics pipeline, and how well the two methods compare in terms of speed, power, flexibility, and usability, initially by implementing Direct-X 9 pixel shader programs using Xilinx Virtex-II FPGAs.

Book Chapter•DOI•
30 Aug 2004
TL;DR: In this paper, the authors present a methodology and a partially automated implementation to select the best function evaluation hardware for a given function, accuracy requirement, technology mapping and optimization metrics, such as area, throughput and latency.
Abstract: Function evaluation is at the core of many compute-intensive applications which perform well on reconfigurable platforms. Yet, in order to implement function evaluation efficiently, the FPGA programmer has to choose between a multitude of function evaluation methods such as table lookup, polynomial approximation, or table lookup combined with polynomial approximation. In this paper, we present a methodology and a partially automated implementation to select the best function evaluation hardware for a given function, accuracy requirement, technology mapping and optimization metrics, such as area, throughput and latency. The automation of function evaluation unit design is combined with ASC, A Stream Compiler, for FPGAs. On the algorithmic side, MATLAB designs approximation algorithms with polynomial coefficients and minimizes bitwidths. On the hardware implementation side, ASC provides partially automated design space exploration. We illustrate our approach for sin(x), log(1+x) and 2 x with a selection of graphs that characterize the design space with various dimensions, including accuracy, precision and function evaluation method. We also demonstrate design space exploration by implementing more than 400 distinct designs.

Proceedings Article•DOI•
Tim Todman1, Wayne Luk1•
06 Dec 2004
TL;DR: A partitioning scheme is presented which allows the FPGA to process images larger than its local, external memory banks; alternatively, the scheme facilitates exploitation of concurrency offered by multiple memory banks in FPGAs systems.
Abstract: The rapid advance in imaging technology has led to the increasing availability of high-resolution digital images. For instance, the latest film scanners can produce more than 6 million pixels with 12-bit colour. We explore the opportunities of working with these high-resolution images on reconfigurable hardware, focusing on memory optimisations to support their effective processing. We consider template matching and related algorithms, and propose a runtime reconfiguration scheme which allows large template-matching algorithms to be implemented on small FPGAs. We present a partitioning scheme which allows the FPGA to process images larger than its local, external memory banks; alternatively, the scheme facilitates exploitation of concurrency offered by multiple memory banks in FPGA systems. We have developed parametric models to analyse these designs and to explore their potential. It is shown that, for instance, our column caching scheme can support linear speedup with respect to the number of columns, even when the reconfiguration time is large.

Proceedings Article•DOI•
20 Apr 2004
TL;DR: A structured system based design methodology which aims to increase productivity and exploit reconfigurability in large scale FPGAs is presented, exemplified by sonic-on-a-chip, a video image processing system.
Abstract: The ever increasing quantities of logic resources combined with heterogeneous integrated performance enhancing primitives in high-end FPGAs creates a design complexity challenge that requires new methodologies to address. We present a structured system based design methodology which aims to increase productivity and exploit reconfigurability in large scale FPGAs. The methodology is exemplified by sonic-on-a-chip, a video image processing system.

Proceedings Article•DOI•
06 Dec 2004
TL;DR: This work explains how parametric descriptions as abstractions for structured data access can be supported either as FPGA libraries targeting existing reconfigurable hardware devices, or as dedicated logic implementations forming autonomous memory blocks (AMBs).
Abstract: Many hardware designs, especially those for signal and image processing, involve structured data access such as queues, stacks and stripes. This work presents parametric descriptions as abstractions for such structured data access, and explains how these abstractions can be supported either as FPGA libraries targeting existing reconfigurable hardware devices, or as dedicated logic implementations forming autonomous memory blocks (AMBs). Scalable architectures combining the address generation logic in AMBs together to provide larger storage with parallel data access, are also examined. The effectiveness of this approach is illustrated with size and performance estimates for our FPGA libraries and dedicated logic implementations of AMBs. It is shown that for two-dimensional filtering, the dedicated AMBs can be 7 times smaller and 5 times faster than the FPGA libraries performing the same function.