scispace - formally typeset
Search or ask a question

Showing papers by "Wayne Luk published in 2002"


Journal Article•DOI•
TL;DR: This paper compares three heuristic search algorithms: genetic algorithm (GA), simulated annealing (SA) and tabu search (TS), for hardware–software partitioning and shows that TS is superior to SA and GA in terms of both search time and quality of solutions.
Abstract: This paper compares three heuristic search algorithms: genetic algorithm (GA), simulated annealing (SA) and tabu search (TS), for hardware–software partitioning. The algorithms operate on functional blocks for designs represented as directed acyclic graphs, with the objective of minimising processing time under various hardware area constraints. Thecomparison involves a model for calculating processing time based on a non-increasing first-fit algorithm to schedule tasks, given that shared resource conflicts do not occur. The results show that TS is superior to SA and GA in terms of both search time and quality of solutions. In addition, we have implemented an intensification strategy in TS called penalty reward, which can further improve the quality of results.

142 citations


Proceedings Article•DOI•
16 Dec 2002
TL;DR: This work presents a novel approach to bitwidth- or precision-analysis for floating-point designs, which involves analysing the dataflow graph representation of a design to see how sensitive the output of a node is to changes in the outputs of other nodes: higher sensitivity requires higher precision and hence more output bits.
Abstract: Automatic bitwidth analysis is a key ingredient for highlevel programming of FPGAs and high-level synthesis of VLSI circuits. The objective is to find the minimal number of bits to represent a value in order to minimise the circuit area and to improve efficiency of the respective arithmetic operations, while satisfying user-defined numerical constraints. We present a novel approach to bitwidth- or precision-analysis for floating-point designs. The approach involves analysing the dataflow graph representation of a design to see how sensitive the output of a node is to changes in the outputs of other nodes: higher sensitivity requires higher precision and hence more output bits. We automate such sensitivity analysis by a mathematical method called automatic differentiation, which involves differentiating variables in a design with respect to other variables. We illustrate our approach by optimising the bitwidth for two examples, a discrete Fourier transform (DFT) implementation and a Finite Impulse Response (FIR) filter implementation.

101 citations


Proceedings Article•DOI•
22 Sep 2002
TL;DR: This paper presents an approach to the wordlength allocation and optimization problem for linear digital signal processing systems implemented in Field-Programmable Gate Arrays, and guarantees an optimum set of wordlengths for each internal variable.
Abstract: This paper presents an approach to the wordlength allocation and optimization problem for linear digital signal processing systems implemented in Field-Programmable Gate Arrays. The proposed technique guarantees an optimum set of wordlengths for each internal variable, allowing the user to trade-off implementation area for error at system outputs. Optimality is guaranteed through modelling as a mixed integer linear program, constructed through novel techniques for the linearization of error and area constraints. Optimum results in this field are valuable since they can be used to assess the effectiveness of heuristic wordlength optimization techniques. It is demonstrated that one such previously published heuristic reaches within 0.7% of the optimum area over a range of benchmark problems.

57 citations


Book Chapter•DOI•
A.A. Gaffar1, Wayne Luk1, Peter Y. K. Cheung1, Nabeel Shirazi2, James Hwang2 •
02 Sep 2002
TL;DR: A method for customising the representation of floating-point numbers that exploits the flexibility of re-configurable hardware and can produce hardware that is smaller and faster when compared with a design adopting the reference representation.
Abstract: This paper describes a method for customising the representation of floating-point numbers that exploits the flexibility of re-configurable hardware. The method determines the appropriate size of mantissa and exponent for each operation in a design, so that a cost functionn with a given error specification for the output relative to a reference representation can be satisfied. We adopt an iterative implementation of this method, which supports IEEE single-precision or double-precision floating-point representation as the reference representation. This implementation produces customised floating-point formats with arbitrary-sized mantissa and exponent. The tool follows a generic framework designed to cover a variety of arithmetic representations and their hardware implementations; both combinational and pipelined designs can be developed. Results show that, particularly for calculations involving large dynamic ranges, our tool can produce hardware that is smaller and faster when compared with a design adopting the reference representation.

38 citations


Proceedings Article•DOI•
22 Sep 2002
TL;DR: This paper finds that the dynamic SA-TM design in a 50 MHz Virtex 1000E device, including reconfiguration time, can perform almost 7,000 times faster than a 1.4 GHz Pentium 4 PC when processing a 100/spl times/100 template on 300 consecutive video frames in HDTV format.
Abstract: This paper presents reconfigurable computing strategies for a Shape-Adaptive Template Matching (SA-TM) method to retrieve arbitrarily shaped objects within images or video frames. A generic systolic array architecture is proposed as the basis for comparing three designs: a static design where the configuration does not change after compilation, a partially-dynamic design where a static circuit can be reconfigured to use different on-chip data, and a dynamic design which completely, adapts to a particular computation. While the logic resources required to implement the static and partially-dynamic designs are constant and depend only on the size of the search frame, the dynamic design is adapted to the size and shape of the template object, and hence requires much less area. The execution time of the matching process greatly depends on the number of frames the same object is matched at. For a small number of frames, the dynamic and partially-dynamic designs suffer from high reconfiguration overheads. This overhead is significantly reduced if the matching process is repeated on a large number of consecutive frames. We find that the dynamic SA-TM design in a 50 MHz Virtex 1000E device, including reconfiguration time, can perform almost 7,000 times faster than a 1.4 GHz Pentium 4 PC when processing a 100/spl times/100 template on 300 consecutive video frames in HDTV format.

35 citations


Proceedings Article•DOI•
T.K. Lee1, S. Yusuf1, Wayne Luk1, A. Sloman1, Emil Lupu1, Naranker Dulay1 •
16 Dec 2002
TL;DR: This work describes a framework, based on the high-level policy specification language Ponder, for capturing firewall rules as authorization policies with user-definable constraints, and supports optimisations to achieve efficient utilisation of hardware resources.
Abstract: High-performance firewalls can benefit from the increasing size, speed and flexibility of advanced reconfigurable hardware. However direct translation of conventional firewall rules in a router-based rule set often leads to inefficient hardware implementation. Moreover, such lowlevel description of firewall rules tends to be difficult to manage and to extend. We describe a framework, based on the high-level policy specification language Ponder for capturing firewall rules as authorization policies with user-definable constraints. Our framework supports optimisations to achieve efficient utilisation of hardware resources. A pipelined firewall implementation developed using this approach running at 10 MHz is capable of processing 2.5 million packets per second, which provides similar performance to a version without optimisation and is about 50 times faster than a software implementation running on a 700 MHz PIII processor.

28 citations


Book Chapter•DOI•
02 Sep 2002
TL;DR: This paper explores run-time adaptation of Flexible Instruction Processors, a method for parametrising descriptions and development of instruction processors, and develops techniques to automatically customise a FIP to an application.
Abstract: This paper explores run-time adaptation of Flexible Instruction Processors (FIPs), a method for parametrising descriptions and development of instruction processors. The run-time adaptability of a FIPsy stem allows it to evolve to suit the requirements of the user, by requesting automatic refinement based on instruction usage patterns. The techniques and tools that we have developed include: (a) a run-time environment that manages the reconfiguration of the FIPs o that it can execute a given application as efficiently as possible; (b) mechanisms to accumulate run-time metrics, and analysis of the metrics to allow the run-time environment to request for automatic refinements; (c) techniques to automatically customise a FIPto an application.

24 citations


Book Chapter•DOI•
16 Dec 2002
TL;DR: This paper describes techniques for producing FPGA-based designs that support free-form deformation in medical image processing by using a B-spline algorithm for modelling three-dimensional deformable objects and adopting a customised number representation format in the implementation.
Abstract: This paper describes techniques for producing FPGA-based designs that support free-form deformation in medical image processing. The free-form deformation method is based on a B-spline algorithm for modelling three-dimensional deformable objects. Our design includes four optimisations. First, we store the values of a third-order B-spline model in lookup tables. Second, we adopt a customised number representation format in our implementation. Third, we transform a nested loop so that conditionals are moved outside the loop. Fourth, we pipeline the design to increase its throughput, and we also deploy multiple pipelines such that each covers a different image. Our design description, captured in the Handel-C language, is parameterisable at compile time to support a range of image resolutions and computational precisions. An implementation on a Xilinx XC2V6000 device would be capable of processing images of resolution up to 256 by 256 pixels in real time.

15 citations


Proceedings Article•DOI•
22 Sep 2002
TL;DR: In this paper, a tabu search (TS) method with intensification strategy for hardware-software partitioning is presented, which operates on functional blocks for designs represented as directed acyclic graphs with the objective of minimizing processing time under various hardware area constraints.
Abstract: This paper presents tabu search (TS) method with intensification strategy for hardware-software partitioning. The algorithm operates on functional blocks for designs represented as directed acyclic graphs (DAG), with the objective of minimising processing time under various hardware area constraints. Results are compared to two other heuristic search algorithms: genetic algorithm (GA) and simulated annealing (SA). The comparison involves a scheduling model based on list scheduling for calculating processing time used as a system cost, assuming that shared resource conflicts do not occur. The results show that TS, which rarely appears for solving this kind of problem, is superior to SA and GA in terms of both search time and the quality of solutions. In addition, we have implemented intensification strategy in TS called penalty reward, which can further improve the quality of results.

15 citations


Proceedings Article•DOI•
22 Sep 2002
TL;DR: Results show that, for calculations involving large dynamic ranges, the method can achieve significant hardware reduction and speed improvement with respect to a design adopting the reference representation.
Abstract: This paper describes a method for customising the representation of floating-point numbers that exploits the flexibility of reconfigurable hardware. The method determines the appropriate size of mantissa and exponent for each operation in a design, so that a cost function with a given error specification for the output relative to a reference representation can be satisfied. Currently our tool, which adopts an iterative implementation of this method, supports single- or double-precision floating-point representation as the reference representation. It produces customised floating-point formats with arbitrary-sized mantissa and exponent. Results show that, for calculations involving large dynamic ranges, our method can achieve significant hardware reduction and speed improvement with respect to a design adopting the reference representation.

14 citations


Proceedings Article•DOI•
16 Dec 2002
TL;DR: This work introduces a scalable FPGA-based architecture for executing inductive logic programs, such that the execution speed largely increases linearly with respect to the number of processors.
Abstract: Inductive logic programming systems are an emerging and powerful paradigm for machine learning which can make use of background knowledge to produce theories expressed in logic. They have been applied successfully to a wide range of problem domains, from protein structure prediction to satellite fault diagnosis. However, their execution can be computationally demanding. We introduce a scalable FPGA-based architecture for executing inductive logic programs, such that the execution speed largely increases linearly with respect to the number of processors. The architecture contains multiple processors derived from Warren's Abstract Machine, which has been optimised for hardware implementation using techniques such as instruction grouping and speculative assignment. The effectiveness of the architecture is demonstrated using the mutagenesis data set containing 12000 facts of chemical compounds.

Book Chapter•DOI•
06 Nov 2002
TL;DR: In this paper, a functional specification of a procedure for compiling programs with relative placement information in Pebble, a simple language based on Structural VHDL, into programs with explicit placement coordinate information is presented.
Abstract: Placement information is useful in producing efficient circuit layout, especially for hardware libraries or for run-time reconfigurable designs. Relative placement information enables control of circuit layout at a higher level of abstraction than placement information in the form of explicit coordinates. We present a functional specification of a procedure for compiling programs with relative placement information in Pebble, a simple language based on Structural VHDL, into programs with explicit placement coordinate information. This procedure includes source-level transformation for compiling into descriptions that support conditional compilation based on symbolic placement constraints, a feature essential for parametrised library elements. Partial evaluation is used to optimise a description using relative placement to improve its size and speed. We illustrate our approach using a DES encryption design, which results in a 60% reduction in area and a 6% improvement in speed.

Proceedings Article•DOI•
02 Jul 2002
TL;DR: IGOL as mentioned in this paper is a framework for developing reconfigurable data processing applications, which adopts a four-layer architecture: application layer, operation layer, appliance layer and configuration layer.
Abstract: This paper describes IGOL, a framework for developing reconfigurable data processing applications. While IGOL was originally designed to target imaging and graphics systems, its structure is sufficiently general to support a broad range of applications. IGOL adopts a four-layer architecture: application layer, operation layer, appliance layer and configuration layer. This architecture is intended to separate and co-ordinate both the development and execution of hardware and software components. Hardware developers can use IGOL as an instance testbed for verification and benchmarking, as well as for distribution. Software application developers can use IGOL to discover hardware accelerated data processors, and to access them in a transparent, non-hardware specific manner. IGOL provides extensive support for the RC1000-PP board via the Handel-C language, and a wide selection of image processing filters have been developed. IGOL also supplies plug-ins to enable such filters to be incorporated in popular applications such as Premiere, Winamp, VirtualDub and DirectShow. Moreover, IGOL allows the automatic use of multiple cards to accelerate an application, demonstrated using DirectShow. To enable transparent acceleration without sacrificing performance, a three-tiered COM (Component Object Model) API has been designed and implemented. This API provides a well-defined and extensible interface which facilitates the development of hardware data processors that can accelerate multiple applications.

Proceedings Article•DOI•
16 Dec 2002
TL;DR: Methods to produce designs with many run-time parameters, which would otherwise require an impractical number of bitstreams to be generated at compile time, are developed.
Abstract: This paper explores representations and compilation of run-time parametrisable FPGA designs. We develop methods to produce designs with many run-time parameters, which would otherwise require an impractical number of bitstreams to be generated at compile time. Run-time parametrisation facilitates specialisation, which can be used to remove logic to produce a smaller and faster design. Our approach involves a source description based on Structural VHDL that allows designers to specify what parameters are available at compile time and at run time. Using this approach, converting a compile-time parameter into a run-time parameter or vice versa is straightforward. The source description does not contain explicit information on how to modify the design at run time. We describe a compilation scheme that can be used to extract this information, generate a run-time representation of the design and rapidly instantiate this representation at run time. We present techniques that allow a parametrised design to be incrementally modified in order to minimise the reconfiguration overhead Our compiler implementation generates a Java program that uses the JBits AN to implement the runtime representation and functions to incrementally modify the design. DES and AES encryption designs are used to illustrate our approach.

Proceedings Article•DOI•
H. Styles1, Wayne Luk1•
22 Sep 2002
TL;DR: A parameterised hardware design pattern, captured in the Handel-C language, which enables rapid exploration of the area/throughput design space for simple pipelines, is described, which determines speedup and resource usage on a range of Xilinx Virtex FPGA devices, and examines future trends in performance.
Abstract: We describe a feasibility study into accelerating computer graphics radiosity calculations using reconfigurable hardware A modular hardware/software codesign framework has been developed for experimenting with hardware acceleration of a time consuming step: formfactor determination We describe a parameterised hardware design pattern, captured in the Handel-C language, which enables rapid exploration of the area/throughput design space for simple pipelines Using this pattern we determine speedup and resource usage on a range of Xilinx Virtex FPGA devices, and examine future trends in performance As a sample of these results we demonstrate a 76 times speed-up over a 14GHz Athlon PC using a Xilinx XCV2000E and, based on place and route reports, estimate 31 times speed-up using a Xilinx XC2V8000

Proceedings Article•DOI•
16 Dec 2002
TL;DR: This paper shows how PD-XML specifications can be translated into appropriate machine descriptions for the parametric HPL-PD VLIW processor, and for the Flexible Instruction Processor (FIP) approach targeting reconfigurable implementations.
Abstract: This paper introduces PD-XML, a meta-language for describing instruction processors in general and with an emphasis on embedded processors, with the specific aim of enabling their rapid prototyping, evaluation and eventual design and implementation. PD-XML is not specific to any one architecture, compiler or simulation environment and hence provides greater flexibility than related machine description methodologies. We demonstrate how PD-XML can be interfaced to existing description methodologies and tool-flows. In particular we show how PD-XML specifications can be translated into appropriate machine descriptions for the parametric HPL-PD VLIW processor, and for the Flexible Instruction Processor (FIP) approach targeting reconfigurable implementations.

Proceedings Article•DOI•
22 Sep 2002
TL;DR: An approach based on motion vectors is proposed and is found to be successful in restoring the video sequence for any affine transform based distortion.
Abstract: This paper is concerned with the image registration problem as applied to video sequences that have been subjected to geometric distortions. This work involves the development of a computationally efficient algorithm to restore the video sequence using image registration techniques. An approach based on motion vectors is proposed and is found to be successful in restoring the video sequence for any affine transform based distortion. The algorithm is implemented in FPGA hardware targeted for a reconfigurable computing platform called SONIC It is shown that the algorithm can efficiently restore the video data in realtime.

Proceedings Article•DOI•
01 Jan 2002
TL;DR: This work explores various ways of mapping Strassen's algorithm into reconfigurable hardware that contains one or more customisable instruction processors, taking advantage of the additional logic and memory blocks available on a reconfigured platform.
Abstract: Strassen's algorithm is an efficient method for multiplying large matrices. We explore various ways of mapping Strassen's algorithm into reconfigurable hardware that contains one or more customisable instruction processors. Our approach has been implemented using Nios processors with custom instructions and with custom-designed coprocessors, taking advantage of the additional logic and memory blocks available on a reconfigurable platform.

Proceedings Article•DOI•
02 Jul 2002
TL;DR: An approach for optimizing hardware designs produced from software languages extended with constructs for parallel execution and hardware processing, such as the Handel-C language, by applying transformations that include the appropriate amount of parallelism by developing an algorithm for sequentialising parallel programs.
Abstract: This paper describes an approach for optimizing hardware designs produced from software languages extended with constructs for parallel execution and hardware processing, such as the Handel-C language. Our aim is to optimize these programs by applying transformations that include the appropriate amount of parallelism, in order to obtain the best trade-offs in space and in time. These transformations can be applied automatically at compile time, enabling the programmer to adapt parallel programs rapidly to a specific hardware platform. Our transformational approach, which involves design sequentialisation and parallelisation, contains two novel features. First, we develop an algorithm for sequentialising parallel programs. This algorithm relaxes the scheduling of the original design, giving a scheduler the freedom to arrange it to achieve better results in speed, in size, or in both. Second, we combine this sequentialisation algorithm with pipeline vectorization, a technique known to reduce the execution delay of loops by pipelining the loop body. We adapt several transformation techniques used in vectorizing and parallelizing software compilers, such as loop unrolling and loop tiling, to widen the applicability of our method. Results show that our approach often works well: for instance a manually pipelined convolution design, for implementation in a Xilinx XC4000 device produced from a Handel-C description, is speeded up by over 2 times by our prototype compiler.

Book Chapter•DOI•
02 Sep 2002
TL;DR: A motion vector based approach is used and found to be successful in restoring the video sequence for any global perspective transform based distortion.
Abstract: This paper is concerned with the image registration problem as applied to real-time video. It describes the development of a computationally efficient algorithm to restore broadcast quality video sequences using image registration techniques. A motion vector based approach is used and found to be successful in restoring the video sequence for any global perspective transform based distortion. The algorithm is implemented on a reconfigurable computing platform called UltraSONIC in a hardware/software codesign environment. It is shown that the algorithm can accurately restore video data in real-time.

Proceedings Article•DOI•
02 Jul 2002
TL;DR: A fully-pipelined design has been developed in the Handel-C language, which can perform image warping in real time for resolutions up to 256 by 256 pixels on a Xilinx XC2V6000 device and is parameterisable at compile time for different image resolutions.
Abstract: This paper describes reconfigurable computing techniques for optimising image warping designs.Our image warping algorithm is based on radial basis functions, which enable the warping effect to be specified in terms of feature points. The coefficients of the warping function are obtained from the Symmetric Bipartite Table Method (SBTM), and the lookup tables can be generated dynamically at run time. We have deployed an optimised number representation involving both custom integer and custom floating-point formats in computing the radial function approximation. Furthermore, a fully-pipelined design has been developed in the Handel-C language, which can perform image warping in real time for resolutions up to 256 by 256 pixels on a Xilinx XC2V6000 device. This design is parameterisable at compile time for different image resolutions. Currently our implementation on a Xilinx Virtex XCV1000 device for the RC1000-PP platform achieves 50% faster than a software version on an AMD Athlon 1.4 GHz PC. A faster data bus and a larger FPGA for the RC1000-PP platform can result in a further speed improvement of over ten times.

Proceedings Article•DOI•
16 Dec 2002
TL;DR: The key elements of this approach include abstractions and tools based on high-level descriptions, and facilities for optimizations such as domain-specific data partitioning and run-time reconfiguration.
Abstract: We present an incremental approach to developing programs for reconfigurable engines, systems which contain both instruction processors and reconfigurable hardware. The purpose is to support rapid production of prototypes, as well as their further systematic refinement and adaptation when required. The key elements of our approach include abstractions and tools based on high-level descriptions, and facilities for optimizations such as domain-specific data partitioning and run-time reconfiguration. The application of our approach is illustrated using the SONIC reconfigurable engine, which contains a multi-FPGA card in a PC system designed for video image processing.

Proceedings Article•DOI•
16 Dec 2002
TL;DR: This paper presents a novel approach that focuses on rapid development and maintenance of optimised hardware designs using a high-level parallel language using an existing timing model that states, for instance, that every assignment executes in one clock cycle.
Abstract: This paper presents a novel approach that focuses on rapid development and maintenance of optimised hardware designs using a high-level parallel language. We use an existing timing model that states, for instance, that every assignment executes in one clock cycle. This strict timing model gives users control over design scheduling, such as managing the number of cycles and cycle time. Our main contribution is the introduction of a flexible timing model that abstracts optimisation details by supporting high-level transformations and automatic scheduling. Furthermore, we provide techniques that unschedule parallel designs, so that they can be rescheduled to meet new performance and hardware constraints, making designs as implementation independent as possible. With both models, manual development and computerised optimisation can be interleaved to achieve the best effect. Our approach is illustrated by a case study where we port a pipelined convolver to another platform, and achieve either a 300% speedup or a 50% reduction in resource usage.