scispace - formally typeset
Search or ask a question

Showing papers by "Wayne Luk published in 2005"


Journal Article•DOI•
25 Jul 2005
TL;DR: It is shown that reconfigurable computing designs are capable of achieving up to 500 times speedup and 70% energy savings over microprocessor implementations for specific applications.
Abstract: Reconfigurable computing is becoming increasingly attractive for many applications. This survey covers two aspects of reconfigurable computing: architectures and design methods. The paper includes recent advances in reconfigurable architectures, such as the Alters Stratix II and Xilinx Virtex 4 FPGA devices. The authors identify major trends in general-purpose and special-purpose design methods. It is shown that reconfigurable computing designs are capable of achieving up to 500 times speedup and 70% energy savings over microprocessor implementations for specific applications.

414 citations


Proceedings Article•DOI•
11 Dec 2005
TL;DR: A methodology for supporting dynamic voltage scaling (DVS) on commercial FPGAs is described and experiments using this technique on various circuits at different clock frequencies and temperatures are described to demonstrate its utility and robustness.
Abstract: A methodology for supporting dynamic voltage scaling (DVS) on commercial FPGAs is described. A logic delay measurement circuit (LDMC) is used to determine the speed of an inverter chain for various operating conditions at run time. A desired LDMC value, intended to match the critical path of the operating circuit plus a safety margin, is then chosen; a closed loop control scheme is used to maintain the desired LDMC value as chip temperature changes, by automatically adjusting the voltage applied to the FPGA. We describe experiments using this technique on various circuits at different clock frequencies and temperatures to demonstrate its utility and robustness. Power savings between 4% and 54% for the VINT supply are observed

126 citations


Proceedings Article•DOI•
11 Dec 2005
TL;DR: A novel hardware accelerator for Monte Carlo (MC) simulation, based on a generic architecture, which combines speed and flexibility by integrating a pipelined MC core with an on-chip instruction processor is described.
Abstract: This paper describes a novel hardware accelerator for Monte Carlo (MC) simulation, and illustrates its implementation in field programmable gate array (FPGA) technology for speeding up financial applications. Our accelerator is based on a generic architecture, which combines speed and flexibility by integrating a pipelined MC core with an on-chip instruction processor. We develop a generic number system representation for determining the choice of number representation that meets numerical precision requirements. Our approach is then used in a complex financial engineering application, involving the Brace, Gatarek and Musiela (BGM) interest rate model for pricing derivatives. We address, in our BGM model, several challenges including the generation of Gaussian distributed random numbers and pipelining of the MC simulation. Our BGM application, based on an off-the-shelf system with a Xilinx XC2VP30 device at 50 MHz, is over 25 times faster than software running on a 1.5 GHz, Intel Pentium machine

108 citations


Journal Article•DOI•
TL;DR: The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.
Abstract: This paper presents a method for producing hardware designs for elliptic curve cryptography (ECC) systems over the finite field GF(2/sup m/), using the optimal normal basis for the representation of numbers. Our field multiplier design is based on a parallel architecture containing multiple m-bit serial multipliers; by changing the number of such serial multipliers, designers can obtain implementations with different tradeoffs in speed, size and level of security. A design generator has been developed which can automatically produce a customised ECC hardware design that meets user-defined requirements. To facilitate performance characterization, we have developed a parametric model for estimating the number of cycles for our generic ECC architecture. The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.

92 citations


Journal Article•DOI•
TL;DR: A hardware Gaussian noise generator based on the Wallace method used for a hardware simulation system that accurately models a true Gaussian probability density function even at high /spl sigma/ values is described.
Abstract: We describe a hardware Gaussian noise generator based on the Wallace method used for a hardware simulation system. Our noise generator accurately models a true Gaussian probability density function even at high /spl sigma/ values. We evaluate its properties using: 1) several different statistical tests, including the chi-square test and the Anderson-Darling test and 2) an application for decoding of low-density parity-check (LDPC) codes. Our design is implemented on a Xilinx Virtex-II XC2V4000-6 field-programmable gate array (FPGA) at 155 MHz; it takes up 3% of the device and produces 155 million samples per second, which is three times faster than a 2.6-GHz Pentium-IV PC. Another implementation on a Xilinx Spartan-III XC3S200E-5 FPGA at 106 MHz is two times faster than the software version. Further improvement in performance can be obtained by concurrent execution: 20 parallel instances of the noise generator on an XC2V4000-6 FPGA at 115 MHz can run 51 times faster than software on a 2.6-GHz Pentium-IV PC.

91 citations


Proceedings Article•DOI•
10 Oct 2005
TL;DR: An efficient hardware implementation of a median filter is presented, which offers a realisable way of efficiently implementing large-windowed median filtering, as required by transforms such as the Trace Transform.
Abstract: An efficient hardware implementation of a median filter is presented. Input samples are used to construct a cumulative histogram, which is then used to find the median. The resource usage of the design is independent of window size, but rather, dependent on the number of bits in each input sample. This offers a realisable way of efficiently implementing large-windowed median filtering, as required by transforms such as the Trace Transform. The method is then extended to weighted median filtering. The designs are synthesised for a Xilinx Virtex II FPGA and the performance and area compared to another implementation for different sized windows. Intentional use of the heterogeneous resources on the FPGA in the design allows for a reduction in slice usage and high throughput.

80 citations


Proceedings Article•DOI•
10 Oct 2005
TL;DR: An architecture and implementation of a high performance Gaussian random number generator (GRNG) is described and the resulting system can generate 169 million normally distributed random numbers per second on a Xilinx XC2VP3O-6 device.
Abstract: An architecture and implementation of a high performance Gaussian random number generator (GRNG) is described. The GRNG uses the Ziggurat algorithm which divides the area under the probability density function into three regions (rectangular, wedge and tail). The rejection method is then used and this amounts to determining whether a random point falls into one of the three regions. The vast majority of points lie in the rectangular region and are accepted to directly produce a random variate. For the nonrectangular regions, which occur 1.5% of the time, the exponential or logarithm functions must be computed and an iterative fixed point operation unit is used. Computation of the rectangular region is heavily pipelined and a buffering scheme is used to allow the processing of rectangular regions to continue to operate in parallel with evaluation of the wedge and tail computation. The resulting system can generate 169 million normally distributed random numbers per second on a Xilinx XC2VP3O-6 device.

77 citations


Journal Article•DOI•
TL;DR: Over 2,000 placed-and-routed FPGA designs are implemented, resulting in over 100 million application-specific integrated circuit (ASIC) equivalent gates, and optimal function evaluation results for range and precision combinations between 8 and 48 bits are provided.
Abstract: We present a methodology and an automated system for function evaluation unit generation. Our system selects the best function evaluation hardware for a given function, accuracy requirements, technology mapping, and optimization metrics, such as area, throughput, and latency. Function evaluation f(x) typically consists of range reduction and the actual evaluation on a small convenient interval such as [0, /spl pi//2) for sin(x). We investigate the impact of hardware function evaluation with range reduction for a given range and precision of x and f(x) on area and speed. An automated bit-width optimization technique for minimizing the sizes of the operators in the data paths is also proposed. We explore a vast design space for fixed-point sin(x), log(x), and /spl radic/x accurate to one unit in the last place using MATLAB and ASC, a stream compiler for field-programmable gate arrays (FPGAs). In this study, we implement over 2,000 placed-and-routed FPGA designs, resulting in over 100 million application-specific integrated circuit (ASIC) equivalent gates. We provide optimal function evaluation results for range and precision combinations between 8 and 48 bits.

68 citations


Proceedings Article•DOI•
10 Oct 2005
TL;DR: A flexible processor and compiler generation system, FPGA implementations of CUSTARD and performance/area results for media and cryptography benchmarks are presented.
Abstract: We propose CUSTARD - customisable threaded architecture - a soft processor design space that combines support for multiple hardware threads and automatically generated custom instructions. Multiple threads incur low additional hardware cost and allow fine-grained concurrency without multiple processor cores or software overhead. Custom instructions, generated for a specific application, accelerate frequently performed computations by implementing them as dedicated hardware. In this paper we present a flexible processor and compiler generation system, FPGA implementations of CUSTARD and performance/area results for media and cryptography benchmarks.

63 citations


Proceedings Article•DOI•
13 Jun 2005
TL;DR: This work describes methods to minimize both the integer and fraction parts of fixed-point signals with the aim of minimizing circuit area and employs a semi-analytical approach with analytical error models in conjunction with adaptive simulated annealing to find the optimum number of fraction bits.
Abstract: MiniBit, our automated approach for optimizing bit-widths of fixed-point designs is based on static analysis via affine arithmetic. We describe methods to minimize both the integer and fraction parts of fixed-point signals with the aim of minimizing circuit area. Our range analysis technique identifies the number of integer bits required. For precision analysis, we employ a semi-analytical approach with analytical error models in conjunction with adaptive simulated annealing to find the optimum number of fraction bits. Improvements for a given design reduce area and latency by up to 20% and 12% respectively, over optimum uniform fraction bit-widths on a Xilinx Virtex-4 FPGA.

58 citations


Proceedings Article•DOI•
S. Yusuf1, Wayne Luk1•
10 Oct 2005
TL;DR: A novel technique, based on a tree-based content addressable memory structure, for a pattern matching engine for use in a hardware-based network intrusion detection system that involves hardware sharing at bit level in order to exploit powerful logic optimisations for multiple strings represented as a boolean expression.
Abstract: String pattern matching is a computationally expensive task, and when implemented in hardware, it can consume a large amount of resources for processing and storage. This paper presents a novel technique, based on a tree-based content addressable memory structure, for a pattern matching engine for use in a hardware-based network intrusion detection system. This technique involves hardware sharing at bit level in order to exploit powerful logic optimisations for multiple strings represented as a boolean expression. Our approach has been used to implement the entire SNORT rule set with around 12% of the area on a Xilinx XC2V80O0 FPGA. The design can run at a rate of approximately 2.5 Gigabits per second, and is approximately 30% smaller in area when compared with published results. The performance of our design can be improved further by having multiple designs operating in parallel.

Journal Article•DOI•
TL;DR: This article presents a systematic approach to hardware/software codesign targeting data-intensive applications and focuses on the application processes that can be represented in directed acrylic graphs (DAGs) and use a synchronous dataflow model, the popular form of dataflow employed in DSP systems when running the process.
Abstract: This article presents a systematic approach to hardware/software codesign targeting data-intensive applications. It focuses on the application processes that can be represented in directed acrylic graphs (DAGs) and use a synchronous dataflow (SDF) model, the popular form of dataflow employed in DSP systems when running the process. The codesign system is based on the ultrasonic reconfigurable platform, a system designed jointly at Imperial College and the SONY Broadcast Laboratory. This system is modeled as a loosely coupled structure consisting of a single instruction processor and multiple reconfigurable hardware elements. The paper also introduces and demonstrates a task-based hardware/software codesign environment specialized for real-time video applications. Both the automated partitioning and scheduling environment and the task manager program help to provide a fast robust for supporting demanding applications in the codesign system.

Proceedings Article•DOI•
07 Mar 2005
TL;DR: It is shown that simulation speed of pin and cycle accurate models can go up to 150 kHz, compared to the 100 Hz range of HDL simulation, and utilising techniques that temporarily compromise cycle accuracy, effective simulation speed can be obtained.
Abstract: This paper evaluates the use of pin and cycle accurate SystemC models for embedded system design exploration and early software development. The target system is the MicroBlaze VanillaNet Platform running MicroBlaze uClinux operating system. The paper compares register transfer level (RTL) hardware description language (HDL) simulation speed to the simulation speed of several different SystemC models. It is shown that simulation speed of pin and cycle accurate models can go up to 150 kHz, compared to the 100 Hz range of HDL simulation. Furthermore, utilising techniques that temporarily compromise cycle accuracy, effective simulation speed of up to 500 kHz can be obtained.

Journal Article•DOI•
TL;DR: This paper explores the problem of architectural synthesis (scheduling, allocation, and binding) for multiple word-length systems and demonstrates significant resource savings of up to 46% are possible by considering these problems within the proposed unified framework.
Abstract: This paper explores the problem of architectural synthesis (scheduling, allocation, and binding) for multiple word-length systems. It is demonstrated that the resource allocation and binding problem, and the interaction between scheduling, allocation, and binding, are complicated by the existence of multiple word-length operators. Both optimum and heuristic approaches to the combined problem are formulated. The optimum solution involves modeling as an integer linear program, while the heuristic solution considers intertwined scheduling, binding, and resource word-length selection. Techniques are introduced to perform scheduling with incomplete word-length information, to combine binding and word-length selection, and to refine word-length information based on critical path analysis. Results are presented for several benchmark and artificial examples, demonstrating significant resource savings of up to 46% are possible by considering these problems within the proposed unified framework.

Proceedings Article•DOI•
07 Mar 2005
TL;DR: The paper presents a system-on-a-chip (SoC) architecture, which targets reconfigurable hardware, for elliptic curve cryptosystems (ECC), and a four-level partitioning scheme is described for exploring the area and speed tradeoffs.
Abstract: This paper presents a System-on-a-Chip (SoC) architecture for Elliptic Curve Cryptosystems (ECC) which targets reconfigurable hardware. A four-level partitioning scheme is described for exploring the area and speed trade-offs. A design generator is used to generate parameterisable building blocks for the configurable SoC architecture. A secure web server, which runs on a reconfigurable soft-processor and an embedded hard-processor, shows over 2000 times speedup when the computationally-intensive operations run on the customised building blocks. The embedded on-chip timer block gives accurate performance information. The design factors of configurable SoC architectures are also discussed and evaluated.

Journal Article•DOI•
TL;DR: It is shown that the short period of the uniform random number generator in the published implementation of Marsaglia and Tsang's Ziggurat method for generating random deviates can lead to poor distributions.
Abstract: We show that the short period of the uniform random number generator in the published implementation of Marsaglia and Tsang's Ziggurat method for generating random deviates can lead to poor distributions. Changing the uniform random number generator used in its implementation fixes this issue.

Proceedings Article•DOI•
18 Apr 2005
TL;DR: Haydn, a hardware compilation approach which aims to combine the benefits of cycle accurate descriptions such as ease of control and performance, and the rapid development and design exploration facilities in behavioral synthesis tools, is described.
Abstract: This paper describes Haydn, a hardware compilation approach which aims to combine the benefits of cycle accurate descriptions such as ease of control and performance, and the rapid development and design exploration facilities in behavioral synthesis tools. Our approach supports two main features: deriving architectures that meet performance goals involving metrics such as resource usage and execution time, and inferring design behavior by generating behavioral code that is easy to verify and modify from scheduled designs such as pipeline architectures. We report four recent developments that significantly enhance the Haydn approach: (a) a design methodology that supports both cycle-accurate and behavioral levels, in which developers can move from one level to the other: (b) an extended scheduling algorithm which supports operation chaining, pipelined resources (with different latencies and initiation intervals), forwarding technique for loop-carried dependencies, and resource sharing and control; (c) a hardware design flow that can be customized with a script language and extended simulation capabilities for the RC2000 board; and (d) an evaluation of our approach using various case studies, including 3D free-form deformation (FFD), Gouraud shading, Fibonacci series, Montgomery multiplication, and one-dimensional DCT. For instance, our approach has been used to produce various FFD designs in hardware automatically; the smallest at 137 MHz is 294 times faster than software on a dual AMD MP2600+ processor machine at 2.1 GHz, and is 2.7 times smaller and 10% slower than the fastest design at 153 MHz.

Proceedings Article•DOI•
11 Dec 2005
TL;DR: A class of FPGA-specific uniform random number generators with a 2k - 1 length period, which can provide k random bits per cycle for the cost of k lookup tables (LUTs) and k flip flops, and produces the highest sample rate for a given area.
Abstract: This paper describes a class of FPGA-specific uniform random number generators with a 2k - 1 length period, which can provide k random bits per cycle for the cost of k lookup tables (LUTs) and k flip flops. The generator is based on a binary linear recurrence, but with a recurrence matrix optimised for LUT based architectures. It avoids many of the problems and inefficiencies associated with LFSRs and Tausworthe generators, while retaining the ability to efficiently skip ahead in the sequence. In particular we show that this class of generators produces the highest sample rate for a given area compared to LFSR and Tausworthe generators. The statistical quality of this type of generators is very good, and can be used to create small and fast generators with long periods which pass all common empirical tests, such as Diehard, Crush, Big-Crush and the NIST cryptographic tests

Proceedings Article•DOI•
H. Styles1, Wayne Luk1•
10 Oct 2005
TL;DR: A framework for a posteriori performance analysis and architectural exploration is presented with which to establish a performance upper bound under perfect phase optimization, investigate sensitivity to reconfiguration time, and examine the quality of the proposed algorithm for phase-detection.
Abstract: A program phase is an interval over which the working set of the program remains more or less constant. This paper presents a dynamic optimization scheme which uses program phase information to optimize designs for reconfigurable computing. We present a mathematical formulation of the optimization problem and propose a solution which comprises of: (1) a hardware compilation scheme for generating configurations that are specialized for different phases of execution. (2) A runtime system which manages interchange of these configurations to maintain specialization between phase transitions. We report experimental results for Xilinx Virtex FPGAs involving OpenGL SFHCview-perf benchmarks and demonstrate 95.39% speedup over an optimized uniform rate static design and 11.13% speedup over an optimized multiinitiation interval static design. We present a framework for a posteriori performance analysis and architectural exploration with which we (a) establish a performance upper bound under perfect phase optimization, (b) investigate sensitivity to reconfiguration time, and (c) examine the quality of the proposed algorithm for phase-detection. The optimization is shown to be surprisingly insensitive to increased reconfiguration time. Faster reconfiguration yields limited benefits and performance improvements are possible up to 1 second reconfiguration time.

Proceedings Article•DOI•
07 Mar 2005
TL;DR: This paper explores methods for hardware acceleration of hidden Markov model (HMM) decoding for the detection of persons in still images by exploiting the inherent structure of the HMM trellis to optimise a Viterbi decoder for extracting the state sequence front observation features.
Abstract: This paper explores methods for hardware acceleration of hidden Markov model (HMM) decoding for the detection of persons in still images. Our architecture exploits the inherent structure of the HMM trellis to optimise a Viterbi decoder for extracting the state sequence front observation features. Further performance enhancement is obtained by computing the HMM trellis states in parallel. The resulting hardware decoder architecture is mapped onto a field programmable gate array (FPGA). The performance and resource usage of our design is investigated for different levels of parallelism. Performance advantages over software are evaluated. We show how this work contributes to a real-time system for person-tracking in video-sequences.

Proceedings Article•DOI•
O. Pell1, Wayne Luk1•
28 Sep 2005
TL;DR: Quartz as mentioned in this paper is a language supporting advanced features such as polymorphism, overloading, formal reasoning and generic VHDL library compilation for correct and efficient reconfigurable design.
Abstract: We present Quartz, the first language supporting advanced features such as polymorphism, overloading, formal reasoning and generic VHDL library compilation, for correct and efficient reconfigurable design. Quartz is designed to support formal reasoning for design verification and generic optimisation strategies can be captured as algebraic transformations; the correctness of such transformations has been established using the Isabelle theorem prover. The parameterisation supported by Quartz higher-order combinators makes the expression of regular designs with a parameterised level of pipelining much easier than the equivalent in VHDL. The language also supports reconfiguration through the use of virtual multiplexer blocks. We have used Quartz to describe a range of designs with parameterised pipelining, and investigated the different tradeoffs in speed, size and power consumption. For designs where pipeline registers can reduce glitch propagation, increasing the level of pipelining can reduce power consumption by as much as 90%


Proceedings Article•DOI•
24 Sep 2005
TL;DR: A customizable mathematical library using fixed-point arithmetic for elementary function evaluation and approximate functions via polynomial or rational approximations depending on the user-defined accuracy requirements is presented.
Abstract: Due to resource and power constraints, embedded processors often cannot afford dedicated floating-point units. For instance, the IBM PowerPC processor embedded in Xilinx Virtex-II Pro FPGAs only supports emulated floating-point arithmetic, which leads to slow operation when floating-point arithmetic is desired. This paper presents a customizable mathematical library using fixed-point arithmetic for elementary function evaluation. We approximate functions via polynomial or rational approximations depending on the user-defined accuracy requirements. The data representation for the inputs and outputs are compatible with IEEE single-precision and double-precision floating-point formats. Results show that our 32-bit polynomial method achieves over 80 times speedup over the single-precision mathematical library from Xilinx, while our 64-bit polynomial method achieves over 30 times speedup.

Journal Article•DOI•
TL;DR: In this article, the authors describe a framework for hardware compilation based on a parallel imperative language, which supports multiple levels of design abstraction, transformational development, optimisation by compiler passes, and metalanguage facilities.
Abstract: Hardware compilers for high-level languages are increasingly recognised to be the key to reducing the productivity gap for advanced circuit development in general, and for reconfigurable designs in particular. This paper explains how customisable frameworks for hardware compilation can enable rapid design exploration, and reusable and extensible hardware optimisation. It describes such a framework, based on a parallel imperative language, which supports multiple levels of design abstraction, transformational development, optimisation by compiler passes, and metalanguage facilities. Our approach has been used in producing designs for applications such as signal and image processing, with different trade-offs in performance and resource usage.

Proceedings Article•DOI•
28 Sep 2005
TL;DR: The design and implementation of three uniform random number generators for use in massively parallel simulations in FPGAs are detailed, which all pass the most stringent empirical statistical tests for randomness.
Abstract: This paper details the design and implementation of three uniform random number generators for use in massively parallel simulations in FPGAs. The three different generators are tailored to make use of three different types of hardware resource: logic, RAM, and DSP blocks. This allows the random number generator to be fitted into resources left-over after the main application has been written. The three generators all pass the most stringent empirical statistical tests for randomness, and all have periods appropriate for long running simulations

Proceedings Article•DOI•
18 Apr 2005
TL;DR: This work develops reconfigurable designs to support radiosity, a computer graphics algorithm for producing highly realistic images of artificial scenes, but which is computationally expensive, using stochastic raytracing, which affords both instruction-level and data parallelism.
Abstract: We develop reconfigurable designs to support radiosity, a computer graphics algorithm for producing highly realistic images of artificial scenes, but which is computationally expensive. We implement radiosity using stochastic raytracing, which affords both instruction-level and data parallelism. Our designs are parameterisable by bitwidth, allowing trade-offs between image quality and computation speed. We measure the speed of our designs for a Xilinx XC2V6000 device in the Celoxica RC2000 platform: at 53 MHz it can run up to five times faster than a software implementation on an Athlon MP 2600+ processor at 2.1 GHz. We estimate that retargeting our design for a Virtex-4 XCVSX55 device can result in over 160 times software speed, while a Spartan-3 XC3S5000 device can run more than 40 times faster than the software implementation.

Proceedings Article•DOI•
11 Dec 2005
TL;DR: One of the designs, which targets a Xilinx XC2V6000 FPGA at 90.2 MHz, represents a 145-fold speedup over a software version running on a 3 GHz Pentium-4 computer.
Abstract: This paper describes the design and implementation of hardware architectures for posture analysis. Posture analysis is an active research area in computer vision for home care environments and security. We report four contributions in this paper: (a) requirements for a posture analysis system with hardware support; (b) a workflow for posture analysis that fulfills these requirements; (c) new architectures and their implementation based on a high level hardware design approach; and (d) performance evaluation for our derived designs. One of our designs, which targets a Xilinx XC2V6000 FPGA at 90.2 MHz, is able to perform posture analysis at a rate of 1164 frames per second with frame size of 320 times 240 pixels, or 220 frames per second for DVD quality of 720 times 576 pixels per frame. It represents a 145-fold speedup over a software version running on a 3 GHz Pentium-4 computer. The frame rate is well above that of real time video, which enables us to share the FPGA design among multiple video sources

Book Chapter•DOI•
TL;DR: This paper describes how Quartz overloading is resolved using satisfiability matrix predicates, a new approach to overloading designed specifically for the requirements of describing hardware in Quartz.
Abstract: Quartz is a new declarative hardware description language with polymorphism, overloading, higher-order combinators and a relational approach to data flow, supporting formal reasoning for design verification in the same style as the Ruby language. The combination of parametric polymorphism and overloading within the language involves the implementation of a system of constrained types. This paper describes how Quartz overloading is resolved using satisfiability matrix predicates. Our algorithm is a new approach to overloading designed specifically for the requirements of describing hardware in Quartz.

Proceedings Article•DOI•
23 Jul 2005
TL;DR: This paper reviews the development of application-specific multiprocessor systems for machine learning applications, and indicates how variants of such systems can be produced by design customisation, and presents a method for automating the compilation of such designs.
Abstract: This paper reviews the development of application-specific multiprocessor systems for machine learning applications, and indicates how variants of such systems can be produced by design customisation. We first provide an overview of Progol, a machine learning framework based on inductive logic programming. We then describe, for such frameworks, various uniprocessor architectures and their adoption in multiprocessor systems. We also present the experimental facilities and results for evaluating our approach, and a method for automating the compilation of such designs.

01 Jan 2005
TL;DR: This paper maps the performancecritical tasks of packet classification and flow monitoring from software into hardware using a field programmable gate array (FPGA), such that operations can run in parallel where desirable.
Abstract: It is increasingly difficult for network devices to keep pace with rapid developments in network data rate speeds. Many such devices are unable to match the OC192 link speed. This paper describes the use of a combined hardware-software system as an application-specific solution to this problem. Our approach maps the performancecritical tasks of packet classification and flow monitoring from software into hardware using a field programmable gate array (FPGA), such that operations can run in parallel where desirable. A feature of our architecture is its capability to process multiple flows in parallel. We explore the scalability of our system showing that it can support flows at multi-gigabit rate, which is faster than most software-based systems where acceptable data rates are typically no more than 100 Mbps.