scispace - formally typeset
Search or ask a question

Showing papers presented at "Field-Programmable Custom Computing Machines in 1997"


Proceedings Article•DOI•
16 Apr 1997
TL;DR: Novel aspects of the Garp Architecture are presented, as well as a prototype software environment and preliminary performance results, which suggest that a Garp of similar technology could achieve speedups ranging from a factor of 2 to as high as a factors of 24 for some useful applications.
Abstract: Typical reconfigurable machines exhibit shortcomings that make them less than ideal for general-purpose computing. The Garp Architecture combines reconfigurable hardware with a standard MIPS processor on the same die to retain the better features of both. Novel aspects of the architecture are presented, as well as a prototype software environment and preliminary performance results. Compared to an UltraSPARC, a Garp of similar technology could achieve speedups ranging from a factor of 2 to as high as a factor of 24 for some useful applications.

1,030 citations


Proceedings Article•DOI•
16 Apr 1997
TL;DR: The architecture of a time-multiplexed FPGA is described, which includes extensions for dealing with state saving and forwarding and for increased routing demand due to time- multiplexing the hardware.
Abstract: This paper describes the architecture of a time-multiplexed FPGA. Eight configurations of the FPGA are stored in on-chip memory. This inactive on-chip memory is distributed around the chip, and accessible so that the entire configuration of the FPGA can be changed in a single cycle of the memory. The entire configuration of the FPGA can be loaded from this on-chip memory in 30 ns. Inactive memory is accessible as block RAM for applications. The FPGA is based on the Xilinx XC4000E FPGA, and includes extensions for dealing with state saving and forwarding and for increased routing demand due to time-multiplexing the hardware.

533 citations


Proceedings Article•DOI•
16 Apr 1997
TL;DR: Chimaera is described, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host processor itself and enables the creation of multi-operand instructions and a speculative execution model key to high-performance, general-purpose reconfiguring computing.
Abstract: By strictly separating reconfigurable logic from their host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper we describe Chimaera, a system that overcomes this bottleneck by integrating reconfigurable logic into the host processor itself with direct access to the host processor's register file, the system enables the creation of multi-operand instruction and a speculative execution model key to high performance, general-purpose reconfigurable computing. It also supports multi-output functions, and utilizes partial run-time reconfiguration to reduce reconfiguration time. Combined, this system can provide speedups of a factor of two or more for general-purpose computing, and speedups of 160 or more are possible for hand-mapped applications.

450 citations


Proceedings Article•DOI•
16 Apr 1997
TL;DR: The paper describes a powerful, scalable, reconfigurable computer called the PARTS engine, which computes 24 stereo disparities on 320 by 240 pixel images at 42 frames per second and achieves throughput of over 70 million point/spl times/disparity measurements per second.
Abstract: The paper describes a powerful, scalable, reconfigurable computer called the PARTS engine. The PARTS engine consists of 16 Xilinx 4025 FPGAs, and 16 one-megabyte SRAMs. The FPGAs are connected in a partial torus-each associated with two adjacent SRAMs. The SRAMs are tightly coupled to the FPGAs so that all the SRAMs can be accessed concurrently. The PARTS engine fits on a standard PCI card in a personal computer or workstation. The first application implemented on the PARTS engine is a depth from stereo vision algorithm that computes 24 stereo disparities on 320 by 240 pixel images at 42 frames per second. Running at this speed, the engine is performing approximately 2.3 billion RISC-equivalent operations per second, accessing memory at a rate of 500 million bytes per second and attaining throughput of over 70 million point/spl times/disparity measurements per second.

229 citations


Proceedings Article•DOI•
16 Apr 1997
TL;DR: A new FPGA configuration mechanism, called striping, is proposed that supports pipeline stage reconfiguration and simultaneous configuration and execution and introduces a design abstraction that enables the development families of upwardly-compatible FPGAs and virtual hardware design.
Abstract: This paper examines the implementation of pipelined applications using run-time reconfiguration. Throughput and latency of pipelined applications can be significantly improved when reconfiguration is performed at the level of individual pipeline stages, as opposed to configuration of the entire FPGA. If reconfiguration and execution can be performed simultaneously, the performance of a pipelined application approaches its theoretical maximum. This paper proposes a new FPGA configuration mechanism, called striping, that supports pipeline stage reconfiguration and simultaneous configuration and execution. Additionally, the use of the pipeline stage as the atomic unit of reconfiguration introduces a design abstraction that enables the development families of upwardly-compatible FPGAs and virtual hardware design.

194 citations


Proceedings Article•DOI•
J. Burns1, A. Donlin, J. Hogg, Satnam Singh, M. De Wit •
16 Apr 1997
TL;DR: This work presents the design of an extensible run-time system for managing the dynamic reconfiguration of FPGAs, called RAGE, and incorporates operating-system style services that permit sophisticated and high level operations on circuits.
Abstract: The feasibility of run-time reconfiguration of FPGAs has been established by a large number of case studies. However, these systems have typically involved an ad hoc combination of hardware and software. The software that manages the dynamic reconfiguration is typically specialised to one application and one hardware configuration. We present three different applications of dynamic reconfiguration, based on research activities at Glasgow University, and extract a set of common requirements. We present the design of an extensible run-time system for managing the dynamic reconfiguration of FPGAs, motivated by these requirements. The system is called RAGE, and incorporates operating-system style services that permit sophisticated and high level operations on circuits.

157 citations


Proceedings Article•DOI•
W.B. Culbertson1, Rick Amerson1, Richard J. Carter1, Phillip J. Kuekes1, Greg Snider1 •
16 Apr 1997
TL;DR: This work has developed methods to precisely locate defects in Teramac, a large custom computer which works correctly despite the fact that three quarters of its FPGAs contain defects.
Abstract: Teramac is a large custom computer which works correctly despite the fact that three quarters of its FPGAs contain defects. This is accomplished through unprecedented use of defect tolerance, which substantially reduces Teramac's cost and permits it to have an unusually complex interconnection network. Teramac tolerates defective resources, like gates and wires, that are introduced during the manufacture of its FPGAs and other components, and during assembly of the system. We have developed methods to precisely locate defects. User designs are mapped onto the system by a completely automated process that avoids the defects and hides the defect tolerance from the user. Defective components are not physically removed from the system.

151 citations


Proceedings Article•DOI•
Carl Ebeling1, D.C. Cronquist1, P. Franklin1, J. Secosky1, Stefan G. Berg1 •
16 Apr 1997
TL;DR: This paper illustrates this mapping and configuration for several important applications including a FIR filter, 2-D DCT, motion estimation, and parametric curve generation; it also shows how static and dynamic control are used to perform complex computations.
Abstract: The goal of the RaPiD (Reconfigurable Pipelined Datapath) architecture is to provide high performance configurable computing for a range of computationally-intensive applications that demand special-purpose hardware. This is accomplished by mapping the computation into a deep pipeline using a configurable array of coarse-grained computational units. A key feature of RaPiD is the combination of static and dynamic control. While the underlying computational pipelines are configured statically, a limited amount of dynamic control is provided which greatly increases the range and capability of applications that can be mapped to RaPiD. This paper illustrates this mapping and configuration for several important applications including a FIR filter, 2-D DCT, motion estimation, and parametric curve generation; it also shows how static and dynamic control are used to perform complex computations.

140 citations


Proceedings Article•DOI•
G. Brebner1•
16 Apr 1997
TL;DR: The overall impact of the work presented in the paper is to show that it is feasible to incorporate configurable hardware within traditional computer systems that use high-level language programs and computer operating systems.
Abstract: Swappable Logic Units (SLUs) were introduced by the author previously (1996) to play a role in virtual hardware subsystems that is analogous to the role of pages or segments in virtual memory subsystems. The intention is that a conventional operating system can be extended to manage SLU circuitry implemented using FPGA real estate. In order to minimise operating system overheads, two particular SLU-based virtual hardware models were deemed practical: a "sea of accelerators" model and a "parallel harness" model. This paper looks in some detail at how SLUs will fit within the overall environment of a fairly conventional hardware/software system. First, there is a discussion of the FPGA-based hardware environment for SLUs, followed by a discussion of the software environment from which SLUs might be used. After this, there is a description of the operational properties that SLUs can have, and how these fit in with the two virtual hardware models. Finally, proposals for standard interfaces between SLUs and their environment are discussed. These interfaces can be regarded as constraints on the designers of SLU circuitry or, more positively, as suppliers of an enriched context within which such circuitry operates. The overall impact of the work presented in the paper is to show that it is feasible to incorporate configurable hardware within traditional computer systems that use high-level language programs and computer operating systems. That is, it should not always be necessary to devise special-purpose hardware/software systems to realise custom computing.

129 citations


Proceedings Article•DOI•
16 Apr 1997
TL;DR: The RAW benchmark suite consists of twelve programs designed to facilitate comparing, validating, and improving reconfigurable computing systems, and includes an architecture-independent compilation framework, Raw Computation Structures (RawCS), to express each algorithm's dependencies and to support automatic synthesis, partitioning, and mapping to a reconfigured computer.
Abstract: The RAW benchmark suite consists of twelve programs designed to facilitate comparing, validating, and improving reconfigurable computing systems. These benchmarks run the gamut of algorithms found in general purpose computing, including sorting, matrix operations, and graph algorithms. The suite includes an architecture-independent compilation framework, Raw Computation Structures (RawCS), to express each algorithm's dependencies and to support automatic synthesis, partitioning, and mapping to a reconfigurable computer. Within this framework, each benchmark is portably designed in both C and Behavioral Verilog and scalably parameterized to consume a range of hardware resource capacities. To establish initial benchmark ratings, we have targeted a commercial logic emulation system based on virtual wires technology to automatically generate designs up to millions of gates (14 to 379 FPGAs). Because the virtual wires techniques abstract away machine-level details like FPGA capacity and interconnect, our hardware target for this system is an abstract reconfigurable logic fabric with memory-mapped host I/O. We report initial speeds in the range of 2X to 1800X faster than a 2.82 SPECint95 SparcStation 20 and encourage others in the field to run these benchmarks on other systems to provide a standard comparison.

122 citations


Proceedings Article•DOI•
Yamin Li1, Wanming Chu1•
16 Apr 1997
TL;DR: A non-restoring square root algorithm and two very simple single precision floating point square root implementations based on the algorithm on FPGAs that uses a traditional adder/subtracter and a high-throughput pipelined implementation.
Abstract: The square root operation is hard to implement on FPGAs because of the complexity of the algorithms. In this paper, we present a non-restoring square root algorithm and two very simple single precision floating point square root implementations based on the algorithm on FPGAs. One is low-cost iterative implementation that uses a traditional adder/subtracter. The operation latency is 25 clock cycles and the issue rate is 24 clock cycles. The other is high-throughput pipelined implementation that uses multiple adder/subtracters. The operation latency is 15 clock cycles and the issue rate is one clock cycle. It means that the pipelined implementation is capable of accepting a square root instruction on every clock cycle.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: A framework and tools for automating the production of designs which can be partially reconfigured at run time, and a tool which further optimises designs for FPGAs supporting simultaneous configuration of multiple cells.
Abstract: This paper describes a framework and tools for automating the production of designs which can be partially reconfigured at run time. The tools include: a partial evaluator, which produces configuration files for a given design, where the number of configurations can be minimised by a process, known as compile-time sequencing; an incremental configuration calculator, which takes the output of the partial evaluator and generates an initial configuration file and incremental configuration files that partially update preceding configurations; and a tool which further optimises designs for FPGAs supporting simultaneous configuration of multiple cells. While many of our techniques are independent of the design language and device used, our tools currently target Xilinx 6200 devices. Simultaneous configuration, for example, can be used to reduce the time for reconfiguring an adder to a subtractor from time linear with respect to its size to constant time at best and logarithmic time at worst.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: The approach described in this paper discusses the use of an FPGA-based front end processor that filters relevant signaling information to the firewall host while at the same time allowing friendly connections to proceed at line speed with no performance degradation.
Abstract: This implementation of the firewall enables a high degree of traffic selectability yet avoids the usual performance penalty associated with IP level firewalls. This approach is applicable to high-speed broadband networks, and asynchronous transfer mode (ATM) networks are addressed in particular. Security management is achieved through a new technique of active connection management with authentication. Past approaches to network security involve firewalls providing selection based on packet filtering and application level proxy gateways. IP level firewalling was sufficient for traditional networks but causes a severe performance degradation in high speed broadband environments. The approach described in this paper discusses the use of an FPGA-based front end processor that filters relevant signaling information to the firewall host while at the same time allowing friendly connections to proceed at line speed with no performance degradation.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: The paper discusses a mapping experiment where a linear-systolic implementation of an ATR algorithm is mapped to the SPLASH 2 platform, and the resulting design is scalable and can be spread across multiple SPLash 2 boards with a linear increase in performance.
Abstract: Automated target recognition is an application area that requires special-purpose hardware to achieve reasonable performance. FPGA-based platforms can provide a high level of performance for ATR systems if the implementation can be adapted to the limited FPGA and routing resources of these architectures. The paper discusses a mapping experiment where a linear-systolic implementation of an ATR algorithm is mapped to the SPLASH 2 platform. Simple column oriented processors were used throughout the design to achieve high performance with limited nearest neighbor communication. The distributed SPLASH 2 memories are also exploited to achieve a high degree of parallelism. The resulting design is scalable and can be spread across multiple SPLASH 2 boards with a linear increase in performance.

Proceedings Article•DOI•
Norman Margolus1•
16 Apr 1997
TL;DR: The proposed FPGA chips proposed would make a wide range of large-scale CA simulations of 3D physical systems practical and economical-simulations that are currently well beyond the reach of any existing computer.
Abstract: We propose an FPGA chip architecture based on a conventional FPGA logic array core, in which I/O pins are clocked at a much higher rate than that of the logic array that they serve. Wide data paths within the chip are time multiplexed at the edge of the chip into much faster and narrower data paths that run off-chip. This kind of arrangement makes it possible to interface a relatively slow FPGA core with high speed memories and data streams, and is useful for many pin-limited FPGA applications. For efficient use of the highest bandwidth DRAMs, our proposed chip includes a RAMBUS DRAM interface, a burst-transfer controller, and burst buffers. This proposal is motivated by our work with virtual processor cellular automata (CA) machines-a kind of SIMD computer. Our next generation of CA machines requires reconfigurable FPGA-like processors coupled to the highest speed DRAMs and SRAMs available. Unfortunately, no current FPGA chips have appropriate DRAM I/O support or the speed needed to easily interface with pipelined SRAMs. The chips proposed would make a wide range of large-scale CA simulations of 3D physical systems practical and economical-simulations that are currently well beyond the reach of any existing computer. These chips would also be well suited to a broad range of other simulation, graphics and DSP-like applications.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: A systematic comparison of two promising arithmetic architecture classes based on standard base representation and composite fields found that composite field multipliers can be more than twice as fast as polynomial base multipliers on FPGAs and EPLD devices.
Abstract: Reed-Solomon (RS) error correction codes are being widely used in modern communication systems such as compact disk players or satellite communication links. RS codes rely on arithmetic in finite, or Galois fields. The specific field GF(2/sup 8/) is of central importance for many practical systems. The most costly, and thus most critical, elementary operations in RS decoders are multiplication and inversion in Galois fields. Although there have been considerable efforts in the area of Galois field arithmetic architectures, there appears to be very little reported work for Galois field arithmetic for reconfigurable hardware. This contribution provides a systematic comparison of two promising arithmetic architecture classes. The first one is based on a standard base representation, and the second one is based on composite fields. For both classes a multiplier and an inverter for GF(2/sup 8/) are described and theoretical gate counts are provided. Using a design entry based on a VHDL description, each architecture is mapped to a popular FPGA and EPLD device. For each mapping an area and a speed optimization was performed. Absolute values with respect to logic cell counts and critical path simulations are provided. The results show that the composite field architectures can have great advantages on both types of reconfigurable platforms. In particular it is found that composite field multipliers can be more than twice as fast as polynomial base multipliers on FPGAs.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: The goal is to use the reconfigurability, of the board's interface to test a system and discover not only the maximum bandwidth and best latency attainable, but also the way to reliably achieve these figures.
Abstract: We describe the use of a reconfigurable board to obtain information on the performance that can be expected on particular systems. Our goal is to use the reconfigurability, of the board's interface to test a system and discover not only the maximum bandwidth and best latency attainable, but also the way to reliably achieve these figures. The board we present uses the now widespread PCI bus. PCI is sufficiently complex, and its implementations sufficiently varied, that it is impossible to guess the performance that can be obtained by a specific board on a specific computer with the only technical characteristics of the two in hand. We observe astonishing performance differences between almost identical systems and comparable figures between small PCs and big servers. Our performance tests can be an end in themselves, however, they also serve to demonstrate the value of a reconfigurable bus interface. With the same board, we can test and choose a system, make informed architectural decisions on the hardware/software interface, and then finely tune the bus interface to get maximum and predictable figures in the running application.

Proceedings Article•DOI•
Qiang Wang1, David Lewis1•
16 Apr 1997
TL;DR: A compiler is described that generates both hardware and controlling software for field-programmable compute accelerators by analyzing a source program together with part of its input and generates VHDL descriptions of functional units that are mapped on a set of FPGA chips and an optimized sequence of control constructions that run on the customized machine.
Abstract: This paper describes a compiler that generates both hardware and controlling software for field-programmable compute accelerators. By analyzing a source program together with part of its input, the compiler generates VHDL descriptions of functional units that are mapped on a set of FPGA chips and an optimized sequence of control constructions that run on the customized machine. The primary technique employed in the compiler is partial evaluation, which is used to transform an application program together with part of its input into an optimized program. Further phases in the compiler identify pieces of the program that can be realized in hardware and schedule computations to execute on the resulting hardware. Finally, a set of specialized functional units generated by the compiler for a timing simulation program is used to demonstrate the approach.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: The authors present an integrated tool set to generate highly optimized hardware computation blocks from a C language subset, specifically targeted to fine grained FPGAs such as the National Semiconductor CLAy/sup TM/ FPGA family.
Abstract: The authors present an integrated tool set to generate highly optimized hardware computation blocks from a C language subset. By starting with a C language description of the algorithm, they address the problem of making FPGA processors accessible to programmers as opposed to hardware designers. Their work is specifically targeted to fine grained FPGAs such as the National Semiconductor CLAy/sup TM/ FPGA family. Such FPGAs exhibit extremely high performance on regular data path circuits, which are more prevalent in computationally oriented hardware applications. Dense packing of data path functional elements makes it possible to fit the computation on one or a small number of chips, and the use of local routing resources makes it possible to clock the chip at a high rate. By developing a lower level tool suite that exploits the regular, geometric nature of fine grained FPGAs, and mapping the compiler output to this tool suite, they greatly improve performance over traditional high level synthesis to fine grained FPGAs.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: Methods of implementation and performance for several common operations using the wormhole RTR paradigm are outlined, serving as indicators of the diversity of algorithms that can be instantiated through the high-speed run-time reconfiguration that these devices make possible.
Abstract: The wormhole run-time reconfiguration (RTR) computing paradigm is a method for creating high performance computational pipelines. The scalability, distributed control and data flow features of the paradigm allow it to fit neatly into the configurable computing machine (CCM) domain. To date, the field has been dominated by large bit-oriented devices whose flexibility can lead to lowered silicon utilization efficiencies. In an effort to raise this efficiency, the Colt CCM has been created based on the wormhole RTR paradigm. This paper outlines methods of implementation and performance for several common operations using these concepts. They serve as indicators of the diversity of algorithms that can be instantiated through the high-speed run-time reconfiguration that these devices make possible. Particular attention is paid to floating point multiplication. Also discussed is the topic of data dependent computation which would seem to be counter intuitive to the wormhole RTR paradigm. The paper concludes with a summary of performance of the three computations.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: A hardware accelerator is presented which exploits the fine-grain parallelism in routing individual nets and accelerates routing of FPGAs by 10 fold with a combination of processor clusters and hardware acceleration.
Abstract: The authors describe their experience and progress in accelerating an FPGA router. Placement and routing is undoubtedly the most time-consuming process in automatic chip design or configuring programmable logic devices as reconfigurable computing elements. Their goal is to accelerate routing of FPGAs by 10 fold with a combination of processor clusters and hardware acceleration. Coarse-grain parallelism is exploited by having several processors route separate groups of nets in parallel. A hardware accelerator is presented which exploits the fine-grain parallelism in routing individual nets.

Proceedings Article•DOI•
Miron Abramovici1, P. Menon1•
16 Apr 1997
TL;DR: A new approach to fault simulation, using reconfigurable hardware to implement a critical path tracing algorithm, shows that the approach is at least on order of magnitude faster than serial fault emulation used in prior work.
Abstract: The authors introduce a new approach to fault simulation, using reconfigurable hardware to implement a critical path tracing algorithm. The performance estimate shows that the approach is at least on order of magnitude faster than serial fault emulation used in prior work.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: A substantive example application is described that performs HMM training for speech recognition with the reconfigurable platform that demonstrates the utility of a small number of FPGAs coupled to a RISC processor with a simple interconnect.
Abstract: Armstrong III is a 20 node multi-computer that is currently operational. In addition to a RISC processor, each node contains reconfigurable resources implemented with FPGAs. The in-circuit reprogramability of static RAM based FPGAs allows the computational capabilities of a node to be dynamically matched to the computational requirements of an application. Most reconfigurable computers in existence today rely solely on a large number of FPGAs to perform computations. In contrast, the paper demonstrates the utility of a small number of FPGAs coupled to a RISC processor with a simple interconnect. The article describes a substantive example application that performs HMM training for speech recognition with the reconfigurable platform.

Proceedings Article•DOI•
T. McDermott1, P. Ryan1, M. Shand1, D. Skellern1, T. Percival1, Neil Weste1 •
16 Apr 1997
TL;DR: The digital section of a wireless local area network (WLAN) demodulator is implemented in a reconfigurable interface card called the PCI Pamette, which took far less time to complete than the card-based design and is much more versatile.
Abstract: We have implemented the digital section of a wireless local area network (WLAN) demodulator in a reconfigurable interface card called the PCI Pamette. The entire baseband section of the demodulator has been implemented using the Pamette and a simple analog to digital mezzanine board. This is the second version of the demodulator, the first being a card-based design using a mixture of discrete and reconfigurable logic. The Pamette design took far less time to complete than the card-based one. Moreover, the reconfigurable substrate is much more versatile. We describe the Pamette implementation and discuss our experiences with the two different design styles and technologies.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: The computer represents the next logical step towards evolvable hardware interacting with biology beyond the massively parallel computer NGEN, and is designed for high throughput dataflow applications with large problem size.
Abstract: Previous work (J.S. McCaskill et al., 1996; 1997) has shown the power of massively parallel configurable hardware (NGEN) in conjunction with dataflow architectures for the simulation of evolving populations. NGEN is a flexible computer hardware for rapid custom circuit simulation of fine grained physical processes via a massively parallel architecture, e.g. 144 hardware configurable field programmable gate arrays (FPGAs, XC4008, Xilinx). NGEN is optimized to implement dataflow architectures and systolic algorithms for large problems and is confectioned with high speed distributed SRAM, 144*8*256 kBit, 15ns access time, on the chip to chip interconnect. Microconfigurable FPGAs allow a further step to close the gap between micro electronics and biology on the information processing area. A design for a massively parallel microconfigurable computer (POLYP) is presented. It is designed to allow online evolution in hardware with significant locally controllable memory resources. It is also designed for high throughput dataflow applications with large problem size. Additionally, an evolvable interface between high rate measurement devices is provided to allow adaptive processing coupled with real time experimental environments. The computer represents the next logical step towards evolvable hardware interacting with biology beyond the massively parallel computer NGEN.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: This work analyzes the use of the distributed arithmetic algorithm for the efficient implementation of the discrete cosine transform in reconfigurable logic.
Abstract: The discrete cosine transform (DCT) is a key step in many image and video coding applications, and its efficient implementation has been extensively studied for software implementations and for custom VLSI. We analyse the use of the distributed arithmetic algorithm for the efficient implementation of the DCT in reconfigurable logic.

Proceedings Article•DOI•
Roger Woods1, S. Ludwig1, J.-P. Heron1, David Trainor1, S. Gehring1 •
16 Apr 1997
TL;DR: The implementation of a number of FIR filter structures in the Xilinx XC6200 technology is presented using a combination of IRIS, an architectural synthesis tool and Trianus/Hades a set of integrated tools for implementing algorithms on Custom Computing Machines.
Abstract: The implementation of a number of FIR filter structures in the Xilinx XC6200 technology is presented. The designs have been implemented using a combination of IRIS, an architectural synthesis tool and Trianus/Hades a set of integrated tools for implementing algorithms on Custom Computing Machines. The main attraction of this approach is that it allows algorithms to be compiled quickly allowing performance changes to be made at the architectural level in IRIS rather than at the FPGA layout level.

Proceedings Article•DOI•
J. Greenbaum1, M. Baxter1•
16 Apr 1997
TL;DR: The authors illustrate this new style of CCM with examples from image processing, in particular a novel FPGA implementation of block motion estimation (as for MPEG encoding) and generalize and speculate on implications for new CCM architectures.
Abstract: The need to partition computation across multiple programmable devices in array architecture CCMs leads to performance bottlenecks in data flow through the computer and wiring delays between adjacent devices. However, significant improvements in FPGA capacities have brought one to a threshold where direct inter-chip connections are not required because an entire algorithm can be implemented on a single device for important problems in areas such as image processing. One can now implement architectures that are similar to today's parallel computers in which interprocessor communication is done through shared memory or dedicated communication hardware. The benefits of this approach are system-wide scalability and flexibility. The authors illustrate this new style of CCM with examples from image processing, in particular a novel FPGA implementation of block motion estimation (as for MPEG encoding). Based on the lessons learned from these specific examples, they generalize and speculate on implications for new CCM architectures.

Proceedings Article•DOI•
S. Kelem1•
16 Apr 1997
TL;DR: The interplay between the CLB architecture, communication between configuration planes, context-switching overhead, and the end-user application are examined as the algorithm makes use of special features of this architecture to achieve high utilization of the silicon at run time.
Abstract: This paper describes the implementation of a real-time video algorithm on a context-switched FPGA. The FPGA is based on the Xilinx XC4000E FPGA, and includes extensions for dealing with state saving and forwarding and for increased routing demand due to time-multiplexing the hardware. The algorithm makes use of special features of this architecture to achieve high utilization of the silicon at run time. Two configuration planes are programmed as distributed RAM and two planes perform replications of the calculation in parallel. The interplay between the CLB architecture, communication between configuration planes, context-switching overhead, and the end-user application are examined as we map the algorithm onto this architecture.

Proceedings Article•DOI•
16 Apr 1997
TL;DR: The experimental results show that the hardware accelerator for the tautology check algorithm is capable of achieving a maximum speedup factor of 2.94 and averaging 1.36 on 110 modified industry benchmarks included with the Espresso II package.
Abstract: We summarize our study on implementing tautology checking, a fundamental logic synthesis algorithm, using an FPGA based reconfigurable application specific coprocessor The use of the tautology checking algorithm is first discussed followed by the specifics of hardware accelerator implementation and interface to application software We compare our hardware accelerator for the tautology check algorithm with the software implementation of the tautology check algorithm in Espresso II (R Rudell and A Sangiovanni-Vincentelli, 1987) Our experimental results show that our accelerator is capable of achieving a maximum speedup factor of 294 and averaging 136 on 110 modified industry benchmarks included with the Espresso II package