TL;DR: A comparative analysis of FPGAs and traditional processors is presented, focusing on floating-point performance and procurement costs, revealing economic hurdles in the adoption of FFPAs for general high-performance computing (HPC).
Abstract: For certain applications, custom computational hardware created using field programmable gate arrays (FPGAs) can produce significant performance improvements over processors, leading some in academia and industry to call for the inclusion of FPGAs in supercomputing clusters This paper presents a comparative analysis of FPGAs and traditional processors, focusing on floating-point performance and procurement costs, revealing economic hurdles in the adoption of FPGAs for general high-performance computing (HPC)
TL;DR: FPGA-based Implementation of Signal Processing Systems is an important reference for practising engineers and researchers working on the design and development of DSP systems for radio, telecommunication, information, audio-visual and security applications.
Abstract: Field programmable gate arrays (FPGAs) are an increasingly popular technology for implementing digital signal processing (DSP) systems. By allowing designers to create circuit architectures developed for the specific applications, high levels of performance can be achieved for many DSP applications providing considerable improvements over conventional microprocessor and dedicated DSP processor solutions. The book addresses the key issue in this process specifically, the methods and tools needed for the design, optimization and implementation of DSP systems in programmable FPGA hardware. It presents a review of the leading-edge techniques in this field, analyzing advanced DSP-based design flows for both signal flow graph- (SFG-) based and dataflow-based implementation, system on chip (SoC) aspects, and future trends and challenges for FPGAs. The automation of the techniques for component architectural synthesis, computational models, and the reduction of energy consumption to help improve FPGA performance, are given in detail. Written from a system level design perspective and with a DSP focus, the authors present many practical application examples of complex DSP implementation, involving: high-performance computing e.g. matrix operations such as matrix multiplication; high-speed filtering including finite impulse response (FIR) filters and wave digital filters (WDFs); adaptive filtering e.g. recursive least squares (RLS) filtering; transforms such as the fast Fourier transform (FFT). FPGA-based Implementation of Signal Processing Systems is an important reference for practising engineers and researchers working on the design and development of DSP systems for radio, telecommunication, information, audio-visual and security applications. Senior level electrical and computer engineering graduates taking courses in signal processing or digital signal processing shall also find this volume of interest.
TL;DR: The introduction and the conclusion are the main chapters of the book, which provide a very strong theoretical and practical background to the field of reconfigurable computing, from the early Estrins machine to the very modern architecture like coarse-grained reconfigured device and the embedded logic devices.
Abstract: Introduction to Reconfigurable Computing provides a comprehensive study of the field Reconfigurable Computing. It provides an entry point to the novice willing to move in the research field reconfigurable computing, FPGA and system on programmable chip design. The book can also be used as teaching reference for a graduate course in computer engineering, or as reference to advance electrical and computer engineers. It provides a very strong theoretical and practical background to the field of reconfigurable computing, from the early Estrins machine to the very modern architecture like coarse-grained reconfigurable device and the embedded logic devices. Apart from the introduction and the conclusion, the main chapters of the book are Architecture of reconfigurable systems, Design and implementation, High-Level Synthesis for Reconfigurable Devices, Temporal placement, On-line and Dynamic Interconnection, Designing a reconfigurable application on Xilinx Virtex FPGA, System on programmable chip, Applications.
Cites background from "Examining the viability of FPGA sup..."
...In , Craven and Athanas provided a performace/price comparative study between FPGA-based high-performance computing machines and traditional supercomputers....
...Craven and Athanas recently provided in  a study on the viability of the FPGA in supercomputers....
TL;DR: The machine itself - Maxwell - its hardware and software environment is described and very early benchmark results from runs of the demonstrators are presented.
Abstract: We present the initial results from the FHPCA Supercomputer project at the University of Edinburgh. The project has successfully built a general-purpose 64 FPGA computer and ported to it three demonstration applications from the oil, medical and finance sectors. This paper describes in brief the machine itself - Maxwell - its hardware and software environment and presents very early benchmark results from runs of the demonstrators.
Abstract: Although hardware/software partitioning of embedded applications onto FPGAs is widely known to have performance and power advantages, FPGA usage has been typically limited to hardware experts, due largely to several problems: 1) difficulty of integrating hardware design tools into well-established software tool flows, 2) increasingly lengthy FPGA design iterations due to placement and routing, and 3) a lack of portability and interoperability resulting from device/platform-specific tools and bitfiles. In this paper, we directly address the last two problems by introducing intermediate fabrics, which are virtual reconfigurable architectures specialized for different application domains, implemented on top of commercial-off-the-shelf devices. Such specialization enables near-instantaneous placement and routing by hiding the complexity of fine-grained physical devices, while also enabling circuit portability across all devices that implement the intermediate fabric. When combined with existing work on runtime synthesis from software binaries, intermediate fabrics reduce the effects of all three problems by enabling transparent usage of COTS FPGAs by software designers. In this paper, we explore intermediate fabric architectures using specialization techniques to minimize area and performance overhead of the virtual fabric while maximizing routability and speedup of placement and routing. We present results showing an average placement and routing speedup of 554x, with an average area overhead of 10% and clock overhead of 18%, which corresponds to an average frequency of 195 MHz.
TL;DR: It is shown that the GPU is more productive than the FPGA architecture for most of the benchmarks and it is concluded thatFPGA-based HPCS is being marginalised by GPUs.
Abstract: Heterogeneous or co-processor architectures are becoming an important component of high productivity computing systems (HPCS). In this work the performance of a GPU based HPCS is compared with the performance of a commercially available FPGA based HPC. Contrary to previous approaches that focussed on specific examples, a broader analysis is performed by considering processes at an architectural level. A set of benchmarks is employed that use different process architectures in order to exploit the benefits of each technology. These include the asynchronous pipelines common to "map" tasks, a partially synchronous tree common to "reduce" tasks and a fully synchronous, fully connected mesh. We show that the GPU is more productive than the FPGA architecture for most of the benchmarks and conclude that FPGA-based HPCS is being marginalised by GPUs.
Cites methods from "Examining the viability of FPGA sup..."
...FPGAs have been shown to effectively accelerate certain types of computation useful for research and modelling , , ....
TL;DR: The hardware aspects of reconfigurable computing machines, from single chip architectures to multi-chip systems, including internal structures and external coupling are explored, and the software that targets these machines is focused on.
Abstract: Due to its potential to greatly accelerate a wide variety of applications, reconfigurable computing has become a subject of a great deal of research. Its key feature is the ability to perform computations in hardware to increase performance, while retaining much of the flexibility of a software solution. In this survey, we explore the hardware aspects of reconfigurable computing machines, from single chip architectures to multi-chip systems, including internal structures and external coupling. We also focus on the software that targets these machines, such as compilation tools that map high-level algorithms directly to the reconfigurable substrate. Finally, we consider the issues involved in run-time reconfigurable systems, which reuse the configurable hardware during program execution.
"Examining the viability of FPGA sup..." refers background in this paper
...A wide body of research over two decades has repeatedly demonstrated significant performance improvements for certain classes of applications through hardware acceleration in an FPGA ....
TL;DR: It is shown that the Cell/B.E.E., or Cell Broadband Engine, processor can outperform other modern processors by approximately an order of magnitude and by even more in some cases.
Abstract: The Cell Broadband Engine™ (Cell/B.E.) processor is the first implementation of the Cell Broadband Engine Architecture (CBEA), developed jointly by Sony, Toshiba, and IBM. In addition to use of the Cell/B.E. processor in the Sony Computer Entertainment PLAYSTATION® 3 system, there is much interest in using it for workstations, media-rich electronics devices, and video and image processing systems. The Cell/B.E. processor includes one PowerPC® processor element (PPE) and eight synergistic processor elements (SPEs). The CBEA is designed to be well suited for a wide variety of programming models, and it allows for partitioning of work between the PPE and the eight SPEs. In this paper we show that the Cell/B.E. processor can outperform other modern processors by approximately an order of magnitude and by even more in some cases.
...Cell processor 3200 × 9 10  $230  $23 System X 2300 × 2200 12 250  $5....
Abstract: Steady advances in VLSI technology and design tools have extensively expanded the application domain of digital signal processing over the past decade. While application-specific integrated circuits (ASICs) and programmable digital signal processors (PDSPs) remain the implementation mechanisms of choice for many DSP applications, increasingly new system implementations based on reconfigurable computing are being considered. These flexible platforms, which offer the functional efficiency of hardware and the programmability of software, are quickly maturing as the logic capacity of programmable devices follows Moore's Law and advanced automated design techniques become available. As initial reconfigurable technologies have emerged, new academic and commercial efforts have been initiated to support power optimization, cost reduction, and enhanced run-time performance.
This paper presents a survey of academic research and commercial development in reconfigurable computing for DSP systems over the past fifteen years. This work is placed in the context of other available DSP implementation media including ASICs and PDSPs to fully document the range of design choices available to system engineers. It is shown that while contemporary reconfigurable computing can be applied to a variety of DSP applications including video, audio, speech, and control, much work remains to realize its full potential. While individual implementations of PDSP, ASIC, and reconfigurable resources each offer distinct advantages, it is likely that integrated combinations of these technologies will provide more complete solutions.
TL;DR: This paper examines the impact of Moore's Law on the peak floating-point performance of FPGAs and results show that peak FPGA floating- point performance is growing significantly faster than peak CPU performance for a CPU.
Abstract: Moore's Law states that the number of transistors on a device doubles every two years; however, it is often (mis)quoted based on its impact on CPU performance. This important corollary of Moore's Law states that improved clock frequency plus improved architecture yields a doubling of CPU performance every 18 months. This paper examines the impact of Moore's Law on the peak floating-point performance of FPGAs. Performance trends for individual operations are analyzed as well as the performance trend of a common instruction mix (multiply accumulate). The important result is that peak FPGA floating-point performance is growing significantly faster than peak floating-point performance for a CPU.
"Examining the viability of FPGA sup..." refers background or methods in this paper
...Additional data was obtained by extrapolating the results of Underwood’s historical analysis  to include the Virtex 4 family....
...Extrapolated Cost-Performance Comparison While the larger FPGA devices that are prevalent in computational accelerators do not provide a cost benefit for the double precision floating-point calculations required by the HPC community, historical trends  suggest that FPGA performance is improving at a rate faster than that of processors....
TL;DR: A 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations and implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology.
Abstract: We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.
"Examining the viability of FPGA sup..." refers background or methods in this paper
...6 GFLOPS by placing 39 floating-point processing elements on a theoretical Xilinx XC2VP125 FPGA ....
...Due to the prevalence of floating-point arithmetic in HPC applications, research in academia and industry has focused on floating-point hardware designs [14, 15], libraries [16, 17], and development tools  to effectively perform floating-point math on FPGAs....
..., representing the fastest double-precision floating-point MAC design, was extrapolated to the largest parts in several Xilinx device families....
...Dou et al. published one of the highest performance benchmarks of 15.6 GFLOPS by placing 39 floating-point processing elements on a theoretical Xilinx XC2VP125 FPGA ....