Examining the viability of FPGA supercomputing
Summary (2 min read)
1. INTRODUCTION
- Supercomputers have experienced a resurgence, fueled by government research dollars and the development of lowcost supercomputing clusters constructed from commodity PC processors.
- Floating-point arithmetic is so prevalent that the benchmarking application ranking supercomputers, LINPACK, heavily utilizes doubleprecision floating-point math.
- Section 3 describes alternatives to floating-point implementations in FPGAs, presenting a balanced benchmark for comparing FPGAs to processors.
2.1. HPC implementations
- The availability of high-performance clusters incorporating FPGAs has prompted efforts to explore acceleration of HPC applications.
- While not an exhaustive list, Table 1 provides a survey of recent representative applications.
- The SRC-6 and 6E combine two Xeon or Pentium processors with two large Virtex-II or Virtex-II Pro FPGAs.
- The abbreviations SP and DP refer to single-precision and double-precision floating point, respectively.
- While the speedups provided in the table are not normalized to a common processor, a trend is clearly visible.
2.2. Theoretical floating-point performance
- FPGA designs may suffer significant performance penalties due to memory and I/O bottlenecks.
- As most clusters incorporating FPGAs also include a host processor to handle serial tasks and communication, it is reasonable to assume that the cost analysis in Table 2 favors FPGAs.
- For Xilinx's double-precision floatingpoint core 16 of these 18-bit multipliers are required [35] for each multiplier, while for the Dou et al. design only nine are needed.
- While the larger FPGA devices that are prevalent in computational accelerators do not provide a cost benefit for the double-precision floating-point calculations required by the HPC community, historical trends [42] suggest that FPGA performance is improving at a rate faster than that of processors.
- In both graphs, the latest data point, representing the largest Virtex-4 device, dis-plays worse cost-performance than the previous generation of devices.
2.3. Tools
- The typical HPC user is a scientist, researcher, or engineer desiring to accelerate some scientific application.
- Many have noted the requirement of high-level development environments to speed acceptance of FPGA-augmented clusters.
- These development tools accept a description of the application written in a high level language (HLL) and automate the translation of appropriate sections of code into hardware.
- Hardware debugging and interfacing still must occur.
- The use of automatic translation also drives up development costs compared to software implementations.
3.1. Nonstandard data formats
- The use of IEEE standard floating-point data formats in hardware implementations prevents the user from leveraging an FPGA's fine-grained configurability, effectively reducing an FPGA to a collection of floating-point units with configurable interconnect.
- Seeing the advantages of customizing the data format to fit the problem, several authors have constructed nonstandard floating-point units.
- One of the earlier projects demonstrated a 23x speedup on a 2D fast Fourier transform (FFT) through the use of a custom 18-bit floating-point format [44] .
- For the cost of their PROGRAPE-3 board, estimated at US$ 15,000, it is likely that a 15-node processor cluster could be constructed producing 196 single-precision peak GFLOPS.
- Many comparisons spend significantly more time optimizing hardware implementations than is spent optimizing software.
3.2. GIMPS benchmark
- The strength of configurable logic stems from the ability to customize a hardware solution to a specific problem at the bit level.
- One such application can be found in the great Internet Mersenne prime search [50] .
- The software used by GIMPS relies heavily on double-precision floating-point FFTs.
- These memories operated concurrently, two of the buffers feeding the butterfly units while the third exchanged data with the external SDRAM.
- In spite of the unique all-integer algorithmic approach, the stand-alone FPGA implementation only achieved a speedup of 1.76 compared to a 3.4 GHz Pentium 4 processor.
4. CONCLUSION
- When comparing HPC architectures many factors must be weighed, including memory and I/O bandwidth, communication latencies, and peak and sustained performance.
- As the recent focus on commodity processor clusters demonstrates, cost-performance is of paramount importance.
- In order for FPGAs to gain acceptance within the general HPC community, they must be cost-competitive with traditional processors for the floating-point arithmetic typical in supercomputing applications.
- The analysis of the costperformance of various current generation FPGAs revealed that only the lower-end devices were cost-competitive with processors for double-precision floating-point matrix multiplications.
- For lower precision data formats current generation FP-GAs fare much better, being cost-competitive with processors.
Did you find this useful? Give us your feedback
Citations
215 citations
190 citations
Cites background from "Examining the viability of FPGA sup..."
...In [57], Craven and Athanas provided a performace/price comparative study between FPGA-based high-performance computing machines and traditional supercomputers....
[...]
...Craven and Athanas recently provided in [57] a study on the viability of the FPGA in supercomputers....
[...]
124 citations
103 citations
61 citations
Cites methods from "Examining the viability of FPGA sup..."
...FPGAs have been shown to effectively accelerate certain types of computation useful for research and modelling [2], [3], [4]....
[...]
References
182 citations
171 citations
"Examining the viability of FPGA sup..." refers methods in this paper
...researchers have designed IEEE 754 compliant floating-point accelerator cores constructed out of the Xilinx Virtex-II Pro FPGA’s configurable logic and dedicated integer multipliers [16-18]....
[...]
123 citations
"Examining the viability of FPGA sup..." refers background in this paper
...More recent work has focused on parameterizible libraries of floating-point units that can be tailored to the task at hand [27-29]....
[...]
118 citations
"Examining the viability of FPGA sup..." refers methods in this paper
...Significant performance gains have been described for gene sequencing [2, 3], digital filtering [4], cryptography [5], network packet filtering [6], target recognition [7], and pattern matching [8]....
[...]
112 citations
Related Papers (5)
Frequently Asked Questions (16)
Q2. What is the strongest suit of FPGAs?
The strong suit of FPGAs, however, is low-precision fixed-point or integer arithmetic and no current device families contain dedicated floating-point operators though dedicated integer multipliers are prevalent.
Q3. What is the efficient multiplication algorithm for large integers?
One of the most efficient multiplication algorithms for large integers utilizes the FFT, treating the number being squared as a long sequence of smaller numbers.
Q4. What factors must be weighed when comparing HPC architectures?
When comparing HPC architectures many factors must be weighed, including memory and I/O bandwidth, communication latencies, and peak and sustained performance.
Q5. What is the common requirement for a floating-point arithmetic?
Many HPC applications and benchmarks require doubleprecision floating-point arithmetic to support a large dy-namic range and ensure numerical stability.
Q6. How many FPGAs could be used to optimize software?
to permit their design to be more costcompetitive, even against efficient software implementations, smaller more cost-effective FPGAs could be used.
Q7. How much money has been awarded to the first person to identify a large Mersenne prime?
The distributed computing project GIMPS was created to identify large Mersenne primes and a reward of US$100,000 has been issued for the first person to identify a prime number with greater than 10 million digits.
Q8. What is the common use of floating-point math?
Floating-point arithmetic is so prevalent that the benchmarking application ranking supercomputers, LINPACK, heavily utilizes doubleprecision floating-point math.
Q9. Why is floating-point arithmetic so prevalent in HPC applications?
Due to the prevalence of floating-point arithmetic in HPC applications, research in academia and industry has focused on floating-point hardware designs [14, 15], libraries [16, 17], and development tools [18] to effectively perform floating-point math on FPGAs.
Q10. How many multipliers are needed for the Xilinx design?
For Xilinx’s double-precision floatingpoint core 16 of these 18-bit multipliers are required [35] for each multiplier, while for the Dou et al. design only nine are needed.
Q11. How much faster could a reworked implementation achieve?
A slightly reworked implementation, designed as an FFT accelerator with all serial functions implemented on an attached processor, could achieve a speedup of 2.6 compared to a processor alone.
Q12. What is the main reason why the availability of high-performance clusters incorporating FPGAs?
The availability of high-performance clusters incorporating FPGAs has prompted efforts to explore acceleration of HPC applications.
Q13. What is the key contribution of this paper?
The key contributions of this paper are the addition of an economic analysis to a discussion of FPGA supercomputing projects and the presentation of an effective benchmark for comparing FPGAs and processors on an equal footing.
Q14. What is the efficient port of the algorithm from software to hardware?
Performing a traditional port of the algorithm from software to hardware involves the creation of a floating-point FFT on the FPGA.
Q15. What is the speedup of the stand-alone FPGA?
In spite of the unique all-integer algorithmic approach, the stand-alone FPGA implementation only achieved a speedup of 1.76 compared to a 3.4 GHz Pentium 4 processor.
Q16. What is the difference between the Dou et al. and Underwood design?
While there is always a danger from drawing conclusions from a small data set, both the Dou et al. and Underwood design results point to a crossover point sometime around 2009 to 2012 when the largest FPGA devices, like those typically found in commercial FPGA-augmented HPC clusters, will be cost effectively compared to processors for doubleprecision floating-point calculations.