Examining the viability of FPGA supercomputing
Summary (2 min read)
1. INTRODUCTION
- Supercomputers have experienced a resurgence, fueled by government research dollars and the development of lowcost supercomputing clusters constructed from commodity PC processors.
- Floating-point arithmetic is so prevalent that the benchmarking application ranking supercomputers, LINPACK, heavily utilizes doubleprecision floating-point math.
- Section 3 describes alternatives to floating-point implementations in FPGAs, presenting a balanced benchmark for comparing FPGAs to processors.
2.1. HPC implementations
- The availability of high-performance clusters incorporating FPGAs has prompted efforts to explore acceleration of HPC applications.
- While not an exhaustive list, Table 1 provides a survey of recent representative applications.
- The SRC-6 and 6E combine two Xeon or Pentium processors with two large Virtex-II or Virtex-II Pro FPGAs.
- The abbreviations SP and DP refer to single-precision and double-precision floating point, respectively.
- While the speedups provided in the table are not normalized to a common processor, a trend is clearly visible.
2.2. Theoretical floating-point performance
- FPGA designs may suffer significant performance penalties due to memory and I/O bottlenecks.
- As most clusters incorporating FPGAs also include a host processor to handle serial tasks and communication, it is reasonable to assume that the cost analysis in Table 2 favors FPGAs.
- For Xilinx's double-precision floatingpoint core 16 of these 18-bit multipliers are required [35] for each multiplier, while for the Dou et al. design only nine are needed.
- While the larger FPGA devices that are prevalent in computational accelerators do not provide a cost benefit for the double-precision floating-point calculations required by the HPC community, historical trends [42] suggest that FPGA performance is improving at a rate faster than that of processors.
- In both graphs, the latest data point, representing the largest Virtex-4 device, dis-plays worse cost-performance than the previous generation of devices.
2.3. Tools
- The typical HPC user is a scientist, researcher, or engineer desiring to accelerate some scientific application.
- Many have noted the requirement of high-level development environments to speed acceptance of FPGA-augmented clusters.
- These development tools accept a description of the application written in a high level language (HLL) and automate the translation of appropriate sections of code into hardware.
- Hardware debugging and interfacing still must occur.
- The use of automatic translation also drives up development costs compared to software implementations.
3.1. Nonstandard data formats
- The use of IEEE standard floating-point data formats in hardware implementations prevents the user from leveraging an FPGA's fine-grained configurability, effectively reducing an FPGA to a collection of floating-point units with configurable interconnect.
- Seeing the advantages of customizing the data format to fit the problem, several authors have constructed nonstandard floating-point units.
- One of the earlier projects demonstrated a 23x speedup on a 2D fast Fourier transform (FFT) through the use of a custom 18-bit floating-point format [44] .
- For the cost of their PROGRAPE-3 board, estimated at US$ 15,000, it is likely that a 15-node processor cluster could be constructed producing 196 single-precision peak GFLOPS.
- Many comparisons spend significantly more time optimizing hardware implementations than is spent optimizing software.
3.2. GIMPS benchmark
- The strength of configurable logic stems from the ability to customize a hardware solution to a specific problem at the bit level.
- One such application can be found in the great Internet Mersenne prime search [50] .
- The software used by GIMPS relies heavily on double-precision floating-point FFTs.
- These memories operated concurrently, two of the buffers feeding the butterfly units while the third exchanged data with the external SDRAM.
- In spite of the unique all-integer algorithmic approach, the stand-alone FPGA implementation only achieved a speedup of 1.76 compared to a 3.4 GHz Pentium 4 processor.
4. CONCLUSION
- When comparing HPC architectures many factors must be weighed, including memory and I/O bandwidth, communication latencies, and peak and sustained performance.
- As the recent focus on commodity processor clusters demonstrates, cost-performance is of paramount importance.
- In order for FPGAs to gain acceptance within the general HPC community, they must be cost-competitive with traditional processors for the floating-point arithmetic typical in supercomputing applications.
- The analysis of the costperformance of various current generation FPGAs revealed that only the lower-end devices were cost-competitive with processors for double-precision floating-point matrix multiplications.
- For lower precision data formats current generation FP-GAs fare much better, being cost-competitive with processors.
Did you find this useful? Give us your feedback
Citations
14 citations
Cites background from "Examining the viability of FPGA sup..."
...These insights would also in a a broader scope be useful to develop a unified framework under which a comparative analysis can be made among clusters deployed with other types of application accelerators such as PowerXCell 8i [9] and Field Programmable Gate Arrays (FPGA) [10]....
[...]
14 citations
Cites background from "Examining the viability of FPGA sup..."
...Keywords-FPGA, traversal cache, pointers, speedup I. INTRODUCTION Much previous work has shown that field-programmable gate arrays (FPGAs) can achieve order of magnitude speedups compared to microprocessors for many important embedded and scientific computing applications [3][4][9]....
[...]
11 citations
10 citations
9 citations
Cites methods from "Examining the viability of FPGA sup..."
...FPGAs are widely employed as co-processors in personal computers such as GPUs or as accelerators in specific purpose devices as high-capacity network systems (Djordjevic et al., 2009) or high-performance computing (Craven & Athanas, 2007)....
[...]
References
1,666 citations
"Examining the viability of FPGA sup..." refers background in this paper
...A wide body of research over two decades has repeatedly demonstrated significant performance improvements for certain classes of applications through hardware acceleration in an FPGA [1]....
[...]
401 citations
Additional excerpts
...Cell processor 3200 × 9 10 [38] $230 [39] $23 System X 2300 × 2200 12 250 [31] $5....
[...]
390 citations
341 citations
"Examining the viability of FPGA sup..." refers background or methods in this paper
...Additional data was obtained by extrapolating the results of Underwood’s historical analysis [25] to include the Virtex 4 family....
[...]
...Extrapolated Cost-Performance Comparison While the larger FPGA devices that are prevalent in computational accelerators do not provide a cost benefit for the double precision floating-point calculations required by the HPC community, historical trends [25] suggest that FPGA performance is improving at a rate faster than that of processors....
[...]
224 citations
"Examining the viability of FPGA sup..." refers background or methods in this paper
...6 GFLOPS by placing 39 floating-point processing elements on a theoretical Xilinx XC2VP125 FPGA [14]....
[...]
...Due to the prevalence of floating-point arithmetic in HPC applications, research in academia and industry has focused on floating-point hardware designs [14, 15], libraries [16, 17], and development tools [18] to effectively perform floating-point math on FPGAs....
[...]
...[14], representing the fastest double-precision floating-point MAC design, was extrapolated to the largest parts in several Xilinx device families....
[...]
...Dou et al. published one of the highest performance benchmarks of 15.6 GFLOPS by placing 39 floating-point processing elements on a theoretical Xilinx XC2VP125 FPGA [14]....
[...]
Related Papers (5)
Frequently Asked Questions (16)
Q2. What is the strongest suit of FPGAs?
The strong suit of FPGAs, however, is low-precision fixed-point or integer arithmetic and no current device families contain dedicated floating-point operators though dedicated integer multipliers are prevalent.
Q3. What is the efficient multiplication algorithm for large integers?
One of the most efficient multiplication algorithms for large integers utilizes the FFT, treating the number being squared as a long sequence of smaller numbers.
Q4. What factors must be weighed when comparing HPC architectures?
When comparing HPC architectures many factors must be weighed, including memory and I/O bandwidth, communication latencies, and peak and sustained performance.
Q5. What is the common requirement for a floating-point arithmetic?
Many HPC applications and benchmarks require doubleprecision floating-point arithmetic to support a large dy-namic range and ensure numerical stability.
Q6. How many FPGAs could be used to optimize software?
to permit their design to be more costcompetitive, even against efficient software implementations, smaller more cost-effective FPGAs could be used.
Q7. How much money has been awarded to the first person to identify a large Mersenne prime?
The distributed computing project GIMPS was created to identify large Mersenne primes and a reward of US$100,000 has been issued for the first person to identify a prime number with greater than 10 million digits.
Q8. What is the common use of floating-point math?
Floating-point arithmetic is so prevalent that the benchmarking application ranking supercomputers, LINPACK, heavily utilizes doubleprecision floating-point math.
Q9. Why is floating-point arithmetic so prevalent in HPC applications?
Due to the prevalence of floating-point arithmetic in HPC applications, research in academia and industry has focused on floating-point hardware designs [14, 15], libraries [16, 17], and development tools [18] to effectively perform floating-point math on FPGAs.
Q10. How many multipliers are needed for the Xilinx design?
For Xilinx’s double-precision floatingpoint core 16 of these 18-bit multipliers are required [35] for each multiplier, while for the Dou et al. design only nine are needed.
Q11. How much faster could a reworked implementation achieve?
A slightly reworked implementation, designed as an FFT accelerator with all serial functions implemented on an attached processor, could achieve a speedup of 2.6 compared to a processor alone.
Q12. What is the main reason why the availability of high-performance clusters incorporating FPGAs?
The availability of high-performance clusters incorporating FPGAs has prompted efforts to explore acceleration of HPC applications.
Q13. What is the key contribution of this paper?
The key contributions of this paper are the addition of an economic analysis to a discussion of FPGA supercomputing projects and the presentation of an effective benchmark for comparing FPGAs and processors on an equal footing.
Q14. What is the efficient port of the algorithm from software to hardware?
Performing a traditional port of the algorithm from software to hardware involves the creation of a floating-point FFT on the FPGA.
Q15. What is the speedup of the stand-alone FPGA?
In spite of the unique all-integer algorithmic approach, the stand-alone FPGA implementation only achieved a speedup of 1.76 compared to a 3.4 GHz Pentium 4 processor.
Q16. What is the difference between the Dou et al. and Underwood design?
While there is always a danger from drawing conclusions from a small data set, both the Dou et al. and Underwood design results point to a crossover point sometime around 2009 to 2012 when the largest FPGA devices, like those typically found in commercial FPGA-augmented HPC clusters, will be cost effectively compared to processors for doubleprecision floating-point calculations.