A reconfigurable fabric for accelerating large-scale datacenter services
read more
Citations
In-Datacenter Performance Analysis of a Tensor Processing Unit
In-Datacenter Performance Analysis of a Tensor Processing Unit
ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars
A cloud-scale acceleration architecture
A configurable cloud-scale DNN processor for real-time AI
References
Design of ion-implanted MOSFET's with very small physical dimensions
A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH
CoRAM: an in-fabric memory architecture for FPGA-based computing
Algorithmic transformations in the implementation of K- means clustering on reconfigurable hardware
Maxwell - a 64 FPGA Supercomputer
Related Papers (5)
A reconfigurable fabric for accelerating large-scale datacenter services
Frequently Asked Questions (21)
Q2. How can a server be serviced without unplugging cables?
Since the server sleds are plugged into a passive backplane, and the torus cabling also attaches to the backplane, a server can be serviced by pulling it out of the backplane without unplugging any cables.
Q3. What is the importance of a small rate of faults and failures?
While reliability is important, the scale of the datacenter permits sufficient redundancy that a small rate of faults and failures is tolerable.
Q4. Why does the Spare FPGA perceive a slightly higher latency increase over FE?
Because the Spare FPGA must forward its requests along a channel shared with responses, it perceives a slightly higher but negligible latency increase over FE at maximum throughput.
Q5. What are some examples of FPGAs used to implement and accelerate important datacenter applications?
FPGAs have been used to implement and accelerate important datacenter applications such as Memcached [17, 6] compression/decompression [14, 19], K-means clustering [11, 13], and web search.
Q6. What are examples of commercial FPGA acceleration appliances?
The Convey HC-2 [8], Maxeler MPC series [21], BeeCube BEE4 [5] and SRC MAPstation [25] are all examples of commercial FPGA acceleration appliances.
Q7. How do the authors organize the input stream into a tree-like hierarchy?
To support a large collection of state machines working in parallel on the same input data at a high clock rate, the authors organize the blocks into a tree-like hierarchy and replicate the input stream several times.
Q8. How does the user configure the fabric?
To configure the fabric with a desired function, user level services may initiate FPGA reconfigurations through calls to a low-level software library.
Q9. What are the requirements for a large-scale reconfigurable fabric?
The acceleration of datacenter services imposes several stringent requirements on the design of a large-scale reconfigurable fabric.
Q10. Why did the authors choose not to incorporate GPUs?
The authors decided not to incorporate GPUs because the current power requirements of high-end GPUs are too high for conventional datacenter servers, but also because it was unclear that some latency-sensitive ranking stages (such as feature extraction) would map well to GPUs.
Q11. What is the way to achieve the required capacity for a large-scale reconfigurable?
To achieve the required capacity for a large-scale reconfigurable fabric, one option is to incorporate multiple FPGAs onto a daughtercard and house such a card along with a subset of the servers.
Q12. Why do multiple cores of a complex block consume so much area?
Because complex floating point instructions consume a large amount of FPGA area, multiple cores (typically 6) are clustered together to share a single complex block.
Q13. What are the limitations of the use of traditional interactive FPGA debugging tools at scale?
The use of traditional interactive FPGA debugging tools at scale (e.g., Altera SignalTap, Xilinx ChipScope) is limited by (1) finite buffering capacity, (2) the need to automatically recover the failed service, and (3) the impracticality of putting USB JTAG units into each machine.
Q14. What is the way to integrate FPGAs into the datacenter?
While the appliance model appears to be an easy way to integrate FPGAs into the datacenter, it breaks homogeneity and reduces overall datacenter flexibility.
Q15. What was the need to add EMI shielding to the board?
It was also necessary to add EMI shielding to the board to protect other server components from interference from the large number of highspeed signals on the board.
Q16. What is the solution to a datacenter application that hangs?
When a datacenter application hangs for any reason, a machine at a higher level in the service hierarchy (such as a machine that aggregates results) will notice that a set of servers are unresponsive.
Q17. What is the reason why the rate at which server performance improves has slowed considerably?
This slowdown, due largely to power limitations, has severe implications for datacenter operators, who have traditionally relied on consistent performance and efficiency improvements in servers to make improved services economically viable.
Q18. How does the Catapult fabric achieve a 95% improvement in throughput?
Compared to a pure software implementation, the Catapult fabric achieves a 95% improvement in throughput at each ranking server with an equivalent latency distribution—or at the same throughput, reduces tail latency by 29%.
Q19. What is the average latency of the FPGA-accelerated ranker?
For a range of representative injection rates per server used in production, Figure 14 illustrates how the FPGA-accelerated ranker substantially reduces the end-to-end scoring latency relative to software.
Q20. What is the role of the queue manager in minimizing model reloads among queries?
This is an order of magnitude slower than processing a single document, so the queue manager’s role in minimizing model reloads among queries is crucial to achieving high performance.
Q21. How can a medium-scale deployment of FPGAs increase ranking throughput?
With this protocol and the appropriate fault handling mechanisms, the authors showed that a medium-scale deployment of FPGAs can increase ranking throughput in a production search infrastructure by 95% at comparable latency to a software-only solution.