What is the importance of a small rate of faults and failures?

While reliability is important, the scale of the datacenter permits sufficient redundancy that a small rate of faults and failures is tolerable.

Why does the Spare FPGA perceive a slightly higher latency increase over FE?

Because the Spare FPGA must forward its requests along a channel shared with responses, it perceives a slightly higher but negligible latency increase over FE at maximum throughput.

What are some examples of FPGAs used to implement and accelerate important datacenter applications?

FPGAs have been used to implement and accelerate important datacenter applications such as Memcached [17, 6] compression/decompression [14, 19], K-means clustering [11, 13], and web search.

What are examples of commercial FPGA acceleration appliances?

The Convey HC-2 [8], Maxeler MPC series [21], BeeCube BEE4 [5] and SRC MAPstation [25] are all examples of commercial FPGA acceleration appliances.

How do the authors organize the input stream into a tree-like hierarchy?

To support a large collection of state machines working in parallel on the same input data at a high clock rate, the authors organize the blocks into a tree-like hierarchy and replicate the input stream several times.

How does the user configure the fabric?

To configure the fabric with a desired function, user level services may initiate FPGA reconfigurations through calls to a low-level software library.

Why did the authors choose not to incorporate GPUs?

The authors decided not to incorporate GPUs because the current power requirements of high-end GPUs are too high for conventional datacenter servers, but also because it was unclear that some latency-sensitive ranking stages (such as feature extraction) would map well to GPUs.

Why do multiple cores of a complex block consume so much area?

Because complex floating point instructions consume a large amount of FPGA area, multiple cores (typically 6) are clustered together to share a single complex block.

What are the limitations of the use of traditional interactive FPGA debugging tools at scale?

The use of traditional interactive FPGA debugging tools at scale (e.g., Altera SignalTap, Xilinx ChipScope) is limited by (1) finite buffering capacity, (2) the need to automatically recover the failed service, and (3) the impracticality of putting USB JTAG units into each machine.

What is the way to integrate FPGAs into the datacenter?

While the appliance model appears to be an easy way to integrate FPGAs into the datacenter, it breaks homogeneity and reduces overall datacenter flexibility.

What is the average latency of the FPGA-accelerated ranker?

For a range of representative injection rates per server used in production, Figure 14 illustrates how the FPGA-accelerated ranker substantially reduces the end-to-end scoring latency relative to software.

What is the role of the queue manager in minimizing model reloads among queries?

This is an order of magnitude slower than processing a single document, so the queue manager’s role in minimizing model reloads among queries is crucial to achieving high performance.

How can a medium-scale deployment of FPGAs increase ranking throughput?

With this protocol and the appropriate fault handling mechanisms, the authors showed that a medium-scale deployment of FPGAs can increase ranking throughput in a production search infrastructure by 95% at comparable latency to a software-only solution.

(Open Access) A reconfigurable fabric for accelerating large-scale datacenter services (2016) | Andrew Putnam

Q: What contributions have the authors mentioned in the paper "A reconfigurable fabric for accelerating large-scale datacenter services" ?

In this paper, the authors describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. The authors describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents.

A Reconﬁgurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam Adrian M. Caulﬁeld Eric S. Chung Derek Chiou

Kypros Constantinides

John Demme

Hadi Esmaeilzadeh

Jeremy Fowers

Gopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck

Stephen Heil

Amir Hormati

Joo-Young Kim Sitaram Lanka James Larus

Eric Peterson

Simon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger

Microsoft

Abstract

Datacenter workloads demand high computational capabili-

ties, ﬂexibility, power efﬁciency, and low cost. It is challenging

to improve all of these factors simultaneously. To advance dat-

acenter capabilities beyond what commodity server designs

can provide, we have designed and built a composable, recon-

ﬁgurable fabric to accelerate portions of large-scale software

services. Each instantiation of the fabric consists of a 6x8 2-D

torus of high-end Stratix V FPGAs embedded into a half-rack

of 48 machines. One FPGA is placed into each server, acces-

sible through PCIe, and wired directly to other FPGAs with

pairs of 10 Gb SAS cables.

In this paper, we describe a medium-scale deployment of

this fabric on a bed of 1,632 servers, and measure its efﬁcacy

in accelerating the Bing web search engine. We describe

the requirements and architecture of the system, detail the

critical engineering challenges and solutions needed to make

the system robust in the presence of failures, and measure

the performance, power, and resilience of the system when

ranking candidate documents. Under high load, the large-

scale reconﬁgurable fabric improves the ranking throughput of

each server by a factor of 95% for a ﬁxed latency distribution—

or, while maintaining equivalent throughput, reduces the tail

latency by 29%.

1. Introduction

The rate at which server performance improves has slowed

considerably. This slowdown, due largely to power limitations,

has severe implications for datacenter operators, who have

traditionally relied on consistent performance and efﬁciency

improvements in servers to make improved services economi-

cally viable. While specialization of servers for speciﬁc scale

workloads can provide efﬁciency gains, it is problematic for

two reasons. First, homogeneity in the datacenter is highly

Microsoft and University of Texas at Austin

Amazon Web Services

Columbia University

Georgia Institute of Technology

Microsoft and University of Washington

Google, Inc.

École Polytechnique Fédérale de Lausanne (EPFL)

All authors contributed to this work while employed by Microsoft.

desirable to reduce management issues and to provide a consis-

tent platform that applications can rely on. Second, datacenter

services evolve extremely rapidly, making non-programmable

hardware features impractical. Thus, datacenter providers

are faced with a conundrum: they need continued improve-

ments in performance and efﬁciency, but cannot obtain those

improvements from general-purpose systems.

Reconﬁgurable chips, such as Field Programmable Gate

Arrays (FPGAs), offer the potential for ﬂexible acceleration

of many workloads. However, as of this writing, FPGAs have

not been widely deployed as compute accelerators in either

datacenter infrastructure or in client devices. One challenge

traditionally associated with FPGAs is the need to ﬁt the ac-

celerated function into the available reconﬁgurable area. One

could virtualize the FPGA by reconﬁguring it at run-time to

support more functions than could ﬁt into a single device.

However, current reconﬁguration times for standard FPGAs

are too slow to make this approach practical. Multiple FPGAs

provide scalable area, but cost more, consume more power,

and are wasteful when unneeded. On the other hand, using a

single small FPGA per server restricts the workloads that may

be accelerated, and may make the associated gains too small

to justify the cost.

This paper describes a reconﬁgurable fabric (that we call

Catapult for brevity) designed to balance these competing

concerns. The Catapult fabric is embedded into each half-rack

of 48 servers in the form of a small board with a medium-sized

FPGA and local DRAM attached to each server. FPGAs are

directly wired to each other in a 6x8 two-dimensional torus,

allowing services to allocate groups of FPGAs to provide the

necessary area to implement the desired functionality.

We evaluate the Catapult fabric by ofﬂoading a signiﬁcant

fraction of Microsoft Bing’s ranking stack onto groups of eight

FPGAs to support each instance of this service. When a server

wishes to score (rank) a document, it performs the software

portion of the scoring, converts the document into a format

suitable for FPGA evaluation, and then injects the document

to its local FPGA. The document is routed on the inter-FPGA

network to the FPGA at the head of the ranking pipeline.

After running the document through the eight-FPGA pipeline,

the computed score is routed back to the requesting server.

Although we designed the fabric for general-purpose service

978-1-4799-4394-4/14/$31.00

 2014 IEEE

acceleration, we used web search to drive its requirements,

due to both the economic importance of search and its size

and complexity. We set a performance target that would be a

signiﬁcant boost over software—2x throughput in the number

of documents ranked per second per server, including portions

of ranking which are not ofﬂoaded to the FPGA.

One of the challenges of maintaining such a fabric in the

datacenter is resilience. The fabric must stay substantially

available in the presence of errors, failing hardware, reboots,

and updates to the ranking algorithm. FPGAs can potentially

corrupt their neighbors or crash the hosting servers during

bitstream reconﬁguration. We incorporated a failure handling

protocol that can reconﬁgure groups of FPGAs or remap ser-

vices robustly, recover from failures by remapping FPGAs,

and report a vector of errors to the management software to

diagnose problems.

We tested the reconﬁgurable fabric, search workload, and

failure handling service on a bed of 1,632 servers equipped

with FPGAs. The experiments show that large gains in search

throughput and latency are achievable using the large-scale

reconﬁgurable fabric. Compared to a pure software imple-

mentation, the Catapult fabric achieves a 95% improvement in

throughput at each ranking server with an equivalent latency

distribution—or at the same throughput, reduces tail latency by

29%. The system is able to run stably for long periods, with a

failure handling service quickly reconﬁguring the fabric upon

errors or machine failures. The rest of this paper describes the

Catapult architecture and our measurements in more detail.

2. Catapult Hardware

The acceleration of datacenter services imposes several strin-

gent requirements on the design of a large-scale reconﬁgurable

fabric. First, since datacenter services are typically large and

complex, a large amount of reconﬁgurable logic is necessary.

Second, the FPGAs must ﬁt within the datacenter architecture

and cost constraints. While reliability is important, the scale

of the datacenter permits sufﬁcient redundancy that a small

rate of faults and failures is tolerable.

To achieve the required capacity for a large-scale reconﬁg-

urable fabric, one option is to incorporate multiple FPGAs

onto a daughtercard and house such a card along with a subset

of the servers. We initially built a prototype in this fashion,

with six Xilinx Virtex 6 SX315T FPGAs connected in a mesh

network through the FPGA’s general-purpose I/Os. Although

straightforward to implement, this solution has four problems.

First, it is inelastic: if more FPGAs are needed than there are

on the daughtercard, the desired service cannot be mapped.

Second, if fewer FPGAs are needed, there is stranded capac-

ity. Third, the power and physical space for the board cannot

be accommodated in conventional ultra-dense servers, requir-

ing either heterogeneous servers in each rack, or a complete

redesign of the servers, racks, network, and power distribu-

tion. Finally, the large board is a single point of failure, whose

failure would result in taking down the entire subset of servers.

FPGA

QSPI

Flash

8GB

DRAM

ECC

JTAG

Figure 1: (a) A block diagram of the FPGA board. (b) A picture

of the manufactured board. (c) A diagram of the 1 U, half-width

server that hosts the FPGA board. The air ﬂows from the left

to the right, leaving the FPGA in the exhaust of both CPUs.

Figure 2: The logical mapping of the torus network, and the

physical wiring on a pod of2x24servers.

The alternative approach we took places a small daughter-

card in each server with a single high-end FPGA, and connects

the cards directly together with a secondary network. Provided

that the latency on the inter-FPGA network is sufﬁciently low,

and that the bandwidth is sufﬁciently high, services requiring

more than one FPGA can be mapped across FPGAs residing

in multiple servers. This elasticity permits efﬁcient utilization

of the reconﬁgurable logic, and keeps the added acceleration

hardware within the power, thermal, and space limits of dense

datacenter servers. To balance the expected per-server per-

formance gains versus the necessary increase in total cost of

ownership (TCO), including both increased capital costs and

operating expenses, we set aggressive power and cost goals.

Given the sensitivity of cost numbers on elements such as pro-

duction servers, we cannot give exact dollar ﬁgures; however,

adding the Catapult card and network to the servers did not

exceed our limit of an increase in TCO of 30%, including a

limit of 10% for total server power.

2.1. Board Design

To minimize disruption to the motherboard, we chose to in-

terface the board to the host CPU over PCIe. While a tighter

coupling of the FPGA to the CPU would provide beneﬁts in

terms of latency, direct access to system memory, and poten-

tially coherence, the selection of PCIe minimized disruption to

this generation of the server design. Since the FPGA resides in

I/O space, the board needed working memory to accommodate

certain services. We chose to add local DRAM, as SRAM

QDR arrays were too expensive to achieve sufﬁcient capacity.

8 GB of DRAM was sufﬁcient to map the services we had

planned, and ﬁt within our power and cost envelopes.

Figure 1 shows a logical diagram of the FPGA board along

with a picture of the manufactured board and the server it

installs into [

]. We chose a high-end Altera Stratix V D5

FPGA [

], which has considerable reconﬁgurable logic, on-

chip memory blocks, and DSP units. The 8 GB of DRAM

consists of two dual-rank DDR3-1600 SO-DIMMs, which can

operate at DDR3-1333 speeds with the full 8 GB capacity, or

trade capacity for additional bandwidth by running as 4 GB

single-rank DIMMs at DDR3-1600 speeds. The PCIe and

inter-FPGA network traces are routed to a mezzanine connec-

tor on the bottom of the daughtercard, which plugs directly

into a socket on the motherboard. Other components on the

board include a programmable oscillator and 32 MB of Quad

SPI ﬂash to hold FPGA conﬁgurations. Because of the limited

physical size of the board and the number of signals that must

be routed, we used a 16-layer board design. Our target appli-

cations would beneﬁt from increased memory bandwidth, but

there was insufﬁcient physical space to add additional DRAM

channels. We chose to use DIMMs with ECC to add resilience

as DRAM failures are commonplace at datacenter scales.

Figure 1(c) shows the position of the board in one of the

datacenter servers. We used the mezzanine connector at the

back of the server so that heat from the FPGA did not disrupt

the existing system components. Since the FPGA is subject

to the air being heated by the host CPUs, which can reach

◦

C, we used an industrial-grade FPGA part rated for higher-

temperature operation up to 100

◦

C. It was also necessary

to add EMI shielding to the board to protect other server

components from interference from the large number of high-

speed signals on the board. One requirement for serviceability

was that no jumper cables should be attached to the board

(e.g., power or signaling). By limiting the power draw of the

daughtercard to under 25 W, the PCIe bus alone provided all

necessary power. By keeping the power draw to under 20 W

during normal operation, we met our thermal requirements

and our 10% limit for added power.

2.2. Network Design

The requirements for the inter-FPGA network were low la-

tency and high bandwidth to meet the performance targets,

low component costs, plus only marginal operational expense

when servicing machines. The rack conﬁguration we target

is organized into two half-racks called pods. Each pod has

its own power distribution unit and top-of-rack switch. The

pods are organized in a 24 U arrangement of 48 half-width

1 U servers (two servers ﬁt into each 1 U tray).

Based on our rack conﬁguration, we selected a two-

dimensional, 6x8 torus for the network topology. This arrange-

ment balanced routability and cabling complexity. Figure 2

shows how the torus is mapped onto a pod of machines. The

server motherboard routes eight high-speed traces from the

mezzanine connector to the back of the server chassis, where

the connections plug into a passive backplane. The traces are

exposed on the backplane as two SFF-8088 SAS ports. We

built custom cable assemblies (shells of eight and six cables)

that plugged into each SAS port and routed two high-speed

signals between each pair of connected FPGAs. At 10 Gb/s sig-

naling rates, each inter-FPGA network link supports 20 Gb/s

of peak bidirectional bandwidth at sub-microsecond latency,

with no additional networking costs such as NICs or switches.

Since the server sleds are plugged into a passive backplane,

and the torus cabling also attaches to the backplane, a server

can be serviced by pulling it out of the backplane without

unplugging any cables. Thus, the cable assemblies can be

installed at rack integration time, tested for topological cor-

rectness, and delivered to the datacenter with correct wiring

and low probability of errors when servers are repaired.

2.3. Datacenter Deployment

To test this architecture on a number of datacenter services at

scale, we manufactured and deployed the fabric in a production

datacenter. The deployment consisted of 34 populated pods

of machines in 17 racks, for a total of 1,632 machines. Each

server uses an Intel Xeon 2-socket EP motherboard, 12-core

Sandy Bridge CPUs, 64 GB of DRAM, and two SSDs in

addition to four HDDs. The machines have a 10 Gb network

card connected to a 48-port top-of-rack switch, which in turn

connects to a set of level-two switches.

The daughtercards and cable assemblies were both tested at

manufacture and again at system integration. At deployment,

we discovered that 7 cards (0.4%) had a hardware failure, and

that one of the 3,264 links (0.03%) in the cable assemblies

was defective. Since then, over several months of operation,

we have seen no additional hardware failures.

3. Infrastructure and Platform Architecture

Supporting an at-scale deployment of reconﬁgurable hardware

requires a robust software stack capable of detecting failures

while providing a simple and accessible interface to software

applications. If developers have to worry about low-level

FPGA details, including drivers and system functions (e.g.,

PCIe), the platform will be difﬁcult to use and rendered in-

compatible with future hardware generations. There are three

categories of infrastructure that must be carefully designed

to enable productive use of the FPGA: (1) APIs for interfac-

ing software with the FPGA, (2) interfaces between FPGA

application logic and board-level functions, and (3) support

for resilience and debugging.

3.1. Software Interface

Applications targeting the Catapult fabric share a common

driver and user-level interface. The communication interface

between the CPU and FPGA must satisfy two key design

goals: (1) the interface must incur low latency, taking fewer

than 10

s for transfers of 16 KB or less, and (2) the interface

must be safe for multithreading. To achieve these goals, we

developed a custom PCIe interface with DMA support.

In our PCIe implementation, low latency is achieved by

avoiding system calls. We allocate one input and one output

buffer in non-paged, user-level memory and supply the FPGA

with a base pointer to the buffers’ physical memory addresses.

Thread safety is achieved by dividing the buffer into 64 slots,

where each slot is 1/64th of the buffer, and by statically assign-

ing each thread exclusive access to one or more slots. In the

case study in Section 4, we use 64 slots of 64 KB each.

Each slot has a set of status bits indicating whether the

slot is full. To send data to the FPGA, a thread ﬁlls its slot

with data, then sets the appropriate full bit for that slot. The

FPGA monitors the full bits and fairly selects a candidate slot

for DMA’ing into one of two staging buffers on the FPGA,

clearing the full bit once the data has been transferred. Fairness

is achieved by taking periodic snapshots of the full bits, and

DMA’ing all full slots before taking another snapshot of the

full bits. When the FPGA produces results for readback, it

checks to make sure that the output slot is empty and then

DMAs the results into the output buffer. Once the DMA is

complete, the FPGA sets the full bit for the output buffer and

generates an interrupt to wake and notify the consumer thread.

To conﬁgure the fabric with a desired function, user level

services may initiate FPGA reconﬁgurations through calls to

a low-level software library. When a service is deployed, each

server is designated to run a speciﬁc application on its local

FPGA. The server then invokes the reconﬁguration function,

passing in the desired bitstream as a parameter.

3.2. Shell Architecture

In typical FPGA programming environments, the user is of-

ten responsible for developing not only the application itself

but also building and integrating system functions required

for data marshaling, host-to-FPGA communication, and inter-

chip FPGA communication (if available). System integration

places a signiﬁcant burden on the user and can often exceed

the effort needed to develop the application itself. This devel-

opment effort is often not portable to other boards, making it

difﬁcult for applications to work on future platforms.

Motivated by the need for user productivity and design

re-usability when targeting the Catapult fabric, we logically

divide all programmable logic into two partitions: the shell and

the role. The shell is a reusable portion of programmable logic

common across applications—while the role is the application

logic itself, restricted to a large ﬁxed region of the chip.

West

SLIII

East

SLIII

South

SLIII

North

SLIII

x8 PCIe

Core

DMA

Engine

Config

Flash

(RSU)

DDR3 Core 1DDR3 Core 0

JTAG

LEDs

Tem p

Sensors

Application

Shell

xcvr

reconfig

2 2

256 Mb

QSPI

Config

Flash

4 GB DDR3-1333

ECC SO-DIMM

4 GB DDR3-1333

ECC SO-DIMM

Host

CPU

Role

Inter-FPGA Router

SEU

Figure 3: Components of the Shell Architecture.

Role designers access convenient and well-deﬁned inter-

faces and capabilities in the shell (e.g., PCIe, DRAM, routing,

etc.) without concern for managing system correctness. The

shell consumes 23% of each FPGA, although extra capacity

can be obtained by discarding unused functions. In the future,

partial reconﬁguration would allow for dynamic switching

between roles while the shell remains active—even routing

inter-FPGA trafﬁc while a reconﬁguration is taking place.

Figure 3 shows a block-level diagram of the shell architec-

ture, consisting of the following components:

•

Two DRAM controllers, which can be operated indepen-

dently or as a uniﬁed interface. On the Stratix V, our dual-

rank DIMMs operate at 667 MHz. Single-rank DIMMs (or

only using one of the two ranks of a dual-rank DIMM) can

operate at 800 MHz.

•

Four high-speed serial links running SerialLite III (SL3), a

lightweight protocol for communicating with neighboring

FPGAs. It supports FIFO semantics, Xon/Xoff ﬂow control,

and ECC.

•

Router logic to manage trafﬁc arriving from PCIe, the role,

or the SL3 cores.

•

Reconﬁguration logic, based on a modiﬁed Remote Status

Update (RSU) unit, to read/write the conﬁguration Flash.

• The PCIe core, with the extensions to support DMA.

•

Single-event upset (SEU) logic, which periodically scrubs

the FPGA conﬁguration state to reduce system or applica-

tion errors caused by soft errors.

The router is a standard crossbar that connects the four

inter-FPGA network ports, the PCIe controller, and the ap-

plication role. The routing decisions are made by a static

software-conﬁgured routing table that supports different rout-

ing policies. The transport protocol is virtual cut-through with

no retransmission or source buffering.

Since uncorrected bit errors can cause high-level disruptions

(requiring intervention from global management software), we

employ double-bit error detection and single-bit error correc-

tion on our DRAM controllers and SL3 links. The use of ECC

on our SL3 links incurs a 20% reduction in peak bandwidth.

ECC on the SL3 links is performed on individual ﬂits, with cor-

rection for single-bit errors and detection of double-bit errors.

Flits with three or more bit errors may proceed undetected

through the pipeline, but are likely to be detected at the end

of packet transmission with a CRC check. Double-bit errors

and CRC failures result in the packet being dropped and not

returned to the host. In the event of a dropped packet, the host

will time out and divert the request to a higher-level failure

handling protocol.

The SEU scrubber runs continuously to scrub conﬁgura-

tion errors. If the error rates can be brought sufﬁciently low,

with conservative signaling speeds and correction, the rare

errors can be handled by the higher levels of software, without

resorting to expensive approaches such as source-based re-

transmission or store-and-forward protocols. The speed of the

FPGAs and the ingestion rate of requests is high enough that

store-and-forward would be too expensive for the applications

that we have implemented.

3.3. Software Infrastructure

The system software, both at the datacenter level and in each

individual server, required several changes to accommodate

the unique aspects of the reconﬁgurable fabric. These changes

fall into three categories: ensuring correct operation, failure

detection and recovery, and debugging.

Two new services are introduced to implement this sup-

port. The ﬁrst, called the Mapping Manager, is responsible for

conﬁguring FPGAs with the correct application images when

starting up a given datacenter service. The second, called the

Health Monitor, is invoked when there is a suspected failure

in one or more systems. These services run on servers within

the pod and communicate through the Ethernet network.

3.4. Correct Operation

The primary challenge we found to ensuring correct operation

was the potential for instability in the system introduced by

FPGAs reconﬁguring while the system was otherwise up and

stable. These problems manifested along three dimensions.

First, a reconﬁguring FPGA can appear as a failed PCIe device

to the host, raising a non-maskable interrupt that may desta-

bilize the system. Second, a failing or reconﬁguring FPGA

may corrupt the state of its neighbors across the SL3 links

by randomly sending trafﬁc that may appear valid. Third, re-

conﬁguration cannot be counted on to occur synchronously

across servers, so FPGAs must remain robust to trafﬁc from

neighbors with incorrect or incompatible conﬁgurations (e.g.

"old" data from FPGAs that have not yet been reconﬁgured).

The solution to a reconﬁguring PCIe device is that the driver

that sits behind the FPGA reconﬁguration call must ﬁrst dis-

able non-maskable interrupts for the speciﬁc PCIe device (the

FPGA) during reconﬁguration.

The solution to the corruption of a neighboring FPGA dur-

ing reconﬁguration is more complex. When remote FPGAs

are reconﬁgured, they may send garbage data. To prevent this

data from corrupting neighboring FPGAs, the FPGA being

reconﬁgured sends a “TX Halt” message, indicating that the

neighbors should ignore all further trafﬁc until the link is re-

established. In addition, messages are delayed a few clock

cycles so that, in case of an unexpected link failure, it can be

detected and the message can be suppressed.

Similarly, when an FPGA comes out of reconﬁguration, it

cannot trust that its neighbors are not sending garbage data.

To handle this, each FPGA comes up with “RX Halt” enabled,

automatically throwing away any message coming in on the

SL3 links. The Mapping Manager tells each server to release

RX Halt once all FPGAs in a pipeline have been conﬁgured.

3.5. Failure Detection and Recovery

When a datacenter application hangs for any reason, a machine

at a higher level in the service hierarchy (such as a machine

that aggregates results) will notice that a set of servers are

unresponsive. At that point, the Health Monitor is invoked.

The Health Monitor queries each machine to ﬁnd its status.

If a server is unresponsive, it is put through a sequence of

soft reboot, hard reboot, and then ﬂagged for manual service

and possible replacement, until the machine starts working

correctly. If the server is operating correctly, it responds to

the Health Monitor with information about the health of its

local FPGA and associated links. The Health Monitor returns

a vector with error ﬂags for inter-FPGA connections, DRAM

status (bit errors and calibration failures), errors in the FPGA

application, PLL lock issues, PCIe errors, and the occurrence

of a temperature shutdown. This call also returns the machine

IDs of the north, south, east, and west neighbors of an FPGA,

to test whether the neighboring FPGAs in the torus are acces-

sible and that they are the machines that the system expects

(in case the cables are miswired or unplugged).

Based on this information, the Health Monitor may update

a failed machine list (including the failure type). Updating

the machine list will invoke the Mapping Manager, which will

determine, based on the failure location and type, where to re-

locate various application roles on the fabric. It is possible that

relocation is unnecessary, such as when the failure occurred

on a spare node, or when simply reconﬁguring the FPGA in-

place is sufﬁcient to resolve the hang. The Mapping Manager

then goes through its reconﬁguration process for every FPGA

involved in that service—clearing out any corrupted state and

mapping out any hardware failure or a recurring failure with

an unknown cause. In the current fabric running accelerated

search, failures have been exceedingly rare; we observed no

hangs due to data corruption; the failures that we have seen

have been due to transient phenomena, primarily machine

reboots due to maintenance or other unresponsive services.

A reconfigurable fabric for accelerating large-scale datacenter services

Figures

Citations

In-Datacenter Performance Analysis of a Tensor Processing Unit

In-Datacenter Performance Analysis of a Tensor Processing Unit

ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars

A cloud-scale acceleration architecture

A configurable cloud-scale DNN processor for real-time AI

References

Design of ion-implanted MOSFET's with very small physical dimensions

A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH

CoRAM: an in-fabric memory architecture for FPGA-based computing

Algorithmic transformations in the implementation of K- means clustering on reconfigurable hardware

Maxwell - a 64 FPGA Supercomputer

Related Papers (5)

A reconfigurable fabric for accelerating large-scale datacenter services

A cloud-scale acceleration architecture

IncBricks: Toward In-Network Computation with an In-Network Cache

Disaggregated FPGAs: Network Performance Comparison against Bare-Metal Servers, Virtual Machines and Linux Containers

Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics

Frequently Asked Questions (21)

Q1. What contributions have the authors mentioned in the paper "A reconfigurable fabric for accelerating large-scale datacenter services" ?

Q2. How can a server be serviced without unplugging cables?

Q3. What is the importance of a small rate of faults and failures?

Q4. Why does the Spare FPGA perceive a slightly higher latency increase over FE?

Q5. What are some examples of FPGAs used to implement and accelerate important datacenter applications?

Q6. What are examples of commercial FPGA acceleration appliances?

Q7. How do the authors organize the input stream into a tree-like hierarchy?

Q8. How does the user configure the fabric?

Q9. What are the requirements for a large-scale reconfigurable fabric?

Q10. Why did the authors choose not to incorporate GPUs?

Q11. What is the way to achieve the required capacity for a large-scale reconfigurable?

Q12. Why do multiple cores of a complex block consume so much area?

Q13. What are the limitations of the use of traditional interactive FPGA debugging tools at scale?

Q14. What is the way to integrate FPGAs into the datacenter?

Q15. What was the need to add EMI shielding to the board?

Q16. What is the solution to a datacenter application that hangs?

Q17. What is the reason why the rate at which server performance improves has slowed considerably?

Q18. How does the Catapult fabric achieve a 95% improvement in throughput?

Q19. What is the average latency of the FPGA-accelerated ranker?

Q20. What is the role of the queue manager in minimizing model reloads among queries?

Q21. How can a medium-scale deployment of FPGAs increase ranking throughput?