scispace - formally typeset
Open AccessJournal ArticleDOI

A reconfigurable fabric for accelerating large-scale datacenter services

Reads0
Chats0
TLDR
The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.
Abstract
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost It is challenging to improve all of these factors simultaneously To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA) Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth networkWe describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput In other words, the reconfigurable fabric enables the same throughput using only half the number of servers

read more

Content maybe subject to copyright    Report

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services
Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou
1
Kypros Constantinides
2
John Demme
3
Hadi Esmaeilzadeh
4
Jeremy Fowers
Gopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck
5
Stephen Heil
Amir Hormati
6
Joo-Young Kim Sitaram Lanka James Larus
7
Eric Peterson
Simon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger
Microsoft
Abstract
Datacenter workloads demand high computational capabili-
ties, flexibility, power efficiency, and low cost. It is challenging
to improve all of these factors simultaneously. To advance dat-
acenter capabilities beyond what commodity server designs
can provide, we have designed and built a composable, recon-
figurable fabric to accelerate portions of large-scale software
services. Each instantiation of the fabric consists of a 6x8 2-D
torus of high-end Stratix V FPGAs embedded into a half-rack
of 48 machines. One FPGA is placed into each server, acces-
sible through PCIe, and wired directly to other FPGAs with
pairs of 10 Gb SAS cables.
In this paper, we describe a medium-scale deployment of
this fabric on a bed of 1,632 servers, and measure its efficacy
in accelerating the Bing web search engine. We describe
the requirements and architecture of the system, detail the
critical engineering challenges and solutions needed to make
the system robust in the presence of failures, and measure
the performance, power, and resilience of the system when
ranking candidate documents. Under high load, the large-
scale reconfigurable fabric improves the ranking throughput of
each server by a factor of 95% for a fixed latency distribution—
or, while maintaining equivalent throughput, reduces the tail
latency by 29%.
1. Introduction
The rate at which server performance improves has slowed
considerably. This slowdown, due largely to power limitations,
has severe implications for datacenter operators, who have
traditionally relied on consistent performance and efficiency
improvements in servers to make improved services economi-
cally viable. While specialization of servers for specific scale
workloads can provide efficiency gains, it is problematic for
two reasons. First, homogeneity in the datacenter is highly
1
Microsoft and University of Texas at Austin
2
Amazon Web Services
3
Columbia University
4
Georgia Institute of Technology
5
Microsoft and University of Washington
6
Google, Inc.
7
École Polytechnique Fédérale de Lausanne (EPFL)
All authors contributed to this work while employed by Microsoft.
desirable to reduce management issues and to provide a consis-
tent platform that applications can rely on. Second, datacenter
services evolve extremely rapidly, making non-programmable
hardware features impractical. Thus, datacenter providers
are faced with a conundrum: they need continued improve-
ments in performance and efficiency, but cannot obtain those
improvements from general-purpose systems.
Reconfigurable chips, such as Field Programmable Gate
Arrays (FPGAs), offer the potential for flexible acceleration
of many workloads. However, as of this writing, FPGAs have
not been widely deployed as compute accelerators in either
datacenter infrastructure or in client devices. One challenge
traditionally associated with FPGAs is the need to fit the ac-
celerated function into the available reconfigurable area. One
could virtualize the FPGA by reconfiguring it at run-time to
support more functions than could fit into a single device.
However, current reconfiguration times for standard FPGAs
are too slow to make this approach practical. Multiple FPGAs
provide scalable area, but cost more, consume more power,
and are wasteful when unneeded. On the other hand, using a
single small FPGA per server restricts the workloads that may
be accelerated, and may make the associated gains too small
to justify the cost.
This paper describes a reconfigurable fabric (that we call
Catapult for brevity) designed to balance these competing
concerns. The Catapult fabric is embedded into each half-rack
of 48 servers in the form of a small board with a medium-sized
FPGA and local DRAM attached to each server. FPGAs are
directly wired to each other in a 6x8 two-dimensional torus,
allowing services to allocate groups of FPGAs to provide the
necessary area to implement the desired functionality.
We evaluate the Catapult fabric by offloading a significant
fraction of Microsoft Bing’s ranking stack onto groups of eight
FPGAs to support each instance of this service. When a server
wishes to score (rank) a document, it performs the software
portion of the scoring, converts the document into a format
suitable for FPGA evaluation, and then injects the document
to its local FPGA. The document is routed on the inter-FPGA
network to the FPGA at the head of the ranking pipeline.
After running the document through the eight-FPGA pipeline,
the computed score is routed back to the requesting server.
Although we designed the fabric for general-purpose service
978-1-4799-4394-4/14/$31.00
c
2014 IEEE

acceleration, we used web search to drive its requirements,
due to both the economic importance of search and its size
and complexity. We set a performance target that would be a
significant boost over software—2x throughput in the number
of documents ranked per second per server, including portions
of ranking which are not offloaded to the FPGA.
One of the challenges of maintaining such a fabric in the
datacenter is resilience. The fabric must stay substantially
available in the presence of errors, failing hardware, reboots,
and updates to the ranking algorithm. FPGAs can potentially
corrupt their neighbors or crash the hosting servers during
bitstream reconfiguration. We incorporated a failure handling
protocol that can reconfigure groups of FPGAs or remap ser-
vices robustly, recover from failures by remapping FPGAs,
and report a vector of errors to the management software to
diagnose problems.
We tested the reconfigurable fabric, search workload, and
failure handling service on a bed of 1,632 servers equipped
with FPGAs. The experiments show that large gains in search
throughput and latency are achievable using the large-scale
reconfigurable fabric. Compared to a pure software imple-
mentation, the Catapult fabric achieves a 95% improvement in
throughput at each ranking server with an equivalent latency
distribution—or at the same throughput, reduces tail latency by
29%. The system is able to run stably for long periods, with a
failure handling service quickly reconfiguring the fabric upon
errors or machine failures. The rest of this paper describes the
Catapult architecture and our measurements in more detail.
2. Catapult Hardware
The acceleration of datacenter services imposes several strin-
gent requirements on the design of a large-scale reconfigurable
fabric. First, since datacenter services are typically large and
complex, a large amount of reconfigurable logic is necessary.
Second, the FPGAs must fit within the datacenter architecture
and cost constraints. While reliability is important, the scale
of the datacenter permits sufficient redundancy that a small
rate of faults and failures is tolerable.
To achieve the required capacity for a large-scale reconfig-
urable fabric, one option is to incorporate multiple FPGAs
onto a daughtercard and house such a card along with a subset
of the servers. We initially built a prototype in this fashion,
with six Xilinx Virtex 6 SX315T FPGAs connected in a mesh
network through the FPGAs general-purpose I/Os. Although
straightforward to implement, this solution has four problems.
First, it is inelastic: if more FPGAs are needed than there are
on the daughtercard, the desired service cannot be mapped.
Second, if fewer FPGAs are needed, there is stranded capac-
ity. Third, the power and physical space for the board cannot
be accommodated in conventional ultra-dense servers, requir-
ing either heterogeneous servers in each rack, or a complete
redesign of the servers, racks, network, and power distribu-
tion. Finally, the large board is a single point of failure, whose
failure would result in taking down the entire subset of servers.
FPGA
QSPI
Flash
8GB
DRAM
w/
ECC
JTAG
A
B
C
Figure 1: (a) A block diagram of the FPGA board. (b) A picture
of the manufactured board. (c) A diagram of the 1 U, half-width
server that hosts the FPGA board. The air flows from the left
to the right, leaving the FPGA in the exhaust of both CPUs.
Figure 2: The logical mapping of the torus network, and the
physical wiring on a pod of2x24servers.
The alternative approach we took places a small daughter-
card in each server with a single high-end FPGA, and connects
the cards directly together with a secondary network. Provided
that the latency on the inter-FPGA network is sufficiently low,
and that the bandwidth is sufficiently high, services requiring
more than one FPGA can be mapped across FPGAs residing
in multiple servers. This elasticity permits efficient utilization
of the reconfigurable logic, and keeps the added acceleration
hardware within the power, thermal, and space limits of dense
datacenter servers. To balance the expected per-server per-
formance gains versus the necessary increase in total cost of
ownership (TCO), including both increased capital costs and
operating expenses, we set aggressive power and cost goals.
Given the sensitivity of cost numbers on elements such as pro-
duction servers, we cannot give exact dollar figures; however,
adding the Catapult card and network to the servers did not
exceed our limit of an increase in TCO of 30%, including a
limit of 10% for total server power.
2.1. Board Design
To minimize disruption to the motherboard, we chose to in-
terface the board to the host CPU over PCIe. While a tighter
coupling of the FPGA to the CPU would provide benefits in

terms of latency, direct access to system memory, and poten-
tially coherence, the selection of PCIe minimized disruption to
this generation of the server design. Since the FPGA resides in
I/O space, the board needed working memory to accommodate
certain services. We chose to add local DRAM, as SRAM
QDR arrays were too expensive to achieve sufficient capacity.
8 GB of DRAM was sufficient to map the services we had
planned, and fit within our power and cost envelopes.
Figure 1 shows a logical diagram of the FPGA board along
with a picture of the manufactured board and the server it
installs into [
20
]. We chose a high-end Altera Stratix V D5
FPGA [
3
], which has considerable reconfigurable logic, on-
chip memory blocks, and DSP units. The 8 GB of DRAM
consists of two dual-rank DDR3-1600 SO-DIMMs, which can
operate at DDR3-1333 speeds with the full 8 GB capacity, or
trade capacity for additional bandwidth by running as 4 GB
single-rank DIMMs at DDR3-1600 speeds. The PCIe and
inter-FPGA network traces are routed to a mezzanine connec-
tor on the bottom of the daughtercard, which plugs directly
into a socket on the motherboard. Other components on the
board include a programmable oscillator and 32 MB of Quad
SPI flash to hold FPGA configurations. Because of the limited
physical size of the board and the number of signals that must
be routed, we used a 16-layer board design. Our target appli-
cations would benefit from increased memory bandwidth, but
there was insufficient physical space to add additional DRAM
channels. We chose to use DIMMs with ECC to add resilience
as DRAM failures are commonplace at datacenter scales.
Figure 1(c) shows the position of the board in one of the
datacenter servers. We used the mezzanine connector at the
back of the server so that heat from the FPGA did not disrupt
the existing system components. Since the FPGA is subject
to the air being heated by the host CPUs, which can reach
68
C, we used an industrial-grade FPGA part rated for higher-
temperature operation up to 100
C. It was also necessary
to add EMI shielding to the board to protect other server
components from interference from the large number of high-
speed signals on the board. One requirement for serviceability
was that no jumper cables should be attached to the board
(e.g., power or signaling). By limiting the power draw of the
daughtercard to under 25 W, the PCIe bus alone provided all
necessary power. By keeping the power draw to under 20 W
during normal operation, we met our thermal requirements
and our 10% limit for added power.
2.2. Network Design
The requirements for the inter-FPGA network were low la-
tency and high bandwidth to meet the performance targets,
low component costs, plus only marginal operational expense
when servicing machines. The rack configuration we target
is organized into two half-racks called pods. Each pod has
its own power distribution unit and top-of-rack switch. The
pods are organized in a 24 U arrangement of 48 half-width
1 U servers (two servers fit into each 1 U tray).
Based on our rack configuration, we selected a two-
dimensional, 6x8 torus for the network topology. This arrange-
ment balanced routability and cabling complexity. Figure 2
shows how the torus is mapped onto a pod of machines. The
server motherboard routes eight high-speed traces from the
mezzanine connector to the back of the server chassis, where
the connections plug into a passive backplane. The traces are
exposed on the backplane as two SFF-8088 SAS ports. We
built custom cable assemblies (shells of eight and six cables)
that plugged into each SAS port and routed two high-speed
signals between each pair of connected FPGAs. At 10 Gb/s sig-
naling rates, each inter-FPGA network link supports 20 Gb/s
of peak bidirectional bandwidth at sub-microsecond latency,
with no additional networking costs such as NICs or switches.
Since the server sleds are plugged into a passive backplane,
and the torus cabling also attaches to the backplane, a server
can be serviced by pulling it out of the backplane without
unplugging any cables. Thus, the cable assemblies can be
installed at rack integration time, tested for topological cor-
rectness, and delivered to the datacenter with correct wiring
and low probability of errors when servers are repaired.
2.3. Datacenter Deployment
To test this architecture on a number of datacenter services at
scale, we manufactured and deployed the fabric in a production
datacenter. The deployment consisted of 34 populated pods
of machines in 17 racks, for a total of 1,632 machines. Each
server uses an Intel Xeon 2-socket EP motherboard, 12-core
Sandy Bridge CPUs, 64 GB of DRAM, and two SSDs in
addition to four HDDs. The machines have a 10 Gb network
card connected to a 48-port top-of-rack switch, which in turn
connects to a set of level-two switches.
The daughtercards and cable assemblies were both tested at
manufacture and again at system integration. At deployment,
we discovered that 7 cards (0.4%) had a hardware failure, and
that one of the 3,264 links (0.03%) in the cable assemblies
was defective. Since then, over several months of operation,
we have seen no additional hardware failures.
3. Infrastructure and Platform Architecture
Supporting an at-scale deployment of reconfigurable hardware
requires a robust software stack capable of detecting failures
while providing a simple and accessible interface to software
applications. If developers have to worry about low-level
FPGA details, including drivers and system functions (e.g.,
PCIe), the platform will be difficult to use and rendered in-
compatible with future hardware generations. There are three
categories of infrastructure that must be carefully designed
to enable productive use of the FPGA: (1) APIs for interfac-
ing software with the FPGA, (2) interfaces between FPGA
application logic and board-level functions, and (3) support
for resilience and debugging.

3.1. Software Interface
Applications targeting the Catapult fabric share a common
driver and user-level interface. The communication interface
between the CPU and FPGA must satisfy two key design
goals: (1) the interface must incur low latency, taking fewer
than 10
μ
s for transfers of 16 KB or less, and (2) the interface
must be safe for multithreading. To achieve these goals, we
developed a custom PCIe interface with DMA support.
In our PCIe implementation, low latency is achieved by
avoiding system calls. We allocate one input and one output
buffer in non-paged, user-level memory and supply the FPGA
with a base pointer to the buffers’ physical memory addresses.
Thread safety is achieved by dividing the buffer into 64 slots,
where each slot is 1/64th of the buffer, and by statically assign-
ing each thread exclusive access to one or more slots. In the
case study in Section 4, we use 64 slots of 64 KB each.
Each slot has a set of status bits indicating whether the
slot is full. To send data to the FPGA, a thread fills its slot
with data, then sets the appropriate full bit for that slot. The
FPGA monitors the full bits and fairly selects a candidate slot
for DMA’ing into one of two staging buffers on the FPGA,
clearing the full bit once the data has been transferred. Fairness
is achieved by taking periodic snapshots of the full bits, and
DMA’ing all full slots before taking another snapshot of the
full bits. When the FPGA produces results for readback, it
checks to make sure that the output slot is empty and then
DMAs the results into the output buffer. Once the DMA is
complete, the FPGA sets the full bit for the output buffer and
generates an interrupt to wake and notify the consumer thread.
To configure the fabric with a desired function, user level
services may initiate FPGA reconfigurations through calls to
a low-level software library. When a service is deployed, each
server is designated to run a specific application on its local
FPGA. The server then invokes the reconfiguration function,
passing in the desired bitstream as a parameter.
3.2. Shell Architecture
In typical FPGA programming environments, the user is of-
ten responsible for developing not only the application itself
but also building and integrating system functions required
for data marshaling, host-to-FPGA communication, and inter-
chip FPGA communication (if available). System integration
places a significant burden on the user and can often exceed
the effort needed to develop the application itself. This devel-
opment effort is often not portable to other boards, making it
difficult for applications to work on future platforms.
Motivated by the need for user productivity and design
re-usability when targeting the Catapult fabric, we logically
divide all programmable logic into two partitions: the shell and
the role. The shell is a reusable portion of programmable logic
common across applications—while the role is the application
logic itself, restricted to a large fixed region of the chip.
West
SLIII
East
SLIII
South
SLIII
North
SLIII
x8 PCIe
Core
DMA
Engine
Config
Flash
(RSU)
DDR3 Core 1DDR3 Core 0
JTAG
LEDs
Tem p
Sensors
Application
Shell
I
2
C
xcvr
reconfig
2 2
2
2
4
256 Mb
QSPI
Config
Flash
4 GB DDR3-1333
ECC SO-DIMM
4 GB DDR3-1333
ECC SO-DIMM
Host
CPU
72
72
Role
8
Inter-FPGA Router
SEU
Figure 3: Components of the Shell Architecture.
Role designers access convenient and well-defined inter-
faces and capabilities in the shell (e.g., PCIe, DRAM, routing,
etc.) without concern for managing system correctness. The
shell consumes 23% of each FPGA, although extra capacity
can be obtained by discarding unused functions. In the future,
partial reconfiguration would allow for dynamic switching
between roles while the shell remains active—even routing
inter-FPGA traffic while a reconfiguration is taking place.
Figure 3 shows a block-level diagram of the shell architec-
ture, consisting of the following components:
Two DRAM controllers, which can be operated indepen-
dently or as a unified interface. On the Stratix V, our dual-
rank DIMMs operate at 667 MHz. Single-rank DIMMs (or
only using one of the two ranks of a dual-rank DIMM) can
operate at 800 MHz.
Four high-speed serial links running SerialLite III (SL3), a
lightweight protocol for communicating with neighboring
FPGAs. It supports FIFO semantics, Xon/Xoff flow control,
and ECC.
Router logic to manage traffic arriving from PCIe, the role,
or the SL3 cores.
Reconfiguration logic, based on a modified Remote Status
Update (RSU) unit, to read/write the configuration Flash.
The PCIe core, with the extensions to support DMA.
Single-event upset (SEU) logic, which periodically scrubs
the FPGA configuration state to reduce system or applica-
tion errors caused by soft errors.
The router is a standard crossbar that connects the four
inter-FPGA network ports, the PCIe controller, and the ap-
plication role. The routing decisions are made by a static
software-configured routing table that supports different rout-
ing policies. The transport protocol is virtual cut-through with
no retransmission or source buffering.

Since uncorrected bit errors can cause high-level disruptions
(requiring intervention from global management software), we
employ double-bit error detection and single-bit error correc-
tion on our DRAM controllers and SL3 links. The use of ECC
on our SL3 links incurs a 20% reduction in peak bandwidth.
ECC on the SL3 links is performed on individual flits, with cor-
rection for single-bit errors and detection of double-bit errors.
Flits with three or more bit errors may proceed undetected
through the pipeline, but are likely to be detected at the end
of packet transmission with a CRC check. Double-bit errors
and CRC failures result in the packet being dropped and not
returned to the host. In the event of a dropped packet, the host
will time out and divert the request to a higher-level failure
handling protocol.
The SEU scrubber runs continuously to scrub configura-
tion errors. If the error rates can be brought sufficiently low,
with conservative signaling speeds and correction, the rare
errors can be handled by the higher levels of software, without
resorting to expensive approaches such as source-based re-
transmission or store-and-forward protocols. The speed of the
FPGAs and the ingestion rate of requests is high enough that
store-and-forward would be too expensive for the applications
that we have implemented.
3.3. Software Infrastructure
The system software, both at the datacenter level and in each
individual server, required several changes to accommodate
the unique aspects of the reconfigurable fabric. These changes
fall into three categories: ensuring correct operation, failure
detection and recovery, and debugging.
Two new services are introduced to implement this sup-
port. The first, called the Mapping Manager, is responsible for
configuring FPGAs with the correct application images when
starting up a given datacenter service. The second, called the
Health Monitor, is invoked when there is a suspected failure
in one or more systems. These services run on servers within
the pod and communicate through the Ethernet network.
3.4. Correct Operation
The primary challenge we found to ensuring correct operation
was the potential for instability in the system introduced by
FPGAs reconfiguring while the system was otherwise up and
stable. These problems manifested along three dimensions.
First, a reconfiguring FPGA can appear as a failed PCIe device
to the host, raising a non-maskable interrupt that may desta-
bilize the system. Second, a failing or reconfiguring FPGA
may corrupt the state of its neighbors across the SL3 links
by randomly sending traffic that may appear valid. Third, re-
configuration cannot be counted on to occur synchronously
across servers, so FPGAs must remain robust to traffic from
neighbors with incorrect or incompatible configurations (e.g.
"old" data from FPGAs that have not yet been reconfigured).
The solution to a reconfiguring PCIe device is that the driver
that sits behind the FPGA reconfiguration call must first dis-
able non-maskable interrupts for the specific PCIe device (the
FPGA) during reconfiguration.
The solution to the corruption of a neighboring FPGA dur-
ing reconfiguration is more complex. When remote FPGAs
are reconfigured, they may send garbage data. To prevent this
data from corrupting neighboring FPGAs, the FPGA being
reconfigured sends a “TX Halt” message, indicating that the
neighbors should ignore all further traffic until the link is re-
established. In addition, messages are delayed a few clock
cycles so that, in case of an unexpected link failure, it can be
detected and the message can be suppressed.
Similarly, when an FPGA comes out of reconfiguration, it
cannot trust that its neighbors are not sending garbage data.
To handle this, each FPGA comes up with “RX Halt” enabled,
automatically throwing away any message coming in on the
SL3 links. The Mapping Manager tells each server to release
RX Halt once all FPGAs in a pipeline have been configured.
3.5. Failure Detection and Recovery
When a datacenter application hangs for any reason, a machine
at a higher level in the service hierarchy (such as a machine
that aggregates results) will notice that a set of servers are
unresponsive. At that point, the Health Monitor is invoked.
The Health Monitor queries each machine to find its status.
If a server is unresponsive, it is put through a sequence of
soft reboot, hard reboot, and then flagged for manual service
and possible replacement, until the machine starts working
correctly. If the server is operating correctly, it responds to
the Health Monitor with information about the health of its
local FPGA and associated links. The Health Monitor returns
a vector with error flags for inter-FPGA connections, DRAM
status (bit errors and calibration failures), errors in the FPGA
application, PLL lock issues, PCIe errors, and the occurrence
of a temperature shutdown. This call also returns the machine
IDs of the north, south, east, and west neighbors of an FPGA,
to test whether the neighboring FPGAs in the torus are acces-
sible and that they are the machines that the system expects
(in case the cables are miswired or unplugged).
Based on this information, the Health Monitor may update
a failed machine list (including the failure type). Updating
the machine list will invoke the Mapping Manager, which will
determine, based on the failure location and type, where to re-
locate various application roles on the fabric. It is possible that
relocation is unnecessary, such as when the failure occurred
on a spare node, or when simply reconfiguring the FPGA in-
place is sufficient to resolve the hang. The Mapping Manager
then goes through its reconfiguration process for every FPGA
involved in that service—clearing out any corrupted state and
mapping out any hardware failure or a recurring failure with
an unknown cause. In the current fabric running accelerated
search, failures have been exceedingly rare; we observed no
hangs due to data corruption; the failures that we have seen
have been due to transient phenomena, primarily machine
reboots due to maintenance or other unresponsive services.

Citations
More filters
Posted Content

In-Datacenter Performance Analysis of a Tensor Processing Unit

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Journal ArticleDOI

ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars

TL;DR: This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.
Proceedings ArticleDOI

A cloud-scale acceleration architecture

TL;DR: A new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications, and is much more scalable than prior work which used secondary rack-scale networks for inter-FPGA communication.
Proceedings ArticleDOI

A configurable cloud-scale DNN processor for real-time AI

TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.
References
More filters
Journal ArticleDOI

Design of ion-implanted MOSFET's with very small physical dimensions

TL;DR: This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 /spl mu/.
Proceedings ArticleDOI

A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH

TL;DR: This paper presents a hw/sw codesign methodology based on BORPH, an operating system designed for FPGA-based reconfigurable computers (RC's).
Proceedings ArticleDOI

CoRAM: an in-fabric memory architecture for FPGA-based computing

TL;DR: A new FPGA memory architecture called Connected RAM (CoRAM) is proposed to serve as a portable bridge between the distributed computation kernels and the external memory interfaces to improve performance and efficiency and to improve an application's portability and scalability.
Proceedings ArticleDOI

Algorithmic transformations in the implementation of K- means clustering on reconfigurable hardware

TL;DR: In mapping the k-means algorithm to FPGA hardware, this work examined algorithm level transforms that dramatically increased the achievable parallelism and also examined the effects of using fixed precision and truncated bit widths in the algorithm.
Proceedings ArticleDOI

Maxwell - a 64 FPGA Supercomputer

TL;DR: The machine itself - Maxwell - its hardware and software environment is described and very early benchmark results from runs of the demonstrators are presented.
Related Papers (5)
Frequently Asked Questions (21)
Q1. What contributions have the authors mentioned in the paper "A reconfigurable fabric for accelerating large-scale datacenter services" ?

In this paper, the authors describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. The authors describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. 

Since the server sleds are plugged into a passive backplane, and the torus cabling also attaches to the backplane, a server can be serviced by pulling it out of the backplane without unplugging any cables. 

While reliability is important, the scale of the datacenter permits sufficient redundancy that a small rate of faults and failures is tolerable. 

Because the Spare FPGA must forward its requests along a channel shared with responses, it perceives a slightly higher but negligible latency increase over FE at maximum throughput. 

FPGAs have been used to implement and accelerate important datacenter applications such as Memcached [17, 6] compression/decompression [14, 19], K-means clustering [11, 13], and web search. 

The Convey HC-2 [8], Maxeler MPC series [21], BeeCube BEE4 [5] and SRC MAPstation [25] are all examples of commercial FPGA acceleration appliances. 

To support a large collection of state machines working in parallel on the same input data at a high clock rate, the authors organize the blocks into a tree-like hierarchy and replicate the input stream several times. 

To configure the fabric with a desired function, user level services may initiate FPGA reconfigurations through calls to a low-level software library. 

The acceleration of datacenter services imposes several stringent requirements on the design of a large-scale reconfigurable fabric. 

The authors decided not to incorporate GPUs because the current power requirements of high-end GPUs are too high for conventional datacenter servers, but also because it was unclear that some latency-sensitive ranking stages (such as feature extraction) would map well to GPUs. 

To achieve the required capacity for a large-scale reconfigurable fabric, one option is to incorporate multiple FPGAs onto a daughtercard and house such a card along with a subset of the servers. 

Because complex floating point instructions consume a large amount of FPGA area, multiple cores (typically 6) are clustered together to share a single complex block. 

The use of traditional interactive FPGA debugging tools at scale (e.g., Altera SignalTap, Xilinx ChipScope) is limited by (1) finite buffering capacity, (2) the need to automatically recover the failed service, and (3) the impracticality of putting USB JTAG units into each machine. 

While the appliance model appears to be an easy way to integrate FPGAs into the datacenter, it breaks homogeneity and reduces overall datacenter flexibility. 

It was also necessary to add EMI shielding to the board to protect other server components from interference from the large number of highspeed signals on the board. 

When a datacenter application hangs for any reason, a machine at a higher level in the service hierarchy (such as a machine that aggregates results) will notice that a set of servers are unresponsive. 

This slowdown, due largely to power limitations, has severe implications for datacenter operators, who have traditionally relied on consistent performance and efficiency improvements in servers to make improved services economically viable. 

Compared to a pure software implementation, the Catapult fabric achieves a 95% improvement in throughput at each ranking server with an equivalent latency distribution—or at the same throughput, reduces tail latency by 29%. 

For a range of representative injection rates per server used in production, Figure 14 illustrates how the FPGA-accelerated ranker substantially reduces the end-to-end scoring latency relative to software. 

This is an order of magnitude slower than processing a single document, so the queue manager’s role in minimizing model reloads among queries is crucial to achieving high performance. 

With this protocol and the appropriate fault handling mechanisms, the authors showed that a medium-scale deployment of FPGAs can increase ranking throughput in a production search infrastructure by 95% at comparable latency to a software-only solution.