scispace - formally typeset
Open AccessProceedings ArticleDOI

Re-architecting datacenter networks and stacks for low latency and high performance

Reads0
Chats0
TLDR
NDP, a novel data-center transport architecture that achieves near-optimal completion times for short transfers and high flow throughput in a wide range of scenarios, including incast, is presented.
Abstract
Modern datacenter networks provide very high capacity via redundant Clos topologies and low switch latency, but transport protocols rarely deliver matching performance. We present NDP, a novel data-center transport architecture that achieves near-optimal completion times for short transfers and high flow throughput in a wide range of scenarios, including incast. NDP switch buffers are very shallow and when they fill the switches trim packets to headers and priority forward the headers. This gives receivers a full view of instantaneous demand from all senders, and is the basis for our novel, high-performance, multipath-aware transport protocol that can deal gracefully with massive incast events and prioritize traffic from different senders on RTT timescales. We implemented NDP in Linux hosts with DPDK, in a software switch, in a NetFPGA-based hardware switch, and in P4. We evaluate NDP's performance in our implementations and in large-scale simulations, simultaneously demonstrating support for very low-latency and high throughput.

read more

Content maybe subject to copyright    Report

Re-architecting datacenter networks and stacks for low
latency and high performance
Mark Handley
University College London
London, UK
m.handley@cs.ucl.ac.uk
Costin Raiciu
Alexandru Agache
Andrei Voinescu
University Politehnica of Bucharest
Bucharest, Romania
firstname.lastname@cs.pub.ro
Andrew W. Moore
Gianni Antichi
Marcin Wójcik
University of Cambridge
Cambridge, UK
firstname.lastname@cl.cam.ac.uk
ABSTRACT
Modern datacenter networks provide very high capacity via redun-
dant Clos topologies and low switch latency, but transport protocols
rarely deliver matching performance. We present NDP, a novel data-
center transport architecture that achieves near-optimal completion
times for short transfers and high flow throughput in a wide range
of scenarios, including incast. NDP switch buffers are very shal-
low and when they fill the switches trim packets to headers and
priority forward the headers. This gives receivers a full view of
instantaneous demand from all senders, and is the basis for our
novel, high-performance, multipath-aware transport protocol that
can deal gracefully with massive incast events and prioritize traffic
from different senders on RTT timescales. We implemented NDP in
Linux hosts with DPDK, in a software switch, in a NetFPGA-based
hardware switch, and in P4. We evaluate NDP’s performance in
our implementations and in large-scale simulations, simultaneously
demonstrating support for very low-latency and high throughput.
CCS CONCEPTS
Networks Network protocols; Data center networks;
KEYWORDS
Datacenters; Network Stacks; Transport Protocols
ACM Reference format:
Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, An-
drew W. Moore, Gianni Antichi, and Marcin Wójcik. 2017. Re-architecting
datacenter networks and stacks for low latency and high performance. In
Proceedings of , , , 14 pages. https://doi.org/10.1145/3098822.3098825
1 INTRODUCTION
Datacenters have evolved rapidly over the last few years, with Clos[
1
,
17
] topologies becoming commonplace, and a new emphasis on low
latency, first with improved transport protocols such as DCTCP[
4
]
and more recently with solutions such as RDMA over Converged
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
, ,
©
2017 Copyright held by the owner/author(s). Publication rights licensed to Association
for Computing Machinery.
ACM ISBN 978-1-4503-4653-5/17/08. . . $15.00
https://doi.org/10.1145/3098822.3098825
Ethernet v2[
25
] that use Ethernet flow control in switches[
23
] to
avoid packet loss caused by congestion.
In a lightly loaded network, Ethernet flow control can give very
good low-delay performance[
20
] for request/response flows that
dominate datacenter workloads. Packets are queued rather than lost,
if necessary producing back-pressure, pausing forwarding across sev-
eral switches, and so no time is wasted on being overly conservative
at start-up or waiting for retransmission timeouts. However, based
on experience deploying RoCEv2 at Microsoft[
20
], Gau et al.note
that a lossless network does not guarantee low latency. When con-
gestion occurs, queues build up and PFC pause frames are generated.
Both queues and PFC pause frames increase network latency. They
conclude “how to achieve low network latency and high network
throughput at the same time for RDMA is still an open problem.
In this paper we present a new datacenter protocol architecture,
NDP, that takes a different approach to simultaneously achieving
both low delay and high throughput. NDP has no connection setup
handshake, and allows flows to start sending instantly at full rate.
We use per-packet multipath load balancing, which avoids core
network congestion at the expense of reordering, and in switches
use an approach similar to Cut Payload (CP)[
9
], which trims the
payloads of packets when a switch queue fills. This gives a network
that is lossless for metadata, but not for traffic payloads. In spite of
reordering, lossless metadata gives the receiver a complete picture
regarding inbound traffic and we take advantage of it to build a
radical new transport protocol that achieves very low latency for
short flows, with minimal interference between flows to different
destinations even in pathological traffic patterns.
We have implemented NDP in Linux hosts, in a software switch,
in a hardware switch based on NetFPGA SUME[
41
], in P4[
29
], and
in simulation. We will demonstrate that NDP achieves:
Better short-flow performance than DCTCP or DCQCN.
Greater than 95% of the maximum network capacity in a heavily
loaded network with switch queues of only eight packets.
Near-perfect delay and fairness in incast[18] scenarios.
Minimal interference between flows to different hosts.
Effective prioritization of straggler traffic during incasts.
2 DESIGN SPACE
Intra-datacenter network traffic primarily consists of request/response
RPC-like protocols. Mean network utilization is rarely very high,
but applications can be very bursty. The big problem today is latency,
especially for short RPC-like workloads. At the cost of head-of-line

, , Handley et al.
blocking, today’s applications often reuse TCP connections across
multiple requests to amortize the latency cost of the TCP handshake.
Is it possible to improve the protocol stack so much that every
request could use a new connection and at the same time expect
to get close to the raw latency and bandwidth of the underlying
network, even under heavy load?
We will show that these goals
are achievable, but to do so involves changes to how traffic is routed,
how switches cope with overload, and most importantly, requires
a completely different transport protocol from those used today.
Before we describe our solution in §3, we first highlight the key
architectural points that must be considered.
2.1 End-to-end Service Demands
What do applications want from a datacenter network?
Location Independence.
It shouldn’t matter which machine in a
datacenter the elements of a distributed application are run on. This is
commonly achieved using high-capacity parallel Clos topologies[
1
,
17
]. Such topologies have sufficient cross-sectional bandwidth that
the core network should not be a bottleneck.
Low Latency.
Clos networks can supply bandwidth, modulo issues
with load balancing between paths, but often fall short in providing
low latency service. Predictable very low latency request/response
behavior is the key application demand, and it is the hardest to satisfy.
This is more important than large file transfer performance, though
high throughput is still a requirement, especially for storage servers.
The strategy must be to optimize for low latency first.
Incast.
Datacenter workloads often require sending requests to large
numbers of workers and then handling their near-simultaneous re-
sponses, causing a problem called incast. A good networking stack
should shield applications from the side-effects of incast traffic pat-
terns gracefully while providing low latency.
Priority.
It is also common for a receiver to handle many incom-
ing flows corresponding to different requests simultaneously. For
example, it may have fanned out two different requests to workers,
and the responses to those requests are now arriving with the last
responses to the first request overlapping the first responses to the
second request. Many applications need all the responses to a request
before they can proceed. A very desirable property is for the receiver
to be able to prioritize arriving traffic from stragglers. The receiver
is the only entity that can dynamically prioritize its inbound traffic,
and this impacts protocol design.
2.2 Transport Protocol
Current datacenter transport protocols satisfy some of these appli-
cation requirements, but satisfying all of them places some unusual
demands on datacenter transport protocols.
Zero-RTT connection setup.
To minimize latency, many applica-
tions would like zero-RTT delivery for small outgoing transfers (or
one RTT for request/response). We need a protocol that doesn’t
require a handshake to complete before sending data, but this poses
security and correctness issues.
Fast start.
Another implication of zero-RTT delivery is that a trans-
port protocol can’t probe for available bandwidth—to minimize
latency, it must assume bandwidth is available, optimistically send
a full initial window, and then react appropriately when it isn’t. In
contrast to the Internet, simpler solutions are possible in datacenter
environments, because link speeds and network delays (except for
queuing delays) can mostly be known in advance.
Per-packet ECMP.
One problem with Clos topologies is that per-
flow ECMP hashing of flows to paths can cause unintended flow
collisions; one deployment[
20
] found this reduced throughput by
40%. For large transfers, multipath protocols such as MPTCP can
establish enough subflows to find unused paths[
31
], but they can
do little to help with the latency of very short transfers. The only
solution here is to stripe across multiple paths on a per-packet basis.
This complicates transport protocol design.
Reorder-tolerant handshake.
If we perform a zero-RTT transfer
with per-packet multipath forwarding in a Clos network, even the
very first window of packets may arrive in a random order. This effect
has implications for connection setup: the first packet to arrive will
not be the first packet of the connection. Such a transport protocol
must be capable of establishing connection state no matter which
packet from the initial window is first to arrive.
Optimized for Incast.
Although Clos networks are well-provisioned
for core-capacity, incast traffic can make life difficult for any trans-
port protocol when applications fan out requests to many workers
simultaneously. Such traffic patterns can cause high packet loss rates,
especially if the transport protocol is aggressive in the first RTT. To
handle this gracefully requires some assistance from the switches.
2.3 Switch Service Model
Application requirements, the transport protocol and the service
model at network switches are tightly coupled, and need to be op-
timized holistically. Of particular relevance is what happens when
a switch port is congested. The switch service model heavily in-
fluences the design space of both protocol and congestion control
algorithms, and couples tightly with forwarding behavior: per-packet
multipath load balancing is ideal as it minimizes hotspots, but it com-
plicates the ability of end-systems to infer network congestion and
increases importance of graceful overload behavior.
Loss as a congestion feedback mechanism has the advantage that
dropped packets don’t use bottleneck bandwidth, and loss only im-
pacts flows traversing the congested link—not all schemes have
these properties. The downside is that it leads to uncertainty as to
a packet’s outcome. Duplicate or selective ACKs to trigger retrans-
missions only work well for long-lived flows. With short flows, tail
loss is common, and then you have to fall back on retransmission
timeouts (RTO). Short RTOs are only safe if you can constrain the
delay in the network, so you need to maintain short queues[
4
] which
in turn constrain the congestion control schemes you can use. Loss
also couples badly with per-packet multipath forwarding; because
the packets of a flow arrive out of order, loss detection is greatly
complicated - fast retransmit is often not possible because its not
rare for a packet to arrive out of sequence by a whole window.
ECN helps significantly. DCTCP uses ECN[
32
] with a sharp
threshold for packet marking, and a congestion control scheme that
aims to push in and out of the marking regime. This greatly reduces
loss for long-lived flows, and allows the use of small buffers, re-
ducing queuing delay. For short flows though, ECN has less benefit
because the flow doesn’t have time to react to the ECN feedback. In
practice switches use large shared buffers in conjunction with ECN
and this reduces incast losses, but retransmit timers must be less

Datacenter networks for low latency and high performance. , ,
New$
Receiver$
Driven$
Protocol$
$
Low$latency$
uncongested$
core$
Clos$
Network$
Per-packet$
Mul<path$
Fast$Flow$
start$
Packet$
Trimming$
Incast$and$$
reordering,$but$$
receiver$has$
full$informa<on$
Small$
queues$
Ctrl$packet$
Priority$
Receiver$
Pacing$
Fast$
RTX$
Zero$RTT$
connect$
Figure 1: Key components of NDP
aggressive. ECN does have the advantage though that it interacts
quite well with per-packet multipath forwarding, given a transport
protocol design that can tolerate reordering.
Lossless Ethernet using 802.3X Pause frames [23] or 802.1 Qbb
priority-based flow control (PFC)[
24
] can prevent loss, avoiding
the need for aggressive RTO in protocols. At low utilizations, this
can be effective at achieving low delay—a burst will arrive at the
maximum rate that the link can forward, with no need to wait for
retransmissions. The problem comes at higher utilizations in tiered
topologies, where flows that happen to hash to the same outgoing
port, and use the same priority in the case of 802.1Qbb, can cause
incoming ports to be paused. This causes collateral damage to other
flows traversing the same incoming port destined for different output
ports. With large incasts, pausing can cascade back up towards core
switches. Lossless Ethernet also interacts badly with per-packet mul-
tipath forwarding, as different switches may pause traffic at different
times, exacerbating reordering and complicating end-system design.
Cut Payload (CP)[
9
] tries to get the benefits of lossless with-
out quite being lossless. It drops packet payloads, but not packet
headers, relieving overload while avoiding uncertainty as to packet
outcomes. It shows great promise, but there are two problems. First,
in severe overload, it is susceptible to congestion collapse, where
only headers get forwarded. Second, because the headers are queued
in a FIFO manner, tail "loss" costs at least one RTT. In addition, CP,
as originally proposed uses single-path forwarding for each flow.
3 DESIGN
Our primary goals are low completion latency for short flows, and
predictable high throughput for longer flows. To fully satisfy these
goals, NDP impacts the whole stack, including switch behavior,
routing, and a completely new transport protocol. We lead with
a brief but simplified design rationale to show how the pieces in
Figure 1 fit together, then fill in the details in the rest of this section.
A Clos topology has sufficient bandwidth in the core to satisfy
all demand, so long as it is perfectly load-balanced. To avoid flow
collisions on core links, which impact both latency and throughput,
load-balancing each flow across many paths is essential. Balanc-
ing short flows requires per-packet multipath load-balancing, but
inevitably packets will get reordered.
To achieve minimal short-flow latency, senders cannot probe
before sending: they must send the first RTT at line rate. This works
well most of the time. When senders perform per-packet multipath
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160 180 200
Percent of fair
goodput achieved
Number of flows
NDP switch, mean
NDP switch, worst 10%
CP switch, mean
CP switch, worst 10%
Figure 2: Collapse and Phase Problems with CP
load balancing, if sending at line rate causes congestion, it is because
several senders are sending to the same receiver. Even then, the
receiver’s link is fully occupied, so this is not, by itself, a problem.
To guarantee low latency, switch queues must be small. This
means colliding flows will overflow the queue. Packet loss, com-
bined with multipath reordering, make it impossible to infer what
happened and retransmit quickly enough to avoid impacting latency;
this violates the low latency goal. Completely preventing packet loss
adds queuing delay; if this is done by pausing inbound traffic, as
with lossless Ethernet, this impacts other unrelated traffic, violating
its low latency and predictable high throughput goals. We seek a
middle ground between packet loss and lossless.
Packet trimming, similar to that performed by CP, is such a middle
ground. Switch queues can be small, and the receiver still discov-
ers which packets were sent by examining the trimmed headers it
receives. However, to minimize retransmission latency, trimmed
headers and control packets need to be prioritized. Arriving trimmed
headers tell the receiver exactly what the demand is, so by using
a receiver-pulled protocol, the receiver can then precisely control
incoming traffic. This avoids persistent overload and allows more
important packets to be pulled first, at the receiver’s discretion.
3.1 NDP Switch Service Model
With CP, when the queue at a switch fills beyond a fixed thresh-
old, rather than dropping a packet, the switch trims off the packet
payload, queuing just the header. The rationale is that packets are
not lost silently, allowing rapid retransmission without waiting for
a timeout. With the short distances in a datacenter network, such
retransmissions can arrive very quickly.
Alongside switch changes, CP proposes minor changes to TCP
to improve incast performance. We wish to go well beyond CP, and
use packet trimming as the basis of an extremely aggressive network
architecture, focused on very low delay service. However, there are
several problems that can arise if vanilla CP is used.
First, CP can suffer from a form of congestion collapse. Figure 2
shows what happens when packets arrive at a switch at a signifi-
cantly higher rate than can be supported by the outgoing link. Many
unresponsive flows converge on a 10Gb/s link that can only support
one of them, as in extreme server incast scenarios. The figure shows
the percent of the ideal fair-share goodput that is achieved. The mean
goodput of the CP flows decreases, as an increasing fraction of the
link is occupied by trimmed packet headers. This figure shows the
best case for CP, with 9KB jumbograms. With 1500 byte packets the
collapse is much faster.

, , Handley et al.
Second, datacenter networks are very regular, so phase effects[
14
]
can occur, leading to unfair throughput. The dashed curves in Fig-
ure 2 show the mean goodput of the worst performing 10% of the
flows. Phase effects can render CP very unfair, though we note
that this figure shows simulation results; real-world phase effects
can sometimes be reduced by variability in the timing of packet
transmissions due to OS scheduling.
Finally, CP aims to provide low delay feedback that packets have
been lost. However, because CP uses a FIFO queue, feedback can
only be sent after all the preceding packets have been received,
resulting in a delay before a retransmission is elicited. We would like
to run very small buffers in the switches, have one of those queues
overflow, and for the retransmission to arrive before the queue has
had a chance to drain. This isn’t possible with FIFO queuing.
NDP switches make three main changes to CP. First, an NDP
switch maintains two queues: a lower priority queue for data packets
and a higher priority queue for trimmed headers, ACKs and NACKs
1
.
This may seem counter-intuitive, but it provides the earliest possible
feedback that a packet didn’t make it, usually allowing a retransmis-
sion to arrive before the offending queue had even had time to drain.
This can provide at least as good low delay behavior as lossless
Ethernet, without the collateral damage caused by pausing.
Second, the switch performs weighted round robin between the
high priority “header queue” and the lower priority “data packet
queue”. With a 10:1 ratio of headers to packets, this allows early
feedback without being susceptible to congestion collapse.
Finally, when a data packet arrives and the low priority queue is
full, the switch decides with 50% probability whether to trim the
newly arrived packet, or the data packet at the tail of the low priority
queue. This breaks up phase effects. Figure 2 shows how an NDP
switch avoids CP’s collapse, and also avoids strong phase effects.
3.1.1 Routing
We want NDP switches to perform per-packet multihop forwarding,
so as to evenly distribute traffic bursts across all the parallel paths
that are available between source and destination. This could be
done in at least four ways:
Perform per-packet ECMP; switches randomly choose the next
hop for each packet;
Explicitly source-route the traffic;
Use label-switched paths; the sender chooses the label;
Destination addresses indicates the path to be taken; the sender
chooses between destination addresses.
For load-balancing purposes the latter three are equivalent—the
sender chooses a path—they differ in how the sender expresses that
path. Our experiments show that if the senders choose the paths, they
can do a better job of load balancing than if the switches randomly
choose paths. This allows the use of slightly smaller switch buffers.
Unlike in the Internet, in a datacenter, senders can know the
topology, so know how many paths are available to a destination.
Each NDP sender takes the list of paths to a destination, randomly
permutes it, then sends packets on paths in this order. After it has
sent one packet on each path, it randomly permutes the list of paths
again, and the process repeats. This spreads packets equally across
all paths while avoiding inadvertent synchronization between two
1
Also PULL packets, which we will introduce shortly.
senders. Such load-balancing is important to achieving very low
delay. If we use very small data packet queues (only eight packets),
our experiments show that this simple scheme can increase the
maximum capacity of the network by as much as 10% over a per-
packet random path choice.
Depending on whether the network is L2 or L3-switched, either
label-switched paths or destination addresses can be used to choose
a path. In an L2 FatTree network for example, a label-switched path
only needs to be set up as far as each core switch, with destination
L2 addresses taking over from there, as a FatTree only has one path
from a core switch to each host. In an L3 FatTree, each host gets
multiple IP addresses, one for each core switch. By choosing the
destination address, the sender chooses the core switch a packet
traverses.
3.2 Transport Protocol
NDP uses a receiver-driven transport protocol designed specifically
to take advantage of multipath forwarding, packet trimming, and
short switch queues. The goal at each step is first to minimize delay
for short transfers, then to maximize throughput for larger transfers.
When starting up a connection, a transport protocol could be pes-
simistic, like TCP, and assume that there is minimal spare network
capacity. TCP starts sending data after the three-way handshake com-
pletes, initially with a small congestion window[
11
], and doubles it
each RTT until it has filled the pipe. Starting slowly is appropriate
in the Internet, where RTTs and link bandwidths differ by orders of
magnitude, and where the consequences of being more aggressive
are severe. In a datacenter, though, link speeds and baseline RTTs
are much more homogeneous, and can be known in advance. Also,
network utilization is often relatively low [
7
]. In such a network,
to minimize delay we must be optimistic and assume there will
be enough capacity to send a full window of data in the first RTT
of a connection without probing. If switch buffers are small, in a
low-delay datacenter environment a full window is likely to be only
about 12 packets given the speed-of-light latencies and hop-counts.
However, if it turns outs that there is insufficient capacity, packets
will be lost. With a normal transport protocol, the combination
of per-packet multipath forwarding and being aggressive in the
first RTT is a recipe for confusion. Some packets arrive, but in a
random order, and some don’t. It is impossible to tell quickly what
actually happened, and so the sender must fall back on conservative
retransmission timeouts to remedy the situation.
Increasing switch buffering could mitigate this situation some-
what, at the expense of increasing delay, but can’t prevent loss with
large incasts. ECN also cannot prevent loss with aggressive short
flows. Pause frames can prevent loss, and could help significantly
here, but we will show in § 6.1 that this brings its own significant
problems in terms of delay to unrelated flows.
This is where packet trimming in the NDP switches really comes
into its own. Headers of trimmed packets arriving at the receiver con-
sume little bottleneck bandwidth, but inform the receiver precisely
which packets were sent. The order of packet arrivals is unimportant
when it comes to inferring what happened. Priority queuing ensures
that these headers arrive quickly, and that control packets such as
NACKs returned to the sender arrive quickly; indeed quickly enough
to elicit a retransmission that arrives before the overflowing queue

Datacenter networks for low latency and high performance. , ,
Src$ ToR $ ToR $Agg$ Core$ Agg$ Dst$
1$
8$
9$
2$
3$
RTX$9$
4$
9$
5$
9$
6$
9$
7$
9$
1$to$9$
!$
!$
!$
!$
!$
!$
!$
!$
!$
!$
!$
!$
!$
!$
t
trim$
t
header$
t
rtx$
t
enqueue$
t
arrive$
Figure 3: Packet trimming enables low-delay retransmission
has had time to drain, so the link does not go idle. This is illustrated
in Figure 3. At time
t
t r im
packets from nine different sources arrive
nearly simultaneously at the ToR switch. The eight-packet queue to
the destination link fills, and the packet from source 9 is trimmed.
After packet 1 finishes being forwarded, packet 9’s header gets pri-
ority treatment. At
t
header
it arrives at the receiver, which generates
a NACK packet
2
. Packet 9 is retransmitted at
t
r t x
and arrives at the
ToR switch queue while packet 7 is still being forwarded. The link
to the destination never goes idle, and packet 9 arrives at
t
ar r ive
,
the same time it would have arrived if PFC had prevented its loss by
pausing the upstream switch.
In a Clos topology employing per-packet multipath, the only hot
spots that can build are when traffic from many sources converges on
a receiver. With NDP, trimmed headers indicate the precise demand
to the receiver; it knows exactly which senders want to send which
data to it, so it is best placed to decide what to do after the first RTT
of a connection. After sending a full window of data at line rate,
NDP senders stop sending. From then on, the protocol is receiver-
driven. An NDP receiver requests packets from the senders, pacing
the sending of those requests so that the data packets they elicit arrive
at a rate that matches the receiver’s link speed. The data requested
can be retransmissions of trimmed packets, or can be new data from
the rest of the transfer. The protocol thus works as follows:
The sender sends a full window of data without waiting for a
response. Data packets carry packet sequence numbers.
For each header
3
that arrives, the receiver immediately sends a
NACK to inform the sender to prepare the packet for retransmis-
sion (but not yet send it).
For each data packet that arrives, the receiver immediately sends
an ACK to inform the sender that the packet arrived, and so the
buffer can be freed.
For every header or packet that arrives, the receiver adds a
P
ULL
packet to its pull queue that will, in due course, be sent to the
corresponding sender. A receiver only has one pull queue, shared
by all connections for which it is the receiver.
A PULL packet contains the connection ID and a per-sender pull
counter that increments on each PULL packet sent to that sender.
The receiver sends out PULL packets from the per-interface pull
queue, paced so that the data packets they elicit from the sender
then arrive at the receiver’s link rate. Pull packets from different
connections are serviced fairly by default, or with strict prioriti-
zation when a flow has higher priority.
2
This NACK has the PULL bit set, requesting retransmission.
3
When we refer to headers in this context, we are referring to the headers of packets
whose payload was trimmed off by a switch
When a PULL packet arrives at the sender, the sender will send as
many data packets as the pull counter increments by. Any packets
queued for retransmission are sent first, followed by new data.
When the sender runs out of data to send, it marks the last packet.
When the last packet arrives, the receiver removes any pull pack-
ets for that sender from its pull queue to avoid sending unneces-
sary pull packets. Any subsequent data the sender later wants to
send will be pushed rather than pulled.
Due to packet trimming, it is very rare for a packet to be actually
lost; usually this is due to corruption. As ACKs and NACKs are
sent immediately, are priority-forwarded, and all switch queues are
small, the sender can know very quickly if a packet was actually lost.
With eight packet switch queues, 9KB jumbograms, and store-and-
forward switches in a 10Gb/s FatTree topology, each packet takes
7.2
µs
to serialize. Taking into account NDP’s priority queuing, the
worst-case network RTT is approximately 400
µs
, with typical RTTs
being much shorter. This allows a very short retransmission timeout
to be used to provide reliability for such corrupted packets.
PULL packets perform a role similar to TCP’s ACK-clock, but are
usually
4
separated from ACKs to allow them to be paced without
impacting the retransmission timeout mechanism. For example, in a
large incast scenario, PULLs may spend a comparatively long time
in the receiver’s pull queue before the pacer allows them to be sent,
but we don’t want to also delay ACKs because doing so requires
being much more conservative with retransmission timeouts.
The emergent behavior is that the first RTT of data in a connection
is pushed, and subsequent RTTs of data are pulled so as to arrive
at the receiver’s line rate. In an incast scenario, if many senders
send simultaneously, many of their first window of packets will be
trimmed, but subsequently receiver pulling ensures that the aggregate
arrival rate from all senders matches the receiver’s link speed, with
few or no packets being trimmed.
3.2.1 Coping with Reordering
Due to per-packet multipath forwarding, it is normal for both data
packets and reverse-path ACKs, NACKs and PULLs to be reordered.
The basic protocol design is robust to reordering, as it does not need
to make inference about loss from other packets’ sequence numbers.
However, reordering still needs to be taken into account.
Although PULL packets are priority-queued, they don’t preempt
data packets, so PULL packets sent on different paths often arrive out
of order, increasing the burstiness of retransmissions. To reduce this,
PULLs carry a pull sequence number. The receiver has a separate pull
sequence space for each connection, incrementing it by one for each
pull sent. On receipt of a PULL, the sender transmits as many packets
as the pull sequence number increases by. For example, if a PULL is
delayed, the next PULL sent may arrive first via a different path, and
will pull two packets rather than one. This reduces burstiness a little.
3.2.2 The First RTT
Unlike in TCP, where the SYN/SYN-ACK handshake happens ahead
of data exchange, we wish NDP data to be sent in the first RTT. This
adds three new requirements:
Be robust to requests that spoof source IP addresses.
4
If there is only one sender, PULL packets don’t need extra pacing because data packets
arrive paced appropriately. In such cases we can send combined PULLACK.

Citations
More filters
Proceedings ArticleDOI

HPCC: high precision congestion control

TL;DR: HPCC (High Precision Congestion Control), a new high-speed CC mechanism which achieves the three goals simultaneously, is presented, which leverages in-network telemetry (INT) to obtain precise link load information and controls traffic precisely.
Proceedings ArticleDOI

Homa: a receiver-driven low-latency transport protocol using network priorities

TL;DR: Homa as discussed by the authors uses in-network priority queues to ensure low latency for short messages; priority allocation is managed dynamically by each receiver and integrated with a receiver-driven flow control mechanism.
Proceedings ArticleDOI

Swift: Delay is Simple and Effective for Congestion Control in the Datacenter

TL;DR: In large-scale testbed experiments, Swift delivers a tail latency of <50μs for short RPCs, with near-zero packet drops, while sustaining ~100Gbps throughput per server, while providing high throughput for long RPCs.
Proceedings ArticleDOI

Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities (Complete Version)

TL;DR: Homa as discussed by the authors uses in-network priority queues to ensure low latency for short messages; priority allocation is managed dynamically by each receiver and integrated with a receiver-driven flow control mechanism.
Proceedings ArticleDOI

A link layer protocol for quantum networks

TL;DR: This work proposes a functional allocation of a quantum network stack, and constructs the first physical and link layer protocols that turn ad-hoc physics experiments producing heralded entanglement between quantum processors into a well-defined and robust service.
References
More filters
Journal ArticleDOI

Congestion avoidance and control

TL;DR: The measurements and the reports of beta testers suggest that the final product is fairly good at dealing with congested conditions on the Internet, and an algorithm recently developed by Phil Karn of Bell Communications Research is described in a soon-to-be-published RFC.
Journal ArticleDOI

A scalable, commodity data center network architecture

TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Proceedings ArticleDOI

VL2: a scalable and flexible data center network

TL;DR: VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.
Proceedings ArticleDOI

Network traffic characteristics of data centers in the wild

TL;DR: An empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, and cloud data centers, which includes not only data centers employed by large online service providers offering Internet-facing applications but also data centers used to host data-intensive (MapReduce style) applications.
Proceedings ArticleDOI

Data center TCP (DCTCP)

TL;DR: DCTCP enables the applications to handle 10X the current background traffic, without impacting foreground traffic, thus largely eliminating incast problems, and delivers the same or better throughput than TCP, while using 90% less buffer space.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What have the authors contributed in "Re-architecting datacenter networks and stacks for low latency and high performance" ?

The authors present NDP, a novel datacenter transport architecture that achieves near-optimal completion times for short transfers and high flow throughput in a wide range of scenarios, including incast. The authors evaluate NDP ’ s performance in their implementations and in large-scale simulations, simultaneously demonstrating support for very low-latency and high throughput. 

Due to per-packet multipath forwarding, it is normal for both data packets and reverse-path ACKs, NACKs and PULLs to be reordered. 

With eight packet switch queues, 9KB jumbograms, and store-andforward switches in a 10Gb/s FatTree topology, each packet takes 7.2µs to serialize. 

Depending on whether the network is L2 or L3-switched, either label-switched paths or destination addresses can be used to choose a path. 

the greatest limitations of all these works is that they only focus on low latency, while ignoring network utilization or large-flow performance. 

For incasts smaller than eight flows, the overhead using a one-packet IW is high as there need to be at least eight packets sent per RTT to fill the receiver’s link. 

When they are trimmed on uplinks, this is due to imperfect load balancing; this is where NDP’s source-based load balancing provides a win over per-packet random ECMP performed by switches. 

In such a network, to minimize delay the authors must be optimistic and assume there will be enough capacity to send a full window of data in the first RTT of a connection without probing. 

Loss as a congestion feedback mechanism has the advantage that dropped packets don’t use bottleneck bandwidth, and loss only impacts flows traversing the congested link—not all schemes have these properties. 

based on experience deploying RoCEv2 at Microsoft[20], Gau et al.note that a lossless network does not guarantee low latency. 

The authors have implemented NDP in Linux hosts, in a software switch, in a hardware switch based on NetFPGA SUME[41], in P4[29], and in simulation. 

To be robust, every packet in the first RTT carries the SYN flag, together with the offset of its sequence number from the first packet in the connection. 

Phase effects can render CP very unfair, though the authors note that this figure shows simulation results; real-world phase effects can sometimes be reduced by variability in the timing of packet transmissions due to OS scheduling. 

The NDP modifications are demonstrated on the simple switch device assuming a single output interface, but they could be added to any P4 switch and easily modified to handle multiple output ports.