What is the greatest limitation of all these works?

the greatest limitations of all these works is that they only focus on low latency, while ignoring network utilization or large-flow performance.

Why is the overhead for a one-packet IW high?

For incasts smaller than eight flows, the overhead using a one-packet IW is high as there need to be at least eight packets sent per RTT to fill the receiver’s link.

Why is NDP able to avoid congestion collapse?

When they are trimmed on uplinks, this is due to imperfect load balancing; this is where NDP’s source-based load balancing provides a win over per-packet random ECMP performed by switches.

What does Gau and his team think of the RDMA network?

based on experience deploying RoCEv2 at Microsoft[20], Gau et al.note that a lossless network does not guarantee low latency.

What is the simplest way to be robust?

To be robust, every packet in the first RTT carries the SYN flag, together with the offset of its sequence number from the first packet in the connection.

How many ports are used for the NDP modifications?

The NDP modifications are demonstrated on the simple switch device assuming a single output interface, but they could be added to any P4 switch and easily modified to handle multiple output ports.

(Open Access) Re-architecting datacenter networks and stacks for low latency and high performance (2017) | Mark Handley

Q: What have the authors contributed in "Re-architecting datacenter networks and stacks for low latency and high performance" ?

The authors present NDP, a novel datacenter transport architecture that achieves near-optimal completion times for short transfers and high flow throughput in a wide range of scenarios, including incast. The authors evaluate NDP ’ s performance in their implementations and in large-scale simulations, simultaneously demonstrating support for very low-latency and high throughput.

Q: Why is it normal for PULLs to be reordered?

Due to per-packet multipath forwarding, it is normal for both data packets and reverse-path ACKs, NACKs and PULLs to be reordered.

Q: How many packets does it take to serialize?

With eight packet switch queues, 9KB jumbograms, and store-andforward switches in a 10Gb/s FatTree topology, each packet takes 7.2µs to serialize.

Q: What is the way to choose a path?

Depending on whether the network is L2 or L3-switched, either label-switched paths or destination addresses can be used to choose a path.

Q: What is the way to minimize delay in a network?

In such a network, to minimize delay the authors must be optimistic and assume there will be enough capacity to send a full window of data in the first RTT of a connection without probing.

Re-architecting datacenter networks and stacks for low

latency and high performance

Mark Handley

University College London

London, UK

m.handley@cs.ucl.ac.uk

Costin Raiciu

Alexandru Agache

Andrei Voinescu

University Politehnica of Bucharest

Bucharest, Romania

ﬁrstname.lastname@cs.pub.ro

Andrew W. Moore

Gianni Antichi

Marcin Wójcik

University of Cambridge

Cambridge, UK

ﬁrstname.lastname@cl.cam.ac.uk

ABSTRACT

Modern datacenter networks provide very high capacity via redun-

dant Clos topologies and low switch latency, but transport protocols

rarely deliver matching performance. We present NDP, a novel data-

center transport architecture that achieves near-optimal completion

times for short transfers and high ﬂow throughput in a wide range

of scenarios, including incast. NDP switch buffers are very shal-

low and when they ﬁll the switches trim packets to headers and

priority forward the headers. This gives receivers a full view of

instantaneous demand from all senders, and is the basis for our

novel, high-performance, multipath-aware transport protocol that

can deal gracefully with massive incast events and prioritize trafﬁc

from different senders on RTT timescales. We implemented NDP in

Linux hosts with DPDK, in a software switch, in a NetFPGA-based

hardware switch, and in P4. We evaluate NDP’s performance in

our implementations and in large-scale simulations, simultaneously

demonstrating support for very low-latency and high throughput.

CCS CONCEPTS

• Networks → Network protocols; Data center networks;

KEYWORDS

Datacenters; Network Stacks; Transport Protocols

ACM Reference format:

Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, An-

drew W. Moore, Gianni Antichi, and Marcin Wójcik. 2017. Re-architecting

datacenter networks and stacks for low latency and high performance. In

Proceedings of , , , 14 pages. https://doi.org/10.1145/3098822.3098825

1 INTRODUCTION

Datacenters have evolved rapidly over the last few years, with Clos[

] topologies becoming commonplace, and a new emphasis on low

latency, ﬁrst with improved transport protocols such as DCTCP[

]

and more recently with solutions such as RDMA over Converged

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

, ,

2017 Copyright held by the owner/author(s). Publication rights licensed to Association

for Computing Machinery.

ACM ISBN 978-1-4503-4653-5/17/08. . . $15.00

https://doi.org/10.1145/3098822.3098825

Ethernet v2[

] that use Ethernet ﬂow control in switches[

] to

avoid packet loss caused by congestion.

In a lightly loaded network, Ethernet ﬂow control can give very

good low-delay performance[

] for request/response ﬂows that

dominate datacenter workloads. Packets are queued rather than lost,

if necessary producing back-pressure, pausing forwarding across sev-

eral switches, and so no time is wasted on being overly conservative

at start-up or waiting for retransmission timeouts. However, based

on experience deploying RoCEv2 at Microsoft[

], Gau et al.note

that a lossless network does not guarantee low latency. When con-

gestion occurs, queues build up and PFC pause frames are generated.

Both queues and PFC pause frames increase network latency. They

conclude “how to achieve low network latency and high network

throughput at the same time for RDMA is still an open problem.”

In this paper we present a new datacenter protocol architecture,

NDP, that takes a different approach to simultaneously achieving

both low delay and high throughput. NDP has no connection setup

handshake, and allows ﬂows to start sending instantly at full rate.

We use per-packet multipath load balancing, which avoids core

network congestion at the expense of reordering, and in switches

use an approach similar to Cut Payload (CP)[

], which trims the

payloads of packets when a switch queue ﬁlls. This gives a network

that is lossless for metadata, but not for trafﬁc payloads. In spite of

reordering, lossless metadata gives the receiver a complete picture

regarding inbound trafﬁc and we take advantage of it to build a

radical new transport protocol that achieves very low latency for

short ﬂows, with minimal interference between ﬂows to different

destinations even in pathological trafﬁc patterns.

We have implemented NDP in Linux hosts, in a software switch,

in a hardware switch based on NetFPGA SUME[

], in P4[

], and

in simulation. We will demonstrate that NDP achieves:

• Better short-ﬂow performance than DCTCP or DCQCN.

•

Greater than 95% of the maximum network capacity in a heavily

loaded network with switch queues of only eight packets.

• Near-perfect delay and fairness in incast[18] scenarios.

• Minimal interference between ﬂows to different hosts.

• Effective prioritization of straggler trafﬁc during incasts.

2 DESIGN SPACE

Intra-datacenter network trafﬁc primarily consists of request/response

RPC-like protocols. Mean network utilization is rarely very high,

but applications can be very bursty. The big problem today is latency,

especially for short RPC-like workloads. At the cost of head-of-line

, , Handley et al.

blocking, today’s applications often reuse TCP connections across

multiple requests to amortize the latency cost of the TCP handshake.

Is it possible to improve the protocol stack so much that every

request could use a new connection and at the same time expect

to get close to the raw latency and bandwidth of the underlying

network, even under heavy load?

We will show that these goals

are achievable, but to do so involves changes to how trafﬁc is routed,

how switches cope with overload, and most importantly, requires

a completely different transport protocol from those used today.

Before we describe our solution in §3, we ﬁrst highlight the key

architectural points that must be considered.

2.1 End-to-end Service Demands

What do applications want from a datacenter network?

Location Independence.

It shouldn’t matter which machine in a

datacenter the elements of a distributed application are run on. This is

commonly achieved using high-capacity parallel Clos topologies[

]. Such topologies have sufﬁcient cross-sectional bandwidth that

the core network should not be a bottleneck.

Low Latency.

Clos networks can supply bandwidth, modulo issues

with load balancing between paths, but often fall short in providing

low latency service. Predictable very low latency request/response

behavior is the key application demand, and it is the hardest to satisfy.

This is more important than large ﬁle transfer performance, though

high throughput is still a requirement, especially for storage servers.

The strategy must be to optimize for low latency ﬁrst.

Incast.

Datacenter workloads often require sending requests to large

numbers of workers and then handling their near-simultaneous re-

sponses, causing a problem called incast. A good networking stack

should shield applications from the side-effects of incast trafﬁc pat-

terns gracefully while providing low latency.

Priority.

It is also common for a receiver to handle many incom-

ing ﬂows corresponding to different requests simultaneously. For

example, it may have fanned out two different requests to workers,

and the responses to those requests are now arriving with the last

responses to the ﬁrst request overlapping the ﬁrst responses to the

second request. Many applications need all the responses to a request

before they can proceed. A very desirable property is for the receiver

to be able to prioritize arriving trafﬁc from stragglers. The receiver

is the only entity that can dynamically prioritize its inbound trafﬁc,

and this impacts protocol design.

2.2 Transport Protocol

Current datacenter transport protocols satisfy some of these appli-

cation requirements, but satisfying all of them places some unusual

demands on datacenter transport protocols.

Zero-RTT connection setup.

To minimize latency, many applica-

tions would like zero-RTT delivery for small outgoing transfers (or

one RTT for request/response). We need a protocol that doesn’t

require a handshake to complete before sending data, but this poses

security and correctness issues.

Fast start.

Another implication of zero-RTT delivery is that a trans-

port protocol can’t probe for available bandwidth—to minimize

latency, it must assume bandwidth is available, optimistically send

a full initial window, and then react appropriately when it isn’t. In

contrast to the Internet, simpler solutions are possible in datacenter

environments, because link speeds and network delays (except for

queuing delays) can mostly be known in advance.

Per-packet ECMP.

One problem with Clos topologies is that per-

ﬂow ECMP hashing of ﬂows to paths can cause unintended ﬂow

collisions; one deployment[

] found this reduced throughput by

40%. For large transfers, multipath protocols such as MPTCP can

establish enough subﬂows to ﬁnd unused paths[

], but they can

do little to help with the latency of very short transfers. The only

solution here is to stripe across multiple paths on a per-packet basis.

This complicates transport protocol design.

Reorder-tolerant handshake.

If we perform a zero-RTT transfer

with per-packet multipath forwarding in a Clos network, even the

very ﬁrst window of packets may arrive in a random order. This effect

has implications for connection setup: the ﬁrst packet to arrive will

not be the ﬁrst packet of the connection. Such a transport protocol

must be capable of establishing connection state no matter which

packet from the initial window is ﬁrst to arrive.

Optimized for Incast.

Although Clos networks are well-provisioned

for core-capacity, incast trafﬁc can make life difﬁcult for any trans-

port protocol when applications fan out requests to many workers

simultaneously. Such trafﬁc patterns can cause high packet loss rates,

especially if the transport protocol is aggressive in the ﬁrst RTT. To

handle this gracefully requires some assistance from the switches.

2.3 Switch Service Model

Application requirements, the transport protocol and the service

model at network switches are tightly coupled, and need to be op-

timized holistically. Of particular relevance is what happens when

a switch port is congested. The switch service model heavily in-

ﬂuences the design space of both protocol and congestion control

algorithms, and couples tightly with forwarding behavior: per-packet

multipath load balancing is ideal as it minimizes hotspots, but it com-

plicates the ability of end-systems to infer network congestion and

increases importance of graceful overload behavior.

Loss as a congestion feedback mechanism has the advantage that

dropped packets don’t use bottleneck bandwidth, and loss only im-

pacts ﬂows traversing the congested link—not all schemes have

these properties. The downside is that it leads to uncertainty as to

a packet’s outcome. Duplicate or selective ACKs to trigger retrans-

missions only work well for long-lived ﬂows. With short ﬂows, tail

loss is common, and then you have to fall back on retransmission

timeouts (RTO). Short RTOs are only safe if you can constrain the

delay in the network, so you need to maintain short queues[

] which

in turn constrain the congestion control schemes you can use. Loss

also couples badly with per-packet multipath forwarding; because

the packets of a ﬂow arrive out of order, loss detection is greatly

complicated - fast retransmit is often not possible because its not

rare for a packet to arrive out of sequence by a whole window.

ECN helps signiﬁcantly. DCTCP uses ECN[

] with a sharp

threshold for packet marking, and a congestion control scheme that

aims to push in and out of the marking regime. This greatly reduces

loss for long-lived ﬂows, and allows the use of small buffers, re-

ducing queuing delay. For short ﬂows though, ECN has less beneﬁt

because the ﬂow doesn’t have time to react to the ECN feedback. In

practice switches use large shared buffers in conjunction with ECN

and this reduces incast losses, but retransmit timers must be less

Datacenter networks for low latency and high performance. , ,

New$

Receiver$

Driven$

Protocol$

Low$latency$

uncongested$

core$

Clos$

Network$

Per-packet$

Mul<path$

Fast$Flow$

start$

Packet$

Trimming$

Incast$and$$

reordering,$but$$

receiver$has$

full$informa<on$

Small$

queues$

Ctrl$packet$

Priority$

Receiver$

Pacing$

Fast$

RTX$

Zero$RTT$

connect$

Figure 1: Key components of NDP

aggressive. ECN does have the advantage though that it interacts

quite well with per-packet multipath forwarding, given a transport

protocol design that can tolerate reordering.

Lossless Ethernet using 802.3X Pause frames [23] or 802.1 Qbb

priority-based ﬂow control (PFC)[

] can prevent loss, avoiding

the need for aggressive RTO in protocols. At low utilizations, this

can be effective at achieving low delay—a burst will arrive at the

maximum rate that the link can forward, with no need to wait for

retransmissions. The problem comes at higher utilizations in tiered

topologies, where ﬂows that happen to hash to the same outgoing

port, and use the same priority in the case of 802.1Qbb, can cause

incoming ports to be paused. This causes collateral damage to other

ﬂows traversing the same incoming port destined for different output

ports. With large incasts, pausing can cascade back up towards core

switches. Lossless Ethernet also interacts badly with per-packet mul-

tipath forwarding, as different switches may pause trafﬁc at different

times, exacerbating reordering and complicating end-system design.

Cut Payload (CP)[

] tries to get the beneﬁts of lossless with-

out quite being lossless. It drops packet payloads, but not packet

headers, relieving overload while avoiding uncertainty as to packet

outcomes. It shows great promise, but there are two problems. First,

in severe overload, it is susceptible to congestion collapse, where

only headers get forwarded. Second, because the headers are queued

in a FIFO manner, tail "loss" costs at least one RTT. In addition, CP,

as originally proposed uses single-path forwarding for each ﬂow.

3 DESIGN

Our primary goals are low completion latency for short ﬂows, and

predictable high throughput for longer ﬂows. To fully satisfy these

goals, NDP impacts the whole stack, including switch behavior,

routing, and a completely new transport protocol. We lead with

a brief but simpliﬁed design rationale to show how the pieces in

Figure 1 ﬁt together, then ﬁll in the details in the rest of this section.

A Clos topology has sufﬁcient bandwidth in the core to satisfy

all demand, so long as it is perfectly load-balanced. To avoid ﬂow

collisions on core links, which impact both latency and throughput,

load-balancing each ﬂow across many paths is essential. Balanc-

ing short ﬂows requires per-packet multipath load-balancing, but

inevitably packets will get reordered.

To achieve minimal short-ﬂow latency, senders cannot probe

before sending: they must send the ﬁrst RTT at line rate. This works

well most of the time. When senders perform per-packet multipath

100

0 20 40 60 80 100 120 140 160 180 200

Percent of fair

goodput achieved

Number of flows

NDP switch, mean

NDP switch, worst 10%

CP switch, mean

CP switch, worst 10%

Figure 2: Collapse and Phase Problems with CP

load balancing, if sending at line rate causes congestion, it is because

several senders are sending to the same receiver. Even then, the

receiver’s link is fully occupied, so this is not, by itself, a problem.

To guarantee low latency, switch queues must be small. This

means colliding ﬂows will overﬂow the queue. Packet loss, com-

bined with multipath reordering, make it impossible to infer what

happened and retransmit quickly enough to avoid impacting latency;

this violates the low latency goal. Completely preventing packet loss

adds queuing delay; if this is done by pausing inbound trafﬁc, as

with lossless Ethernet, this impacts other unrelated trafﬁc, violating

its low latency and predictable high throughput goals. We seek a

middle ground between packet loss and lossless.

Packet trimming, similar to that performed by CP, is such a middle

ground. Switch queues can be small, and the receiver still discov-

ers which packets were sent by examining the trimmed headers it

receives. However, to minimize retransmission latency, trimmed

headers and control packets need to be prioritized. Arriving trimmed

headers tell the receiver exactly what the demand is, so by using

a receiver-pulled protocol, the receiver can then precisely control

incoming trafﬁc. This avoids persistent overload and allows more

important packets to be pulled ﬁrst, at the receiver’s discretion.

3.1 NDP Switch Service Model

With CP, when the queue at a switch ﬁlls beyond a ﬁxed thresh-

old, rather than dropping a packet, the switch trims off the packet

payload, queuing just the header. The rationale is that packets are

not lost silently, allowing rapid retransmission without waiting for

a timeout. With the short distances in a datacenter network, such

retransmissions can arrive very quickly.

Alongside switch changes, CP proposes minor changes to TCP

to improve incast performance. We wish to go well beyond CP, and

use packet trimming as the basis of an extremely aggressive network

architecture, focused on very low delay service. However, there are

several problems that can arise if vanilla CP is used.

First, CP can suffer from a form of congestion collapse. Figure 2

shows what happens when packets arrive at a switch at a signiﬁ-

cantly higher rate than can be supported by the outgoing link. Many

unresponsive ﬂows converge on a 10Gb/s link that can only support

one of them, as in extreme server incast scenarios. The ﬁgure shows

the percent of the ideal fair-share goodput that is achieved. The mean

goodput of the CP ﬂows decreases, as an increasing fraction of the

link is occupied by trimmed packet headers. This ﬁgure shows the

best case for CP, with 9KB jumbograms. With 1500 byte packets the

collapse is much faster.

, , Handley et al.

Second, datacenter networks are very regular, so phase effects[

]

can occur, leading to unfair throughput. The dashed curves in Fig-

ure 2 show the mean goodput of the worst performing 10% of the

ﬂows. Phase effects can render CP very unfair, though we note

that this ﬁgure shows simulation results; real-world phase effects

can sometimes be reduced by variability in the timing of packet

transmissions due to OS scheduling.

Finally, CP aims to provide low delay feedback that packets have

been lost. However, because CP uses a FIFO queue, feedback can

only be sent after all the preceding packets have been received,

resulting in a delay before a retransmission is elicited. We would like

to run very small buffers in the switches, have one of those queues

overﬂow, and for the retransmission to arrive before the queue has

had a chance to drain. This isn’t possible with FIFO queuing.

NDP switches make three main changes to CP. First, an NDP

switch maintains two queues: a lower priority queue for data packets

and a higher priority queue for trimmed headers, ACKs and NACKs

This may seem counter-intuitive, but it provides the earliest possible

feedback that a packet didn’t make it, usually allowing a retransmis-

sion to arrive before the offending queue had even had time to drain.

This can provide at least as good low delay behavior as lossless

Ethernet, without the collateral damage caused by pausing.

Second, the switch performs weighted round robin between the

high priority “header queue” and the lower priority “data packet

queue”. With a 10:1 ratio of headers to packets, this allows early

feedback without being susceptible to congestion collapse.

Finally, when a data packet arrives and the low priority queue is

full, the switch decides with 50% probability whether to trim the

newly arrived packet, or the data packet at the tail of the low priority

queue. This breaks up phase effects. Figure 2 shows how an NDP

switch avoids CP’s collapse, and also avoids strong phase effects.

3.1.1 Routing

We want NDP switches to perform per-packet multihop forwarding,

so as to evenly distribute trafﬁc bursts across all the parallel paths

that are available between source and destination. This could be

done in at least four ways:

•

Perform per-packet ECMP; switches randomly choose the next

hop for each packet;

• Explicitly source-route the trafﬁc;

• Use label-switched paths; the sender chooses the label;

•

Destination addresses indicates the path to be taken; the sender

chooses between destination addresses.

For load-balancing purposes the latter three are equivalent—the

sender chooses a path—they differ in how the sender expresses that

path. Our experiments show that if the senders choose the paths, they

can do a better job of load balancing than if the switches randomly

choose paths. This allows the use of slightly smaller switch buffers.

Unlike in the Internet, in a datacenter, senders can know the

topology, so know how many paths are available to a destination.

Each NDP sender takes the list of paths to a destination, randomly

permutes it, then sends packets on paths in this order. After it has

sent one packet on each path, it randomly permutes the list of paths

again, and the process repeats. This spreads packets equally across

all paths while avoiding inadvertent synchronization between two

Also PULL packets, which we will introduce shortly.

senders. Such load-balancing is important to achieving very low

delay. If we use very small data packet queues (only eight packets),

our experiments show that this simple scheme can increase the

maximum capacity of the network by as much as 10% over a per-

packet random path choice.

Depending on whether the network is L2 or L3-switched, either

label-switched paths or destination addresses can be used to choose

a path. In an L2 FatTree network for example, a label-switched path

only needs to be set up as far as each core switch, with destination

L2 addresses taking over from there, as a FatTree only has one path

from a core switch to each host. In an L3 FatTree, each host gets

multiple IP addresses, one for each core switch. By choosing the

destination address, the sender chooses the core switch a packet

traverses.

3.2 Transport Protocol

NDP uses a receiver-driven transport protocol designed speciﬁcally

to take advantage of multipath forwarding, packet trimming, and

short switch queues. The goal at each step is ﬁrst to minimize delay

for short transfers, then to maximize throughput for larger transfers.

When starting up a connection, a transport protocol could be pes-

simistic, like TCP, and assume that there is minimal spare network

capacity. TCP starts sending data after the three-way handshake com-

pletes, initially with a small congestion window[

], and doubles it

each RTT until it has ﬁlled the pipe. Starting slowly is appropriate

in the Internet, where RTTs and link bandwidths differ by orders of

magnitude, and where the consequences of being more aggressive

are severe. In a datacenter, though, link speeds and baseline RTTs

are much more homogeneous, and can be known in advance. Also,

network utilization is often relatively low [

]. In such a network,

to minimize delay we must be optimistic and assume there will

be enough capacity to send a full window of data in the ﬁrst RTT

of a connection without probing. If switch buffers are small, in a

low-delay datacenter environment a full window is likely to be only

about 12 packets given the speed-of-light latencies and hop-counts.

However, if it turns outs that there is insufﬁcient capacity, packets

will be lost. With a normal transport protocol, the combination

of per-packet multipath forwarding and being aggressive in the

ﬁrst RTT is a recipe for confusion. Some packets arrive, but in a

random order, and some don’t. It is impossible to tell quickly what

actually happened, and so the sender must fall back on conservative

retransmission timeouts to remedy the situation.

Increasing switch buffering could mitigate this situation some-

what, at the expense of increasing delay, but can’t prevent loss with

large incasts. ECN also cannot prevent loss with aggressive short

ﬂows. Pause frames can prevent loss, and could help signiﬁcantly

here, but we will show in § 6.1 that this brings its own signiﬁcant

problems in terms of delay to unrelated ﬂows.

This is where packet trimming in the NDP switches really comes

into its own. Headers of trimmed packets arriving at the receiver con-

sume little bottleneck bandwidth, but inform the receiver precisely

which packets were sent. The order of packet arrivals is unimportant

when it comes to inferring what happened. Priority queuing ensures

that these headers arrive quickly, and that control packets such as

NACKs returned to the sender arrive quickly; indeed quickly enough

to elicit a retransmission that arrives before the overﬂowing queue

Datacenter networks for low latency and high performance. , ,

Src$ ToR $ ToR $Agg$ Core$ Agg$ Dst$

RTX$9$

1$to$9$

trim$

header$

rtx$

enqueue$

arrive$

Figure 3: Packet trimming enables low-delay retransmission

has had time to drain, so the link does not go idle. This is illustrated

in Figure 3. At time

t r im

packets from nine different sources arrive

nearly simultaneously at the ToR switch. The eight-packet queue to

the destination link ﬁlls, and the packet from source 9 is trimmed.

After packet 1 ﬁnishes being forwarded, packet 9’s header gets pri-

ority treatment. At

header

it arrives at the receiver, which generates

a NACK packet

. Packet 9 is retransmitted at

r t x

and arrives at the

ToR switch queue while packet 7 is still being forwarded. The link

to the destination never goes idle, and packet 9 arrives at

ar r ive

the same time it would have arrived if PFC had prevented its loss by

pausing the upstream switch.

In a Clos topology employing per-packet multipath, the only hot

spots that can build are when trafﬁc from many sources converges on

a receiver. With NDP, trimmed headers indicate the precise demand

to the receiver; it knows exactly which senders want to send which

data to it, so it is best placed to decide what to do after the ﬁrst RTT

of a connection. After sending a full window of data at line rate,

NDP senders stop sending. From then on, the protocol is receiver-

driven. An NDP receiver requests packets from the senders, pacing

the sending of those requests so that the data packets they elicit arrive

at a rate that matches the receiver’s link speed. The data requested

can be retransmissions of trimmed packets, or can be new data from

the rest of the transfer. The protocol thus works as follows:

•

The sender sends a full window of data without waiting for a

response. Data packets carry packet sequence numbers.

•

For each header

that arrives, the receiver immediately sends a

NACK to inform the sender to prepare the packet for retransmis-

sion (but not yet send it).

•

For each data packet that arrives, the receiver immediately sends

an ACK to inform the sender that the packet arrived, and so the

buffer can be freed.

•

For every header or packet that arrives, the receiver adds a

ULL

packet to its pull queue that will, in due course, be sent to the

corresponding sender. A receiver only has one pull queue, shared

by all connections for which it is the receiver.

•

A PULL packet contains the connection ID and a per-sender pull

counter that increments on each PULL packet sent to that sender.

• The receiver sends out PULL packets from the per-interface pull

queue, paced so that the data packets they elicit from the sender

then arrive at the receiver’s link rate. Pull packets from different

connections are serviced fairly by default, or with strict prioriti-

zation when a ﬂow has higher priority.

This NACK has the PULL bit set, requesting retransmission.

When we refer to headers in this context, we are referring to the headers of packets

whose payload was trimmed off by a switch

•

When a PULL packet arrives at the sender, the sender will send as

many data packets as the pull counter increments by. Any packets

queued for retransmission are sent ﬁrst, followed by new data.

•

When the sender runs out of data to send, it marks the last packet.

When the last packet arrives, the receiver removes any pull pack-

ets for that sender from its pull queue to avoid sending unneces-

sary pull packets. Any subsequent data the sender later wants to

send will be pushed rather than pulled.

Due to packet trimming, it is very rare for a packet to be actually

lost; usually this is due to corruption. As ACKs and NACKs are

sent immediately, are priority-forwarded, and all switch queues are

small, the sender can know very quickly if a packet was actually lost.

With eight packet switch queues, 9KB jumbograms, and store-and-

forward switches in a 10Gb/s FatTree topology, each packet takes

7.2

µs

to serialize. Taking into account NDP’s priority queuing, the

worst-case network RTT is approximately 400

µs

, with typical RTTs

being much shorter. This allows a very short retransmission timeout

to be used to provide reliability for such corrupted packets.

PULL packets perform a role similar to TCP’s ACK-clock, but are

usually

separated from ACKs to allow them to be paced without

impacting the retransmission timeout mechanism. For example, in a

large incast scenario, PULLs may spend a comparatively long time

in the receiver’s pull queue before the pacer allows them to be sent,

but we don’t want to also delay ACKs because doing so requires

being much more conservative with retransmission timeouts.

The emergent behavior is that the ﬁrst RTT of data in a connection

is pushed, and subsequent RTTs of data are pulled so as to arrive

at the receiver’s line rate. In an incast scenario, if many senders

send simultaneously, many of their ﬁrst window of packets will be

trimmed, but subsequently receiver pulling ensures that the aggregate

arrival rate from all senders matches the receiver’s link speed, with

few or no packets being trimmed.

3.2.1 Coping with Reordering

Due to per-packet multipath forwarding, it is normal for both data

packets and reverse-path ACKs, NACKs and PULLs to be reordered.

The basic protocol design is robust to reordering, as it does not need

to make inference about loss from other packets’ sequence numbers.

However, reordering still needs to be taken into account.

Although PULL packets are priority-queued, they don’t preempt

data packets, so PULL packets sent on different paths often arrive out

of order, increasing the burstiness of retransmissions. To reduce this,

PULLs carry a pull sequence number. The receiver has a separate pull

sequence space for each connection, incrementing it by one for each

pull sent. On receipt of a PULL, the sender transmits as many packets

as the pull sequence number increases by. For example, if a PULL is

delayed, the next PULL sent may arrive ﬁrst via a different path, and

will pull two packets rather than one. This reduces burstiness a little.

3.2.2 The First RTT

Unlike in TCP, where the SYN/SYN-ACK handshake happens ahead

of data exchange, we wish NDP data to be sent in the ﬁrst RTT. This

adds three new requirements:

• Be robust to requests that spoof source IP addresses.

If there is only one sender, PULL packets don’t need extra pacing because data packets

arrive paced appropriately. In such cases we can send combined PULLACK.

Re-architecting datacenter networks and stacks for low latency and high performance

Figures

Citations

HPCC: high precision congestion control

Homa: a receiver-driven low-latency transport protocol using network priorities

Swift: Delay is Simple and Effective for Congestion Control in the Datacenter

Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities (Complete Version)

A link layer protocol for quantum networks

References

Congestion avoidance and control

A scalable, commodity data center network architecture

VL2: a scalable and flexible data center network

Network traffic characteristics of data centers in the wild

Data center TCP (DCTCP)

Related Papers (5)

Data center TCP (DCTCP)

pFabric: minimal near-optimal datacenter transport

Inside the Social Network's (Datacenter) Network

Deadline-aware datacenter tcp (D2TCP)

CONGA: distributed congestion-aware load balancing for datacenters

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Re-architecting datacenter networks and stacks for low latency and high performance" ?

Q2. Why is it normal for PULLs to be reordered?

Q3. How many packets does it take to serialize?

Q4. What is the way to choose a path?

Q5. What is the greatest limitation of all these works?

Q6. Why is the overhead for a one-packet IW high?

Q7. Why is NDP able to avoid congestion collapse?

Q8. What is the way to minimize delay in a network?

Q9. What is the advantage of dropping packets as a congestion feedback mechanism?

Q10. What does Gau and his team think of the RDMA network?

Q11. How do you implement NDP in Linux?

Q12. What is the simplest way to be robust?

Q13. What can be done to reduce the effect of phase effects?

Q14. How many ports are used for the NDP modifications?