What future works have the authors mentioned in the paper "Scdp: systematic rateless coding for efficient data transport in data centres" ?

As part of their future work, the authors aim at developing an SCDP prototype ( in-kernel and/or using user-space network stack ) and exploring its performance with real application workloads. As part of their future work, the authors will investigate this argument further by developing extensions of existing unicast data centre protocols ( e. g. [ 26 ] ) that can handle one-to-many and many-to-one data transport and compare their performance with SCDP.

How many aggregation switches are in each pod?

For their experimentation the authors have used a 250-server FatTree topology with 25 core switches and 5 aggregation switches in each pod (50 aggregation switches in total).

What is the reason why PIAS performs worse than NDP?

In general, the authors argue that PIAS performs worse than NDP because (1) it relies on DCTCP for data transport and as a result it suffers from the limitations of a single-path protocol (i.e. lack of support for multi-path transport and packet spraying); (2) connection establishment requires a three-way handshake and senders start with a small window, both of which can severely hurt FCTs for short flows; and (3) buffer occupancy in NDP is significantly lower than in PIAS [26] which also affects performance for short slows.

Why is the gap between SCDP and NDP at its smallest?

When the network load is very high, the gap between SCDP and NDP is at its smallest, because losses (trimmed packets) and therefore decoding are more frequent.

How do the authors model the decoding latency?

The authors model the decoding latency based on the results reported in [55], by fitting the worst-case decoding latencies for different number of K source symbols into a polynomial function.

How will the authors investigate this argument further?

As part of their future work, the authors will investigate this argument further by developing extensions of existing unicast data centre protocols (e.g. [26]) that can handle one-to-many and many-to-one data transport and compare their performance with SCDP.

What does SCDP do to reduce the latency of the source block?

This ensures that SCDP does not induce any unnecessary overhead; i.e. symbol packets that are redundant in decoding the source block.

What is the difference between NDP and SCDP?

For higher loads, NDP performs even worse than SCDP because of the lack of support for MLFQ, which results in the trimming of more packets belonging to short flows.

(Open Access) SCDP: Systematic Rateless Coding for Efficient Data Transport in Data Centres (Complete Version) (2019) | Mohammed Alasmar

Q: What contributions have the authors mentioned in the paper "Scdp: systematic rateless coding for efficient data transport in data centres" ?

In this paper the authors propose SCDP, a general-purpose data transport protocol for data centres that, in contrast to all other protocols proposed to date, supports efficient one-to-many and many-to-one communication, which is extremely common in modern data centres. SCDP achieves this by integrating RaptorQ codes with receiver-driven data transport, packet trimming and Multi-Level Feedback Queuing ( MLFQ ) ; ( 1 ) RaptorQ codes enable efficient one-to-many and many-toone data transport ; ( 2 ) on top of RaptorQ codes, receiver-driven flow control, in combination with in-network packet trimming, enable efficient usage of network resources as well as multi-path transport and packet spraying for all transport modes. SCDP does so without compromising on efficiency for short and long unicast flows.

SCDP: systematic rateless coding for efficient data transport in

data centres

Article (Accepted Version)

http://sro.sussex.ac.uk

Alasmar, Mohammed, Parisis, George and Crowcroft, Jon (2021) SCDP: systematic rateless

coding for efficient data transport in data centres. IEEE/ACM Transactions on Networking, 29 (6).

pp. 2723-2736. ISSN 1063-6692

This version is available from Sussex Research Online: http://sro.sussex.ac.uk/id/eprint/100696/

This document is made available in accordance with publisher policies and may differ from the

published version or from the version of record. If you wish to cite this item you are advised to

consult the publisher’s version. Please see the URL above for details on accessing the published

version.

Sussex Research Online is a digital repository of the research output of the University.

author(s) and/or other copyright owners. To the extent reasonable and practicable, the material

made available in SRO has been checked for eligibility before being made available.

Copies of full text items generally can be reproduced, displayed or performed and given to third

parties in any format or medium for personal research or study, educational, or not-for-profit

purposes without prior permission or charge, provided that the authors, title and full bibliographic

details are credited, a hyperlink and/or URL is given for the original metadata page and the

content is not changed in any way.

IEEE/ACM TRANSACTIONS ON NETWORKING 1

SCDP: Systematic Rateless Coding for Efﬁcient

Data Transport in Data Centres

Mohammed Alasmar

∗

, George Parisis

∗

, Jon Crowcroft

†

∗

School of Engineering and Informatics, University of Sussex, UK, Email: {m.alasmar, g.parisis}@sussex.ac.uk

†

Computer Laboratory, University of Cambridge, UK, Email: Jon.Crowcroft@cl.cam.ac.uk

Abstract—In this paper we propose SCDP, a general-purpose

data transport protocol for data centres that, in contrast to all

other protocols proposed to date, supports efﬁcient one-to-many

and many-to-one communication, which is extremely common in

modern data centres. SCDP does so without compromising on

efﬁciency for short and long unicast ﬂows. SCDP achieves this by

integrating RaptorQ codes with receiver-driven data transport,

packet trimming and Multi-Level Feedback Queuing (MLFQ);

(1) RaptorQ codes enable efﬁcient one-to-many and many-to-

one data transport; (2) on top of RaptorQ codes, receiver-driven

ﬂow control, in combination with in-network packet trimming,

enable efﬁcient usage of network resources as well as multi-path

transport and packet spraying for all transport modes. Incast and

Outcast are eliminated; (3) the systematic nature of RaptorQ

codes, in combination with MLFQ, enable fast, decoding-free

completion of short ﬂows. We extensively evaluate SCDP in

a wide range of simulated scenarios with realistic data centre

workloads. For one-to-many and many-to-one transport sessions,

SCDP performs signiﬁcantly better compared to NDP and PIAS.

For short and long unicast ﬂows, SCDP performs equally well

or better compared to NDP and PIAS.

Index Terms—Data centre networking, data transport protocol,

fountain coding, modern workloads.

I. INTRODUCTION

Data centres support the provision of core Internet services

and it is therefore crucial to have in place data transport

mechanisms that ensure high performance for the diverse

set of supported services. Data centres consist of a large

number of commodity servers and switches, support multiple

paths among servers, which can be multi-homed, very large

aggregate bandwidth and very low latency communication

with shallow buffers at the switches.

One-to-many and many-to-one communication. Modern

data centres support a plethora of services that produce one-to-

many and many-to-one trafﬁc workloads. Distributed storage

systems, such as GFS/HDFS [1], [2] and Ceph [3], replicate

data blocks across the data centre (with or without daisy chain-

ing

). Partition-aggregate [4], [5], streaming telemetry [6], [7],

distributed messaging [8], [9], publish-subscribe systems [10],

[11], high frequency trading [12], [13] and replicated state

machines [14], [15] also produce similar workloads. Multicast

has already been deployed in data centres (e.g. to support

virtualised workloads [16] and ﬁnancial services [17]). With

the advent of P4, multicasting in data centres is becoming

practical [18]. As a result, much research on scalable network-

layer multicasting in data centres has recently emerged [19]–

[23], including approaches for optimising multicast ﬂows in

https://patents.google.com/patent/US20140215257

reconﬁgurable data centre networks [24] and programming

interfaces for applications requesting data multicast [25].

Existing data centre transport protocols are suboptimal in

terms of network and server utilisation for these workloads.

One-to-many data transport is implemented through multi-

unicasting or daisy chaining for distributed storage. As a result,

copies of the same data are transmitted multiple times, wasting

network bandwidth and creating hotspots that severely impair

the performance of short, latency-sensitive ﬂows. In many

application scenarios, multiple copies of the same data can

be found in the network at the same time (e.g. in replicated

distributed storage) but only one replica server is used to

fetch it. Fetching data from all servers, in parallel, from all

available replica servers (many-to-one data transport) would

provide signiﬁcant beneﬁts in terms of eliminating hotspots

and naturally balancing load among servers.

These performance limitations are illustrated in Figure 1,

where we plot the application goodput for TCP and NDP

(Novel Datacenter transport Protocol) [26] in a distributed

storage scenario with 1 and 3 replicas. When a single replica

is stored in the data centre, NDP performs very well, as also

demonstrated in [26]. TCP performs poorly

. On the other

hand, when three replicas are stored in the network, both NDP

and TCP perform poorly in both write and read workloads.

Writing data involves either multi-unicasting replicas to all

three servers (blue and green lines in Figure 1a) or daisy

chaining replica servers (black line); although daisy chaining

performs better, avoiding the bottleneck at the client’s uplink,

they both consume excessive bandwidth by moving multiple

copies of the same block in the data centre. Fetching a data

block from a single server when it is stored in two more

servers creates hotspots at servers’ uplinks due to collisions

from randomly selecting a replica server for each read request

(see black and purple lines in Figure 1b).

Long and short ﬂows. Modern cloud applications commonly

have strict latency requirements [28]–[33]. At the same time,

background services require high network utilisation [34]–

[37]. A plethora of mechanisms and protocols have been pro-

posed to date to provide efﬁcient access to network resources

to data centre applications, by exploiting support for multiple

equal-cost paths between any two servers [26], [35], [36],

[38] and hardware capable of low latency communication [32],

[39], [40] and eliminating Incast [41]–[43] and Outcast [44].

Recent proposals commonly focus on a single dimension of

It is well-established that TCP is ill-suited for meeting throughput and

latency requirements of applications in data centre networks, therefore we

will be using NDP and PIAS [27] as the baseline protocols in this paper.

IEEE/ACM TRANSACTIONS ON NETWORKING 2

0 5000 10000

Rank of transport session

0.5

Goodput (Gbps)

1 Replica NDP

3 Replicas NDP (daisy chain)

3 Replicas NDP (multi-unicast)

1 Replica TCP

3 Replicas TCP

(a) One-to-many (write)

0 5000 10000

Rank of transport session

0.5

Goodput (Gbps)

1 Sender NDP

3 Senders NDP

1 Sender TCP

3 Senders TCP

(b) Many-to-one (read)

Fig. 1: Goodput in a 250-server FatTree topology with 1GB

link speed & 10µs link delay. Background trafﬁc is present to

simulate congestion. Results are for 10,000 (a) write and (b)

read block requests (2MB each). Each I/O request is ‘assigned’

to a host in the network, which is selected uniformly at random

and acts as the client. Requests’ arrival times follow a Poisson

process with an inter-arrival rate λ = 1000. Replica selection

and placement is based on HDFS’ default policy.

the otherwise complex problem space; e.g. TIMELY [45],

DCQCN [46], QJUMP [47] and RDMA over Converged

Ethernet v2 [48] focus on low latency communication but do

not support multi-path routing. Other approaches [36], [37]

do provide excellent performance for long ﬂows but perform

poorly for short ﬂows [34], [35]. None of these protocols sup-

port efﬁcient one-to-many and many-to-one communication.

Contribution. In this paper we propose SCDP

, a general-

purpose data transport protocol for data centres that, unlike

any other protocol proposed to date, supports efﬁcient one-

to-many and many-to-one communication. This, in turn, re-

sults in signiﬁcantly better overall network utilisation, min-

imising hotspots and providing more resources to long and

short unicast ﬂows. At the same time, SCDP supports fast

completion of latency-sensitive ﬂows and consistently high-

bandwidth communication for long ﬂows. SCDP eliminates

Incast and Outcast. All these are made possible by integrating

RaptorQ codes [52], [53] with receiver-driven data transport

[26], [32], in-network packet trimming [26], [54] and Multi-

Level Feedback Queuing (MLFQ) [27].

SCDP performance overview. We found that SCDP improves

goodput performance by up to ∼50% compared to NDP and

∼60% compared to PIAS with different application work-

loads involving one-to-many and many-to-one communication

(§V-A). Equally importantly, it reduces the average FCT for

short ﬂows by up to ∼45% compared to NDP and ∼70% com-

pared to PIAS under two realistic data centre trafﬁc workloads

(§V-B). For short ﬂows, decoding latency is minimised by the

combination of the systematic nature of RaptorQ codes and

MLFQ; even in a 70% loaded network, decoding was needed

for only 9.6% of short ﬂows. This percentage was less than 1%

in a 50% congested network (§V-G). The network overhead in-

duced by RaptorQ codes is negligible compared to the beneﬁts

of supporting one-to-many and many-to-one communication.

Only 1% network overhead was introduced when the network

SCDP builds on our early work on integrating fountain coding in data

transport protocols [49]–[51]. In [50] we motivated the need for a novel

data transport mechanism to efﬁciently support one-to-many and many-to-one

communication and argued that rateless codes is the way forward in doing so.

In [49], we introduced an early version of SCDP to the research community.

was very heavily congested (§V-H). RaptorQ codes have been

shown to perform exceptionally well even on a single core,

in terms of encoding/decoding rates. We therefore expect that

with hardware ofﬂoading, in combination with SCDP’s block

pipelining mechanism (§IV-F), the required computational

overhead will not be signiﬁcant.

II. RAPTORQ ENCODING AND DECODING

Encoding. RaptorQ codes are rateless and systematic. The

input to the encoder is one or more source blocks; for each one

of these source blocks, the encoder creates a potentially very

large number of encoding symbols (rateless coding). All K

source symbols (i.e. the original fragments of a source block)

are amongst the set of encoding symbols (systematic coding).

All other symbols are called repair symbols. Senders initially

send source symbols, followed by repair symbols, if needed.

Decoding. A source block can be decoded after receiving

a number of symbols that must be equal to or larger than

the number of source symbols; all symbols contribute to

the decoding process equally. In a lossless communication

scenario, decoding is not required, because all source symbols

are available (systematic coding).

Performance. In the absence of loss, RaptorQ codes do

not incur any network or computational overhead. The trade-

off associated with RaptorQ codes when loss occurs is with

respect to some (1) minimal network overhead to enable

successful decoding of the original fragments and (2) com-

putational overhead for decoding the received symbols to the

original fragments. RaptorQ codes behave exceptionally well

in both respects. With two extra encoding symbols (compared

to the number of original fragments), the decoding failure

probability is in the order of 10

−6

. It is important to note that

decoding failure is not fatal; instead one or more encoding

symbols can be requested in order to ensure that decoding is

successful [53]. The time complexity of RaptorQ encoding

and decoding is linear to the number of source symbols.

RaptorQ codes support excellent performance for all block

sizes, including very small ones, which is very important for

building a general-purpose data transport protocol that is able

to handle efﬁciently a diverse set of workloads. In [55], [56],

the authors report encoding and decoding speeds of over 10

Gbps using a RaptorQ software prototype running on a single

core. With hardware ofﬂoading, RaptorQ codes would be able

to support data transport at line speeds in modern data centre

deployments. On top of that, multiple blocks can be decoded

in parallel, independently of each other (e.g. on different

cores). Decoding small source blocks is even faster, as reported

in [55]. The decoding performance does not depend on the

sequence that symbols arrived nor on which ones do.

Example. Before explaining how RaptorQ codes are integrated

in SCDP, we present a simple example of point-to-point

communication between two hosts, which is illustrated in

Figure 2

. On the sender side, a single source block is passed

to the encoder that fragments it into 8 equal-sized source

symbols S

, S

, ..., S

. The encoder uses the source symbols

Note that Figure 2 does not illustrate SCDP’s underlying mechanisms. The

design of SCDP is discussed extensively in Section IV.

IEEE/ACM TRANSACTIONS ON NETWORKING 3

Sender

123

45678abc

Encoding Symbols

Source

Repair

a c

Network Node

Overhead

Received Symbols

Receiver

Source Block

Decoder

Encoder

Lost

Fig. 2: RaptorQ-based communication

to generate repair symbols S

, S

(here, the decision to

encode 3 repair symbols is arbitrary). Encoding symbols are

transmitted to the network, along with the respective encoding

symbol identiﬁers (ESI) and source block numbers (SBN) [52].

As shown in Figure 2, symbols S

and S

are lost. Symbols

take different paths in the network but this is transparent to

the receiver that only needs to collect a speciﬁc amount of

encoding symbols (source and/or repair). The receiver can

receive symbols from multiple senders from different network

interfaces. In this example, the receiver attempts to decode the

original source block upon receiving 9 symbols, i.e. with one

extra symbol which is network overhead (as shown in Figure

2). Decoding is successful and the source block is passed to

the receiver application. As mentioned above, if no loss had

occurred, there would be no need for decoding and the data

would have been directly passed to the application.

Erasure coding in data transport. There is a long and

interesting trail of research that integrates erasure coding into

data transport protocols. SCDP is unique compared to all these

works, efﬁciently supporting one-to-many and many-to-one

data transport sessions for distributed storage and numerous

other workloads prevalent in modern data centres, without

sacriﬁcing performance for traditional short and long ﬂows.

In [57], the authors explore the advantages and challenges

of integrating end-to-end coding into TCP. Corrective [58]

employs coding for faster loss recovery but it can only deal

with one packet loss in one window as its coding redun-

dancy is ﬁxed. FMTCP [38] employs fountain coding to

improve the performance of MPTCP [35] by recovering data

over multiple subﬂows. LTTP [42] is a UDP-based transport

protocol that uses fountain codes to mitigate Incast in data

centres. CAPS [59] deals with out of order data by integrating

forward error correction on short ﬂows, in order to reduce their

ﬂow completion time, and employs ECMP for achieving high

throughput for long ﬂows. RC-UDP [60] is a rateless coding

data transport protocol that enables reliable data transfer

over high bandwidth networks. It uses block-by-block ﬂow

control where the sender keeps sending encoded symbols until

the receiver sends an acknowledgement indicating successful

decoding. PPUSH [61] is a multi-source data delivery protocol

that employs RaptorQ codes for sending multiple ﬂows in

parallel using all available replicas.

III. THE CASE FOR RAPTORQ CODING IN DATA

TRANSPORT FOR DATA CENTRE NETWORKS

The starting point in designing SCDP, which is also the

key differentiator to the rest of the literature, is its efﬁcient

handling of one-to-many and many-to-one communication,

without sacriﬁcing performance for traditional unicast ﬂows.

One-to-many communication. None of the existing data

transport protocols for data centres can support communication

beyond traditional unicast ﬂows, even if network-level multi-

casting was deployed in the network. Congestion control in

reliable multicasting is a challenging problem and traditional

sender-driven, reliable multicasting approaches (e.g. as in [62],

[63]) would suffer from Incast [41], and lack of support

for multipath routing and multi-homed servers, as well as

their inability to spray packets in the network. A receiver-

driven approach would be more suitable. However, extending

approaches, such as NDP [26] or Homa [32], is far from

trivial as this would entail complications with ﬂow control,

when losses occur, because lost packets must be retransmitted.

Senders would have to maintain state, enqueuing incoming

pull requests by multiple receivers, while waiting to multicast

a new packet or retransmit a lost packet. Equally importantly,

the slowest receiver would slow down all other receivers.

With RaptorQ codes and receiver-driven ﬂow control, one-

to-many communication is simple and efﬁcient: a sender

multicasts a new symbol after receiving a pull request from

all receivers (see Section IV-D for a detailed description). A

sender does not need to remember which symbols it has sent

as there is no notion of retransmission. Instead, it only needs to

count the number of pending pull requests from each receiver

so it can ‘clock’ symbol sending. A receiver can decode the

original data and complete the session after it receives the

necessary amount of symbols (see Section II), independently

of other receivers that may be behind in terms of receiving

symbols because of network congestion (e.g. when they are

connected to a congested ToR switch).

Many-to-one communication. Existing protocols do not and

could not support many-to-one communication in a way

that beneﬁts the overall performance. Even if senders were

instructed to only send a subset of the original data fragments

(emulating many-to-one communication), a congested or slow

server would always be the bottleneck for the whole session.

With RaptorQ codes, each sender contributes as much as

it can, given the current conditions, in terms of network

congestion and local load. The rateless nature of RaptorQ

codes, enable receivers to successfully decode a source block

regardless of which server sent the symbols. The only require-

ment is to receive the required number of symbols (See Section

II). This is a unique characteristic of SCDP (see Section IV-E),

which ‘bypasses’ network hotspots by having non-congested

servers contributing more symbols to the receiver. Crucially,

this is done without any central coordination.

Flow completion time and goodput. SCDP’s beneﬁts dis-

cussed above do not come at a cost for traditional unicast

ﬂows. This is due to the combination of the systematic nature

of RaptorQ codes, MLFQ, and packet trimming. More specif-

ically, FCT for short ﬂows is very small, unaffected by the

introduction of coding because senders ﬁrst send the original

data fragments (systematic coding) with the highest priority,

minimising loss for them. As a result, decoding is very rarely

required for short ﬂows. SCDP performs exceptionally well

How existing protocols for data centres could be extended to support one-

to-many and many-to-one communication is beyond the scope of this paper.

IEEE/ACM TRANSACTIONS ON NETWORKING 4

also for long ﬂows despite the fact that (the otherwise efﬁcient)

decoding is needed more often. This is done by employing

pipelining of source blocks, which alleviates the decoding

overhead for large data blocks and maximises application

goodput (see Section IV-F). In combination with receiver-

driven ﬂow control and packet trimming, SCDP eliminates In-

cast and Outcast, playing well with switches’ shallow buffers.

Network utilisation. SCDP ensures high network utilisation

for all communication modes; with RaptorQ coding there is no

notion of ordering, as all symbols contribute to the decoding

(if needed) of source data. As a result, symbols can be sprayed

in the network through all available paths maximising utilisa-

tion and minimising the formation of hotspots. At the same

time, receivers can receive symbols from different interfaces

naturally enabling multi-homed topologies (e.g. [64], [65]).

IV. SCDP DESIGN

In this section, we present SCDP’s design; we deﬁne

SCDP’s packet types and adopted switch model. We then

describe all SCDP’s supported communication modes, and

how we maximise goodput and minimise ﬂow completion time

(FCT) for long and short ﬂows, respectively.

A. Packet Types

SCDP’s packet format is shown in Figure 3. Port numbers

are used to identify a transport session. The type ﬁeld (TYP

in Figure 3) is used to denote one of the three SCDP packet

types; symbol, header and pull (denoted as SMBL, HDR and

PULL, respectively, in Algorithms 1 and 2). The priority ﬁeld

(PRI in Figure 3) is set by the sender and is used by MLFQ

(see Section IV-B).

A symbol packet carries in its payload one MTU-sized

source or repair symbol. The source block number (SBN)

identiﬁes the source block the carried symbol belongs to. The

encoding symbol identiﬁer (ESI) identiﬁes the symbol within

the stream of source and repair symbols for the speciﬁc source

block [52]. A sender initiates a transport session by pushing

an initial window of symbols with the syn ﬂag set, for the

ﬁrst source block. These symbol packets also carry a number

of options: the transfer mode (M in Figure 3) can be unicast,

many-to-one or one-to-many. The rest of the options are used

to deﬁne the total length of the session (F in Figure 3), number

of source blocks (Z in Figure 3) and the symbol size (T in

Figure 3). The source block size K is derived from these

options as described in RaptorQ RFC [52]. We adopt the

notation used in this RFC [52].

Header packets are trimmed versions of symbol packets.

Upon receiving a symbol packet that cannot be buffered, a

network switch trims its payload and forwards the header, with

the highest priority. Header packets are used to ensure that a

window (w) of symbol packets is always in-ﬂight.

A pull packet is sent by a receiver to request a symbol. The

sequence number is only used to indicate how many symbols

of the speciﬁed source block identiﬁer to send, in case pull

requests get reordered. Multiple symbol packets may be sent

in response to a single pull request, as described in Section

IV-C. The ﬁn ﬂag is used to identify the last pull request; upon

receiving such a pull request, a sender sends the last symbol

packet for this SCDP session.

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Source Port | Destination Port |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Source Block Number |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Encoded Symbol Identifier / Sequence Number |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| P | T |S|F| |

| R | Y |Y|I| |

| I | P |N|N| |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| Options{M,F,T,Z} |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| payload |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Fig. 3: SCDP packet format

B. Switch Service Model

SCDP relies on network switching functionality that is

either readily available in today’s data centre networks [32] or

is expected to be [26] when P4 switches are widely deployed.

SCDP does not require any more switch functionality than

NDP [26]

, Homa [32], QJUMP [47], or PIAS [27] do.

Priority scheduling and packet trimming. In order to support

latency-sensitive ﬂows, we employ MLFQ [27], and packet

trimming [54]. We assume that network switches support a

small number of queues with respective priority levels. The top

priority queue is only used for header and pull packets. This is

crucial for swiftly providing feedback to receivers about loss.

Given that both types of packets are very small, it is extremely

unlikely that the respective queue gets full and that they are

dropped

. The rest of the queues are small and buffer symbol

packets. Switches perform weighted round-robin scheduling

between the top-priority (header/pull) queue and the symbol

packet queues. This guards against a congestion collapse

situation, where a switch only forwards trimmed headers and

all symbol packets are trimmed to headers. When a data packet

is to be transmitted, the switch selects the head packet from

the highest priority, non-empty queue.

Multipath routing. SCDP packets are sprayed to all available

equal-cost paths to the destination

in the network. SCDP

relies on ECMP and spraying could be done either by using

randomised source ports [34], or the ESI of symbol and header

packets and the sequence number of pull packets.

C. Unicast Transport Sessions

A sender implicitly opens a unicast SCDP transport session

by pushing an initial window of w (syn-enabled) symbol

packets tagged with the highest priority (Lines 2 − 12 in

Algorithm 1

). Senders tag outgoing symbol packets with a

priority value, which is used by the switches when scheduling

their transmission (§IV-B). The priority of outgoing symbol

packets is gradually degraded when speciﬁc thresholds are

reached. Calculating these thresholds can be done as in PIAS

[27] or AuTO [39] (Line 30 in Algorithm 1). The receiver

establishes a new session upon receiving the ﬁrst symbol that

As reported in [66], there is ongoing work by switch vendors to implement

the NDP switch. Moreover, a smartNIC implementation of the NDP end-host

stack is also ongoing. This is very promising for the deployability of next-

generation protocols, including SCDP, in the real-world.

SCDP receivers employ a simple timeout mechanism, as in [26], to recover

from the unlikely losses of pull and header packets.

In SCDP’s one-to-many transfer mode there are multiple destinations.

For clarity, Algorithms 1 and 2 illustrate a slightly simpliﬁed version of

SCDP for unicast data transport for a single source block without pipelining.

SCDP: Systematic Rateless Coding for Efficient Data Transport in Data Centres (Complete Version)

Figures

Citations

Reducing tail latency with coding-based packet spraying in edge datacenters

A Cost-Effective and Multi-Source-Aware Replica Migration Approach for Geo-Distributed Data Centers

References

MapReduce: simplified data processing on large clusters

The Google file system

The part-time parliament

BCube: a high performance, server-centric network architecture for modular data centers

Ceph: a scalable, high-performance distributed file system

Related Papers (5)

A decentralized redundancy generation scheme for codes with locality in distributed storage systems

One packet suffices - Highly efficient packetized Network Coding With finite memory

Reliable data delivery over deep space networks: Benefits of long erasure codes over ARQ strategies

A digital fountain approach to asynchronous reliable multicast

Simple Regenerating Codes: Network Coding for Cloud Storage

Frequently Asked Questions (10)

Q1. What contributions have the authors mentioned in the paper "Scdp: systematic rateless coding for efficient data transport in data centres" ?

Q2. What future works have the authors mentioned in the paper "Scdp: systematic rateless coding for efficient data transport in data centres" ?

Q3. How many aggregation switches are in each pod?

Q4. What is the reason why PIAS performs worse than NDP?

Q5. Why is the gap between SCDP and NDP at its smallest?

Q6. How do the authors model the decoding latency?

Q7. How does SCDP reduce the decoding overhead for large data blocks?

Q8. How will the authors investigate this argument further?

Q9. What does SCDP do to reduce the latency of the source block?

Q10. What is the difference between NDP and SCDP?