scispace - formally typeset
Open AccessJournal ArticleDOI

SCDP: Systematic Rateless Coding for Efficient Data Transport in Data Centres (Complete Version)

TLDR
SCDP as discussed by the authors integrates RaptorQ codes with receiver-driven data transport, packet trimming and Multi-Level Feedback Queuing (MLFQ) to enable efficient one-to-many and many-toone data transport.
Abstract
In this paper we propose SCDP, a general-purpose data transport protocol for data centres that, in contrast to all other protocols proposed to date, supports efficient one-to-many and many-to-one communication, which is extremely common in modern data centres. SCDP does so without compromising on efficiency for short and long unicast flows. SCDP achieves this by integrating RaptorQ codes with receiver-driven data transport, packet trimming and Multi-Level Feedback Queuing (MLFQ); (1) RaptorQ codes enable efficient one-to-many and many-to-one data transport; (2) on top of RaptorQ codes, receiver-driven flow control, in combination with in-network packet trimming, enable efficient usage of network resources as well as multi-path transport and packet spraying for all transport modes. Incast and Outcast are eliminated; (3) the systematic nature of RaptorQ codes, in combination with MLFQ, enable fast, decoding-free completion of short flows. We extensively evaluate SCDP in a wide range of simulated scenarios with realistic data centre workloads. For one-to-many and many-to-one transport sessions, SCDP performs significantly better compared to NDP and PIAS. For short and long unicast flows, SCDP performs equally well or better compared to NDP and PIAS.

read more

Content maybe subject to copyright    Report

SCDP: systematic rateless coding for efficient data transport in
data centres
Article (Accepted Version)
http://sro.sussex.ac.uk
Alasmar, Mohammed, Parisis, George and Crowcroft, Jon (2021) SCDP: systematic rateless
coding for efficient data transport in data centres. IEEE/ACM Transactions on Networking, 29 (6).
pp. 2723-2736. ISSN 1063-6692
This version is available from Sussex Research Online: http://sro.sussex.ac.uk/id/eprint/100696/
This document is made available in accordance with publisher policies and may differ from the
published version or from the version of record. If you wish to cite this item you are advised to
consult the publisher’s version. Please see the URL above for details on accessing the published
version.
Copyright and reuse:
Sussex Research Online is a digital repository of the research output of the University.
Copyright and all moral rights to the version of the paper presented here belong to the individual
author(s) and/or other copyright owners. To the extent reasonable and practicable, the material
made available in SRO has been checked for eligibility before being made available.
Copies of full text items generally can be reproduced, displayed or performed and given to third
parties in any format or medium for personal research or study, educational, or not-for-profit
purposes without prior permission or charge, provided that the authors, title and full bibliographic
details are credited, a hyperlink and/or URL is given for the original metadata page and the
content is not changed in any way.

IEEE/ACM TRANSACTIONS ON NETWORKING 1
SCDP: Systematic Rateless Coding for Efficient
Data Transport in Data Centres
Mohammed Alasmar
, George Parisis
, Jon Crowcroft
School of Engineering and Informatics, University of Sussex, UK, Email: {m.alasmar, g.parisis}@sussex.ac.uk
Computer Laboratory, University of Cambridge, UK, Email: Jon.Crowcroft@cl.cam.ac.uk
Abstract—In this paper we propose SCDP, a general-purpose
data transport protocol for data centres that, in contrast to all
other protocols proposed to date, supports efficient one-to-many
and many-to-one communication, which is extremely common in
modern data centres. SCDP does so without compromising on
efficiency for short and long unicast flows. SCDP achieves this by
integrating RaptorQ codes with receiver-driven data transport,
packet trimming and Multi-Level Feedback Queuing (MLFQ);
(1) RaptorQ codes enable efficient one-to-many and many-to-
one data transport; (2) on top of RaptorQ codes, receiver-driven
flow control, in combination with in-network packet trimming,
enable efficient usage of network resources as well as multi-path
transport and packet spraying for all transport modes. Incast and
Outcast are eliminated; (3) the systematic nature of RaptorQ
codes, in combination with MLFQ, enable fast, decoding-free
completion of short flows. We extensively evaluate SCDP in
a wide range of simulated scenarios with realistic data centre
workloads. For one-to-many and many-to-one transport sessions,
SCDP performs significantly better compared to NDP and PIAS.
For short and long unicast flows, SCDP performs equally well
or better compared to NDP and PIAS.
Index Terms—Data centre networking, data transport protocol,
fountain coding, modern workloads.
I. INTRODUCTION
Data centres support the provision of core Internet services
and it is therefore crucial to have in place data transport
mechanisms that ensure high performance for the diverse
set of supported services. Data centres consist of a large
number of commodity servers and switches, support multiple
paths among servers, which can be multi-homed, very large
aggregate bandwidth and very low latency communication
with shallow buffers at the switches.
One-to-many and many-to-one communication. Modern
data centres support a plethora of services that produce one-to-
many and many-to-one traffic workloads. Distributed storage
systems, such as GFS/HDFS [1], [2] and Ceph [3], replicate
data blocks across the data centre (with or without daisy chain-
ing
1
). Partition-aggregate [4], [5], streaming telemetry [6], [7],
distributed messaging [8], [9], publish-subscribe systems [10],
[11], high frequency trading [12], [13] and replicated state
machines [14], [15] also produce similar workloads. Multicast
has already been deployed in data centres (e.g. to support
virtualised workloads [16] and financial services [17]). With
the advent of P4, multicasting in data centres is becoming
practical [18]. As a result, much research on scalable network-
layer multicasting in data centres has recently emerged [19]–
[23], including approaches for optimising multicast flows in
1
https://patents.google.com/patent/US20140215257
reconfigurable data centre networks [24] and programming
interfaces for applications requesting data multicast [25].
Existing data centre transport protocols are suboptimal in
terms of network and server utilisation for these workloads.
One-to-many data transport is implemented through multi-
unicasting or daisy chaining for distributed storage. As a result,
copies of the same data are transmitted multiple times, wasting
network bandwidth and creating hotspots that severely impair
the performance of short, latency-sensitive flows. In many
application scenarios, multiple copies of the same data can
be found in the network at the same time (e.g. in replicated
distributed storage) but only one replica server is used to
fetch it. Fetching data from all servers, in parallel, from all
available replica servers (many-to-one data transport) would
provide significant benefits in terms of eliminating hotspots
and naturally balancing load among servers.
These performance limitations are illustrated in Figure 1,
where we plot the application goodput for TCP and NDP
(Novel Datacenter transport Protocol) [26] in a distributed
storage scenario with 1 and 3 replicas. When a single replica
is stored in the data centre, NDP performs very well, as also
demonstrated in [26]. TCP performs poorly
2
. On the other
hand, when three replicas are stored in the network, both NDP
and TCP perform poorly in both write and read workloads.
Writing data involves either multi-unicasting replicas to all
three servers (blue and green lines in Figure 1a) or daisy
chaining replica servers (black line); although daisy chaining
performs better, avoiding the bottleneck at the client’s uplink,
they both consume excessive bandwidth by moving multiple
copies of the same block in the data centre. Fetching a data
block from a single server when it is stored in two more
servers creates hotspots at servers’ uplinks due to collisions
from randomly selecting a replica server for each read request
(see black and purple lines in Figure 1b).
Long and short flows. Modern cloud applications commonly
have strict latency requirements [28]–[33]. At the same time,
background services require high network utilisation [34]–
[37]. A plethora of mechanisms and protocols have been pro-
posed to date to provide efficient access to network resources
to data centre applications, by exploiting support for multiple
equal-cost paths between any two servers [26], [35], [36],
[38] and hardware capable of low latency communication [32],
[39], [40] and eliminating Incast [41]–[43] and Outcast [44].
Recent proposals commonly focus on a single dimension of
2
It is well-established that TCP is ill-suited for meeting throughput and
latency requirements of applications in data centre networks, therefore we
will be using NDP and PIAS [27] as the baseline protocols in this paper.

IEEE/ACM TRANSACTIONS ON NETWORKING 2
0 5000 10000
Rank of transport session
0
0.5
1
Goodput (Gbps)
1 Replica NDP
3 Replicas NDP (daisy chain)
3 Replicas NDP (multi-unicast)
1 Replica TCP
3 Replicas TCP
(a) One-to-many (write)
0 5000 10000
Rank of transport session
0
0.5
1
Goodput (Gbps)
1 Sender NDP
3 Senders NDP
1 Sender TCP
3 Senders TCP
(b) Many-to-one (read)
Fig. 1: Goodput in a 250-server FatTree topology with 1GB
link speed & 10µs link delay. Background traffic is present to
simulate congestion. Results are for 10,000 (a) write and (b)
read block requests (2MB each). Each I/O request is ‘assigned’
to a host in the network, which is selected uniformly at random
and acts as the client. Requests’ arrival times follow a Poisson
process with an inter-arrival rate λ = 1000. Replica selection
and placement is based on HDFS’ default policy.
the otherwise complex problem space; e.g. TIMELY [45],
DCQCN [46], QJUMP [47] and RDMA over Converged
Ethernet v2 [48] focus on low latency communication but do
not support multi-path routing. Other approaches [36], [37]
do provide excellent performance for long flows but perform
poorly for short flows [34], [35]. None of these protocols sup-
port efficient one-to-many and many-to-one communication.
Contribution. In this paper we propose SCDP
3
, a general-
purpose data transport protocol for data centres that, unlike
any other protocol proposed to date, supports efficient one-
to-many and many-to-one communication. This, in turn, re-
sults in significantly better overall network utilisation, min-
imising hotspots and providing more resources to long and
short unicast flows. At the same time, SCDP supports fast
completion of latency-sensitive flows and consistently high-
bandwidth communication for long flows. SCDP eliminates
Incast and Outcast. All these are made possible by integrating
RaptorQ codes [52], [53] with receiver-driven data transport
[26], [32], in-network packet trimming [26], [54] and Multi-
Level Feedback Queuing (MLFQ) [27].
SCDP performance overview. We found that SCDP improves
goodput performance by up to 50% compared to NDP and
60% compared to PIAS with different application work-
loads involving one-to-many and many-to-one communication
(§V-A). Equally importantly, it reduces the average FCT for
short flows by up to 45% compared to NDP and 70% com-
pared to PIAS under two realistic data centre traffic workloads
(§V-B). For short flows, decoding latency is minimised by the
combination of the systematic nature of RaptorQ codes and
MLFQ; even in a 70% loaded network, decoding was needed
for only 9.6% of short flows. This percentage was less than 1%
in a 50% congested network (§V-G). The network overhead in-
duced by RaptorQ codes is negligible compared to the benefits
of supporting one-to-many and many-to-one communication.
Only 1% network overhead was introduced when the network
3
SCDP builds on our early work on integrating fountain coding in data
transport protocols [49]–[51]. In [50] we motivated the need for a novel
data transport mechanism to efficiently support one-to-many and many-to-one
communication and argued that rateless codes is the way forward in doing so.
In [49], we introduced an early version of SCDP to the research community.
was very heavily congested (§V-H). RaptorQ codes have been
shown to perform exceptionally well even on a single core,
in terms of encoding/decoding rates. We therefore expect that
with hardware offloading, in combination with SCDP’s block
pipelining mechanism (§IV-F), the required computational
overhead will not be significant.
II. RAPTORQ ENCODING AND DECODING
Encoding. RaptorQ codes are rateless and systematic. The
input to the encoder is one or more source blocks; for each one
of these source blocks, the encoder creates a potentially very
large number of encoding symbols (rateless coding). All K
source symbols (i.e. the original fragments of a source block)
are amongst the set of encoding symbols (systematic coding).
All other symbols are called repair symbols. Senders initially
send source symbols, followed by repair symbols, if needed.
Decoding. A source block can be decoded after receiving
a number of symbols that must be equal to or larger than
the number of source symbols; all symbols contribute to
the decoding process equally. In a lossless communication
scenario, decoding is not required, because all source symbols
are available (systematic coding).
Performance. In the absence of loss, RaptorQ codes do
not incur any network or computational overhead. The trade-
off associated with RaptorQ codes when loss occurs is with
respect to some (1) minimal network overhead to enable
successful decoding of the original fragments and (2) com-
putational overhead for decoding the received symbols to the
original fragments. RaptorQ codes behave exceptionally well
in both respects. With two extra encoding symbols (compared
to the number of original fragments), the decoding failure
probability is in the order of 10
6
. It is important to note that
decoding failure is not fatal; instead one or more encoding
symbols can be requested in order to ensure that decoding is
successful [53]. The time complexity of RaptorQ encoding
and decoding is linear to the number of source symbols.
RaptorQ codes support excellent performance for all block
sizes, including very small ones, which is very important for
building a general-purpose data transport protocol that is able
to handle efficiently a diverse set of workloads. In [55], [56],
the authors report encoding and decoding speeds of over 10
Gbps using a RaptorQ software prototype running on a single
core. With hardware offloading, RaptorQ codes would be able
to support data transport at line speeds in modern data centre
deployments. On top of that, multiple blocks can be decoded
in parallel, independently of each other (e.g. on different
cores). Decoding small source blocks is even faster, as reported
in [55]. The decoding performance does not depend on the
sequence that symbols arrived nor on which ones do.
Example. Before explaining how RaptorQ codes are integrated
in SCDP, we present a simple example of point-to-point
communication between two hosts, which is illustrated in
Figure 2
4
. On the sender side, a single source block is passed
to the encoder that fragments it into 8 equal-sized source
symbols S
1
, S
2
, ..., S
8
. The encoder uses the source symbols
4
Note that Figure 2 does not illustrate SCDP’s underlying mechanisms. The
design of SCDP is discussed extensively in Section IV.

IEEE/ACM TRANSACTIONS ON NETWORKING 3
Sender
123
45678abc
Encoding Symbols
Source
Repair
1
2
3
5
6
7
8
4
a
b
c
1
2
3
56
7
8
a c
Network Node
Overhead
Received Symbols
Receiver
Source Block
Source Block
Decoder
Encoder
Lost
Lost
Fig. 2: RaptorQ-based communication
to generate repair symbols S
a
, S
b
, S
c
(here, the decision to
encode 3 repair symbols is arbitrary). Encoding symbols are
transmitted to the network, along with the respective encoding
symbol identifiers (ESI) and source block numbers (SBN) [52].
As shown in Figure 2, symbols S
4
and S
b
are lost. Symbols
take different paths in the network but this is transparent to
the receiver that only needs to collect a specific amount of
encoding symbols (source and/or repair). The receiver can
receive symbols from multiple senders from different network
interfaces. In this example, the receiver attempts to decode the
original source block upon receiving 9 symbols, i.e. with one
extra symbol which is network overhead (as shown in Figure
2). Decoding is successful and the source block is passed to
the receiver application. As mentioned above, if no loss had
occurred, there would be no need for decoding and the data
would have been directly passed to the application.
Erasure coding in data transport. There is a long and
interesting trail of research that integrates erasure coding into
data transport protocols. SCDP is unique compared to all these
works, efficiently supporting one-to-many and many-to-one
data transport sessions for distributed storage and numerous
other workloads prevalent in modern data centres, without
sacrificing performance for traditional short and long flows.
In [57], the authors explore the advantages and challenges
of integrating end-to-end coding into TCP. Corrective [58]
employs coding for faster loss recovery but it can only deal
with one packet loss in one window as its coding redun-
dancy is fixed. FMTCP [38] employs fountain coding to
improve the performance of MPTCP [35] by recovering data
over multiple subflows. LTTP [42] is a UDP-based transport
protocol that uses fountain codes to mitigate Incast in data
centres. CAPS [59] deals with out of order data by integrating
forward error correction on short flows, in order to reduce their
flow completion time, and employs ECMP for achieving high
throughput for long flows. RC-UDP [60] is a rateless coding
data transport protocol that enables reliable data transfer
over high bandwidth networks. It uses block-by-block flow
control where the sender keeps sending encoded symbols until
the receiver sends an acknowledgement indicating successful
decoding. PPUSH [61] is a multi-source data delivery protocol
that employs RaptorQ codes for sending multiple flows in
parallel using all available replicas.
III. THE CASE FOR RAPTORQ CODING IN DATA
TRANSPORT FOR DATA CENTRE NETWORKS
The starting point in designing SCDP, which is also the
key differentiator to the rest of the literature, is its efficient
handling of one-to-many and many-to-one communication,
without sacrificing performance for traditional unicast flows.
One-to-many communication. None of the existing data
transport protocols for data centres can support communication
beyond traditional unicast flows, even if network-level multi-
casting was deployed in the network. Congestion control in
reliable multicasting is a challenging problem and traditional
sender-driven, reliable multicasting approaches (e.g. as in [62],
[63]) would suffer from Incast [41], and lack of support
for multipath routing and multi-homed servers, as well as
their inability to spray packets in the network. A receiver-
driven approach would be more suitable. However, extending
approaches, such as NDP [26] or Homa [32], is far from
trivial as this would entail complications with flow control,
when losses occur, because lost packets must be retransmitted.
Senders would have to maintain state, enqueuing incoming
pull requests by multiple receivers, while waiting to multicast
a new packet or retransmit a lost packet. Equally importantly,
the slowest receiver would slow down all other receivers.
5
With RaptorQ codes and receiver-driven flow control, one-
to-many communication is simple and efficient: a sender
multicasts a new symbol after receiving a pull request from
all receivers (see Section IV-D for a detailed description). A
sender does not need to remember which symbols it has sent
as there is no notion of retransmission. Instead, it only needs to
count the number of pending pull requests from each receiver
so it can ‘clock’ symbol sending. A receiver can decode the
original data and complete the session after it receives the
necessary amount of symbols (see Section II), independently
of other receivers that may be behind in terms of receiving
symbols because of network congestion (e.g. when they are
connected to a congested ToR switch).
Many-to-one communication. Existing protocols do not and
could not support many-to-one communication in a way
that benefits the overall performance. Even if senders were
instructed to only send a subset of the original data fragments
(emulating many-to-one communication), a congested or slow
server would always be the bottleneck for the whole session.
With RaptorQ codes, each sender contributes as much as
it can, given the current conditions, in terms of network
congestion and local load. The rateless nature of RaptorQ
codes, enable receivers to successfully decode a source block
regardless of which server sent the symbols. The only require-
ment is to receive the required number of symbols (See Section
II). This is a unique characteristic of SCDP (see Section IV-E),
which ‘bypasses’ network hotspots by having non-congested
servers contributing more symbols to the receiver. Crucially,
this is done without any central coordination.
Flow completion time and goodput. SCDP’s benefits dis-
cussed above do not come at a cost for traditional unicast
flows. This is due to the combination of the systematic nature
of RaptorQ codes, MLFQ, and packet trimming. More specif-
ically, FCT for short flows is very small, unaffected by the
introduction of coding because senders first send the original
data fragments (systematic coding) with the highest priority,
minimising loss for them. As a result, decoding is very rarely
required for short flows. SCDP performs exceptionally well
5
How existing protocols for data centres could be extended to support one-
to-many and many-to-one communication is beyond the scope of this paper.

IEEE/ACM TRANSACTIONS ON NETWORKING 4
also for long flows despite the fact that (the otherwise efficient)
decoding is needed more often. This is done by employing
pipelining of source blocks, which alleviates the decoding
overhead for large data blocks and maximises application
goodput (see Section IV-F). In combination with receiver-
driven flow control and packet trimming, SCDP eliminates In-
cast and Outcast, playing well with switches’ shallow buffers.
Network utilisation. SCDP ensures high network utilisation
for all communication modes; with RaptorQ coding there is no
notion of ordering, as all symbols contribute to the decoding
(if needed) of source data. As a result, symbols can be sprayed
in the network through all available paths maximising utilisa-
tion and minimising the formation of hotspots. At the same
time, receivers can receive symbols from different interfaces
naturally enabling multi-homed topologies (e.g. [64], [65]).
IV. SCDP DESIGN
In this section, we present SCDP’s design; we define
SCDP’s packet types and adopted switch model. We then
describe all SCDP’s supported communication modes, and
how we maximise goodput and minimise flow completion time
(FCT) for long and short flows, respectively.
A. Packet Types
SCDP’s packet format is shown in Figure 3. Port numbers
are used to identify a transport session. The type field (TYP
in Figure 3) is used to denote one of the three SCDP packet
types; symbol, header and pull (denoted as SMBL, HDR and
PULL, respectively, in Algorithms 1 and 2). The priority field
(PRI in Figure 3) is set by the sender and is used by MLFQ
(see Section IV-B).
A symbol packet carries in its payload one MTU-sized
source or repair symbol. The source block number (SBN)
identifies the source block the carried symbol belongs to. The
encoding symbol identifier (ESI) identifies the symbol within
the stream of source and repair symbols for the specific source
block [52]. A sender initiates a transport session by pushing
an initial window of symbols with the syn flag set, for the
first source block. These symbol packets also carry a number
of options: the transfer mode (M in Figure 3) can be unicast,
many-to-one or one-to-many. The rest of the options are used
to define the total length of the session (F in Figure 3), number
of source blocks (Z in Figure 3) and the symbol size (T in
Figure 3). The source block size K is derived from these
options as described in RaptorQ RFC [52]. We adopt the
notation used in this RFC [52].
Header packets are trimmed versions of symbol packets.
Upon receiving a symbol packet that cannot be buffered, a
network switch trims its payload and forwards the header, with
the highest priority. Header packets are used to ensure that a
window (w) of symbol packets is always in-flight.
A pull packet is sent by a receiver to request a symbol. The
sequence number is only used to indicate how many symbols
of the specified source block identifier to send, in case pull
requests get reordered. Multiple symbol packets may be sent
in response to a single pull request, as described in Section
IV-C. The fin flag is used to identify the last pull request; upon
receiving such a pull request, a sender sends the last symbol
packet for this SCDP session.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Block Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Encoded Symbol Identifier / Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| P | T |S|F| |
| R | Y |Y|I| |
| I | P |N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options{M,F,T,Z} |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| payload |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Fig. 3: SCDP packet format
B. Switch Service Model
SCDP relies on network switching functionality that is
either readily available in today’s data centre networks [32] or
is expected to be [26] when P4 switches are widely deployed.
SCDP does not require any more switch functionality than
NDP [26]
6
, Homa [32], QJUMP [47], or PIAS [27] do.
Priority scheduling and packet trimming. In order to support
latency-sensitive flows, we employ MLFQ [27], and packet
trimming [54]. We assume that network switches support a
small number of queues with respective priority levels. The top
priority queue is only used for header and pull packets. This is
crucial for swiftly providing feedback to receivers about loss.
Given that both types of packets are very small, it is extremely
unlikely that the respective queue gets full and that they are
dropped
7
. The rest of the queues are small and buffer symbol
packets. Switches perform weighted round-robin scheduling
between the top-priority (header/pull) queue and the symbol
packet queues. This guards against a congestion collapse
situation, where a switch only forwards trimmed headers and
all symbol packets are trimmed to headers. When a data packet
is to be transmitted, the switch selects the head packet from
the highest priority, non-empty queue.
Multipath routing. SCDP packets are sprayed to all available
equal-cost paths to the destination
8
in the network. SCDP
relies on ECMP and spraying could be done either by using
randomised source ports [34], or the ESI of symbol and header
packets and the sequence number of pull packets.
C. Unicast Transport Sessions
A sender implicitly opens a unicast SCDP transport session
by pushing an initial window of w (syn-enabled) symbol
packets tagged with the highest priority (Lines 2 12 in
Algorithm 1
9
). Senders tag outgoing symbol packets with a
priority value, which is used by the switches when scheduling
their transmission (§IV-B). The priority of outgoing symbol
packets is gradually degraded when specific thresholds are
reached. Calculating these thresholds can be done as in PIAS
[27] or AuTO [39] (Line 30 in Algorithm 1). The receiver
establishes a new session upon receiving the first symbol that
6
As reported in [66], there is ongoing work by switch vendors to implement
the NDP switch. Moreover, a smartNIC implementation of the NDP end-host
stack is also ongoing. This is very promising for the deployability of next-
generation protocols, including SCDP, in the real-world.
7
SCDP receivers employ a simple timeout mechanism, as in [26], to recover
from the unlikely losses of pull and header packets.
8
In SCDP’s one-to-many transfer mode there are multiple destinations.
9
For clarity, Algorithms 1 and 2 illustrate a slightly simplified version of
SCDP for unicast data transport for a single source block without pipelining.

Citations
More filters
Journal ArticleDOI

Reducing tail latency with coding-based packet spraying in edge datacenters

TL;DR: In this paper , a coding-based random packet spraying (CRPS) scheme is proposed to reduce the tail latency caused by retransmission, where the source host transmits forward error correction encoded packets and dynamically adjusts the data redundancy based on the packet loss rate.
Proceedings ArticleDOI

A Cost-Effective and Multi-Source-Aware Replica Migration Approach for Geo-Distributed Data Centers

TL;DR: In this paper , a cost-effective and deadline-aware replica migration approach for geo-distributed data centers is proposed, which discovers the appropriate source(s) and paths to transmit the replicas to a desired destination cost-effectively.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Journal ArticleDOI

The part-time parliament

TL;DR: The Paxon parliament's protocol provides a new way of implementing the state machine approach to the design of distributed systems.
Proceedings ArticleDOI

BCube: a high performance, server-centric network architecture for modular data centers

TL;DR: Experiments in the testbed demonstrate that BCube is fault tolerant and load balancing and it significantly accelerates representative bandwidth-intensive applications.
Proceedings ArticleDOI

Ceph: a scalable, high-performance distributed file system

TL;DR: Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "Scdp: systematic rateless coding for efficient data transport in data centres" ?

In this paper the authors propose SCDP, a general-purpose data transport protocol for data centres that, in contrast to all other protocols proposed to date, supports efficient one-to-many and many-to-one communication, which is extremely common in modern data centres. SCDP achieves this by integrating RaptorQ codes with receiver-driven data transport, packet trimming and Multi-Level Feedback Queuing ( MLFQ ) ; ( 1 ) RaptorQ codes enable efficient one-to-many and many-toone data transport ; ( 2 ) on top of RaptorQ codes, receiver-driven flow control, in combination with in-network packet trimming, enable efficient usage of network resources as well as multi-path transport and packet spraying for all transport modes. SCDP does so without compromising on efficiency for short and long unicast flows. 

As part of their future work, the authors aim at developing an SCDP prototype ( in-kernel and/or using user-space network stack ) and exploring its performance with real application workloads. As part of their future work, the authors will investigate this argument further by developing extensions of existing unicast data centre protocols ( e. g. [ 26 ] ) that can handle one-to-many and many-to-one data transport and compare their performance with SCDP. 

For their experimentation the authors have used a 250-server FatTree topology with 25 core switches and 5 aggregation switches in each pod (50 aggregation switches in total). 

In general, the authors argue that PIAS performs worse than NDP because (1) it relies on DCTCP for data transport and as a result it suffers from the limitations of a single-path protocol (i.e. lack of support for multi-path transport and packet spraying); (2) connection establishment requires a three-way handshake and senders start with a small window, both of which can severely hurt FCTs for short flows; and (3) buffer occupancy in NDP is significantly lower than in PIAS [26] which also affects performance for short slows. 

When the network load is very high, the gap between SCDP and NDP is at its smallest, because losses (trimmed packets) and therefore decoding are more frequent. 

The authors model the decoding latency based on the results reported in [55], by fitting the worst-case decoding latencies for different number of K source symbols into a polynomial function. 

This is done by employing pipelining of source blocks, which alleviates the decoding overhead for large data blocks and maximises application goodput (see Section IV-F). 

As part of their future work, the authors will investigate this argument further by developing extensions of existing unicast data centre protocols (e.g. [26]) that can handle one-to-many and many-to-one data transport and compare their performance with SCDP. 

This ensures that SCDP does not induce any unnecessary overhead; i.e. symbol packets that are redundant in decoding the source block. 

For higher loads, NDP performs even worse than SCDP because of the lack of support for MLFQ, which results in the trimming of more packets belonging to short flows.