scispace - formally typeset
Open AccessProceedings ArticleDOI

Data center networking with multipath TCP

TLDR
It is shown that multipath TCP can effectively and seamlessly use available bandwidth, providing improved throughput and better fairness in these new topologies when compared to single path TCP and randomized flow-level load balancing.
Abstract
Recently new data center topologies have been proposed that offer higher aggregate bandwidth and location independence by creating multiple paths in the core of the network. To effectively use this bandwidth requires ensuring different flows take different paths, which poses a challenge.Plainly put, there is a mismatch between single-path transport and the multitude of available network paths. We propose a natural evolution of data center transport from TCP to multipath TCP. We show that multipath TCP can effectively and seamlessly use available bandwidth, providing improved throughput and better fairness in these new topologies when compared to single path TCP and randomized flow-level load balancing. We also show that multipath TCP outperforms laggy centralized flow scheduling without needing centralized control or additional infrastructure.

read more

Content maybe subject to copyright    Report

Data Center Networking with Multipath TCP
Costin Raiciu, Christopher Pluntke, Sebastien Barre, Adam Greenhalgh,
Damon Wischik, Mark Handley
University College London, Universite Catholique de Louvain
ABSTRACT
Recently new data center topologies have been proposed that
offer higher aggregate bandwidth and location independence
by creating multiple paths in the core of the network. To ef-
fectively use this bandwidth requires ensuring different flows
take different paths, which poses a challenge.
Plainly put, there is a mismatch between single-path trans-
port and the multitude of available network paths. We pro-
pose a natural evolution of data center transport from TCP
to multipath TCP. We show that multipath TCP can effec-
tively and seamlessly use available bandwidth, providing im-
proved throughput and better fairness in these new topolo-
gies when compared to single path TCP and randomized
flow-level load balancing. We also show that multipath TCP
outperforms laggy centralized flow scheduling without need-
ing centralized control or additional infrastructure.
Categories and Subject Descriptors
C.2.2[Computer-Comms Nets]: Network Protocols
General Terms: Multipath TCP, Data Center Networks
1. INTRODUCTION
Recent growth in cloud applications from companies such
as Google, Microsoft, and Amazon has resulted in the con-
struction of data centers of unprecedented size. These appli-
cations are written to be distributed across machines num-
bering in the tens of thousands, but in so doing, they stress
the networking fabric within the data center: distributed file
systems such as GFS transfer huge quantities of data be-
tween end-systems (a point-to-point traffic pattern) while data
processing applications such as MapReduce, BigTable or
Dryad shuffle a significant amount of data between many
machines. To allow maximum flexibility when rolling out
new applications, it is important that any machine can play
any role without creating hot-spots in the network fabric.
Data center networking has become focus of attention re-
cently; in part this is because data centers are now important
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Hotnets ’10, October 20–21, 2010, Monterey, CA, USA.
Copyright 2010 ACM 978-1-4503-0409-2/10/10 ...$10.00.
enough to be considered as special cases in their own right,
but perhaps equally importantly, they are one of the few
cases where researchers can dictate both the physical topol-
ogy and the routing of traffic simultaneously. New topolo-
gies such as FatTree[1] and VL2[5] propose much denser
interconnects than have traditionally been implemented, so
as to allow operators to deploy application functionality in
a location independent manner. However, while such dense
interconnects can in principle support the full cross-sectional
bandwidth of every host communicating flat out simultane-
ously, the denseness of interconnection poses a difficult chal-
lenge for routing. How can we ensure that no matter the traf-
fic pattern, the load is distributed between the many possible
parallel paths as evenly as possible?
The current wisdom seems to be to use randomised load
balancing (RLB) to randomly choose a path for each flow
from among the possible parallel paths. However, RLB can-
not achieve full bisectional bandwidth because some flows
will randomly choose the same path while other links ran-
domly fail to be selected. Thus, RLB tends to be supple-
mented by centralized flow-scheduling for large flows.
In this paper we propose an alternative and simpler ap-
proach; the end systems in the data center should simply
use multipath TCP (MPTCP), as currently under consider-
ation in the IETF[4], to utilize multiple parallel paths for
each TCP connection. The great advantage of this approach
is that the linked congestion controller in each MPTCP end
system can act on very short timescales to move its own traf-
fic from paths it observes to be more congested, onto paths
it observes to be less congested. Theory suggests that such
behavior can be stable, and can also serve to load balance
the entire network.
We evaluate how effective MPTCP is in comparison to
alternative scheduling mechanisms across a range of differ-
ent proposed data center topologies. We use a combination
of large scale simulation and smaller scale data center ex-
perimentation for evaluation. Our conclusion is that for all
the workloads and topologies we considered, MPTCP either
matches or in many cases exceeds the performance a central-
ized scheduler can achieve, and is more robust.
Further, we show that single-path TCP cannot fully utilize
capacity for certain topologies and traffic matrices, while
multipath can. There is a close connection between topol-
ogy, path selection, and transport in data centers; this hints
at possible benefits from designing topologies for MPTCP.
1

2. DATA CENTER NETWORKING
From a high-level perspective, there are four main com-
ponents to a data center networking architecture:
Physical topology
Routing over the topology
Selection between the paths supplied by routing
Congestion control of traffic on the selected paths
These are not independent; the performance of one will
depend on the choices made by those preceding it in the list,
and in some cases by those after it in the list. We will discuss
each in turn, but it is worth noting now that MPTCP spans
both path selection and congestion control, which is why it
is able to offer benefits that cannot otherwise be obtained.
2.1 Topology
Traditionally data centers have been built using hierar-
chical topologies: racks of hosts connect to a top-of-rack
switch; these switches connect to aggregation switches; in
turn these are connected to a core switch. Such topologies
make sense if most of the traffic flows into or out of the data
center. However, if most of the traffic is intra-datacenter, as
is increasingly the trend, then there is a very uneven distri-
bution of bandwidth. Unless traffic is localized to racks, the
higher levels of the topology become a serious bottleneck.
Recent proposals address these limitations. VL2 and Fat-
Tree are Clos[3] topologies that use multiple core switches
to provide full bandwidth between any pair of hosts in the
network. They differ in that FatTree uses larger quantities of
lower speed (1Gb/s) links between switches, whereas VL2
uses fewer faster (10Gb/s) links. In contrast, in BCube[6],
the hierarchy is abandoned in favor a hypercube-like topol-
ogy, using hosts themselves to relay traffic.
All three proposals solve the traffic concentration prob-
lem at the physical level there is enough capacity for ev-
ery host to be able to transmit flat-out to another randomly
chosen host. However the denseness of interconnection they
provide poses its own problems when it comes to determin-
ing how traffic should be routed.
2.2 Routing
Dense interconnection topologies provide many possible
parallel paths between each pair of hosts. We cannot ex-
pect the host itself to know which of these paths is the least
loaded, so the routing system itself must spread traffic across
these paths. The simplest solution is to use randomized load
balancing, where each flow is assigned a random path from
the set of possible paths.
In practice there are multiple ways to implement random-
ized load balancing in today’s switches. For example, if each
switch uses a link-state routing protocol to provide ECMP
forwarding then, based on a hash of the five-tuple in each
packet, flows will be split roughly equally across equal length
paths. VL2 provides just such a mechanism over a virtual
layer 2 infrastructure.
However, in topologies such as BCube, paths vary in length,
and simple ECMP cannot access many of these paths be-
cause it only hashes between the shortest paths. A simple
alternative is to use multiple static VLANs to provide mul-
tiple paths that expose all the underlying network paths[8].
Either the host or the first hop switch can then hash the five-
tuple to determine which path is used.
In our simulations, we do not model dynamic routing; in-
stead we assume that all the paths between a pair of end-
points are available for selection, whatever mechanism actu-
ally does the selection. For our experiments in Section 4, we
use the VLAN-based routing solution.
2.3 Path Selection
Solutions such as ECMP or multiple VLANs provide the
basis for randomised load balancing as the default path se-
lection mechanism. However, as others have shown, ran-
domised load balancing cannot achieve the full cross-sectional
bandwidth in most topologies, nor is it especially fair. The
problem, quite simply, is that often a random selection causes
hot-spots to develop, where an unlucky combination of ran-
dom path selection causes a few links to be underloaded and
links elsewhere to have little or no load.
To address these issues, the use of a centralized flow sched-
uler has been proposed. Large flows are assigned to lightly
loaded paths and existing flows may be reassigned to maxi-
mize overall throughput[2]. The scheduler does a good job
if flows are network-limited, with exponentially distributed
sizes and Poisson arrivals, as shown in Hedera [2]. The in-
tuition is that if we only schedule the big flows we can fully
utilize all the bandwidth, and yet have a small scheduling
cost, as dictated by the small number of flows.
However, data center traffic analysis shows that flow dis-
tributions are not Pareto distributed [5]. In such cases, the
scheduler has to run frequently (100ms or faster) to keep up
with the flow arrivals. Yet, the scheduler is fundamentally
limited in its reaction time as it has to retrieve statistics, com-
pute placements and instantiate them, all in this scheduling
period. We show through simulation that a scheduler run-
ning every 500ms has similar performance to randomised
load balancing when these assumptions do not hold.
2.4 Congestion Control
Most applications use singlepath TCP, and inherit TCP’s
congestion control mechanism which does a fair job of match-
ing offered load to available capacity on whichever path was
selected. Recent research has shown there are benefits from
tuning TCP for data center use, such as by reducing the min-
imum retransmit timeout[10], but the problem TCP solves
remains unchanged.
In proposing the use of MPTCP, we change the partition-
ing of the problem. MPTCP can establish multiple subflows
across different paths between the same pair of endpoints for
a single TCP connection. The key point is that by linking
the congestion control dynamics on these multiple subflows,
2

0
10
20
30
40
50
60
70
80
90
100
RLB 2 3 4 5 6 7 8
Throughput (% of optimal)
Multipath TCP
(a) FatTree (8192 hosts)
0
10
20
30
40
50
60
70
80
90
100
RLB 2 3 4 5 6 7 8
Multipath TCP
(b) VL2 (11520)
0
10
20
30
40
50
60
70
80
90
100
RLB 2 3 4 5
Multipath TCP
(c) BCube (1024)
Figure 1: Throughput for long running connections using a permutation traf-
fic matrix, for RLB and varying numbers of MPTCP subflows
0
5
10
15
20
100 1000 10000
Subflows
Number of Servers
Fat Tree
VL2
BCube
Figure 2: Subflows needed to
reach 90% network utilization
MPTCP can explicitly move traffic away from the more con-
gested paths and place it on the less congested paths.
The algorithm currently under discussion in the IETF is
called “linked increases” because the slope of additive in-
crease part of the TCP sawtooth is determined by that flow’s
fraction of the total window of traffic in flight. The faster
a flow goes, the larger its fraction of the total, and so the
faster it can increase. This algorithm makes MPTCP incre-
mentally deployable, as it is designed to be fair to competing
singlepath TCP traffic, unlike simply running multiple reg-
ular TCP flows between the same endpoints. In addition it
moves more traffic off congested paths than multiple regular
TCP flows would.
Our hypothesis is that given sufficiently many randomly
chosen paths, MPTCP will find at least one good unloaded
path, and move most of its traffic that way. In so doing it will
relieve congestion on links that got more than their fair share
of RLB-balanced flows. This in turn will allow those com-
peting flows to achieve their full potential, maximizing the
cross-sectional bandwidth of the network and also improv-
ing fairness. Fairness is not an abstract concept for many
distributed applications; for example, when a search appli-
cation is distributed across many machines, the overall com-
pletion time is determined by the slowest machine. Hence
worst-case performance matters significantly.
3. ANALYSIS
To validate our hypothesis, we must examine how MPTCP
performs in a range of topologies and with a varying num-
ber of subflows. We must also show how well it performs
against alternative systems. To perform such an analysis is
itself challenging - we really want to know how well such de-
ployments will perform at large scale with real-world trans-
port protocol implementations and with reasonable traffic
patterns. Lacking a huge data center to play with, we have
to address these issues independently, using different tools;
Flow-level simulation to examine idealized large scale
behavior.
Packet-level simulation to examine more detailed medium-
scale behavior.
Real-world implementation to examine practical limi-
tations at small-scale.
3.1 Large scale analysis
First we wish to understand the potential benefits of MPTCP
with respect to the three major topologies in the literature:
FatTree, VL2 and BCube. The baseline for comparison is
randomized load balancing with RLB using singlepath TCP.
MPTCP adds additional randomly chosen paths, but then the
linked congestion control moves the traffic within each con-
nection to the least congested subflows.
We use an iterative flow-level simulator to analyze topolo-
gies of up to 10,000 servers
1
. In each iteration the simulator
computes the loss rates for each link based on the offered
load, and adjusts the load accordingly. When the offered
load and loss rate stabilize, the simulator finishes. This sim-
ulator does not model flow startup behavior and other packet
level effects, but is scalable to very large topologies.
Fig. 1 shows the total throughput of all flows when we
use a random permutation matrix where each host sends flat
out (as determined by the TCP response function) to a sin-
gle other host. In all three topologies, with the right path
selection this traffic pattern should just be able to load the
network to full capacity but no more.
What we observe is that RLB is unable to fill any of these
networks. It performs best in the VL2 topology, where it
achieves 77% throughput, but performs much worse in Fat-
Tree and BCube. The intuition is simple: to achieve 100%
capacity with RLB, no two flows should ever traverse the
same link. Obviously RLB cannot do this, but how badly
it suffers depends on how overloaded links become. With
FatTree, when two TCP flows that could potentially send at
1Gb/s end up on the same 1Gb/s link, each backs off by 50%,
leaving other links underutilized. With VL2, when eleven
1Gb/s TCP flows end up on the same 10Gb/s link, the effect
is much less drastic, hence the reduced performance penalty.
The benefits of MPTCP are clear; as additional subflows
are added, the overall throughput increases. How many sub-
flows are needed depends on the topology; the intuition is
as before - we need enough subflows to overcome the traf-
fic concentration effects of the random path allocation. One
might think that the power of two choices[7] might apply
here, providing good load balancing with very few subflows.
However it does not because the paths are not disjoint. Each
1
how many is determined by the need for a regular topology
3

0
10
20
30
40
50
60
70
80
90
100
RLB 2 3 4 5 6
% of optimal
Multipath TCP
Min Rate Fairness
(a) FatTree
0
10
20
30
40
50
60
70
80
90
100
RLB 2 3 4 5 6 7 8
% of optimal
Multipath TCP
Min Rate Fairness
(b) VL2
0
10
20
30
40
50
60
70
80
90
100
RLB 2 3 4 5
% of interface rate
Multipath TCP
Min Rate Fairness
(c) BCube
Figure 3: Minimum flow throughput and Jain fairness index for flows in Fig. 1
0
10
20
30
40
50
60
70
80
90
100
RLB 1s 500ms 100ms 10ms MTCP
Throughput (% of max)
First Fit Scheduler
Figure 4: First-fit scheduling
compared to RLB and MPTCP
subflow can encounter a congested bottleneck on a single
link along its path, causing the other links along the path to
be underutilized. Although such bottleneck links are load-
balanced, with FatTree in particular, other links cannot be
fully utilized, and it takes more than two subflows to spread
load across sufficient paths to fully utilize the network.
This raises the question of how the number of subflows
needed scales with the size of the network. We chose an ar-
bitrary utilization target of 90% of the cross sectional band-
width. For different network sizes we then progressively in-
creased the number of subflows used. Fig. 2 shows the min-
imum number of subflows that can achieve 90% utilization
for each size of network. The result is encouraging: be-
yond a certain size, the number of subflows needed does not
increase significantly with network size. For VL2, two sub-
flows are needed. For FatTree, eight are needed. This might
seem like quite a high number, but for an 8192-node FatTree
network there are 256 distinct paths between each host pair,
so only a small fraction of the paths are needed to achieve
full utilization. From the host point of view, eight subflows
is not a great overhead.
We also care that the capacity is allocated fairly between
connections, especially for applications where the final re-
sult can only be returned when the last node running a part
of a calculation returns its results. Fig. 3 shows the through-
put of the lowest speed flow (as a percentage of what should
be achievable) and Jain’s fairness index for the three topolo-
gies. Multipath always improves fairness, even for the VL2
topology which performed relatively well if we only exam-
ine throughput.
We have also run experiments in our packet-level simu-
lator with a wide range of load levels. At very light load,
there are few collisions, so MPTCP gives little benefit over
RLB on FatTree or VL2 topologies. However on BCube,
MPTCP excels because a single flow can use all the the host
interfaces simultaneously.
At the other extreme, under overload conditions even RLB
manages to fill the network, but MPTCP still gives better
fairness. Fig. 5 shows the throughput of each individual flow
in just such an overload scenario.
The results above use a permutation traffic matrix, which
is useful as a benchmark because it enables a network de-
signed for full bisection bandwidth to be loaded to 100%
utilization with the right traffic distribution scheme. In prac-
tice less regular traffic and both lighter and heavier loads are
of interest. Fig. 6 shows results when the source and desti-
nation is chosen randomly for varying numbers of flows.
FatTree show substantial improvements over single-path
RLB, even for very light or very heavy loads. This shows
the performance benefits of MPTCP are robust across a wide
range of conditions.
The improvements for BCube are even greater at lower
traffic loads. This is because BCube hosts have multiple in-
terfaces, and MPTCP can use them all for a single flow - at
light loads the bottlenecks are the hosts themselves.
The results for VL2 were a surprise, given that Fig. 1
shows improvements for this topology with the permutation
matrix. MPTCP gives improvements over RLB of less than
1% for all loads we studied. On closer examination, it turns
out that the host interface is almost always the bottleneck for
VL2. Many flows collide on either the sending or receiving
host, and MPTCP has no path diversity here. The 10Gb/s
links are then not the bottleneck for the remaining flows un-
der these load levels.
3.2 Scheduling and Dynamic Flow Arrivals
With single-path TCP is it clear that RLB does not per-
form sufficiently well unless the topology has been specif-
ically tailored for it, as with VL2. Even with VL2, fluid
simulations show that MPTCP can increase fairness and per-
formance significantly.
RLB however is not the only singlepath path selection
algorithm; Hedera proposes using a centralized scheduler
to supplement RLB, with the goal of explicitly allocating
large flows to paths. Specifically, Hedera flows start off
using RLB, but are measured by the centralized scheduler.
If, during a scheduling period, a flow’s average throughput
is greater than 10% of the interface speed, it is explicitly
scheduled. How well does MPTCP compare with central-
ized scheduling?
This evaluation is more difficult; the performance of a
scheduler can depend on lag in flow measurement, path con-
figuration, and TCP’s response to path reconfiguration. Sim-
ilarly the performance of MPTCP can depend on how quickly
4

Figure 5: Flow Rates for an
overloaded Fat Tree (128)
Figure 6: Random conns:
Improvement vs. load
0
5
10
15
20
1 2 4 8
Loss Rate (%)
Number of subflows
Linked Max
Linked Ave
Independent Max
Independent Ave
(a) Network loss rates
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8
Timeouts
Number of subflows
Linked
Independent
(b) Retransmit Timeouts
Figure 7: MPTCP vs. multiple independent TCP flows
new subflows can slowstart. None of these effects can be
captured in a fluid flow model, so we have to resort to full
packet-level simulation.
For our experiments we modified htsim[9], which was built
from ground up to support high speeds and large numbers of
flows. It models TCP very similarly to ns2, but performance
is much better and simulation time scales approximately lin-
early with total bandwidth simulated.
For space reasons, we only examine the FatTree topology
with 128 severs and a total maximum bandwidth of 128Gb/s.
We use a permutation traffic matrix with closed loop flow
arrivals (one flow finishes, another different one starts), and
flow sizes distributed according to the VL2 dataset. We mea-
sure throughputs over 20 seconds of simulated time for RLB,
MPTCP (8 subflows), and a centralized scheduler using the
First Fit heuristic, as in Hedera [2].
2
The average bisectional bandwidth achieved is shown in
Fig. 4. Again, MPTCP significantly outperforms RLB. Cen-
tralized scheduler performance depends on how frequently
it is run. In the Hedera paper it is run every 5 seconds. Our
results show it needs to run every 100ms to approach the per-
formance of MPTCP; if it runs as frequently as every 500ms
there is little benefit because in the high bandwidth data cen-
ter environment even large flows only take around a second
to complete.
Host-limited Flows
Hedera’s flow scheduling algorithm is based on the assump-
tion that long-lived flows contribute most of the bytes and
therefore it only needs to schedule those flows. Other flows
are treated as background noise. It also assumes that flows
which it schedules onto unused links are capable of increas-
ing their transmit rate to fill that link.
Both assumptions can be violated by flows which are end-
host limited and so cannot increase their transmission rate.
For example, network bandwidth can easily exceed disk per-
formance for certain workloads. Host-limited flows can be
long lived and transfer a great deal of data, but never exceed
the scheduling threshold. These flows are essentially invis-
ible to the scheduler and can collide with scheduled flows.
2
We chose First Fit because it runs much faster than the Simulated
Annealing heuristic; execution speed is really important to get ben-
efits with centralized scheduling.
Perhaps worse, a host-limited flow might just exceed the
threshold for scheduling and be assigned to an empty path
which it cannot utilize, wasting capacity.
We ran simulations using a permutation matrix where each
host sends two flows; one is host-limited and the other is
not. When the host-limited flows have throughput just below
the 10% scheduling threshold, Hedera’s throughput drops
20%. When the same flows are just above the threshold for
scheduling it costs Hedera 17%.
Scheduling App Limited Flows
Threshold Over-Threshold Under-Threshold
5% -21% -22%
10% -17% -21%
20% -22% -23%
50% -51% -45%
The table shows the 10% threshold is a sweet spot; chang-
ing it either caused too few flows to be scheduled, or causes
even more problems when a scheduled flow cannot expand
to fill capacity.
In contrast, MPTCP makes no such assumptions. It re-
sponds correctly to competing host-limited flows, consis-
tently obtaining high throughput.
MPTCP vs. Multiple TCP Connections
Using multiple subflows clearly has significant benefits. How-
ever, MPTCP is not the only possible solution. Could we not
simply use multiple TCP connections in parallel, and stripe
at the application level?
From a network performance point of view, this is equiv-
alent to asking what the effect is of the congestion control
linkage within MPTCP. If, instead of using MPTCP’s “linked
increases” algorithm, we use regular TCP congestion control
independently for each subflow, this will have the same ef-
fect on the network.
To test this, we use again the permutation traffic matrix
and create 20 long running flows from each host. We mea-
sure network loss rates for MPTCP with Linked Increases
and compare against running independent TCP congestion
control on each subflow.
The results in Fig. 7(a) show that MPTCP does not in-
crease network load, as measured by either mean or max loss
rate. In contrast, independent congestion control for each
5

Citations
More filters
Proceedings ArticleDOI

DevoFlow: scaling flow management for high-performance networks

TL;DR: DevoFlow is designed and evaluated, a modification of the OpenFlow model which gently breaks the coupling between control and global visibility, in a way that maintains a useful amount of visibility without imposing unnecessary costs.
Proceedings ArticleDOI

Design, implementation and evaluation of congestion control for multipath TCP

TL;DR: It is shown that some 'obvious' solutions for multipath congestion control can be harmful, but that the proposed algorithm improves throughput and fairness compared to single-path TCP.
Proceedings ArticleDOI

DeTail: reducing the flow completion time tail in datacenter networks

TL;DR: A new cross-layer network stack aimed at reducing the long tail of flow completion times is presented, which exploits cross- layer information to reduce packet drops, prioritize latency-sensitive flows, and evenly distribute network load, effectively reducing theLong tail offlow completion times.
Proceedings ArticleDOI

On the impact of packet spraying in data center networks

TL;DR: This paper argues that due to symmetry, the multiple equal-cost paths between two hosts are composed of links that exhibit similar queuing properties, which means that TCP is able to tolerate the induced packet reordering and maintain a single estimate of RTT.
Proceedings ArticleDOI

Decentralized task-aware scheduling for data center networks

TL;DR: It is shown that task-aware network scheduling, which groups flows of a task and schedules them together, can reduce both the average as well as tail completion time for typical data center applications.
References
More filters
Journal ArticleDOI

A scalable, commodity data center network architecture

TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Proceedings ArticleDOI

VL2: a scalable and flexible data center network

TL;DR: VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.
Journal ArticleDOI

A study of non-blocking switching networks

TL;DR: In this article, the authors describe a method of designing arrays of crosspoints for use in telephone switching systems in which it will always be possible to establish a connection from an idle inlet to an idle outlet regardless of the number of calls served by the system.
Proceedings ArticleDOI

BCube: a high performance, server-centric network architecture for modular data centers

TL;DR: Experiments in the testbed demonstrate that BCube is fault tolerant and load balancing and it significantly accelerates representative bandwidth-intensive applications.
Proceedings ArticleDOI

Hedera: dynamic flow scheduling for data center networks

TL;DR: Hedera is presented, a scalable, dynamic flow scheduling system that adaptively schedules a multi-stage switching fabric to efficiently utilize aggregate network resources and delivers bisection bandwidth that is 96% of optimal and up to 113% better than static load-balancing methods.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What are the future works in "Data center networking with multipath tcp" ?

The authors will explore such topologies in their future work. The authors intend to explore these in future work, together with ways of automatically tuning MPTCP to bring the best performance. As with any system, there may be some costs too: under heavy load, per-packet latency may increase due to timeouts ; a little more memory is needed for receive buffers ; and some complexity will be added to OS implementations, particularly to deal with issues such as such as flow-to-core affinities. 

The authors propose a natural evolution of data center transport from TCP to multipath TCP. The authors show that multipath TCP can effectively and seamlessly use available bandwidth, providing improved throughput and better fairness in these new topologies when compared to single path TCP and randomized flow-level load balancing. The authors also show that multipath TCP outperforms laggy centralized flow scheduling without needing centralized control or additional infrastructure. 

during a scheduling period, a flow’s average throughput is greater than 10% of the interface speed, it is explicitly scheduled. 

Traditionally data centers have been built using hierarchical topologies: racks of hosts connect to a top-of-rack switch; these switches connect to aggregation switches; in turn these are connected to a core switch. 

Solutions such as ECMP or multiple VLANs provide the basis for randomised load balancing as the default path selection mechanism. 

The simplest solution is to use randomized load balancing, where each flow is assigned a random path from the set of possible paths. 

The great advantage of this approach is that the linked congestion controller in each MPTCP end system can act on very short timescales to move its own traffic from paths it observes to be more congested, onto paths it observes to be less congested. 

Recent growth in cloud applications from companies such as Google, Microsoft, and Amazon has resulted in the construction of data centers of unprecedented size. 

Host-limited Flows Hedera’s flow scheduling algorithm is based on the assumption that long-lived flows contribute most of the bytes and therefore it only needs to schedule those flows. 

First the authors wish to understand the potential benefits of MPTCP with respect to the three major topologies in the literature: FatTree, VL2 and BCube. 

The intuition is that if the authors only schedule the big flows the authors can fully utilize all the bandwidth, and yet have a small scheduling cost, as dictated by the small number of flows. 

This might seem like quite a high number, but for an 8192-node FatTree network there are 256 distinct paths between each host pair, so only a small fraction of the paths are needed to achieve full utilization. 

Although such bottleneck links are loadbalanced, with FatTree in particular, other links cannot be fully utilized, and it takes more than two subflows to spread load across sufficient paths to fully utilize the network. 

They differ in that FatTree uses larger quantities of lower speed (1Gb/s) links between switches, whereas VL2 uses fewer faster (10Gb/s) links. 

while such dense interconnects can in principle support the full cross-sectional bandwidth of every host communicating flat out simultaneously, the denseness of interconnection poses a difficult challenge for routing. 

In all three topologies, with the right path selection this traffic pattern should just be able to load the network to full capacity but no more. 

Trending Questions (1)
How do I make my Terraria server less laggy?

We also show that multipath TCP outperforms laggy centralized flow scheduling without needing centralized control or additional infrastructure.