scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A scalable, commodity data center network architecture

17 Aug 2008-Vol. 38, Iss: 4, pp 63-74
TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Abstract: Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance.In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

Summary (3 min read)

Introduction

  • The role of diet and nutritional factors in brain aging is attracting much research attention.
  • A growing body of evidence links dietary patterns with cognitive abilities in old age.
  • The a priori method makes assumptions about which specific food items constitute a healthy diet based on current nutritional research and does not take into account the complexity of the full diet (Allès et al., 2012).
  • Studies using more comprehensive neuropsychometric tests assessing specific cognitive domains are limited (Akbaraly et al., 2011; Kesse-Guyot et al., 2012).

Study population

  • The study sample was drawn from the Lothian Birth Cohort 1936 Study (LBC1936), an ongoing longitudinal study of cognitive aging, which comprises 1,091 men and women living independently in the community.
  • Almost all participants were residing in Edinburgh and the surrounding Lothian region at recruitment.
  • Food frequency questionnaire (FFQ) data were available for 882 participants.
  • Ethics permission for the LBC1936 study protocol was obtained from the Multi-Centre Research Ethics Committee for Scotland (MREC/01/0/56) and from the Lothian Research Ethics Committee for Scotland (LREC/2003/2/29).

Dietary assessment

  • Dietary patterns were assessed using the Scottish Collaborative Group FFQ version 7.0.
  • The FFQ version 7.0 lists 168 foods or drinks and a common unit or portion size for each item is specified.
  • Response to all items was on a 9-point scale, ranging from “rarely or never” to “7+ per day” in the previous two to three months.
  • All participants (n = 1,091) were asked to complete the FFQ at home and return it by post.
  • Of these questionnaires, 98 were not returned, 26 were returned blank, and 39 had >10 missing items and excluded from the analyses.

Identification of dietary patterns

  • Dietary factors were previously identified for this sample using principal components analysis (PCA) with varimax orthogonal rotation on all the FFQ items.
  • Four main components were extracted, based on the examination of scree plots, which accounted for 11.67% of the total variance.
  • The “health aware” diet pattern (14 items) was defined by eating more fruits (such as apples, bananas, tinned fruit, oranges, and others) and carrots, and had negative loadings from high consumption of meat products (bacon or Dietary patterns and cognitive function 1395 gammon, pork or lamb, and sausages) eggs, and spirits or liqueurs.
  • Factor scores were calculated by summing the frequency of consumption multiplied by factor loadings across all food items.

MO R A Y HOUSE TE S T (AGE 11 AND 70

  • The MHT is a groupadministered test of general intelligence.
  • This test was concurrently validated against the Terman– Merrill revision of the Binet scales (SCRE, 1949).
  • SCRE recorded and archived these scores and made them available to the LBC1936 study.
  • Participants re-sat the MHT at a mean age of 70 years.

OTHER COGNITIVE TESTING

  • Three cognitive domains are represented in the LBC1936 cognitive battery: general (g) cognitive ability, processing speed, and memory.
  • A general (g) cognitive ability factor was derived from scores on six Wechsler Adult Intelligence Scale-IIIUK (WAIS-III) subtests (Wechsler, 1998a), namely Letter-Number Sequencing; Matrix Reasoning; Block Design; Digit Symbol; Digit Span Backwards; and Symbol Search.
  • Verbal ability was assessed using the National Adult Reading Test (NART; Nelson and Willison, 1991) and the Wechsler Test of Adult Reading (WTAR; Holdnack, 2001).
  • The MMSE is a standardized brief screening measure for cognitive pathology (Folstein et al., 1975).

Covariates

  • Covariates included age (in days at the time of testing) and sex.
  • Adult SES was derived from participants’ (or their spouses’) highest reported occupation and classified into one of the following six categories: professional; managerial; skilled nonmanual; skilled manual; semi-skilled; and unskilled (Office of Population Censuses and Surveys, 1980).
  • Raw scores were corrected for age in days at the time of testing and converted into IQ-type scores for the sample (M = 100, SD = 15).
  • Health measures included history of diabetes, stroke, or cardiovascular disease (CVD) (all coded as dichotomous variables, yes/no).

Statistical analysis

  • Analyses were performed using SPSS version 19 (IBM, NY, USA).
  • The authors used one-way analysis of variance for continuous variables, and Chi-square tests (χ2) for categorical variables, to examine the relations between dietary patterns and characteristics of the participants.
  • For these analyses, the authors used a General Linear Model (GLM) approach in a series of models; each subsequent model was adjusted for a different set of covariates.
  • The authors report p-values (p < 0.05 as level of significance was used for all data analyses).
  • It is defined as the ratio of variance in the outcome accounted for by an effect, and that effect plus its associated error variance, within an ANOVA/GLM design.

Results

  • Table 1 shows the characteristics of study participants in relation to dietary patterns.
  • Generally, controlling for age 11 IQ and occupational social class (models 2, 3, and 4) strongly attenuated most of the associations between the dietary patterns and cognitive scores, and often reduced them to non-significance.
  • Similarly, the negative associations between the “traditional” dietary pattern and cognitive abilities were diminished after adjustment of covariates, particularly age 11 IQ.
  • In model 1, persons with higher scores on the “health aware” foods dietary pattern scored more poorly on a test of age 70 IQ, although the effect size was small (ηp2 = 0.005).

Discussion

  • In this study the authors examined associations between four empirically derived dietary patterns and important domains of cognitive function in a UK sample of men and women aged about 70 years.
  • Before adjustment for childhood IQ and SES, their results suggested that following a “Mediterranean-style” diet was associated with better cognitive function, and following a “traditional” diet was associated with poorer cognitive function on all cognitive domains tested: IQ, general (g) cognitive ability factor, processing speed, memory, and verbal ability.
  • Higher lifetime trait IQ was also found to predict vitamin supplement use rather than supplement use predicting IQ in old age.
  • Prospective studies are important for research on the etiology of cognitive aging; dietary assessment at midlife and/or measures of long-term dietary intake likely reduce the possibility of confounding or reverse causation by factors caused by the disease process in later life.

Strengths and limitations

  • In contrast with the conventional approach, which focuses on a single nutrient or a few nutrients or foods in isolation, identifying dietary patterns takes into account overall eating patterns.
  • The authors used a single measure of diet (FFQ) designed to capture dietary habits in the short term, but not necessarily representative of dietary habits over a longer period of time.
  • That said, the authors excluded persons based on the MMSE score, and therefore they were confident that the current samples were free from cognitive impairment.
  • Of course, given their interests in the lifelong association between dietary patterns and cognitive abilities, it would have been useful to have information on dietary patterns from childhood, and more information on both diet and cognition from points in the life course between the age of 11 and 70 years.

Conclusions

  • Dietary patterns are a promising strategy for analyzing the associations between food and cognitive performance in epidemiological investigations.
  • Their findings urge caution in interpreting diet–cognition associations as causal effects in that direction.
  • The authors results suggest a pattern of reverse causation (or confounding); a higher childhood cognitive ability (and adult SES) might predict choice of and/or adherence to a “healthy” dietary pattern and better cognitive performance in old age.
  • The authors models show no direct link between diet and cognitive performance in older age; instead, they raise the possibility that they are related via the lifelong stable trait of intelligence.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Scalable, Commodity Data Center Network Architecture
Mohammad Al-Fares
malfares@cs.ucsd.edu
Alexander Loukissas
aloukiss@cs.ucsd.edu
Amin Vahdat
vahdat@cs.ucsd.edu
Department of Computer Science and Engineering
University of California, San Diego
La Jolla, CA 92093-0404
ABSTRACT
Today’s data centers may contain tens of thousands of computers
with significant aggregate bandwidth requirements. The network
architecture typically consists of a tree of routing and switching
elements with progressively more specialized and expensive equip-
ment moving up the network hierarchy. Unfortunately, even when
deploying the highest-end IP switches/routers, resulting topologies
may only support 50% of the aggregate bandwidth available at the
edge of the network, while still incurring tremendous cost. Non-
uniform bandwidth among data center nodes complicates applica-
tion design and limits overall system performance.
In this paper, we show how to leverage largely commodity Eth-
ernet switches to support the full aggregate bandwidth of clusters
consisting of tens of thousands of elements. Similar to how clusters
of commodity computers have largely replaced more specialized
SMPs and MPPs, we argue that appropriately architected and inter-
connected commodity switches may deliver more performance at
less cost than available from today’s higher-end solutions. Our ap-
proach requires no modifications to the end host network interface,
operating system, or applications; critically, it is fully backward
compatible with Ethernet, IP, and TCP.
Categories and Subject Descriptors
C.2.1 [Network Architecture and Design]: Network topology;
C.2.2 [Network Protocols]: Routing protocols
General Terms
Design, Performance, Management, Reliability
Keywords
Data center topology, equal-cost routing
1. INTRODUCTION
Growing expertise with clusters of commodity PCs have enabled
a number of institutions to harness petaflops of computation power
and petabytes of storage in a cost-efficient manner. Clusters con-
sisting of tens of thousands of PCs are not unheard of in the largest
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGCOMM’08, August 17–22, 2008, Seattle, Washington, USA.
Copyright 2008 ACM 978-1-60558-175-0/08/08 ...$5.00.
institutions and thousand-node clusters are increasingly common
in universities, research labs, and companies. Important applica-
tions classes include scientific computing, nancial analysis, data
analysis and warehousing, and large-scale network services.
Today, the principle bottleneck in large-scale clusters is often
inter-node communication bandwidth. Many applications must ex-
change information with remote nodes to proceed with their local
computation. For example, MapReduce [12] must perform signif-
icant data shuffling to transport the output of its map phase before
proceeding with its reduce phase. Applications running on cluster-
based file systems [18, 28, 13, 26] often require remote-node ac-
cess before proceeding with their I/O operations. A query to a
web search engine often requires parallel communication with ev-
ery node in the cluster hosting the inverted index to return the most
relevant results [7]. Even between logically distinct clusters, there
are often significant communication requirements, e.g., when up-
dating the inverted index for individual clusters performing search
from the site responsible for building the index. Internet services
increasingly employ service oriented architectures [13], where the
retrieval of a single web page can require coordination and commu-
nication with literally hundreds of individual sub-services running
on remote nodes. Finally, the significant communication require-
ments of parallel scientific applications are well known [27, 8].
There are two high-level choices for building the communication
fabric for large-scale clusters. One option leverages specialized
hardware and communication protocols, such as InfiniBand [2] or
Myrinet [6]. While these solutions can scale to clusters of thou-
sands of nodes with high bandwidth, they do not leverage com-
modity parts (and are hence more expensive) and are not natively
compatible with TCP/IP applications. The second choice lever-
ages commodity Ethernet switches and routers to interconnect clus-
ter machines. This approach supports a familiar management in-
frastructure along with unmodified applications, operating systems,
and hardware. Unfortunately, aggregate cluster bandwidth scales
poorly with cluster size, and achieving the highest levels of band-
width incurs non-linear cost increases with cluster size.
For compatibility and cost reasons, most cluster communication
systems follow the second approach. However, communication
bandwidth in large clusters may become oversubscribed by a sig-
nificant factor depending on the communication patterns. That is,
two nodes connected to the same physical switch may be able to
communicate at full bandwidth (e.g., 1Gbps) but moving between
switches, potentially across multiple levels in a hierarchy, may
limit available bandwidth severely. Addressing these bottlenecks
requires non-commodity solutions, e.g., large 10Gbps switches and
routers. Further, typical single path routing along trees of intercon-
nected switches means that overall cluster bandwidth is limited by
the bandwidth available at the root of the communication hierarchy.

Even as we are at a transition point where 10Gbps technology is
becoming cost-competitive, the largest 10Gbps switches still incur
significant cost and still limit overall available bandwidth for the
largest clusters.
In this context, the goal of this paper is to design a data center
communication architecture that meets the following goals:
Scalable interconnection bandwidth: it should be possible for
an arbitrary host in the data center to communicate with any
other host in the network at the full bandwidth of its local
network interface.
Economies of scale: just as commodity personal computers
became the basis for large-scale computing environments,
we hope to leverage the same economies of scale to make
cheap off-the-shelf Ethernet switches the basis for large-
scale data center networks.
Backward compatibility: the entire system should be back-
ward compatible with hosts running Ethernet and IP. That is,
existing data centers, which almost universally leverage com-
modity Ethernet and run IP, should be able to take advantage
of the new interconnect architecture with no modifications.
We show that by interconnecting commodity switches in a fat-
tree architecture, we can achieve the full bisection bandwidth of
clusters consisting of tens of thousands of nodes. Specifically, one
instance of our architecture employs 48-port Ethernet switches ca-
pable of providing full bandwidth to up 27,648 hosts. By leveraging
strictly commodity switches, we achieve lower cost than existing
solutions while simultaneously delivering more bandwidth. Our so-
lution requires no changes to end hosts, is fully TCP/IP compatible,
and imposes only moderate modifications to the forwarding func-
tions of the switches themselves. We also expect that our approach
will be the only way to deliver full bandwidth for large clusters
once 10 GigE switches become commodity at the edge, given the
current lack of any higher-speed Ethernet alternatives (at any cost).
Even when higher-speed Ethernet solutions become available, they
will initially have small port densities at significant cost.
2. BACKGROUND
2.1 Current Data Center Network Topologies
We conducted a study to determine the current best practices for
data center communication networks. We focus here on commodity
designs leveraging Ethernet and IP; we discuss the relationship of
our work to alternative technologies in Section 7.
2.1.1 Topology
Typical architectures today consist of either two- or three-level
trees of switches or routers. A three-tiered design (see Figure 1) has
a core tier in the root of the tree, an aggregation tier in the middle
and an edge tier at the leaves of the tree. A two-tiered design has
only the core and the edge tiers. Typically, a two-tiered design can
support between 5K to 8K hosts. Since we target approximately
25,000 hosts, we restrict our attention to the three-tier design.
Switches
1
at the leaves of the tree have some number of GigE
ports (48–288) as well as some number of 10 GigE uplinks toone or
more layers of network elementsthat aggregate and transfer packets
between the leaf switches. In the higher levels of the hierarchy there
are switches with 10 GigE ports (typically 32–128) and significant
switching capacity to aggregate traffic between the edges.
1
We use the term switch throughout the rest of the paper to refer to
devices that perform both layer 2 switching and layer 3 routing.
We assume the use of two types of switches, which represent
the current high-end in both port density and bandwidth. The rst,
used at the edge of the tree, is a 48-port GigE switch, with four 10
GigE uplinks. For higher levels of a communication hierarchy, we
consider 128-port 10 GigE switches. Both types of switches allow
all directly connected hosts to communicate with one another at the
full speed of their network interface.
2.1.2 Oversubscription
Many data center designs introduce oversubscription as a means
to lower the total cost of the design. We define the term over-
subscription to be the ratio of the worst-case achievable aggregate
bandwidth among the end hosts to the total bisection bandwidth of
a particular communication topology. An oversubscription of 1:1
indicates that all hosts may potentially communicate with arbitrary
other hosts at the full bandwidth of their network interface (e.g., 1
Gb/s for commodity Ethernet designs). An oversubscription value
of 5:1 means that only 20% of available host bandwidth is avail-
able for some communication patterns. Typical designs are over-
subscribed by a factor of 2.5:1 (400 Mbps) to 8:1 (125 Mbps) [1].
Although data centers with oversubscription of 1:1 are possible for
1 Gb/s Ethernet, as we discuss in Section 2.1.4, the cost for such
designs is typically prohibitive, even for modest-size data centers.
Achieving full bisection bandwidth for 10 Gb/s Ethernet is not cur-
rently possible when moving beyond a single switch.
2.1.3 Multi-path Routing
Delivering full bandwidth between arbitrary hosts in larger clus-
ters requires a “multi-rooted” tree with multiple core switches (see
Figure 1). This in turn requires a multi-path routing technique,
such as ECMP [19]. Currently, most enterprise core switches sup-
port ECMP. Without the use of ECMP, the largest cluster that can
be supported with a singly rooted core with 1:1 oversubscription
would be limited to 1,280 nodes (corresponding to the bandwidth
available from a single 128-port 10 GigE switch).
To take advantage of multiple paths, ECMP performs static load
splitting among flows. This does not account for flow bandwidth
in making allocation decisions, which can lead to oversubscription
even for simple communication patterns. Further, current ECMP
implementations limit the multiplicity of paths to 8–16, which is
often less diversity than required to deliver high bisection band-
width for larger data centers. In addition, the number of routing
table entries grows multiplicatively with the number of paths con-
sidered, which increases cost and can also increase lookup latency.
2.1.4 Cost
The cost for building a network interconnect for a large cluster
greatly affects design decisions. As we discussed above, oversub-
scription is typically introduced to lower the total cost. Here we
give the rough cost of various configurations for different number
of hosts and oversubscription using current best practices. We as-
sume a cost of $7,000 for each 48-port GigE switch at the edge
and $700,000 for 128-port 10 GigE switches in the aggregation and
core layers. We do not consider cabling costs in these calculations.
Figure 2 plots the cost in millions of US dollars as a function
of the total number of end hosts on the x axis. Each curve rep-
resents a target oversubscription ratio. For instance, the switching
hardware to interconnect 20,000 hosts with full bandwidth among
all hosts comes to approximately $37M. The curve corresponding
to an oversubscription of 3:1 plots the cost to interconnect end
hosts where the maximum available bandwidth for arbitrary end
host communication would be limited to approximately 330 Mbps.

Core
Aggregation
Edge
Figure 1: Common data center interconnect topology. Host to switch links are GigE and links between switches are 10 GigE.
0
5
10
15
20
25
30
35
40
1000 10000
Estimated cost (USD millions)
Number of hosts
1:1
3:1
7:1
Fat-tree
Figure 2: Current cost estimate vs. maximum possible number
of hosts for different oversubscription ratios.
We also include the cost to deliver an oversubscription of 1:1 using
our proposed fat-tree architecture for comparison.
Overall, we find that existing techniques for delivering high lev-
els of bandwidth in large clusters incur significant cost and that
fat-tree based cluster interconnects hold significant promise for de-
livering scalable bandwidth at moderate cost. However, in some
sense, Figure 2 understates the difficulty and expense of employing
the highest-end components in building data center architectures.
In 2008, 10 GigE switches are on the verge of becoming commod-
ity parts; there is roughly a factor of 5 differential in price per port
per bit/sec when comparing GigE to 10 GigE switches, and this
differential continues to shrink. To explore the historical trend,
we show in Table 1 the cost of the largest cluster configuration
that could be supported using the highest-end switches available
in a particular year. We based these values on a historical study of
product announcements from various vendors of high-end 10 GigE
switches in 2002, 2004, 2006, and 2008.
We use our findings to build the largest cluster configuration that
technology in that year could support while maintaining an over-
subscription of 1:1. Table 1 shows the largest 10 GigE switch avail-
able in a particular year; we employ these switches in the core and
aggregation layers for the hierarchical design. Tables 1 also shows
the largest commodity GigE switch available in that year; we em-
Hierarchical design Fat-tree
Year 10 GigE Hosts
Cost/
GigE Hosts
Cost/
GigE GigE
2002 28-port 4,480 $25.3K 28-port 5,488 $4.5K
2004 32-port 7,680 $4.4K 48-port 27,648 $1.6K
2006 64-port 10,240 $2.1K 48-port 27,648 $1.2K
2008 128-port 20,480 $1.8K 48-port 27,648 $0.3K
Table 1: The maximum possible cluster size with an oversub-
scription ratio of 1:1 for different years.
ploy these switches at all layers of the fat-tree and at the edge layer
for the hierarchical design.
The maximum cluster size supported by traditional techniques
employing high-end switches has been limited by available port
density until recently. Further, the high-end switches incurred pro-
hibitive costs when 10 GigE switches were initially available. Note
that we are being somewhat generous with our calculations for tra-
ditional hierarchies since commodity GigE switches at the aggre-
gation layer did not have the necessary 10 GigE uplinks until quite
recently. Clusters based on fat-tree topologies on the other hand
scale well, with the total cost dropping more rapidly and earlier (as
a result of following commodity pricing trends earlier). Also, there
is no requirement for higher-speed uplinks in the fat-tree topology.
Finally, it is interesting to note that, today, it is technically in-
feasible to build a 27,648-node cluster with 10 Gbps bandwidth
potentially available among all nodes. On the other hand, a fat-
tree switch architecture would leverage near-commodity 48-port 10
GigE switches and incur a cost of over $690 million. While likely
cost-prohibitive in most settings, the bottom line is that it is not
even possible to build such a configuration using traditional aggre-
gation with high-end switches because today there is no product or
even Ethernet standard for switches faster than 10 GigE.
2.2 Clos Networks/Fat-Trees
Today, the price differential between commodity and non-
commodity switches provides a strong incentive to build large-scale
communication networks from many small commodity switches
rather than fewer larger and more expensive ones. More than fifty
years ago, similar trends in telephone switches led Charles Clos to
design a network topology that delivers high levels of bandwidth
for many end devices by appropriately interconnecting smaller
commodity switches [11].

We adopt a special instance of a Clos topology called a fat-
tree [23] to interconnect commodity Ethernet switches. We orga-
nize a k-ary fat-tree as shown in Figure 3. There are k pods, each
containing two layers of k/2 switches. Each k-port switch in the
lower layer is directly connected to k/2 hosts. Each of the remain-
ing k/2 ports is connected to k/2 of the k ports in the aggregation
layer of the hierarchy.
There are (k/2)
2
k-port core switches. Each core switch has one
port connected to each of k pods. The i
th
port of any core switch
is connected to pod i such that consecutive ports in the aggregation
layer of each pod switch are connected to core switches on (k/2)
strides. In general, a fat-tree built with k-port switches supports
k
3
/4 hosts. In this paper, we focus on designs up to k = 48. Our
approach generalizes to arbitrary values for k.
An advantage of the fat-tree topology is that all switching ele-
ments are identical, enabling us to leverage cheap commodity parts
for all of the switches in the communication architecture.
2
Further,
fat-trees are rearrangeably non-blocking, meaning that for arbitrary
communication patterns, there is some set of paths that will satu-
rate all the bandwidth available to the end hosts in the topology.
Achieving an oversubscription ratio of 1:1 in practice may be diffi-
cult because of the need to prevent packet reordering for TCP flows.
Figure 3 shows the simplest non-trivial instance of the fat-tree
with k = 4. All hosts connected to the same edge switch form their
own subnet. Therefore, all traffic to a host connected to the same
lower-layer switch is switched, whereas all other traffic is routed.
As an example instance of this topology, a fat-tree built from 48-
port GigE switches would consist of 48 pods, each containing an
edge layer and an aggregation layer with 24 switches each. The
edge switches in every pod are assigned 24 hosts each. The net-
work supports 27,648 hosts, made up of 1,152 subnets with 24
hosts each. There are 576 equal-cost paths between any given pair
of hosts in different pods. The cost of deploying such a network
architecture would be $8.64M, compared to $37M for the tradi-
tional techniques described earlier.
2.3 Summary
Given our target network architecture, in the rest of this paper we
address two principal issues with adopting this topology in Ethernet
deployments. First, IP/Ethernet networks typically build a single
routing path between each source and destination. For even sim-
ple communication patterns, such single-path routing will quickly
lead to bottlenecks up and down the fat-tree, significantly limiting
overall performance. We describe simple extensions to IP forward-
ing to effectively utilize the high fan-out available from fat-trees.
Second, fat-tree topologies can impose significant wiring complex-
ity in large networks. To some extent, this overhead is inherent
in fat-tree topologies, but in Section 6 we present packaging and
placement techniques to ameliorate this overhead. Finally, we have
built a prototype of our architecture in Click [21] as described in
Section 3. An initial performance evaluation presented in Section 5
confirms the potential performance benefits of our approach in a
small-scale deployment.
3. ARCHITECTURE
In this section, we describe an architecture to interconnect com-
modity switches in a fat-tree topology. We first motivate the need
for a slight modification in the routing table structure. We then de-
scribe how we assign IP addresses to hosts in the cluster. Next,
2
Note that switch homogeneity is not required, as bigger switches
could be used at the core (e.g. for multiplexing). While these likely
have a longer mean time to failure (MTTF), this defeats the cost
benefits, and maintains the same cabling overhead.
we introduce the concept of two-level route lookups to assist with
multi-path routing across the fat-tree. We then present the algo-
rithms we employ to populate the forwarding table in each switch.
We also describe flow classification and flow scheduling techniques
as alternate multi-path routing methods. And finally, we present
a simple fault-tolerance scheme, as well as describe the heat and
power characteristics of our approach.
3.1 Motivation
Achieving maximum bisection bandwidth in this network re-
quires spreading outgoing traffic from any given pod as evenly
as possible among the core switches. Routing protocols such as
OSPF2 [25] usually take the hop-count as their metric of “shortest-
path, and in the k-ary fat-tree topology (see Section 2.2), there
are (k/2)
2
such shortest-paths between any two hosts on differ-
ent pods, but only one is chosen. Switches, therefore, concentrate
traffic going to a given subnet to a single port even though other
choices exist that give the same cost. Furthermore, depending on
the interleaving of the arrival times of OSPF messages, it is pos-
sible for a small subset of core switches, perhaps only one, to be
chosen as the intermediate links between pods. This will cause se-
vere congestion at those points and does not take advantage of path
redundancy in the fat-tree.
Extensions such as OSPF-ECMP [30], in addition to being un-
available in the class of switches under consideration, cause an
explosion in the number of required prefixes. A lower-level pod
switch would need (k/2) prefixes for every other subnet; a total of
k (k/2)
2
prefixes.
We therefore need a simple, fine-grained method of traffic dif-
fusion between pods that takes advantage of the structure of the
topology. The switches must be able to recognize, and give special
treatment to, the class of traffic that needs to be evenly spread. To
achieve this, we propose using two-level routing tables that spread
outgoing traffic based on the low-order bits of the destination IP
address (see Section 3.3).
3.2 Addressing
We allocate all the IP addresses in the network within the private
10.0.0.0/8 block. We follow the familiar quad-dotted form with
the following conditions: The pod switches are given addresses of
the form 10.pod.switch.1, where pod denotes the pod number (in
[0, k 1]), and switch denotes the position of that switch in the
pod (in [0, k 1], starting from left to right, bottom to top). We give
core switches addresses of the form 10.k.j.i, where j and i denote
that switch’s coordinates in the (k/2)
2
core switch grid (each in
[1, (k/2)], starting from top-left).
The address of a host follows from the pod switch it is connected
to; hosts have addresses of the form: 10.pod.switch.ID, where
ID is the host’s position in that subnet (in [2, k/2+1], startingfrom
left to right). Therefore, each lower-level switch is responsible for a
/24 subnet of k/2 hosts (for k < 256). Figure 3 shows examples of
this addressing scheme for a fat-tree corresponding to k = 4. Even
though this is relatively wasteful use of the available address space,
it simplifies building the routing tables, as seen below. Nonetheless,
this scheme scales up to 4.2M hosts.
3.3 Two-Level Routing Table
To provide the even-distribution mechanism motivated in Sec-
tion 3.1, we modify routing tables to allow two-level prefix lookup.
Each entry in the main routing table will potentially have an addi-
tional pointer to a small secondary table of (suffix, port) entries. A
first-level prefix is terminating if it does not contain any second-
level suffixes, and a secondary table may be pointed to by more

Pod 0
10.0.2.1
10.0.1.1
Pod 1
Pod 3Pod 2
10.2.0.2
10.2.0.3
10.2.0.1
10.4.1.1 10.4.1.2 10.4.2.1 10.4.2.2
Core
10.2.2.1
10.0.1.2
Edge
Aggregation
Figure 3: Simple fat-tree topology. Using the two-level routing tables described in Section 3.3, packets from source 10.0.1.2 to
destination 10.2.0.3 would take the dashed path.
Prefix
10.2.0.0/24
10.2.1.0/24
0.0.0.0/0
Output port
0
1
Suffix
0.0.0.2/8
0.0.0.3/8
Output port
2
3
Figure 4: Two-level table example. This is the table at switch
10.2.2.1. An incoming packet with destination IP address
10.2.1.2 is forwarded on port 1, whereas a packet with desti-
nation IP address 10.3.0.3 is forwarded on port 3.
than one first-level prefix. Whereas entries in the primary table are
left-handed (i.e., /m prefix masks of the form 1
m
0
32m
), entries
in the secondary tables are right-handed (i.e. /m suffix masks of
the form 0
32m
1
m
). If the longest-matching prefix search yields
a non-terminating prefix, then the longest-matching suffix in the
secondary table is found and used.
This two-level structure will slightly increase the routing table
lookup latency, but the parallel nature of prefix search in hardware
should ensure only a marginal penalty (see below). This is helped
by the fact that these tables are meant to be very small. As shown
below, the routing table of any pod switch will contain no more
than k/2 prefixes and k/2 suffixes.
3.4 Two-Level Lookup Implementation
We now describe how the two-level lookup can be implemented
in hardware using Content-Addressable Memory (CAM) [9].
CAMs are used in search-intensive applications and are faster
than algorithmic approaches [15, 29] for finding a match against
a bit pattern. A CAM can perform parallel searches among all
its entries in a single clock cycle. Lookup engines use a special
kind of CAM, called Ternary CAM (TCAM). A TCAM can store
don’t care bits in addition to matching 0’s and 1’s in particular
positions, making it suitable for storing variable length prefixes,
such as the ones found in routing tables. On the downside, CAMs
have rather low storage density, they are very power hungry, and
Next hop
10.2.0.1
10.2.1.1
10.4.1.1
10.4.1.2
Address
00
01
10
11
Output port
0
1
2
3
RAM
Encoder
10.2.0.X
10.2.1.X
X.X.X.2
X.X.X.3
TCAM
Figure 5: TCAM two-level routing table implementation.
expensive per bit. However, in our architecture, routing tables can
be implemented in a TCAM of a relatively modest size (k entries
each 32 bits wide).
Figure 5 shows our proposed implementation of the two-level
lookup engine. A TCAM stores address prefixes and suffixes,
which in turn indexes a RAM that stores the IP address of the next
hop and the output port. We store left-handed (prefix) entries in
numerically smaller addresses and right-handed (suffix) entries in
larger addresses. We encode the output of the CAM so that the
entry with the numerically smallest matching address is output.
This satisfies the semantics of our specific application of two-level
lookup: when the destination IP address of a packet matches both a
left-handed and a right-handed entry, then the left-handed entry is
chosen. For example, using the routing table in Figure 5, a packet
with destination IP address 10.2.0.3 matches the left-handed entry
10.2.0.X and the right-handed entry X.X.X.3. The packet is
correctly forwarded on port 0. However, a packet with destination
IP address 10.3.1.2 matches only the right-handed entry X.X.X.2
and is forwarded on port 2.
3.5 Routing Algorithm
The first two levels of switches in a fat-tree act as filtering traf-
fic diffusers; the lower- and upper-layer switches in any given pod
have terminating prefixes to the subnets in that pod. Hence, if a
host sends a packet to another host in the same pod but on a dif-
ferent subnet, then all upper-level switches in that pod will have a
terminating prefix pointing to the destination subnet’s switch.
For all other outgoing inter-pod traffic, the pod switches have
a default /0 prefix with a secondary table matching host IDs (the

Citations
More filters
Journal ArticleDOI
TL;DR: A survey of cloud computing is presented, highlighting its key concepts, architectural principles, state-of-the-art implementation as well as research challenges to provide a better understanding of the design challenges of cloud Computing and identify important research directions in this increasingly important area.
Abstract: Cloud computing has recently emerged as a new paradigm for hosting and delivering services over the Internet. Cloud computing is attractive to business owners as it eliminates the requirement for users to plan ahead for provisioning, and allows enterprises to start from the small and increase resources only when there is a rise in service demand. However, despite the fact that cloud computing offers huge opportunities to the IT industry, the development of cloud computing technology is currently at its infancy, with many issues still to be addressed. In this paper, we present a survey of cloud computing, highlighting its key concepts, architectural principles, state-of-the-art implementation as well as research challenges. The aim of this paper is to provide a better understanding of the design challenges of cloud computing and identify important research directions in this increasingly important area.

3,465 citations


Cites background from "A scalable, commodity data center n..."

  • ...should meet the following objectives [1, 21–23, 35]:...

    [...]

Proceedings ArticleDOI
16 Aug 2009
TL;DR: VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.
Abstract: To be agile and cost effective, data centers should allow dynamic resource allocation across large server pools. In particular, the data center network should enable any server to be assigned to any service. To meet these goals, we present VL2, a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics. VL2 uses (1) flat addressing to allow service instances to be placed anywhere in the network, (2) Valiant Load Balancing to spread traffic uniformly across network paths, and (3) end-system based address resolution to scale to large server pools, without introducing complexity to the network control plane. VL2's design is driven by detailed measurements of traffic and fault data from a large operational cloud service provider. VL2's implementation leverages proven network technologies, already available at low cost in high-speed hardware implementations, to build a scalable and reliable network architecture. As a result, VL2 networks can be deployed today, and we have built a working prototype. We evaluate the merits of the VL2 design using measurement, analysis, and experiments. Our VL2 prototype shuffles 2.7 TB of data among 75 servers in 395 seconds - sustaining a rate that is 94% of the maximum possible.

2,350 citations

Proceedings ArticleDOI
27 Aug 2013
TL;DR: This work presents the design, implementation, and evaluation of B4, a private WAN connecting Google's data centers across the planet, using OpenFlow to control relatively simple switches built from merchant silicon.
Abstract: We present the design, implementation, and evaluation of B4, a private WAN connecting Google's data centers across the planet. B4 has a number of unique characteristics: i) massive bandwidth requirements deployed to a modest number of sites, ii) elastic traffic demand that seeks to maximize average bandwidth, and iii) full control over the edge servers and network, which enables rate limiting and demand measurement at the edge.These characteristics led to a Software Defined Networking architecture using OpenFlow to control relatively simple switches built from merchant silicon. B4's centralized traffic engineering service drives links to near 100% utilization, while splitting application flows among multiple paths to balance capacity against application priority/demands. We describe experience with three years of B4 production deployment, lessons learned, and areas for future work.

2,226 citations


Cites background from "A scalable, commodity data center n..."

  • ...While there has been substantial focus on OpenFlow in the data center [1, 35, 40], there has been relatively little focus on the WAN....

    [...]

  • ...Load balancing and multipath solutions have largely focused on data center architectures [1, 18, 20], though at least one ešort recently targets the WAN [22]....

    [...]

Proceedings ArticleDOI
01 Nov 2010
TL;DR: An empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, and cloud data centers, which includes not only data centers employed by large online service providers offering Internet-facing applications but also data centers used to host data-intensive (MapReduce style) applications.
Abstract: Although there is tremendous interest in designing improved networks for data centers, very little is known about the network-level traffic characteristics of data centers today. In this paper, we conduct an empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, and cloud data centers. Our definition of cloud data centers includes not only data centers employed by large online service providers offering Internet-facing applications but also data centers used to host data-intensive (MapReduce style) applications). We collect and analyze SNMP statistics, topology and packet-level traces. We examine the range of applications deployed in these data centers and their placement, the flow-level and packet-level transmission properties of these applications, and their impact on network and link utilizations, congestion and packet drops. We describe the implications of the observed traffic patterns for data center internal traffic engineering as well as for recently proposed architectures for data center networks.

2,119 citations


Cites background from "A scalable, commodity data center n..."

  • ...The architectures for several proposals [1, 22, 12, 2, 14, 21, 4, 18, 29] rely in some form or another on a centralized controller for configuring routes or for disseminating routing information to endhosts....

    [...]

  • ...There is tremendous interest in designing improved networks for data centers [1, 2, 22, 13, 16, 11, 4, 18, 29, 14, 21]; however, such work and its evaluation is driven by only a few studies of data center traffic, and those studies are solely of huge (> 10K server) data centers, primarily running data mining, MapReduce jobs, or Web services....

    [...]

  • ...Several proposals [1, 22, 11, 2] for new data center network architectures attempt to maximize the network bisection bandwidth....

    [...]

Book
Luiz Andre Barroso1, Urs Hoelzle1
01 Jan 2008
TL;DR: The architecture of WSCs is described, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base are described.
Abstract: As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today's WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today's WSCs on a single board. Table of Contents: Introduction / Workloads and Software Infrastructure / Hardware Building Blocks / Datacenter Basics / Energy and Power Efficiency / Modeling Costs / Dealing with Failures and Repairs / Closing Remarks

1,938 citations

References
More filters
Book
01 Jan 1979
TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Abstract: This is the second edition of a quarterly column the purpose of which is to provide a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book ‘‘Computers and Intractability: A Guide to the Theory of NP-Completeness,’’ W. H. Freeman & Co., San Francisco, 1979 (hereinafter referred to as ‘‘[G&J]’’; previous columns will be referred to by their dates). A background equivalent to that provided by [G&J] is assumed. Readers having results they would like mentioned (NP-hardness, PSPACE-hardness, polynomial-time-solvability, etc.), or open problems they would like publicized, should send them to David S. Johnson, Room 2C355, Bell Laboratories, Murray Hill, NJ 07974, including details, or at least sketches, of any new proofs (full papers are preferred). In the case of unpublished results, please state explicitly that you would like the results mentioned in the column. Comments and corrections are also welcome. For more details on the nature of the column and the form of desired submissions, see the December 1981 issue of this journal.

40,020 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations


"A scalable, commodity data center n..." refers methods in this paper

  • ...MapReduce: Simpli.ed Data Processing on Large Clusters....

    [...]

  • ...For example, MapReduce [12] must perform significant data shuffling to transport the output of its map phase before proceeding with its reduce phase....

    [...]

  • ...For example, MapReduce [12] must perform signif­icant data shuf.ing to transport the output of its map phase before proceeding with its reduce phase....

    [...]

Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations

Frequently Asked Questions (14)
Q1. What contributions have the authors mentioned in the paper "A scalable, commodity data center network architecture" ?

In this paper, the authors show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, the authors argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today ’ s higher-end solutions. Their approach requires no modifications to the end host network interface, operating system, or applications ; critically, it is fully backward compatible with Ethernet, IP, and TCP. 

By leveraging strictly commodity switches, the authors achieve lower cost than existing solutions while simultaneously delivering more bandwidth. 

The ith port of any core switch is connected to pod i such that consecutive ports in the aggregation layer of each pod switch are connected to core switches on (k/2) strides. 

Without the use of ECMP, the largest cluster that can be supported with a singly rooted core with 1:1 oversubscription would be limited to 1,280 nodes (corresponding to the bandwidth available from a single 128-port 10 GigE switch). 

An advantage of the fat-tree topology is that all switching elements are identical, enabling us to leverage cheap commodity parts for all of the switches in the communication architecture. 

Clusters consisting of tens of thousands of PCs are not unheard of in the largestPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. 

For even simple communication patterns, such single-path routing will quickly lead to bottlenecks up and down the fat-tree, significantly limiting overall performance. 

The authors also expect that their approach will be the only way to deliver full bandwidth for large clusters once 10 GigE switches become commodity at the edge, given the current lack of any higher-speed Ethernet alternatives (at any cost). 

communication bandwidth in large clusters may become oversubscribed by a significant factor depending on the communication patterns. 

As an example instance of this topology, a fat-tree built from 48- port GigE switches would consist of 48 pods, each containing an edge layer and an aggregation layer with 24 switches each. 

The cost of deploying such a network architecture would be $8.64M , compared to $37M for the traditional techniques described earlier. 

Clusters based on fat-tree topologies on the other hand scale well, with the total cost dropping more rapidly and earlier (as a result of following commodity pricing trends earlier). 

For instance, the switching hardware to interconnect 20,000 hosts with full bandwidth among all hosts comes to approximately $37M. 

The authors assume a cost of $7,000 for each 48-port GigE switch at the edge and $700,000 for 128-port 10 GigE switches in the aggregation and core layers.