scispace - formally typeset
Open AccessJournal ArticleDOI

Optimizing Data-Center TCO with Scale-Out Processors

TLDR
In this article, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.
Abstract
Performance and total cost of ownership (TCO) are key optimization metrics in large-scale data centers. According to these metrics, data centers designed with conventional server processors are inefficient. Recently introduced processors based on low-power cores can improve both throughput and energy efficiency compared to conventional server chips. However, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.

read more

Content maybe subject to copyright    Report

.............................................................................................................................................. ................ ................ ................ .................. ..........
O
PTIMIZING
D
ATA
-C
ENTER
TCO
WITH
S
CALE
-O
UT
P
ROCESSORS
.............................................................................................................................................. ................ ................ ................ .................. ..........
PERFORMANCE AND TOTAL COST OF OWNERSHI P (TCO) ARE KEY OPTIMIZATION METRICS
IN LARGE
-SCALE DATA CENTERS.ACCORDING TO THESE METRICS, DATA CENTERS
DESIGNED WITH CONVENTIONAL SERVER PROCESSORS ARE INEFFICIENT
.RECENTLY
INTRODUCED PROCESSORS BASED ON LOW
-POWER CORES CAN IMPROVE BOTH
THROUGHPUT AND ENERGY EFFICIENCY COMPARED TO CONVENTIONAL SERVER CHIPS
.
H
OWEVER, A SPECIALIZED SCALE-OUT PROCESSOR (SOP) ARCHITECTURE MAXIMIZES
ON
-CHIP COMPUTING DENSITY TO DELIVE R THE HIGHEST PERFORMANCE PER TCO AND
PERFORMANCE PER WATT AT THE DATA
-CENTER LEVEL.
......We are in the midst of an infor-
mation revolution, driven by ubiquitous
access to vast data stores via a variety of
richly networked platforms. Data centers
are the workhorses powering this revolution.
Companies leading the transformation to the
digital universe, such as Google, Microsoft,
and Facebook, rely on networks of megascale
data centers to provide search, social connec-
tivity, media streaming, and a growing num-
berofotherofferingstolarge,distributed
audiences. A scale-out data center powering
cloud services can house te ns of thousands
of servers that are necessary for high scalabil-
ity, availability, and resilience.
1
The massive scale of such data centers
requires an enormous capital outlay for infra-
structure and hardware, often exceeding
$100 million per data center.
2
Similarly ex-
pansive are the power requirements, typically
in the range of 5 to 15 MW per data center,
totaling millions of dollars in annual operat-
ing costs. With demand for information
services skyrocketing around the globe,
efficiency has become a paramount concern
in the design and operation of large-scale
data centers.
To reduce infrastructure, hardware, and
energy costs, data-center operators target
high computing density and power effi-
ciency. Total cost of ownership (TCO) is
an optimization metric that considers the
costs of real estate, power delivery and cool-
ing infrastructure, hardware acquisition
costs, and operating expenses. Because server
acquisition and power costs constitute the
two largest TCO components,
3
servers pres-
ent a prime optimization target in the quest
for more efficient data centers. In addition
to cost, performance is also critical in scale-
out data centers designed to service thou-
sands of concurrent requests with real-time
constraints. The ratio of performance to
TCO (performance per dollar of ownership
expense) is thus an appropriate metric f or
evaluating different data-center designs.
Scale-out workloads prevalent in lar ge-
scale data centers rely on in-memory
[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 52
Pejman Lotfi-Kamran
Babak Falsafi
EPFL
Chrysostomos Nicopoulos
Yiannakis Sazeides
University of Cyprus
Boris Grot
EPFL
Damien Hardy
University of Cyprus
..............................................................
52
Published by the IEEE Computer Society 0272-1732/12/$31.00
c
2012 IEEE

processing and massive parallelism to guaran-
tee low response latency and high throughput.
Although processors ultimately determine a
server’s performance characteristics, they
contribute just a fraction of the overall pur-
chase price and power burden in a server
node. Memory, disk, networking equipment,
power provisioning, and cooling all contrib-
ute substantially to acquisition and operating
costs. Moreover, these components are less
energy proportional than modern processors,
meaning their power requirements don’t
scale down well as the server load drops.
Thus, maximizing the benefit from the
TCO investment requires getting high utili-
zationfromtheentireserver,notjustthe
processor.
To achieve high server utilization, data
centers must employ processors that can
fully leverage the available bandwidth to
memory and I/O. Conventional server pro-
cessors use powerful cores designed for a
broad range of workloads, including scien-
tific, gaming, and media processing. As a re-
sult, they deliver good performance across
the workload range, but they fail to maxi-
mize either performance or efficiency on
memory-intensive scale-out applications.
Emerging server processors, o n the other
hand, employ simpler core microarchitec-
tures that improve efficiency but fall short
of maximizing performance. What the in-
dustry needs are server processors that jointly
optimize for performance, energy, and TCO.
With this in mind, we developed a method-
ology for designing performance-density-
optimal server chips called Scale-Out
Processors (SOPs). Our SOP methodology
improves data-center efficiency through a
many-core organization tuned to the demands
of scale-out workloads.
Today’s server processors
Multicore processors common today are
well-suited for m assively parallel scale-out
workloads running in data centers. First, they
improve throughput per chip over single-core
designs. Second, they amortize on-chip and
board-level resources among multiple hard-
ware threads, thereby lowering both cost
and power consumption per unit of work
(that is, per thread).
Table 1 summarizes the principal charac-
teristics of today’s server processors. Existing
data centers are built with server-class designs
from Intel and AMD. A representative pro-
cessor is Intel’s Xeon 5670,
4
amid-range
design that integrates six powerful dual-
threaded cores and a spacious 12-Mbyte
last-level cache (LLC). The Xeon 5670 con-
sumes 95 W at a maximum frequency of
3 GHz. The combination of powerful cores
and relatively large chip size leads us to clas-
sify conventional server processors as big-core,
big-chip designs.
Recently, several companies have intro-
duced processors featuring simpler core
microarchitectures that specifically target
[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 53
Table 1. Server chip characteristics. The first three processors are existing designs,
and the last two processors are proposed designs.
Processor Type
Cores,
threads
Last-level
cache size
(Mbytes)
DDR3
interfaces
Frequency
(GHz)
Power
(W)
Area
(mm
2
)
Cost per
processor
($)
Big core,
big chip
Conventional 6, 12 12 3 3 95 233 800
Small core,
small chip
Small chip 4, 4 4 1 1.5 6 62 95
Small core,
big chip
Tiled 36, 36 9 2 1.5 28 132 300
Scale-out,
in order
Scale-Out
Processor
48, 48 4 3 1.5 34 132 320
Scale-out,
out of order
Scale-Out
Processor
16, 16 4 2 2 33 132 320
....................................................................
SEPTEMBER/OCTOBER 2012 53

scale-out data centers. Research has shown
simple-core designs to be well-matched to
the demand s of many scale-out workloads,
which spend a high fraction of their time
accessing memory and have moderate com-
putational intensity.
5
Two design para-
digms have emerged in this space: one
type features a few small cores on a small
chip (small core, small chip); the other inte-
grates m any small cores on a bigger chip
(small core, big chip).
Companies including Calxeda, Marvell,
and SeaMicro market small-core, small-chip
processors targeted at data centers. Despite
the differences in the core organization
and even the instruction set architecture
(ISA)—Calxeda’s and Marvell’s designs are
powered by ARM, whereas SeaMicro uses
an x86-based Atom processor—the chips
are surprisingly similar in their feature set:
all have four hardware contexts, dual-issue
cores, a clock speed in the range of 1.1 to
1.6 GHz, and power consumption of 5 to
10 W. We use the Calxeda design as a rep-
resentative configuration, featuring four
Cortex-A9cores,a4-MbyteLLC,andan
on-die memory controller.
6
At 1.5 GHz,
our model estimates a peak power con-
sumption of 6 W.
A processor representative of the small-
core, big-chip design philosophy is Tilera’s
Tile-Gx3036. This server-class processor fea-
tures 36 simple cores and a 9-Mbyte LLC in
a tiled organization.
7
Each tile integrates a
core, a slice of the shared LLC, and a router.
Accesses to the distributed LLC’s remote
banks require a traversal of the on-chip
interconnect, imple mented as a 2D mesh
network with a single-cycle per-hop delay.
Operating at 1.5 GHz, the Tilera-like tiled
design draws approximately 28 W of power
at peak load.
To understand the efficiency implications
of these diverse processor architectures, we
use a combination of analytic models and
simulation-based studies, employing a full-
system server simulation infrastructure, to esti-
mate their performance, area, and power char-
acteristics. Our workloads are taken from
CloudSuite (http://parsa.epfl.ch/cloudsuite),
a collection of representative scale-out appli-
cations that includes web search, data serv-
ing, and MapReduce.
Figure 1a compares the designs along two
dimensions: performance density and energy
efficiency. Performance density, expressed as
performance per mm
2
, measures the process-
or’s ability to effectively utilize the chip real
estate. Energy efficiency, in units of perfor-
mance per watt, indicates the processor’s
ability to convert energy into useful work.
The small-core, small-chip processor
offers a 2.2 improvement in ener gy effi-
ciency over a co nventional big-core design,
than ks to the former’s simpler core micro-
architecture. However, the small-chip design
has 45 percent lower performance density
than the conventional one. To better
[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 54
(a) (b)
0
0.3
0.6
0.9
1.2
0
0.07
0.14
0.21
0.28
Conventional Small chip Tiled
Performance/watt
Performance/mm
2
Performance density
Energy efficiency
Power (watt)
Cache DDR
Misc.
Core
0
10
20
30
40
50
0
50
100
150
200
250
Area Power Area Power Area Power
Conventional Small chip Tiled
Area (mm
2
)
95
Figure 1. Efficiency, area, and power of today’s server processors: performance density and energy efficiency (a);
processor area and power breakdown (b). We use a combination of analytic models and simulation-based studies to
estimate the performance, area, and power characteristics.
....................................................................
54 IEEE MICRO
.......... ................ .................. ................ ................ ................ .................. ................ .................................................................
ENERGY-AWARE COMPUTING

understand the trends, Figure 1b shows a
breakdown of the respective processors’ area
and power budgets. The data reveals that
while the cores in the conventional server
processortakeup44percentofthechip
area, the small-chip design commits just
20 percent of the chip to compute, with the
remainder of the area going to the LLC,
I/O, and auxiliary circuitry. In terms of
power, the six conventional cores consume
71 W of the 95-W power budget (75 per-
cent), whereas the four simpler cores in the
small-chip organization dissipate just 2.4 W
(38 percent of total chip power) under full
load. As with the area, the relative energy
cost of the cache and peripheral circuitry in
the small-chip design is greater than in the
conventional design (62 percent and 25 per-
cent of the respective chips’ power budgets).
The most-efficient design point is the
small-core, big-ch ip t iled processor, which
surpasses both conventional and small-chip
alternatives by more than 88 percent in per-
formance density, and 65 percent in energy
efficiency. The cores in the tiled processor
take up 36 percent of the chip real estate,
nearly doubling the fraction of the area dedi-
cated to execution resources as compared
to the s mall-chip design. The fraction of
the power devoted to execution resources
increases to 48 percent compared to 38 per-
cent in the small-chip design.
Our results corroborate earlier studies that
identify efficiency benefits stemming from
the use of lower-complexity cores as com-
pared to those used in conventional server
processors.
8,9
However, our findings also
identify an important, yet unsurprising,
trend: the use of simpler cores by themselves
is insufficient for maximizing processor effi-
ciency, and the chip-level organization must
be considered. More specifically, a larger
chip that integrates many cores is necessary
to amortize the area and power expense
of uncore resources, such as cache and
off-chip interfaces, by multiplexing them
among the cores.
Scale-Out Processors
To maximize silicon efficiency on scale-
out workloads, we examined the characteris-
tics of a suite of representative scale-out
applications and the demands they place on
processor resources. Our findings, consistent
with prior work,
10,11
indicate that
large LLCs are not beneficial for cap-
turing data-center applications’ enor-
mous data footprint;
the active instruction footprint greatly
exceeds the Level-1 (L1) caches’ capac-
ity, but can be accommodated with a
2- to 4-Mbyte secondary cache; and
scale-out workloads have virtually no
thread-to-thread communication, requir-
ing minimal on-chip coherence and
communication infrastructure.
Driven by these observa tions, we d evel-
oped the SOP design methodology that
extends the small-core, big-chip design space
by optimizing the on-chip cache capacity,
core count, interconnect delay, and number
of inter faces to the off-chip memory in a
way that maximizes computing density and
throughput.
12
At the heart of an SOP is a coarse-grained
building block called a pod—a stand-alone
multicore server. Each pod features a mod-
estly sized 2- to 4-Mbyte LLC for capturing
the active instruction footprint and com-
monly accessed data structures. The small
LLC size reduces the cache access time and
leaves more chip area for the cores. To
further reduce the laten cy of performance-
critical LLC accesses, SOPs use a high-
bandwidth crossbar interconnect instead of
a multihop point-to-point network. The
number of cores in a pod is empirically cho-
sen in a way that maximizes cache utilization
without causing thrashing or penalizing
interconnect area and delay.
The SOP architecture achieves scalability
through tiling at the pod granularit y up to
the available area, power, or memory band-
width limit. The multiple pods share t he
off-chip interfaces to reduce cost and maxi-
mize bandwidth utilization. The pod-based
tiling strategy reduces chip-level complexity
and provides a technology-scalable architec-
ture that preserves each pod’s optimality
across technology generations. Figure 2 com-
pares the SOP chip architecture to conven-
tional, small-chip, and tiled designs.
Compared to a tiled design, an SOP
increases the number of cores integrated on
[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 55
....................................................................
SEPTEMBER/OCTOBER 2012 55

chip b y 33 percent, from 36 to 48, within
thesameareabudget.SOPsachievethis
computing-density improvement by reduc-
ing th e LLC capacity from 9 to 4 Mbytes,
freeing up chip area for the cores. The result-
ing SOP devotes 47 percent of the chip area
to the cores, up from 36 percent in the tiled
processor. Our evaluation shows that the per-
core performance in the SOP design is com-
parable to that in a tiled one; although the
SOP’s smaller LLC capacity has a dampen-
ing effect on sin gle-threaded p erformance ,
the lower access delay in the crossbar-based
SOP compensates for the higher miss ratio
by accelerating fetches of performance-
critical instructions. The bottom line is that
the SOP design improves chip-level perfor-
mance (that is, throughput) b y 33 per cent
over the tiled pro cessor. Finally, the SOP
design’s peak power consumption is higher
than that of the tiled processor owing to
the former’s greater on-chip computing capac-
ity. However, as our results demonstrate, the
SOP’s greater chip-level processing capability
is beneficial from a TCO perspective despite
the increased power draw at the chip level.
Methodology
We now describe the cost models, server
hardware features, workloads, and simulation
infrastructure used in evaluating the various
chip organizations at the data-center level.
TCO model
Large-scale data centers employ hig h -
density server racks to reduce the space foot-
print and improve cost efficiency. A stan-
dard rack can accommodate up to 42 1U
(one-rack-unit) servers, with each server
integrating one or more processors, multiple
DRAM DIMMs, disk- or flash-based stor-
age nodes, an d a network interface. Serve rs
in a rack share the power distribution infra-
structure and netw ork interfaces with the
rest of the data center. The number of
racks i n a large-sc ale da ta cen ter is com-
monly constrained by the available power
budget.
Our TCO analysis, derived using
EETCO,
13
considers four major expense
categories. Table 2 further details the key
parameters.
Data-center infrastructure. This includes the
land, building, and power provisioning and
cooling equipment with a 15-year deprecia-
tion schedule. The data-center area is primar-
ily determined by the IT (rack) area, with
cooling and power provisioning equipment
factored in. We estimate the cost of this
equipment per watt of critical power.
Server and networking hardware. Server
hardware includes processors, memory,
disks, and motherboards. We also account
for the networking gear at the data center’s
edge, aggregation, and core layers and as-
sume that the cost scales with the number
of racks. The amortization schedule is three
years for server hardware, and four years for
networking equipment.
Power. This is predominantly determined
by the servers, including f ans and power
[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 56
(a)
(c)
(b)
Interconnect
Memory + I/O ports
LLC ($)
C C C
C
(d)
Interconnect
Memory + I/O ports
LLC ($)
Core
Core
Core
Core
Core
Core
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
C
$
Memory + I/O ports
Memory + I/O ports
Interconnect
LLC ($)
CC CC
CC CC
CC CC
Interconnect
CC CC
CC CC
CC CC
LLC ($)
Interconnect
CC CC
CC CC
CC CC
Interconnect
CC CC
CC CC
CC CC
Figure 2. Comparison of server processor organizations: conventional (a),
small chip (b), tiled (c), and Scale Out (d). The Scale-Out design achieves
the highest performance density through modestly sized caches and many
cores in a multi-pod organization.
....................................................................
56 IEEE MICRO
.......... ................ .................. ................ ................ ................ .................. ................ .................................................................
ENERGY-AWARE COMPUTING

Citations
More filters
Proceedings ArticleDOI

UNO: uniflying host and smart NIC offload for flexible packet processing

TL;DR: This paper proposes a generalized SDN-controlled NF offload architecture called UNO, which can transparently offload dynamically selected host processors' packet processing functions to sNICs by using multiple switches in the host while keeping the data centerwide network control and management planes unmodified.
Posted Content

Chasing Carbon: The Elusive Environmental Footprint of Computing

TL;DR: This work quantifies the carbon output of computer systems to show that most emissions related to modern mobile and data-center equipment come from hardware manufacturing and infrastructure, and outlines future directions for minimizing the environmental impact of computing systems.
Proceedings ArticleDOI

NOC-Out: Microarchitecting a Scale-Out Processor

TL;DR: NOC-Out is a many-core processor organization that affords low LLC access delays at a small area cost and simplifies the interconnect through the use of low-complexity tree-based topologies.
Proceedings ArticleDOI

Towards sustainable in-situ server systems in the big data era

TL;DR: This work implements a heavily instrumented proof-of-concept prototype called InSURE: in-situ server systems using renewable energy, and develops a novel energy buffering mechanism and a unique joint spatio-temporal power management strategy to coordinate standalone power supplies and in-Situ servers.
Proceedings ArticleDOI

Building the Computing System for Autonomous Micromobility Vehicles: Design Constraints and Architectural Optimizations

TL;DR: Drawing from the commercial deployment experience, the computing system design in commercial autonomous vehicles is presented, and a detailed performance, energy, and cost analyses are provided.
References
More filters
Proceedings ArticleDOI

Clearing the clouds: a study of emerging scale-out workloads on modern hardware

TL;DR: This work identifies the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Journal ArticleDOI

SimFlex: Statistical Sampling of Computer System Simulation

TL;DR: Statistical sampling makes simulation-based studies feasible by providing ten-thousand-fold reductions in simulation runtime and enabling thousand-way simulation parallelism.
Journal ArticleDOI

Toward Dark Silicon in Servers

TL;DR: Server chips will not scale beyond a few tens to low hundreds of cores, and an increasing fraction of the chip in future technologies will be dark silicon that the authors cannot afford to power.
Proceedings ArticleDOI

Web search using mobile cores: quantifying and mitigating the price of efficiency

TL;DR: This work quantifies efficiency for an industry-strength online web search engine in production at both the microarchitecture- and system-level, evaluating search on server and mobile-class architectures using Xeon and Atom processors.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

Companies leading the transformation to the digital universe, such as Google, Microsoft, and Facebook, rely on networks of megascale data centers to provide search, social connectivity, media streaming, and a growing number of other offerings to large, distributed audiences.