What are the contributions in this paper?

(Open Access) Optimizing Data-Center TCO with Scale-Out Processors (2012) | Boris Grot

.............................................................................................................................................. ................ ................ ................ .................. ..........

PTIMIZING

ATA

-C

ENTER

TCO

WITH

CALE

-O

ROCESSORS

PERFORMANCE AND TOTAL COST OF OWNERSHI P (TCO) ARE KEY OPTIMIZATION METRICS

IN LARGE

-SCALE DATA CENTERS.ACCORDING TO THESE METRICS, DATA CENTERS

DESIGNED WITH CONVENTIONAL SERVER PROCESSORS ARE INEFFICIENT

.RECENTLY

INTRODUCED PROCESSORS BASED ON LOW

-POWER CORES CAN IMPROVE BOTH

THROUGHPUT AND ENERGY EFFICIENCY COMPARED TO CONVENTIONAL SERVER CHIPS

OWEVER, A SPECIALIZED SCALE-OUT PROCESSOR (SOP) ARCHITECTURE MAXIMIZES

-CHIP COMPUTING DENSITY TO DELIVE R THE HIGHEST PERFORMANCE PER TCO AND

PERFORMANCE PER WATT AT THE DATA

-CENTER LEVEL.

......We are in the midst of an infor-

mation revolution, driven by ubiquitous

access to vast data stores via a variety of

richly networked platforms. Data centers

are the workhorses powering this revolution.

Companies leading the transformation to the

digital universe, such as Google, Microsoft,

and Facebook, rely on networks of megascale

data centers to provide search, social connec-

tivity, media streaming, and a growing num-

berofotherofferingstolarge,distributed

audiences. A scale-out data center powering

cloud services can house te ns of thousands

of servers that are necessary for high scalabil-

ity, availability, and resilience.

The massive scale of such data centers

requires an enormous capital outlay for infra-

structure and hardware, often exceeding

$100 million per data center.

Similarly ex-

pansive are the power requirements, typically

in the range of 5 to 15 MW per data center,

totaling millions of dollars in annual operat-

ing costs. With demand for information

services skyrocketing around the globe,

efficiency has become a paramount concern

in the design and operation of large-scale

data centers.

To reduce infrastructure, hardware, and

energy costs, data-center operators target

high computing density and power effi-

ciency. Total cost of ownership (TCO) is

an optimization metric that considers the

costs of real estate, power delivery and cool-

ing infrastructure, hardware acquisition

costs, and operating expenses. Because server

acquisition and power costs constitute the

two largest TCO components,

servers pres-

ent a prime optimization target in the quest

for more efficient data centers. In addition

to cost, performance is also critical in scale-

out data centers designed to service thou-

sands of concurrent requests with real-time

constraints. The ratio of performance to

TCO (performance per dollar of ownership

expense) is thus an appropriate metric f or

evaluating different data-center designs.

Scale-out workloads prevalent in lar ge-

scale data centers rely on in-memory

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 52

Pejman Lotfi-Kamran

Babak Falsafi

EPFL

Chrysostomos Nicopoulos

Yiannakis Sazeides

University of Cyprus

Boris Grot

EPFL

Damien Hardy

University of Cyprus

..............................................................

Published by the IEEE Computer Society 0272-1732/12/$31.00



2012 IEEE

processing and massive parallelism to guaran-

tee low response latency and high throughput.

Although processors ultimately determine a

server’s performance characteristics, they

contribute just a fraction of the overall pur-

chase price and power burden in a server

node. Memory, disk, networking equipment,

power provisioning, and cooling all contrib-

ute substantially to acquisition and operating

costs. Moreover, these components are less

energy proportional than modern processors,

meaning their power requirements don’t

scale down well as the server load drops.

Thus, maximizing the benefit from the

TCO investment requires getting high utili-

zationfromtheentireserver,notjustthe

processor.

To achieve high server utilization, data

centers must employ processors that can

fully leverage the available bandwidth to

memory and I/O. Conventional server pro-

cessors use powerful cores designed for a

broad range of workloads, including scien-

tific, gaming, and media processing. As a re-

sult, they deliver good performance across

the workload range, but they fail to maxi-

mize either performance or efficiency on

memory-intensive scale-out applications.

Emerging server processors, o n the other

hand, employ simpler core microarchitec-

tures that improve efficiency but fall short

of maximizing performance. What the in-

dustry needs are server processors that jointly

optimize for performance, energy, and TCO.

With this in mind, we developed a method-

ology for designing performance-density-

optimal server chips called Scale-Out

Processors (SOPs). Our SOP methodology

improves data-center efficiency through a

many-core organization tuned to the demands

of scale-out workloads.

Today’s server processors

Multicore processors common today are

well-suited for m assively parallel scale-out

workloads running in data centers. First, they

improve throughput per chip over single-core

designs. Second, they amortize on-chip and

board-level resources among multiple hard-

ware threads, thereby lowering both cost

and power consumption per unit of work

(that is, per thread).

Table 1 summarizes the principal charac-

teristics of today’s server processors. Existing

data centers are built with server-class designs

from Intel and AMD. A representative pro-

cessor is Intel’s Xeon 5670,

amid-range

design that integrates six powerful dual-

threaded cores and a spacious 12-Mbyte

last-level cache (LLC). The Xeon 5670 con-

sumes 95 W at a maximum frequency of

3 GHz. The combination of powerful cores

and relatively large chip size leads us to clas-

sify conventional server processors as big-core,

big-chip designs.

Recently, several companies have intro-

duced processors featuring simpler core

microarchitectures that specifically target

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 53

Table 1. Server chip characteristics. The first three processors are existing designs,

and the last two processors are proposed designs.

Processor Type

Cores,

threads

Last-level

cache size

(Mbytes)

DDR3

interfaces

Frequency

(GHz)

Power

(W)

Area

(mm

)

Cost per

processor

($)

Big core,

big chip

Conventional 6, 12 12 3 3 95 233 800

Small core,

small chip

Small chip 4, 4 4 1 1.5 6 62 95

Small core,

big chip

Tiled 36, 36 9 2 1.5 28 132 300

Scale-out,

in order

Scale-Out

Processor

48, 48 4 3 1.5 34 132 320

Scale-out,

out of order

Scale-Out

Processor

16, 16 4 2 2 33 132 320

....................................................................

SEPTEMBER/OCTOBER 2012 53

scale-out data centers. Research has shown

simple-core designs to be well-matched to

the demand s of many scale-out workloads,

which spend a high fraction of their time

accessing memory and have moderate com-

putational intensity.

Two design para-

digms have emerged in this space: one

type features a few small cores on a small

chip (small core, small chip); the other inte-

grates m any small cores on a bigger chip

(small core, big chip).

Companies including Calxeda, Marvell,

and SeaMicro market small-core, small-chip

processors targeted at data centers. Despite

the differences in the core organization

and even the instruction set architecture

(ISA)—Calxeda’s and Marvell’s designs are

an x86-based Atom processor—the chips

are surprisingly similar in their feature set:

all have four hardware contexts, dual-issue

cores, a clock speed in the range of 1.1 to

1.6 GHz, and power consumption of 5 to

10 W. We use the Calxeda design as a rep-

resentative configuration, featuring four

Cortex-A9cores,a4-MbyteLLC,andan

on-die memory controller.

At 1.5 GHz,

our model estimates a peak power con-

sumption of 6 W.

A processor representative of the small-

core, big-chip design philosophy is Tilera’s

Tile-Gx3036. This server-class processor fea-

tures 36 simple cores and a 9-Mbyte LLC in

a tiled organization.

Each tile integrates a

core, a slice of the shared LLC, and a router.

Accesses to the distributed LLC’s remote

banks require a traversal of the on-chip

interconnect, imple mented as a 2D mesh

network with a single-cycle per-hop delay.

Operating at 1.5 GHz, the Tilera-like tiled

design draws approximately 28 W of power

at peak load.

To understand the efficiency implications

of these diverse processor architectures, we

use a combination of analytic models and

simulation-based studies, employing a full-

system server simulation infrastructure, to esti-

mate their performance, area, and power char-

acteristics. Our workloads are taken from

CloudSuite (http://parsa.epfl.ch/cloudsuite),

a collection of representative scale-out appli-

cations that includes web search, data serv-

ing, and MapReduce.

Figure 1a compares the designs along two

dimensions: performance density and energy

efficiency. Performance density, expressed as

performance per mm

, measures the process-

or’s ability to effectively utilize the chip real

estate. Energy efficiency, in units of perfor-

mance per watt, indicates the processor’s

ability to convert energy into useful work.

The small-core, small-chip processor

offers a 2.2 improvement in ener gy effi-

ciency over a co nventional big-core design,

than ks to the former’s simpler core micro-

architecture. However, the small-chip design

has 45 percent lower performance density

than the conventional one. To better

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 54

(a) (b)

0.3

0.6

0.9

1.2

0.07

0.14

0.21

0.28

Conventional Small chip Tiled

Performance/watt

Performance/mm

Performance density

Energy efficiency

Power (watt)

Cache DDR

Misc.

Core

100

150

200

250

Area Power Area Power Area Power

Conventional Small chip Tiled

Area (mm

)

Figure 1. Efficiency, area, and power of today’s server processors: performance density and energy efficiency (a);

processor area and power breakdown (b). We use a combination of analytic models and simulation-based studies to

estimate the performance, area, and power characteristics.

....................................................................

54 IEEE MICRO

.......... ................ .................. ................ ................ ................ .................. ................ .................................................................

ENERGY-AWARE COMPUTING

understand the trends, Figure 1b shows a

breakdown of the respective processors’ area

and power budgets. The data reveals that

while the cores in the conventional server

processortakeup44percentofthechip

area, the small-chip design commits just

20 percent of the chip to compute, with the

remainder of the area going to the LLC,

I/O, and auxiliary circuitry. In terms of

power, the six conventional cores consume

71 W of the 95-W power budget (75 per-

cent), whereas the four simpler cores in the

small-chip organization dissipate just 2.4 W

(38 percent of total chip power) under full

load. As with the area, the relative energy

cost of the cache and peripheral circuitry in

the small-chip design is greater than in the

conventional design (62 percent and 25 per-

cent of the respective chips’ power budgets).

The most-efficient design point is the

small-core, big-ch ip t iled processor, which

surpasses both conventional and small-chip

alternatives by more than 88 percent in per-

formance density, and 65 percent in energy

efficiency. The cores in the tiled processor

take up 36 percent of the chip real estate,

nearly doubling the fraction of the area dedi-

cated to execution resources as compared

to the s mall-chip design. The fraction of

the power devoted to execution resources

increases to 48 percent compared to 38 per-

cent in the small-chip design.

Our results corroborate earlier studies that

identify efficiency benefits stemming from

the use of lower-complexity cores as com-

pared to those used in conventional server

processors.

8,9

However, our findings also

identify an important, yet unsurprising,

trend: the use of simpler cores by themselves

is insufficient for maximizing processor effi-

ciency, and the chip-level organization must

be considered. More specifically, a larger

chip that integrates many cores is necessary

to amortize the area and power expense

of uncore resources, such as cache and

off-chip interfaces, by multiplexing them

among the cores.

Scale-Out Processors

To maximize silicon efficiency on scale-

out workloads, we examined the characteris-

tics of a suite of representative scale-out

applications and the demands they place on

processor resources. Our findings, consistent

with prior work,

10,11

indicate that

 large LLCs are not beneficial for cap-

turing data-center applications’ enor-

mous data footprint;

 the active instruction footprint greatly

exceeds the Level-1 (L1) caches’ capac-

ity, but can be accommodated with a

2- to 4-Mbyte secondary cache; and

 scale-out workloads have virtually no

thread-to-thread communication, requir-

ing minimal on-chip coherence and

communication infrastructure.

Driven by these observa tions, we d evel-

oped the SOP design methodology that

extends the small-core, big-chip design space

by optimizing the on-chip cache capacity,

core count, interconnect delay, and number

of inter faces to the off-chip memory in a

way that maximizes computing density and

throughput.

At the heart of an SOP is a coarse-grained

building block called a pod—a stand-alone

multicore server. Each pod features a mod-

estly sized 2- to 4-Mbyte LLC for capturing

the active instruction footprint and com-

monly accessed data structures. The small

LLC size reduces the cache access time and

leaves more chip area for the cores. To

further reduce the laten cy of performance-

critical LLC accesses, SOPs use a high-

bandwidth crossbar interconnect instead of

a multihop point-to-point network. The

number of cores in a pod is empirically cho-

sen in a way that maximizes cache utilization

without causing thrashing or penalizing

interconnect area and delay.

The SOP architecture achieves scalability

through tiling at the pod granularit y up to

the available area, power, or memory band-

width limit. The multiple pods share t he

off-chip interfaces to reduce cost and maxi-

mize bandwidth utilization. The pod-based

tiling strategy reduces chip-level complexity

and provides a technology-scalable architec-

ture that preserves each pod’s optimality

across technology generations. Figure 2 com-

pares the SOP chip architecture to conven-

tional, small-chip, and tiled designs.

Compared to a tiled design, an SOP

increases the number of cores integrated on

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 55

....................................................................

SEPTEMBER/OCTOBER 2012 55

chip b y 33 percent, from 36 to 48, within

thesameareabudget.SOPsachievethis

computing-density improvement by reduc-

ing th e LLC capacity from 9 to 4 Mbytes,

freeing up chip area for the cores. The result-

ing SOP devotes 47 percent of the chip area

to the cores, up from 36 percent in the tiled

processor. Our evaluation shows that the per-

core performance in the SOP design is com-

parable to that in a tiled one; although the

SOP’s smaller LLC capacity has a dampen-

ing effect on sin gle-threaded p erformance ,

the lower access delay in the crossbar-based

SOP compensates for the higher miss ratio

by accelerating fetches of performance-

critical instructions. The bottom line is that

the SOP design improves chip-level perfor-

mance (that is, throughput) b y 33 per cent

over the tiled pro cessor. Finally, the SOP

design’s peak power consumption is higher

than that of the tiled processor owing to

the former’s greater on-chip computing capac-

ity. However, as our results demonstrate, the

SOP’s greater chip-level processing capability

is beneficial from a TCO perspective despite

the increased power draw at the chip level.

Methodology

We now describe the cost models, server

hardware features, workloads, and simulation

infrastructure used in evaluating the various

chip organizations at the data-center level.

TCO model

Large-scale data centers employ hig h -

density server racks to reduce the space foot-

print and improve cost efficiency. A stan-

dard rack can accommodate up to 42 1U

(one-rack-unit) servers, with each server

integrating one or more processors, multiple

DRAM DIMMs, disk- or flash-based stor-

age nodes, an d a network interface. Serve rs

in a rack share the power distribution infra-

structure and netw ork interfaces with the

rest of the data center. The number of

racks i n a large-sc ale da ta cen ter is com-

monly constrained by the available power

budget.

Our TCO analysis, derived using

EETCO,

considers four major expense

categories. Table 2 further details the key

parameters.

Data-center infrastructure. This includes the

land, building, and power provisioning and

cooling equipment with a 15-year deprecia-

tion schedule. The data-center area is primar-

ily determined by the IT (rack) area, with

cooling and power provisioning equipment

factored in. We estimate the cost of this

equipment per watt of critical power.

Server and networking hardware. Server

hardware includes processors, memory,

disks, and motherboards. We also account

for the networking gear at the data center’s

edge, aggregation, and core layers and as-

sume that the cost scales with the number

of racks. The amortization schedule is three

years for server hardware, and four years for

networking equipment.

Power. This is predominantly determined

by the servers, including f ans and power

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 56

(a)

(c)

(b)

Interconnect

Memory + I/O ports

LLC ($)

C C C

(d)

Interconnect

Memory + I/O ports

LLC ($)

Core

Memory + I/O ports

Interconnect

LLC ($)

CC CC

Interconnect

CC CC

LLC ($)

Interconnect

CC CC

Interconnect

CC CC

Figure 2. Comparison of server processor organizations: conventional (a),

small chip (b), tiled (c), and Scale Out (d). The Scale-Out design achieves

the highest performance density through modestly sized caches and many

cores in a multi-pod organization.

....................................................................

56 IEEE MICRO

.......... ................ .................. ................ ................ ................ .................. ................ .................................................................

ENERGY-AWARE COMPUTING

Optimizing Data-Center TCO with Scale-Out Processors

Figures

Citations

UNO: uniflying host and smart NIC offload for flexible packet processing

Chasing Carbon: The Elusive Environmental Footprint of Computing

NOC-Out: Microarchitecting a Scale-Out Processor

Towards sustainable in-situ server systems in the big data era

Building the Computing System for Autonomous Micromobility Vehicles: Design Constraints and Architectural Optimizations

References

Clearing the clouds: a study of emerging scale-out workloads on modern hardware

SimFlex: Statistical Sampling of Computer System Simulation

Toward Dark Silicon in Servers

Web search using mobile cores: quantifying and mitigating the price of efficiency

Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments

Related Papers (5)

Clearing the clouds: a study of emerging scale-out workloads on modern hardware

Understanding sources of inefficiency in general-purpose chips

Reactive NUCA: near-optimal block placement and replication in distributed caches

Express virtual channels: towards the ideal interconnection fabric

Design tradeoffs for tiled CMP on-chip networks

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?