Proceedings Article•DOI•

Dark silicon and the end of multicore scaling

Hadi Esmaeilzadeh¹, Emily Blem², Renee St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

04 Jun 2011-Vol. 39, Iss: 3, pp 365-376

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

read less

Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

...read moreread less

Summary (4 min read)

Jump to: [2. OVERVIEW] – [3. DEVICE MODEL] – [4. CORE MODEL] – [4.1 Decoupling Area and Power Constraints] – [4.2 Pareto Frontier Derivation] – [4.3 Device Scaling × Core Scaling] – [5. MULTICORE MODEL] – [6. DEVICE × CORE × CMP SCALING] – [7. SCALING AND FUTURE MULTICORES] – [7.2 Analysis using Real Workloads] – [7.3 Sources of Dark Silicon] – [7.4 Sensitivity Studies] – [7.5 Summary] – [7.6 Limitations] – [8. RELATED WORK] and [9. CONCLUSIONS]

2. OVERVIEW

Figure 1 shows how this paper combines models and empirical measurements to project multicore performance and chip utilization.
The authors consider ITRS Roadmap projections [18] and conservative scaling parameters from Borkar’s recent study [7].
The core-level model provides the maximum performance that a single-core can sustain for any given area.
The CPU multicore organization represents Intel Nehalemlike, heavy-weight multicore designs with fast caches and high singlethread performance.
The design leverages the high-performing large core for the serial portion of code and leverages the numerous small cores as well as the large core to exploit the parallel portion of code.

3. DEVICE MODEL

The authors consider two different technology scaling schemes to build a device scaling model.
The first scheme uses projections from the ITRS 2010 technology roadmap [18].
The second scheme, which the authors call conservative scaling, is based on predictions presented by Borkar and represents a less optimistic view [7].
The power scaling factor is computed using the predicted frequency, voltage, and gate capacitance scaling factors in accordance with the P = αCV2dd f equation.
The ITRS roadmap predicts that multi-gate MOSFETs, such as FinTETs, will supersede planar bulk at 22 nm [18].

4. CORE MODEL

This paper uses Pareto frontiers to provide single-core power/performance and area/performance tradeoffs at each technology node while abstracting away specific details of the cores.
These functions are derived from the data collected for a large set of processors.
The power/performance Pareto frontier represents the optimal design points in terms of power and performance [16].
Similarly, the area/performance Pareto frontier represents the optimal design points in the area/performance design space.
Below, the authors first describe why separate area and power functions are required.

4.1 Decoupling Area and Power Constraints

Furthermore, these studies consider the power consumption of a core to be directly proportional to its transistor count.
This assumption makes power an areadependent constraint.
Power is a function of not only area, but also supply voltage and frequency.
Since these no longer scale at historical rates, Pollack’s rule is insufficient for modeling core power.

4.2 Pareto Frontier Derivation

Figure 2(a) shows the power/performance single-core design space.
To derive the quadratic area/performance Pareto frontier ), die photos of four microarchitectures, including Intel Atom, Intel Core, AMD Shanghai, and Intel Nehalem, are used to estimate the core areas (excluding level 2 and level 3 caches).
The authors allocate 20% of the chip power budget to leakage power.
To derive the Pareto frontiers at 45 nm, the authors fit a cubic polynomial, P(q), to the points along the edge of the power/performance design space.
Figure 2(d) shows the result of voltage/frequency scaling on the design points along the power/performance frontier.

4.3 Device Scaling × Core Scaling

Performance, measured in SPECmark, is assumed to scale linearly with frequency.
This is an optimistic assumption, which ignores the effects of memory latency and bandwidth on the performance.
Figures 2(e) and 2(f) show the scaled Pareto frontiers for the ITRS and conservative scaling schemes.
Conservative scaling, however, suggests that performance will increase only by 34%, and power will decrease by 74%.

5. MULTICORE MODEL

The authors first present a simple upper-bound (CmpMU ) model for multicore scaling that builds upon Amdahl’s Law to estimate the speedup of area- and power-constrained multicores.
Their models describe symmetric, asymmetric, dynamic, and composed multicore topologies, considering area as the constraint and using Pollack’s rule–the performance of a core is proportional to the square root of its area–to estimate the performance of multicores.
The authors extend their approach to build the multicore model that incorporates application behavior, microarchitectural features, and physical constraints.
Figure 3(a), which includes both CPU and GPU data, shows that the model is optimistic.
While their results are impressively close to Intel’s empirical measurements using similar benchmarks [21], the match in the model’s maximum speedup prediction (12× vs 11× in the Intel study) is an anomaly.

6. DEVICE × CORE × CMP SCALING

The authors now describe how the three models are combined to produce projections for optimal performance, number of cores, and amount of dark silicon.
To determine the best core configuration at each technology node, the authors consider only the processor design points along the area/performance and power/performance Pareto frontiers as they represent the most efficient design points.
The fraction of dark silicon can then be computed by subtracting the area occupied by these cores from the total die area allocated to processor cores.
This exhaustive search is performed separately for Amdahl’s Law CmpMU , CPU-like CmpMR , and GPU-like CmpMR models.
The authors optimistically add cores until either the power or area budget is reached.

7. SCALING AND FUTURE MULTICORES

Then, to achieve an understanding of speedups for real workloads, the authors consider the PARSEC benchmarks and examine both CPU-like and GPU-like multicore organizations under the four topologies using their CmpMR model.
The authors also describe sources of dark silicon and perform sensitivity studies for cache organization and memory bandwidth.

7.2 Analysis using Real Workloads

The authors now consider PARSEC applications executing on CPU- and GPU-like chips.
The study considers all four symmetric, asymmetric, dynamic, and composed multicore topologies (see Table 1) using the CmpMR realistic model.
There are two reasons for this discrepancy.
Second, their study optimizes core count and multicore configuration for general purpose workloads similar to the PARSEC suite.
The authors assume Fermi is optimized for graphics rendering.

7.3 Sources of Dark Silicon

To understand whether parallelism or power is the primary source of dark silicon, the authors examine their model results with power and parallelism levels alone varying in separate experiments as shown in Figure 6 for the 8 nm node (2018).
First, the authors set power to be the “only” constraint, and vary the level of parallelism in the PARSEC applications from 0.75 to 0.99, assuming programmer effort can somehow realize this.
The markers show the level of parallelism in their current implementation.
With conservative scaling, this best-case speedup is 6.3×.
Eight of twelve benchmarks show no more than 10X speedup even with practically unlimited power, i.e. parallelism is the primary contributor to dark silicon.

7.4 Sensitivity Studies

The authors analysis thus far examined “typical” configurations and showed poor scalability for the multicore approach.
The authors model allows us to do such studies, and shows that only small benefits are possible from such simple changes.
The authors elaborate on two representative studies below.
Figure 7(b) illustrates the sensitivity of PARSEC performance to the available memory bandwidth for symmetric GPU multicores at 45 nm, also known as Memory bandwidth.

7.5 Summary

Figure 9 summarizes all the speedup projections in a single scatter plot.
For every benchmark at each technology node, the authors plot the eight possible configurations, (CPU, GPU) × (symmetric, asymmetric, dynamic, composed).
The solid curve indicates performance Moore’s Law or doubling performance with every technology node.
As depicted, due to the power and parallelism limitations, a significant gap exists between what is achievable and what is expected by Moore’s Law.
Results for ITRS scaling are slightly better but not by much.

7.6 Limitations

The authors modeling includes certain limitations, which the authors argue do not significantly change the results.
SMT support can improve the power efficiency of the cores for parallel workloads to some extent.
There is consensus that the number of these components will increase and hence they will further eat into the power budget, reducing speedups.
Questions may still linger on the model’s accuracy and whether its assumptions contribute to the performance projections that fall well below the ideal 32×.
First, in all instances, the authors selected parameter values that would be favorable towards performance.

9. CONCLUSIONS

For decades, Dennard scaling permitted more transistors, faster transistors, and more energy efficient transistors with each new process node, justifying the enormous costs required to develop each new process node.
Dennard scaling’s failure led the industry to race down the multicore path, which for some time permitted performance scaling for parallel and multitasked workloads, permitting the economics of process scaling to hold.
The authors believe that the ITRS projections are much too optimistic, especially in the challenging sub-22 nanometer environment.
The conservative model the authors use in this paper more closely tracks recent history.
There is a silver lining for architects, however:.

Did you find this useful? Give us your feedback

Figures (12)

Table 4: CmpMR parameters with default values from 45 nm Nehalem

Figure 9: Speedup across process technology nodes over all organizations and topologies with PARSEC benchmarks

Figure 1: Overview of the models and the methodology

Table 5: Effect of assumptions on CmpMR accuracy. Assumptions lead to ↑ (slightly higher), ⇑ (higher) or ↓ (slightly lower) predicted speedups (or have no effect (—)).

Figure 8: Optimal number of cores, speedup over quad-Nehalem at 45 nm, and percentage dark silicon under ITRS scaling projections using the CmpMR realistic model.

Table 1: CPU and GPU topologies (ST Core: Single-Thread Core and MT: Many-Thread Core)

Table 2: Scaling factors for ITRS and Conservative projections.

Figure 4: Amdahl’s law projections for the dynamic topology. Upperbound of all four topologies (x-axis: technology node).

Figure 2: Deriving the area/performance and power/performance Pareto frontiers

Figure 5: Speedup and number of cores across technology nodes using symmetric topology and ITRS scaling

Table 3: CmpMU equations: corollaries of Amdahl’s Law for power-constrained multicores.

Content maybe subject to copyright Report

Appears in the Proceedings of the 38

International Symposium on Computer Architecture (ISCA ’11)

Dark Silicon and the End of Multicore Scaling

Hadi Esmaeilzadeh

†

Emily Blem

‡

Renée St. Amant

Karthikeyan Sankaralingam

‡

Doug Burger



†

University of Washington

‡

University of Wisconsin-Madison

The University of Texas at Austin



Microsoft Research

hadianeh@cs.washington.edu blem@cs.wisc.edu stamant@cs.utexas.edu karu@cs.wisc.edu dburger@microsoft.com

ABSTRACT

Since 2005, processor designers have increased core counts to ex-

ploit Moore’s Law scaling, rather than focusing on single-core per-

formance. The failure of Dennard scaling, to which the shift to mul-

ticore parts is partially a response, may soon limit multicore scaling

just as single-core scaling has been curtailed. This paper models

multicore scaling limits by combining device scaling, single-core

scaling, and multicore scaling to measure the speedup potential for

a set of parallel workloads for the next ﬁve technology generations.

For device scaling, we use both the ITRS projections and a set

of more conservative device scaling parameters. To model single-

core scaling, we combine measurements from over 150 processors

to derive Pareto-optimal frontiers for area/performance and pow-

er/performance. Finally, to model multicore scaling, we build a de-

tailed performance model of upper-bound performance and lower-

bound core power. The multicore designs we study include single-

threaded CPU-like and massively threaded GPU-like multicore chip

organizations with symmetric, asymmetric, dynamic, and composed

topologies. The study shows that regardless of chip organization

and topology, multicore scaling is power limited to a degree not

widely appreciated by the computing community. Even at 22 nm

(just one year from now), 21% of a ﬁxed-size chip must be powered

oﬀ, and at 8 nm, this number grows to more than 50%. Through

2024, only 7.9×average speedup is possible across commonly used

parallel workloads, leaving a nearly 24-fold gap from a target of

doubled performance per generation.

Categories and Subject Descriptors: C.0 [Computer Systems Or-

ganization] General — Modeling of computer architecture; C.0

[Computer Systems Organization] General — System architectures

General Terms: Design, Measurement, Performance

Keywords: Dark Silicon, Modeling, Power, Technology Scaling,

Multicore

1. INTRODUCTION

Moore’s Law [23] (the doubling of transistors on chip every 18

months) has been a fundamental driver of computing. For the past

three decades, through device, circuit, microarchitecture, architec-

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

ISCA’11, June 4–8, 2011, San Jose, California, USA.

ture, and compiler advances, Moore’s Law, coupled with Dennard

scaling [11], has resulted in commensurate exponential performance

increases. The recent shift to multicore designs has aimed to in-

crease the number of cores along with transistor count increases,

and continue the proportional scaling of performance. As a re-

sult, architecture researchers have started focusing on 100-core and

1000-core chips and related research topics and called for changes

to the undergraduate curriculum to solve the parallel programming

challenge for multicore designs at these scales.

With the failure of Dennard scaling–and thus slowed supply volt-

age scaling–core count scaling may be in jeopardy, which would

leave the community with no clear scaling path to exploit contin-

ued transistor count increases. Since future designs will be power

limited, higher core counts must provide performance gains despite

the worsening energy and speed scaling of transistors, and given

the available parallelism in applications. By studying these charac-

teristics together, it is possible to predict for how many additional

technology generations multicore scaling will provide a clear ben-

eﬁt. Since the energy eﬃciency of devices is not scaling along with

integration capacity, and since few applications (even from emerg-

ing domains such as recognition, mining, and synthesis [5]) have

parallelism levels that can eﬃciently use a 100-core or 1000-core

chip, it is critical to understand how good multicore performance

will be in the long term. In 2024, will processors have 32 times the

performance of processors from 2008, exploiting ﬁve generations

of core doubling?

Such a study must consider devices, core microarchitectures,

chip organizations, and benchmark characteristics, applying area

and power limits at each technology node. This paper consid-

ers all those factors together, projecting upper-bound performance

achievable through multicore scaling, and measuring the eﬀects of

non-ideal device scaling, including the percentage of “dark silicon”

(transistor under-utilization) on future multicore chips. Additional

projections include best core organization, best chip-level topology,

and optimal number of cores.

We consider technology scaling projections, single-core design

scaling, multicore design choices, actual application behavior, and

microarchitectural features together. Previous studies have also

analyzed these features in various combinations, but not all to-

gether [8, 9, 10, 14, 15, 20, 22, 27, 28]. This study builds and

combines three models to project performance and fraction of “dark

silicon” on ﬁxed-size and ﬁxed-power chips as listed below:

• Device scaling model (DevM): area, frequency, and power

requirements at future technology nodes through 2024.

• Core scaling model (CorM): power/performance and area/

performance single core Pareto frontiers derived from a large

set of diverse microprocessor designs.

• Multicore scaling model (CmpM): area, power and perfor-

Figure 1: Overview of the models and the methodology

mance of any application for “any” chip topology for CPU-

like and GPU-like multicore performance.

• DevM ×CorM: Pareto frontiers at future technology nodes;

any performance improvements for future cores will come

only at the cost of area or power as deﬁned by these curves.

• CmpM×DevM×CorM and an exhaustive state-space search:

maximum multicore speedups for future technology nodes

while enforcing area, power, and benchmark constraints.

The results from this study provide detailed best-case multicore

performance speedups for future technologies considering real ap-

plications from the PARSEC benchmark suite [5]. Our results eval-

uating the PARSEC benchmarks and our upper-bound analysis con-

ﬁrm the following intuitive arguments:

i) Contrary to conventional wisdom on performance improve-

ments from using multicores, over ﬁve technology generations, only

7.9× average speedup is possible using ITRS scaling.

ii) While transistor dimensions continue scaling, power limita-

tions curtail the usable chip fraction. At 22 nm (i.e. in 2012), 21%

of the chip will be dark and at 8 nm, over 50% of the chip will not

be utilized using ITRS scaling.

iii) Neither CPU-like nor GPU-like multicore designs are suﬃ-

cient to achieve the expected performance speedup levels. Radical

microarchitectural innovations are necessary to alter the power/per-

formance Pareto frontier to deliver speed-ups commensurate with

Moore’s Law.

2. OVERVIEW

Figure 1 shows how this paper combines models and empirical

measurements to project multicore performance and chip utiliza-

tion. There are three components used in our approach:

Device scaling model (DevM): We build a device-scaling model

that provides the area, power, and frequency scaling factors at tech-

nology nodes from 45 nm to 8 nm. We consider ITRS Roadmap

projections [18] and conservative scaling parameters from Borkar’s

recent study [7].

Core scaling model (CorM): The core-level model provides the

maximum performance that a single-core can sustain for any given

area. Further, it provides the minimum power (or energy) that must

be consumed to sustain this level of performance. To quantify, we

measure the core performance in terms of SPECmark. We consider

empirical data from a large set of processors and use curve ﬁtting

to obtain the Pareto-optimal frontiers for single-core area/perfor-

mance and power/performance tradeoﬀs.

Multicore scaling model (CmpM): We model two mainstream

classes of multicore organizations, multi-core CPUs and many-thread

GPUs, which represent two extreme points in the threads-per-core

spectrum. The CPU multicore organization represents Intel Nehalem-

like, heavy-weight multicore designs with fast caches and high single-

thread performance. The GPU multicore organization represents

NVIDIA Tesla-like lightweight cores with heavy multithreading

support and poor single-thread performance. For each multicore

organization, we consider four topologies: symmetric, asymmet-

ric, dynamic, and composed (also called “fused” in the literature).

Symmetric Multicore: The symmetric, or homogeneous, multicore

topology consists of multiple copies of the same core operating at

the same voltage and frequency setting. In a symmetric multicore,

the resources, including the power and the area budget, are shared

equally across all cores.

Asymmetric Multicore: The asymmetric multicore topology con-

sists of one large monolithic core and many identical small cores.

The design leverages the high-performing large core for the serial

portion of code and leverages the numerous small cores as well as

the large core to exploit the parallel portion of code.

Dynamic Multicore: The dynamic multicore topology is a varia-

tion of the asymmetric multicore topology. During parallel code

portions, the large core is shut down and, conversely, during the

serial portion, the small cores are turned oﬀ and the code runs only

on the large core [8, 26].

Composed Multicore: The composed multicore topology consists

of a collection of small cores that can logically fuse together to

compose a high-performance large core for the execution of the

serial portion of code [17, 19]. In either serial or parallel cases, the

large core or the small cores are used exclusively.

Table 1 outlines the design space we explore and explains the

roles of the cores during serial and parallel portions of applica-

Table 1: CPU and GPU topologies (ST Core: Single-Thread Core and MT: Many-Thread Core)

Symmetric Asymmetric Dynamic Composed

CPU Serial 1 ST Core 1 Large ST Core 1 Large ST Core 1 Large ST Core

Multicores Parallel N ST Cores 1 Large ST Core + N Small ST Cores N Small ST Cores N Small ST Cores

Serial

1 MT Core 1 Large ST Core 1 Large ST Core 1 Large ST Core

GPU (1 Thread) (1 Thread) (1 Thread) (1 Thread)

Multicores

Parallel

N MT Cores 1 Large ST Core + N Small MT Cores N Small MT Cores N Small MT Cores

(Multiple Threads) (1 Thread) (Multiple Threads) (Multiple Threads) (Multiple Threads)

Table 2: Scaling factors for ITRS and Conservative projections.

Frequency Vdd Capacitance Power

Tech Scaling Scaling Scaling Scaling

Node Factor Factor Factor Factor

Year (nm) (/45nm) (/45nm) (/45nm) (/45nm)

ITRS

2010 45

∗

1.00 1.00 1.00 1.00

2012 32

∗

1.09 0.93 0.7 0.66

2015 22

†

2.38 0.84 0.33 0.54

2018 16

†

3.21 0.75 0.21 0.38

2021 11

†

4.17 0.68 0.13 0.25

2024 8

†

3.85 0.62 0.08 0.12

31% frequency increase and 35% power reduction per node

Conservative

2008 45 1.00 1.00 1.00 1.00

2010 32 1.10 0.93 0.75 0.71

2012 22 1.19 0.88 0.56 0.52

2014 16 1.25 0.86 0.42 0.39

2016 11 1.30 0.84 0.32 0.29

2018 8 1.34 0.84 0.24 0.22

6% frequency increase and 23% power reduction per node

∗: Extended Planar Bulk Transistors, †:Multi-Gate Transistors

tions. Single-thread (ST) cores are uni-processor style cores with

large caches and many-thread (MT) cores are GPU-style cores with

smaller caches; both are described in more detail in Section 5.

This paper describes an analytic model that provides system-

level performance using as input the core’s performance (obtained

from CorM) and the multicore’s organization (CPU-like or GPU-

like). Unlike previous studies, the model considers application

behavior, its memory access pattern, the amount of thread-level

parallelism in the workload, and microarchitectural features such

as cache size, memory bandwidth, etc. We choose the PARSEC

benchmarks because they represent a set of highly parallel applica-

tions that are widely studied in the research community.

Heterogeneous conﬁgurations such as AMD Fusion and Intel

Sandybrige combine CPU and GPU designs on a single chip. The

asymmetric and dynamic GPU topologies resemble those two de-

signs, and the composed topology models conﬁgurations similar to

AMD Bulldozer. For GPU-like multicores, this study assumes that

the single ST core does not participate in parallel work. Finally,

our methodology implicitly models heterogeneous cores of diﬀer-

ent types (mix of issue widths, frequencies, etc.) integrated on one

chip. Since we perform a per-benchmark optimal search for each

organization and topology, we implicitly cover the upper-bound of

this heterogeneous case.

3. DEVICE MODEL

We consider two diﬀerent technology scaling schemes to build a

device scaling model. The ﬁrst scheme uses projections from the

ITRS 2010 technology roadmap [18]. The second scheme, which

we call conservative scaling, is based on predictions presented by

Borkar and represents a less optimistic view [7]. The parameters

used for calculating the power and performance scaling factors are

summarized in Table 2. For ITRS scaling, frequency is assumed to

scale linearly with respect to FO4 inverter delay. The power scaling

factor is computed using the predicted frequency, voltage, and gate

capacitance scaling factors in accordance with the P = αCV

equation. The ITRS roadmap predicts that multi-gate MOSFETs,

such as FinTETs, will supersede planar bulk at 22 nm [18]. Table 2

also highlights the key diﬀerence between the two projections. De-

tails on how we handle the partitioning between leakage power and

dynamic power is explained in Section 4.2.

4. CORE MODEL

This paper uses Pareto frontiers to provide single-core power/per-

formance and area/performance tradeoﬀs at each technology node

while abstracting away speciﬁc details of the cores. The Pareto-

optimal core model provides two functions, A(q) and P(q), repre-

senting the area/performance and power/performance tradeoﬀ Pareto

frontiers, where q is the single-threaded performance of a core

measured in SPECmarks. These functions are derived from the

data collected for a large set of processors. The power/perfor-

mance Pareto frontier represents the optimal design points in terms

of power and performance [16]. Similarly, the area/performance

Pareto frontier represents the optimal design points in the area/per-

formance design space. Below, we ﬁrst describe why separate area

and power functions are required. Then, we describe the basic

model and empirical data used to derive the actual Pareto frontier

curves at 45 nm. Finally, we project these power and area Pareto

frontiers to future technology nodes using the device scaling model.

4.1 Decoupling Area and Power Constraints

Previous studies on multicore performance modeling [8, 9, 10,

15, 20, 22, 28] use Pollack’s rule [6] to denote the tradeoﬀ be-

tween transistor count and performance. Furthermore, these stud-

ies consider the power consumption of a core to be directly propor-

tional to its transistor count. This assumption makes power an area-

dependent constraint. However, power is a function of not only

area, but also supply voltage and frequency. Since these no longer

scale at historical rates, Pollack’s rule is insuﬃcient for modeling

core power. Thus, it is necessary to decouple area and power into

two independent constraints.

4.2 Pareto Frontier Derivation

Figure 2(a) shows the power/performance single-core design space.

We populated the depicted design space by collecting data for 152

real processors (from P54C Pentium to Nehalem-based i7) fabri-

cated at various technology nodes from 600 nm through 45 nm.

As shown, the boundary of the design space that comprises the

power/performance optimal points constructs the Pareto frontier.

Each processor’s performance is collected from the SPEC website

[25] and the processor’s power is the TDP reported in its datasheet.

Thermal design power, TDP, is the chip power budget, the amount

of power the chip can dissipate without exceeding transistors junc-

tion temperature. In Figure 2(a), the x-axis is the SPEC CPU2006 [25]

0 5 10 15 20 25 30 35 40

Performance (SP ECmark)

Core Power (W )

Processors (600 nm)

Processors (350 nm)

Processors (250 nm)

Processors (180 nm)

Processors (130 nm)

Processors (90 nm)

Processors (65 nm)

Processors (45 nm)

Pareto Frontier (45 nm)

(a) Power/performance across nodes

0 5 10 15 20 25 30 35 40

Performance (SP ECmark)

Core Power (W )

P (q) = 0.0002q

+ 0.0009q

+ 0.3859q − 0.0301

Intel Nehalem (45nm)

Intel Core (45nm)

AMD Shanghai (45nm)

Intel Atom (45nm)

Pareto Frontier (45 nm)

(b) Power/performance frontier, 45 nm

0 5 10 15 20 25 30 35 40

Performance (SP ECmark)

Core Area (mm

)

A(q) = 0.0152q

+ 0.0265q + 7.4393

Intel Nehalem (45nm)

Intel Core (45nm)

AMD Shanghai (45nm)

Intel Atom (45nm)

Pareto Frontier (45 nm)

0 5 10 15 20 25 30 35 40

Performance (SPECmark)

Core Power (W )

Voltage Scaling

Frequency Scaling

Pareto Frontier (45 nm)

(d) Voltage and frequency scaling

0 20 40 60 80 100 120 140 160

Performance (SPECmark)

Core Power (W )

45 nm

32 nm

22 nm

16 nm

11 nm

8 nm

(e) ITRS frontier scaling

0 10 20 30 40 50

Performance (SPECmark)

Core Power (W )

45 nm

32 nm

22 nm

16 nm

11 nm

8 nm

(f) Conservative frontier scaling

Figure 2: Deriving the area/performance and power/performance Pareto frontiers

score (SPECmark) of the processor, and the y-axis is the core power

budget. All SPEC scores are converted to SPEC CPU2006 scores.

Empirical data for the core model: To build a technology-scalable

model, we consider a family of processors at one technology node

(45 nm) and construct the frontier for that technology node. We

used 20 representative Intel and AMD processors at 45 nm (Atom

Z520, Atom 230, Atom D510, C2Duo T9500, C2Extreme QX9650,

C2Q-Q8400, Opteron 2393SE, Opteron 2381HE, C2Duo E7600,

C2Duo E8600, C2Quad Q9650, C2Quad QX9770, C2Duo T9900,

Pentium SU2700, Xeon E5405, Xeon E5205, Xeon X3440, Xeon

E7450, i7-965 ExEd). The power/performance design space and

the cubic Pareto frontier at 45 nm, P(q), are depicted in Figure 2(b).

To derive the quadratic area/performance Pareto frontier (Fig-

ure 2(c)), die photos of four microarchitectures, including Intel

Atom, Intel Core, AMD Shanghai, and Intel Nehalem, are used

to estimate the core areas (excluding level 2 and level 3 caches).

The Intel Atom Z520 with a 2.2 W total TDP represents the lowest

power design (lower-left frontier point), and the Nehalem-based

Intel Core i7-965 Extreme Edition with a 130 W total TDP rep-

resents the highest performing (upper-right frontier point). Other

low-power architectures, such as those from ARM and Tilera, were

not included because their SPECmark were not available for a mean-

ingful performance comparison.

Since the focus of this work is to study the impact of power con-

straints on logic scaling rather than cache scaling, we derive the

Pareto frontiers using only the portion of chip power budget (TDP)

allocated to each core. To compute the power budget of a single

core, the power budget allocated to the level 2 and level 3 caches

is estimated and deducted from the chip TDP. In the case of a mul-

ticore CPU, the remainder of the chip power budget is divided by

the number of cores, resulting in the power budget allocated to a

single core (1.89 W for the Atom core in Z520 and 31.25 W for

each Nehalem core in i7-965 Extreme Edition). We allocate 20%

of the chip power budget to leakage power. As shown in [24], the

transistor threshold voltage can be selected so that the maximum

leakage power is always an acceptable ratio of the chip power bud-

get while still meeting the power and performance constraints. We

also observe that with 10% or 30% leakage power, we do not see

signiﬁcant changes in optimal conﬁgurations.

Deriving the core model: To derive the Pareto frontiers at 45 nm,

we ﬁt a cubic polynomial, P(q), to the points along the edge of the

power/performance design space. We ﬁt a quadratic polynomial

(Pollack’s rule), A(q), to the points along the edge of the area/per-

formance design space. We used the least square regression method

for curve ﬁtting such that the frontiers enclose all design points.

Figures 2(b) and 2(c) show the 45 nm processor points and identify

the power/performance and area/performance Pareto frontiers. The

power/performance cubic polynomial P(q) function (Figure 2(b))

and the area/performance quadratic polynomial A(q) (Figure 2(c))

are the outputs of the core model. The points along the Pareto fron-

tier are used as the search space for determining the best core con-

ﬁguration by the multicore-scaling model. We discretized the fron-

tier into 100 points to consider 100 diﬀerent core designs.

Voltage and frequency scaling: When deriving the Pareto fron-

tiers, each processor data point was assumed to operate at its opti-

mal voltage (Vdd

min

) and frequency setting (Freq

max

). Figure 2(d)

shows the result of voltage/frequency scaling on the design points

along the power/performance frontier. As depicted, at a ﬁxed Vdd

setting, scaling down the frequency from Freq

max

, results in a pow-

er/performance point inside of the optimal Pareto curve, or a sub-

optimal design point. Scaling voltage up, on the other hand, and

operating at a new Vdd

min

and Freq

max

setting, results in a diﬀerent

power-performance point along the frontier. Since we investigate

all the points along the Pareto frontier to ﬁnd the optimal multi-

core conﬁguration, voltage and frequency scaling does not require

special consideration in our study. If an application dissipates less

than the power budget, we assume that the voltage and frequency

scaling will be utilized to achieve the highest possible performance

with the minimum power increase. This is possible since voltage

and frequency scaling only changes the operating condition in a

Pareto-optimal fashion. Hence, we do not need to measure per-

benchmark power explicitly as reported in a recent study [12].

Table 3: CmpM

equations: corollaries of Amdahl’s Law for

power-constrained multicores.

Symmetric

S ym

(q) = min(

DIE

AREA

A(q)

T DP

P(q)

)

S peedup

S ym

( f, q) =

(1−f )

(q)

S ym

(q)S

(q)

Asymmetric

Asym

, q

) = min(

DIE

AREA

−A(q

)

A(q

)

T DP−P(q

)

P(q

)

S peedup

Asym

( f, q

, q

) =

(1−f )

)

Asym

)+S

)

Dynamic

Dyn

, q

) = min(

DIE

AREA

−A(q

)

A(q

)

T DP

P(q

)

S peedup

Dyn

( f, q

, q

) =

(1−f )

)

Dyn

)

Composed

Composd

, q

) = min(

DIE

AREA

(1+τ)A(q

)

T DP−Pq

P(q

)

S peedup

Composed

( f, q

, q

) =

(1−f )

)

Com posed

)

4.3 Device Scaling × Core Scaling

To study core scaling in future technology nodes, we scaled the

45 nm Pareto frontiers to 8 nm by scaling the power and perfor-

mance of each processor data point using the projected DevM scal-

ing factors and then re-ﬁtting the Pareto optimal curves at each

technology node. Performance, measured in SPECmark, is as-

sumed to scale linearly with frequency. This is an optimistic as-

sumption, which ignores the eﬀects of memory latency and band-

width on the performance. Thus actual performance through scal-

ing is likely to be lower. Figures 2(e) and 2(f) show the scaled

Pareto frontiers for the ITRS and conservative scaling schemes.

Based on the optimistic ITRS roadmap predictions, scaling a mi-

croarchitecture (core) from 45 nm to 8 nm will result in a 3.9× per-

formance improvement and an 88% reduction in power consump-

tion. Conservative scaling, however, suggests that performance will

increase only by 34%, and power will decrease by 74%.

5. MULTICORE MODEL

We ﬁrst present a simple upper-bound (CmpM

) model for mul-

ticore scaling that builds upon Amdahl’s Law to estimate the speedup

of area- and power-constrained multicores. To account for microar-

chitectural features and application behavior, we then develop a de-

tailed chip-level model (CmpM

) for CPU-like and GPU-like mul-

ticore organizations with diﬀerent topologies. Both models use the

A(q) and P(q) frontiers from the core-scaling model.

5.1 Amdahl’s Law Upper-bounds: CmpM

Hill and Marty extended Amdahl’s Law [1] to study a range of

multicore topologies by considering the fraction of parallel code in

a workload [15]. Their models describe symmetric, asymmetric,

dynamic, and composed multicore topologies, considering area as

the constraint and using Pollack’s rule–the performance of a core

is proportional to the square root of its area–to estimate the perfor-

mance of multicores. We extend their work and incorporate power

as a primary design constraint, independent of area. Then, we de-

termine the optimal number of cores and speedup for topology. The

CmpM

model does not diﬀerentiate between CPU-like and GPU-

like architectures, since it abstracts away the chip organization.

Per Amdahl’s Law [1], system speedup is

(1−f )+

where f repre-

sents the portion that can be optimized, or enhanced, and S repre-

sents the speedup achievable on the enhanced portion. In the case

of parallel processing with perfect parallelization, f can be thought

of as the parallel portion of code and S as the number of proces-

sor cores. Table 3 lists the derived corollaries for each multicore

topology, where T DP is the chip power budget and DIE

AREA

is the

area budget. The q parameter denotes the performance of a single

core. Speedup is measured against a baseline core with perfor-

mance q

Baseline

. The upper-bound speedup of a single core over the

baseline is computed as S

(q) = q/q

Baseline

Symmetric Multicore: The parallel fraction ( f ) is distributed across

the N

S ym

(q) cores each of which has S

(q) speedup over the base-

line. The serial code-fraction, 1 − f , runs only on one core.

Asymmetric Multicore: All cores (including the large core), con-

tribute to execution of the parallel code. Terms q

and q

denote

performance of the large core and a single small core, respectively.

The number of small cores is bounded by the power consumption

or area of the large core.

Dynamic Multicore: Unlike the asymmetric case, if power is the

dominant constraint, the number of small cores is not bounded by

the power consumption of the large core. However, if area is the

dominant constraint, the number of small cores is bounded by the

area of the large core.

Composed Multicore: The area overhead supporting the com-

posed topology is τ. Thus, the area of small cores increases by

a factor of (1 + τ). No power overhead is assumed for the compos-

ability support in the small cores. We assume that τ increases from

10% up to 400% depending on the total area of the composed core.

We assume performance of the composed core cannot exceed per-

formance of a scaled single-core Nehalem at 45 nm. The composed

core consumes the power of a same-size uniprocessor core.

5.2 Realistic Performance Model: CmpM

The above corollaries provide a strict upper-bound on parallel

performance, but do not have the level of detail required to explore

microarchitectural features (cache organization, memory bandwidth,

number of threads per core, etc.) and workload behavior (mem-

ory access pattern and level of multithread parallelism in the ap-

plication). Guz et al. proposed a model to consider ﬁrst-order im-

pacts of these additional microarchitectural features [13]. We ex-

tend their approach to build the multicore model that incorporates

application behavior, microarchitectural features, and physical con-

straints. Using this model, we consider single-threaded cores with

large caches to cover the CPU multicore design space and mas-

sively threaded cores with minimal caches to cover the GPU mul-

ticore design space. For each of these multicore organizations, we

consider the four possible topologies.

The CmpM

model formulates the performance of a multicore

in terms of chip organization (CPU-like or GPU-like), frequency,

CPI, cache hierarchy, and memory bandwidth. The model also in-

cludes application behaviors such as the degree of thread-level par-

allelism, the frequency of load and store instructions, and the cache

miss rate. To ﬁrst order, the model considers stalls due to mem-

ory dependences and resource constraints (bandwidth or functional

units). The input parameters to the model, and how, if at all, they

are impacted by the multicore design choices are listed in Table 4.

Microarchitectural Features

Multithreaded performance (Per f ) of an Multithreaded performance

(Per f ) of an either CPU-like or GPU-like multicore running a fully

parallel ( f = 1) and multithreaded application is calculated in terms

of instructions per second in Equation (1) by multiplying the num-

ber of cores (N) by the core utilization (η) and scaling by the ratio

of the processor frequency to CPI

exe

Per f = min

f req

CPI

exe

η,

max

× m

× b

(1)

HTML Viewer

Frequently Asked Questions (12)

Q1. What are the contributions in "Dark silicon and the end of multicore scaling" ?

This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, the authors use both the ITRS projections and a set of more conservative device scaling parameters. To model singlecore scaling, the authors combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, the authors build a detailed performance model of upper-bound performance and lowerbound core power. The multicore designs the authors study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

Q2. How many GB/s does the speedup increase in the benchmarks?

As the memory bandwidth increases, the speedup improves as the bandwidth can keep more threads fed with data; however, the increases are limited by power and/or parallelism and in 10 out of 12 benchmarks speedups do not increase by more than 2× compared to the baseline, 200 GB/s.

Q3. What is the reason why the community has no clear scaling path to exploit?

With the failure of Dennard scaling–and thus slowed supply voltage scaling–core count scaling may be in jeopardy, which would leave the community with no clear scaling path to exploit continued transistor count increases.

Q4. What is the best-case speedup for a symmetric topology?

The symmetric topology achieves the lower bound on speedups; with speedups that are no more than 10% higher, the dynamic and composed topologies achieve the upper-bound.

Q5. How many thread cores can fit in the same area as one Atom processor?

From Atom and Tesla die photo inspections, the authors estimate that 8 small MT cores, their shared L1 cache, and their thread register file can fit in the same area as one Atom processor.

Q6. What is the simplest way to calculate the performance of a multicore?

Microarchitectural Features Multithreaded performance (Per f ) of an Multithreaded performance (Per f ) of an either CPU-like or GPU-like multicore running a fully parallel ( f = 1) and multithreaded application is calculated in terms of instructions per second in Equation (1) by multiplying the number of cores (N) by the core utilization (η) and scaling by the ratio of the processor frequency to CPIexe:Per f = min ( Nf req CPIexe η, BWmax rm × mL1 × b)

Q7. What is the main reason why the benchmarks are not optimized for parallelism?

Only four benchmarks have sufficient parallelism to even hypothetically sustain Moore’s Law level speedup, but dark silicon due to power limitations constrains what can be realized.

Q8. How is the fraction of dark silicon calculated?

The fraction of dark silicon can then be computed by subtracting the area occupied by these cores from the total die area allocated to processor cores.

Q9. What is the result of scaling down the frequency from Freqmax?

As depicted, at a fixed Vdd setting, scaling down the frequency from Freqmax, results in a power/performance point inside of the optimal Pareto curve, or a suboptimal design point.

Q10. How much of the chip area is devoted to cache?

Across the PARSEC benchmarks, the optimal percentage of chip devoted to cache varies from 20% to 50% depending on benchmark memory access characteristics.

Q11. What is the model’s assumption about the performance of a thread?

The model assumes that each thread effectively only sees its own slice of the cache and the cache hit rate function may over or underestimate.

Q12. How does the CmpMR model compute the performance of a multicore?

To compute the overall speedup of different multicore topologies using the CmpMR model, the authors calculate the baseline multithreaded performance for all benchmarks by providing the Per f equation with the inputs corresponding to a Quad-core Nehalem at 45 nm.

Dark silicon and the end of multicore scaling

Summary (4 min read)

2. OVERVIEW

3. DEVICE MODEL

4. CORE MODEL

4.1 Decoupling Area and Power Constraints

4.2 Pareto Frontier Derivation

4.3 Device Scaling × Core Scaling

5. MULTICORE MODEL

6. DEVICE × CORE × CMP SCALING

7. SCALING AND FUTURE MULTICORES

7.2 Analysis using Real Workloads

7.3 Sources of Dark Silicon

7.4 Sensitivity Studies

7.5 Summary

7.6 Limitations

9. CONCLUSIONS

Figures (12)

Citations

References

"Dark silicon and the end of multico..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (12)

Q1. What are the contributions in "Dark silicon and the end of multicore scaling" ?

Q2. How many GB/s does the speedup increase in the benchmarks?

Q3. What is the reason why the community has no clear scaling path to exploit?

Q4. What is the best-case speedup for a symmetric topology?

Q5. How many thread cores can fit in the same area as one Atom processor?

Q6. What is the simplest way to calculate the performance of a multicore?

Q7. What is the main reason why the benchmarks are not optimized for parallelism?

Q8. How is the fraction of dark silicon calculated?

Q9. What is the result of scaling down the frequency from Freqmax?

Q10. How much of the chip area is devoted to cache?

Q11. What is the model’s assumption about the performance of a thread?

Q12. How does the CmpMR model compute the performance of a multicore?