scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Dark silicon and the end of multicore scaling

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.
Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

Summary (4 min read)

2. OVERVIEW

  • Figure 1 shows how this paper combines models and empirical measurements to project multicore performance and chip utilization.
  • The authors consider ITRS Roadmap projections [18] and conservative scaling parameters from Borkar’s recent study [7].
  • The core-level model provides the maximum performance that a single-core can sustain for any given area.
  • The CPU multicore organization represents Intel Nehalemlike, heavy-weight multicore designs with fast caches and high singlethread performance.
  • The design leverages the high-performing large core for the serial portion of code and leverages the numerous small cores as well as the large core to exploit the parallel portion of code.

3. DEVICE MODEL

  • The authors consider two different technology scaling schemes to build a device scaling model.
  • The first scheme uses projections from the ITRS 2010 technology roadmap [18].
  • The second scheme, which the authors call conservative scaling, is based on predictions presented by Borkar and represents a less optimistic view [7].
  • The power scaling factor is computed using the predicted frequency, voltage, and gate capacitance scaling factors in accordance with the P = αCV2dd f equation.
  • The ITRS roadmap predicts that multi-gate MOSFETs, such as FinTETs, will supersede planar bulk at 22 nm [18].

4. CORE MODEL

  • This paper uses Pareto frontiers to provide single-core power/performance and area/performance tradeoffs at each technology node while abstracting away specific details of the cores.
  • These functions are derived from the data collected for a large set of processors.
  • The power/performance Pareto frontier represents the optimal design points in terms of power and performance [16].
  • Similarly, the area/performance Pareto frontier represents the optimal design points in the area/performance design space.
  • Below, the authors first describe why separate area and power functions are required.

4.1 Decoupling Area and Power Constraints

  • Furthermore, these studies consider the power consumption of a core to be directly proportional to its transistor count.
  • This assumption makes power an areadependent constraint.
  • Power is a function of not only area, but also supply voltage and frequency.
  • Since these no longer scale at historical rates, Pollack’s rule is insufficient for modeling core power.

4.2 Pareto Frontier Derivation

  • Figure 2(a) shows the power/performance single-core design space.
  • To derive the quadratic area/performance Pareto frontier ), die photos of four microarchitectures, including Intel Atom, Intel Core, AMD Shanghai, and Intel Nehalem, are used to estimate the core areas (excluding level 2 and level 3 caches).
  • The authors allocate 20% of the chip power budget to leakage power.
  • To derive the Pareto frontiers at 45 nm, the authors fit a cubic polynomial, P(q), to the points along the edge of the power/performance design space.
  • Figure 2(d) shows the result of voltage/frequency scaling on the design points along the power/performance frontier.

4.3 Device Scaling × Core Scaling

  • Performance, measured in SPECmark, is assumed to scale linearly with frequency.
  • This is an optimistic assumption, which ignores the effects of memory latency and bandwidth on the performance.
  • Figures 2(e) and 2(f) show the scaled Pareto frontiers for the ITRS and conservative scaling schemes.
  • Conservative scaling, however, suggests that performance will increase only by 34%, and power will decrease by 74%.

5. MULTICORE MODEL

  • The authors first present a simple upper-bound (CmpMU ) model for multicore scaling that builds upon Amdahl’s Law to estimate the speedup of area- and power-constrained multicores.
  • Their models describe symmetric, asymmetric, dynamic, and composed multicore topologies, considering area as the constraint and using Pollack’s rule–the performance of a core is proportional to the square root of its area–to estimate the performance of multicores.
  • The authors extend their approach to build the multicore model that incorporates application behavior, microarchitectural features, and physical constraints.
  • Figure 3(a), which includes both CPU and GPU data, shows that the model is optimistic.
  • While their results are impressively close to Intel’s empirical measurements using similar benchmarks [21], the match in the model’s maximum speedup prediction (12× vs 11× in the Intel study) is an anomaly.

6. DEVICE × CORE × CMP SCALING

  • The authors now describe how the three models are combined to produce projections for optimal performance, number of cores, and amount of dark silicon.
  • To determine the best core configuration at each technology node, the authors consider only the processor design points along the area/performance and power/performance Pareto frontiers as they represent the most efficient design points.
  • The fraction of dark silicon can then be computed by subtracting the area occupied by these cores from the total die area allocated to processor cores.
  • This exhaustive search is performed separately for Amdahl’s Law CmpMU , CPU-like CmpMR , and GPU-like CmpMR models.
  • The authors optimistically add cores until either the power or area budget is reached.

7. SCALING AND FUTURE MULTICORES

  • Then, to achieve an understanding of speedups for real workloads, the authors consider the PARSEC benchmarks and examine both CPU-like and GPU-like multicore organizations under the four topologies using their CmpMR model.
  • The authors also describe sources of dark silicon and perform sensitivity studies for cache organization and memory bandwidth.

7.2 Analysis using Real Workloads

  • The authors now consider PARSEC applications executing on CPU- and GPU-like chips.
  • The study considers all four symmetric, asymmetric, dynamic, and composed multicore topologies (see Table 1) using the CmpMR realistic model.
  • There are two reasons for this discrepancy.
  • Second, their study optimizes core count and multicore configuration for general purpose workloads similar to the PARSEC suite.
  • The authors assume Fermi is optimized for graphics rendering.

7.3 Sources of Dark Silicon

  • To understand whether parallelism or power is the primary source of dark silicon, the authors examine their model results with power and parallelism levels alone varying in separate experiments as shown in Figure 6 for the 8 nm node (2018).
  • First, the authors set power to be the “only” constraint, and vary the level of parallelism in the PARSEC applications from 0.75 to 0.99, assuming programmer effort can somehow realize this.
  • The markers show the level of parallelism in their current implementation.
  • With conservative scaling, this best-case speedup is 6.3×.
  • Eight of twelve benchmarks show no more than 10X speedup even with practically unlimited power, i.e. parallelism is the primary contributor to dark silicon.

7.4 Sensitivity Studies

  • The authors analysis thus far examined “typical” configurations and showed poor scalability for the multicore approach.
  • The authors model allows us to do such studies, and shows that only small benefits are possible from such simple changes.
  • The authors elaborate on two representative studies below.
  • Figure 7(b) illustrates the sensitivity of PARSEC performance to the available memory bandwidth for symmetric GPU multicores at 45 nm, also known as Memory bandwidth.

7.5 Summary

  • Figure 9 summarizes all the speedup projections in a single scatter plot.
  • For every benchmark at each technology node, the authors plot the eight possible configurations, (CPU, GPU) × (symmetric, asymmetric, dynamic, composed).
  • The solid curve indicates performance Moore’s Law or doubling performance with every technology node.
  • As depicted, due to the power and parallelism limitations, a significant gap exists between what is achievable and what is expected by Moore’s Law.
  • Results for ITRS scaling are slightly better but not by much.

7.6 Limitations

  • The authors modeling includes certain limitations, which the authors argue do not significantly change the results.
  • SMT support can improve the power efficiency of the cores for parallel workloads to some extent.
  • There is consensus that the number of these components will increase and hence they will further eat into the power budget, reducing speedups.
  • Questions may still linger on the model’s accuracy and whether its assumptions contribute to the performance projections that fall well below the ideal 32×.
  • First, in all instances, the authors selected parameter values that would be favorable towards performance.

9. CONCLUSIONS

  • For decades, Dennard scaling permitted more transistors, faster transistors, and more energy efficient transistors with each new process node, justifying the enormous costs required to develop each new process node.
  • Dennard scaling’s failure led the industry to race down the multicore path, which for some time permitted performance scaling for parallel and multitasked workloads, permitting the economics of process scaling to hold.
  • The authors believe that the ITRS projections are much too optimistic, especially in the challenging sub-22 nanometer environment.
  • The conservative model the authors use in this paper more closely tracks recent history.
  • There is a silver lining for architects, however:.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Appears in the Proceedings of the 38
th
International Symposium on Computer Architecture (ISCA ’11)
Dark Silicon and the End of Multicore Scaling
Hadi Esmaeilzadeh
Emily Blem
Renée St. Amant
§
Karthikeyan Sankaralingam
Doug Burger
University of Washington
University of Wisconsin-Madison
§
The University of Texas at Austin
Microsoft Research
hadianeh@cs.washington.edu blem@cs.wisc.edu stamant@cs.utexas.edu karu@cs.wisc.edu dburger@microsoft.com
ABSTRACT
Since 2005, processor designers have increased core counts to ex-
ploit Moore’s Law scaling, rather than focusing on single-core per-
formance. The failure of Dennard scaling, to which the shift to mul-
ticore parts is partially a response, may soon limit multicore scaling
just as single-core scaling has been curtailed. This paper models
multicore scaling limits by combining device scaling, single-core
scaling, and multicore scaling to measure the speedup potential for
a set of parallel workloads for the next ve technology generations.
For device scaling, we use both the ITRS projections and a set
of more conservative device scaling parameters. To model single-
core scaling, we combine measurements from over 150 processors
to derive Pareto-optimal frontiers for area/performance and pow-
er/performance. Finally, to model multicore scaling, we build a de-
tailed performance model of upper-bound performance and lower-
bound core power. The multicore designs we study include single-
threaded CPU-like and massively threaded GPU-like multicore chip
organizations with symmetric, asymmetric, dynamic, and composed
topologies. The study shows that regardless of chip organization
and topology, multicore scaling is power limited to a degree not
widely appreciated by the computing community. Even at 22 nm
(just one year from now), 21% of a fixed-size chip must be powered
o, and at 8 nm, this number grows to more than 50%. Through
2024, only 7.9×average speedup is possible across commonly used
parallel workloads, leaving a nearly 24-fold gap from a target of
doubled performance per generation.
Categories and Subject Descriptors: C.0 [Computer Systems Or-
ganization] General Modeling of computer architecture; C.0
[Computer Systems Organization] General System architectures
General Terms: Design, Measurement, Performance
Keywords: Dark Silicon, Modeling, Power, Technology Scaling,
Multicore
1. INTRODUCTION
Moore’s Law [23] (the doubling of transistors on chip every 18
months) has been a fundamental driver of computing. For the past
three decades, through device, circuit, microarchitecture, architec-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISCA’11, June 4–8, 2011, San Jose, California, USA.
Copyright 2011 ACM 978-1-4503-0472-6/11/06 ...$10.00.
ture, and compiler advances, Moore’s Law, coupled with Dennard
scaling [11], has resulted in commensurate exponential performance
increases. The recent shift to multicore designs has aimed to in-
crease the number of cores along with transistor count increases,
and continue the proportional scaling of performance. As a re-
sult, architecture researchers have started focusing on 100-core and
1000-core chips and related research topics and called for changes
to the undergraduate curriculum to solve the parallel programming
challenge for multicore designs at these scales.
With the failure of Dennard scaling–and thus slowed supply volt-
age scaling–core count scaling may be in jeopardy, which would
leave the community with no clear scaling path to exploit contin-
ued transistor count increases. Since future designs will be power
limited, higher core counts must provide performance gains despite
the worsening energy and speed scaling of transistors, and given
the available parallelism in applications. By studying these charac-
teristics together, it is possible to predict for how many additional
technology generations multicore scaling will provide a clear ben-
efit. Since the energy eciency of devices is not scaling along with
integration capacity, and since few applications (even from emerg-
ing domains such as recognition, mining, and synthesis [5]) have
parallelism levels that can eciently use a 100-core or 1000-core
chip, it is critical to understand how good multicore performance
will be in the long term. In 2024, will processors have 32 times the
performance of processors from 2008, exploiting ve generations
of core doubling?
Such a study must consider devices, core microarchitectures,
chip organizations, and benchmark characteristics, applying area
and power limits at each technology node. This paper consid-
ers all those factors together, projecting upper-bound performance
achievable through multicore scaling, and measuring the eects of
non-ideal device scaling, including the percentage of “dark silicon”
(transistor under-utilization) on future multicore chips. Additional
projections include best core organization, best chip-level topology,
and optimal number of cores.
We consider technology scaling projections, single-core design
scaling, multicore design choices, actual application behavior, and
microarchitectural features together. Previous studies have also
analyzed these features in various combinations, but not all to-
gether [8, 9, 10, 14, 15, 20, 22, 27, 28]. This study builds and
combines three models to project performance and fraction of “dark
silicon” on fixed-size and fixed-power chips as listed below:
Device scaling model (DevM): area, frequency, and power
requirements at future technology nodes through 2024.
Core scaling model (CorM): power/performance and area/
performance single core Pareto frontiers derived from a large
set of diverse microprocessor designs.
Multicore scaling model (CmpM): area, power and perfor-

Figure 1: Overview of the models and the methodology
mance of any application for “any” chip topology for CPU-
like and GPU-like multicore performance.
DevM ×CorM: Pareto frontiers at future technology nodes;
any performance improvements for future cores will come
only at the cost of area or power as defined by these curves.
CmpM×DevM×CorM and an exhaustive state-space search:
maximum multicore speedups for future technology nodes
while enforcing area, power, and benchmark constraints.
The results from this study provide detailed best-case multicore
performance speedups for future technologies considering real ap-
plications from the PARSEC benchmark suite [5]. Our results eval-
uating the PARSEC benchmarks and our upper-bound analysis con-
firm the following intuitive arguments:
i) Contrary to conventional wisdom on performance improve-
ments from using multicores, over five technology generations, only
7.9× average speedup is possible using ITRS scaling.
ii) While transistor dimensions continue scaling, power limita-
tions curtail the usable chip fraction. At 22 nm (i.e. in 2012), 21%
of the chip will be dark and at 8 nm, over 50% of the chip will not
be utilized using ITRS scaling.
iii) Neither CPU-like nor GPU-like multicore designs are su-
cient to achieve the expected performance speedup levels. Radical
microarchitectural innovations are necessary to alter the power/per-
formance Pareto frontier to deliver speed-ups commensurate with
Moore’s Law.
2. OVERVIEW
Figure 1 shows how this paper combines models and empirical
measurements to project multicore performance and chip utiliza-
tion. There are three components used in our approach:
Device scaling model (DevM): We build a device-scaling model
that provides the area, power, and frequency scaling factors at tech-
nology nodes from 45 nm to 8 nm. We consider ITRS Roadmap
projections [18] and conservative scaling parameters from Borkar’s
recent study [7].
Core scaling model (CorM): The core-level model provides the
maximum performance that a single-core can sustain for any given
area. Further, it provides the minimum power (or energy) that must
be consumed to sustain this level of performance. To quantify, we
measure the core performance in terms of SPECmark. We consider
empirical data from a large set of processors and use curve fitting
to obtain the Pareto-optimal frontiers for single-core area/perfor-
mance and power/performance tradeos.
Multicore scaling model (CmpM): We model two mainstream
classes of multicore organizations, multi-core CPUs and many-thread
GPUs, which represent two extreme points in the threads-per-core
spectrum. The CPU multicore organization represents Intel Nehalem-
like, heavy-weight multicore designs with fast caches and high single-
thread performance. The GPU multicore organization represents
NVIDIA Tesla-like lightweight cores with heavy multithreading
support and poor single-thread performance. For each multicore
organization, we consider four topologies: symmetric, asymmet-
ric, dynamic, and composed (also called “fused” in the literature).
Symmetric Multicore: The symmetric, or homogeneous, multicore
topology consists of multiple copies of the same core operating at
the same voltage and frequency setting. In a symmetric multicore,
the resources, including the power and the area budget, are shared
equally across all cores.
Asymmetric Multicore: The asymmetric multicore topology con-
sists of one large monolithic core and many identical small cores.
The design leverages the high-performing large core for the serial
portion of code and leverages the numerous small cores as well as
the large core to exploit the parallel portion of code.
Dynamic Multicore: The dynamic multicore topology is a varia-
tion of the asymmetric multicore topology. During parallel code
portions, the large core is shut down and, conversely, during the
serial portion, the small cores are turned o and the code runs only
on the large core [8, 26].
Composed Multicore: The composed multicore topology consists
of a collection of small cores that can logically fuse together to
compose a high-performance large core for the execution of the
serial portion of code [17, 19]. In either serial or parallel cases, the
large core or the small cores are used exclusively.
Table 1 outlines the design space we explore and explains the
roles of the cores during serial and parallel portions of applica-

Table 1: CPU and GPU topologies (ST Core: Single-Thread Core and MT: Many-Thread Core)
Symmetric Asymmetric Dynamic Composed
CPU Serial 1 ST Core 1 Large ST Core 1 Large ST Core 1 Large ST Core
Multicores Parallel N ST Cores 1 Large ST Core + N Small ST Cores N Small ST Cores N Small ST Cores
Serial
1 MT Core 1 Large ST Core 1 Large ST Core 1 Large ST Core
GPU (1 Thread) (1 Thread) (1 Thread) (1 Thread)
Multicores
Parallel
N MT Cores 1 Large ST Core + N Small MT Cores N Small MT Cores N Small MT Cores
(Multiple Threads) (1 Thread) (Multiple Threads) (Multiple Threads) (Multiple Threads)
Table 2: Scaling factors for ITRS and Conservative projections.
Frequency Vdd Capacitance Power
Tech Scaling Scaling Scaling Scaling
Node Factor Factor Factor Factor
Year (nm) (/45nm) (/45nm) (/45nm) (/45nm)
ITRS
2010 45
1.00 1.00 1.00 1.00
2012 32
1.09 0.93 0.7 0.66
2015 22
2.38 0.84 0.33 0.54
2018 16
3.21 0.75 0.21 0.38
2021 11
4.17 0.68 0.13 0.25
2024 8
3.85 0.62 0.08 0.12
31% frequency increase and 35% power reduction per node
Conservative
2008 45 1.00 1.00 1.00 1.00
2010 32 1.10 0.93 0.75 0.71
2012 22 1.19 0.88 0.56 0.52
2014 16 1.25 0.86 0.42 0.39
2016 11 1.30 0.84 0.32 0.29
2018 8 1.34 0.84 0.24 0.22
6% frequency increase and 23% power reduction per node
: Extended Planar Bulk Transistors, :Multi-Gate Transistors
tions. Single-thread (ST) cores are uni-processor style cores with
large caches and many-thread (MT) cores are GPU-style cores with
smaller caches; both are described in more detail in Section 5.
This paper describes an analytic model that provides system-
level performance using as input the core’s performance (obtained
from CorM) and the multicore’s organization (CPU-like or GPU-
like). Unlike previous studies, the model considers application
behavior, its memory access pattern, the amount of thread-level
parallelism in the workload, and microarchitectural features such
as cache size, memory bandwidth, etc. We choose the PARSEC
benchmarks because they represent a set of highly parallel applica-
tions that are widely studied in the research community.
Heterogeneous configurations such as AMD Fusion and Intel
Sandybrige combine CPU and GPU designs on a single chip. The
asymmetric and dynamic GPU topologies resemble those two de-
signs, and the composed topology models configurations similar to
AMD Bulldozer. For GPU-like multicores, this study assumes that
the single ST core does not participate in parallel work. Finally,
our methodology implicitly models heterogeneous cores of dier-
ent types (mix of issue widths, frequencies, etc.) integrated on one
chip. Since we perform a per-benchmark optimal search for each
organization and topology, we implicitly cover the upper-bound of
this heterogeneous case.
3. DEVICE MODEL
We consider two dierent technology scaling schemes to build a
device scaling model. The first scheme uses projections from the
ITRS 2010 technology roadmap [18]. The second scheme, which
we call conservative scaling, is based on predictions presented by
Borkar and represents a less optimistic view [7]. The parameters
used for calculating the power and performance scaling factors are
summarized in Table 2. For ITRS scaling, frequency is assumed to
scale linearly with respect to FO4 inverter delay. The power scaling
factor is computed using the predicted frequency, voltage, and gate
capacitance scaling factors in accordance with the P = αCV
2
dd
f
equation. The ITRS roadmap predicts that multi-gate MOSFETs,
such as FinTETs, will supersede planar bulk at 22 nm [18]. Table 2
also highlights the key dierence between the two projections. De-
tails on how we handle the partitioning between leakage power and
dynamic power is explained in Section 4.2.
4. CORE MODEL
This paper uses Pareto frontiers to provide single-core power/per-
formance and area/performance tradeos at each technology node
while abstracting away specific details of the cores. The Pareto-
optimal core model provides two functions, A(q) and P(q), repre-
senting the area/performance and power/performance tradeo Pareto
frontiers, where q is the single-threaded performance of a core
measured in SPECmarks. These functions are derived from the
data collected for a large set of processors. The power/perfor-
mance Pareto frontier represents the optimal design points in terms
of power and performance [16]. Similarly, the area/performance
Pareto frontier represents the optimal design points in the area/per-
formance design space. Below, we first describe why separate area
and power functions are required. Then, we describe the basic
model and empirical data used to derive the actual Pareto frontier
curves at 45 nm. Finally, we project these power and area Pareto
frontiers to future technology nodes using the device scaling model.
4.1 Decoupling Area and Power Constraints
Previous studies on multicore performance modeling [8, 9, 10,
15, 20, 22, 28] use Pollack’s rule [6] to denote the tradeo be-
tween transistor count and performance. Furthermore, these stud-
ies consider the power consumption of a core to be directly propor-
tional to its transistor count. This assumption makes power an area-
dependent constraint. However, power is a function of not only
area, but also supply voltage and frequency. Since these no longer
scale at historical rates, Pollack’s rule is insucient for modeling
core power. Thus, it is necessary to decouple area and power into
two independent constraints.
4.2 Pareto Frontier Derivation
Figure 2(a) shows the power/performance single-core design space.
We populated the depicted design space by collecting data for 152
real processors (from P54C Pentium to Nehalem-based i7) fabri-
cated at various technology nodes from 600 nm through 45 nm.
As shown, the boundary of the design space that comprises the
power/performance optimal points constructs the Pareto frontier.
Each processor’s performance is collected from the SPEC website
[25] and the processor’s power is the TDP reported in its datasheet.
Thermal design power, TDP, is the chip power budget, the amount
of power the chip can dissipate without exceeding transistors junc-
tion temperature. In Figure 2(a), the x-axis is the SPEC CPU2006 [25]

0 5 10 15 20 25 30 35 40
Performance (SP ECmark)
0
10
20
30
40
50
60
70
80
90
Core Power (W )
Processors (600 nm)
Processors (350 nm)
Processors (250 nm)
Processors (180 nm)
Processors (130 nm)
Processors (90 nm)
Processors (65 nm)
Processors (45 nm)
Pareto Frontier (45 nm)
(a) Power/performance across nodes
0 5 10 15 20 25 30 35 40
Performance (SP ECmark)
0
5
10
15
20
25
30
Core Power (W )
P (q) = 0.0002q
3
+ 0.0009q
2
+ 0.3859q 0.0301
Intel Nehalem (45nm)
Intel Core (45nm)
AMD Shanghai (45nm)
Intel Atom (45nm)
Pareto Frontier (45 nm)
(b) Power/performance frontier, 45 nm
0 5 10 15 20 25 30 35 40
Performance (SP ECmark)
5
10
15
20
25
30
Core Area (mm
2
)
A(q) = 0.0152q
2
+ 0.0265q + 7.4393
Intel Nehalem (45nm)
Intel Core (45nm)
AMD Shanghai (45nm)
Intel Atom (45nm)
Pareto Frontier (45 nm)
(c) Area/performance frontier, 45 nm
0 5 10 15 20 25 30 35 40
Performance (SPECmark)
0
5
10
15
20
25
Core Power (W )
Voltage Scaling
Frequency Scaling
Pareto Frontier (45 nm)
(d) Voltage and frequency scaling
0 20 40 60 80 100 120 140 160
Performance (SPECmark)
0
5
10
15
20
25
Core Power (W )
45 nm
32 nm
22 nm
16 nm
11 nm
8 nm
(e) ITRS frontier scaling
0 10 20 30 40 50
Performance (SPECmark)
0
5
10
15
20
25
Core Power (W )
45 nm
32 nm
22 nm
16 nm
11 nm
8 nm
(f) Conservative frontier scaling
Figure 2: Deriving the area/performance and power/performance Pareto frontiers
score (SPECmark) of the processor, and the y-axis is the core power
budget. All SPEC scores are converted to SPEC CPU2006 scores.
Empirical data for the core model: To build a technology-scalable
model, we consider a family of processors at one technology node
(45 nm) and construct the frontier for that technology node. We
used 20 representative Intel and AMD processors at 45 nm (Atom
Z520, Atom 230, Atom D510, C2Duo T9500, C2Extreme QX9650,
C2Q-Q8400, Opteron 2393SE, Opteron 2381HE, C2Duo E7600,
C2Duo E8600, C2Quad Q9650, C2Quad QX9770, C2Duo T9900,
Pentium SU2700, Xeon E5405, Xeon E5205, Xeon X3440, Xeon
E7450, i7-965 ExEd). The power/performance design space and
the cubic Pareto frontier at 45 nm, P(q), are depicted in Figure 2(b).
To derive the quadratic area/performance Pareto frontier (Fig-
ure 2(c)), die photos of four microarchitectures, including Intel
Atom, Intel Core, AMD Shanghai, and Intel Nehalem, are used
to estimate the core areas (excluding level 2 and level 3 caches).
The Intel Atom Z520 with a 2.2 W total TDP represents the lowest
power design (lower-left frontier point), and the Nehalem-based
Intel Core i7-965 Extreme Edition with a 130 W total TDP rep-
resents the highest performing (upper-right frontier point). Other
low-power architectures, such as those from ARM and Tilera, were
not included because their SPECmark were not available for a mean-
ingful performance comparison.
Since the focus of this work is to study the impact of power con-
straints on logic scaling rather than cache scaling, we derive the
Pareto frontiers using only the portion of chip power budget (TDP)
allocated to each core. To compute the power budget of a single
core, the power budget allocated to the level 2 and level 3 caches
is estimated and deducted from the chip TDP. In the case of a mul-
ticore CPU, the remainder of the chip power budget is divided by
the number of cores, resulting in the power budget allocated to a
single core (1.89 W for the Atom core in Z520 and 31.25 W for
each Nehalem core in i7-965 Extreme Edition). We allocate 20%
of the chip power budget to leakage power. As shown in [24], the
transistor threshold voltage can be selected so that the maximum
leakage power is always an acceptable ratio of the chip power bud-
get while still meeting the power and performance constraints. We
also observe that with 10% or 30% leakage power, we do not see
significant changes in optimal configurations.
Deriving the core model: To derive the Pareto frontiers at 45 nm,
we fit a cubic polynomial, P(q), to the points along the edge of the
power/performance design space. We fit a quadratic polynomial
(Pollack’s rule), A(q), to the points along the edge of the area/per-
formance design space. We used the least square regression method
for curve fitting such that the frontiers enclose all design points.
Figures 2(b) and 2(c) show the 45 nm processor points and identify
the power/performance and area/performance Pareto frontiers. The
power/performance cubic polynomial P(q) function (Figure 2(b))
and the area/performance quadratic polynomial A(q) (Figure 2(c))
are the outputs of the core model. The points along the Pareto fron-
tier are used as the search space for determining the best core con-
figuration by the multicore-scaling model. We discretized the fron-
tier into 100 points to consider 100 dierent core designs.
Voltage and frequency scaling: When deriving the Pareto fron-
tiers, each processor data point was assumed to operate at its opti-
mal voltage (Vdd
min
) and frequency setting (Freq
max
). Figure 2(d)
shows the result of voltage/frequency scaling on the design points
along the power/performance frontier. As depicted, at a fixed Vdd
setting, scaling down the frequency from Freq
max
, results in a pow-
er/performance point inside of the optimal Pareto curve, or a sub-
optimal design point. Scaling voltage up, on the other hand, and
operating at a new Vdd
min
and Freq
max
setting, results in a dierent
power-performance point along the frontier. Since we investigate
all the points along the Pareto frontier to find the optimal multi-
core configuration, voltage and frequency scaling does not require
special consideration in our study. If an application dissipates less
than the power budget, we assume that the voltage and frequency
scaling will be utilized to achieve the highest possible performance
with the minimum power increase. This is possible since voltage
and frequency scaling only changes the operating condition in a
Pareto-optimal fashion. Hence, we do not need to measure per-
benchmark power explicitly as reported in a recent study [12].

Table 3: CmpM
U
equations: corollaries of Amdahl’s Law for
power-constrained multicores.
Symmetric
N
S ym
(q) = min(
DIE
AREA
A(q)
,
T DP
P(q)
)
S peedup
S ym
( f, q) =
1
(1f )
S
U
(q)
+
f
N
S ym
(q)S
U
(q)
Asymmetric
N
Asym
(q
L
, q
S
) = min(
DIE
AREA
A(q
L
)
A(q
S
)
,
T DPP(q
L
)
P(q
S
)
)
S peedup
Asym
( f, q
L
, q
S
) =
1
(1f )
S
U
(q
L
)
+
f
N
Asym
(q
L
,q
S
)S
U
(q
S
)+S
U
(q
L
)
Dynamic
N
Dyn
(q
L
, q
S
) = min(
DIE
AREA
A(q
L
)
A(q
S
)
,
T DP
P(q
S
)
)
S peedup
Dyn
( f, q
L
, q
S
) =
1
(1f )
S
U
(q
L
)
+
f
N
Dyn
(q
L
,q
S
)S
U
(q
S
)
Composed
N
Composd
(q
L
, q
S
) = min(
DIE
AREA
(1+τ)A(q
S
)
,
T DPPq
L
P(q
S
)
)
S peedup
Composed
( f, q
L
, q
S
) =
1
(1f )
S
U
(q
L
)
+
f
N
Com posed
(q
L
,q
S
)S
U
(q
S
)
4.3 Device Scaling × Core Scaling
To study core scaling in future technology nodes, we scaled the
45 nm Pareto frontiers to 8 nm by scaling the power and perfor-
mance of each processor data point using the projected DevM scal-
ing factors and then re-fitting the Pareto optimal curves at each
technology node. Performance, measured in SPECmark, is as-
sumed to scale linearly with frequency. This is an optimistic as-
sumption, which ignores the eects of memory latency and band-
width on the performance. Thus actual performance through scal-
ing is likely to be lower. Figures 2(e) and 2(f) show the scaled
Pareto frontiers for the ITRS and conservative scaling schemes.
Based on the optimistic ITRS roadmap predictions, scaling a mi-
croarchitecture (core) from 45 nm to 8 nm will result in a 3.9× per-
formance improvement and an 88% reduction in power consump-
tion. Conservative scaling, however, suggests that performance will
increase only by 34%, and power will decrease by 74%.
5. MULTICORE MODEL
We first present a simple upper-bound (CmpM
U
) model for mul-
ticore scaling that builds upon Amdahl’s Law to estimate the speedup
of area- and power-constrained multicores. To account for microar-
chitectural features and application behavior, we then develop a de-
tailed chip-level model (CmpM
R
) for CPU-like and GPU-like mul-
ticore organizations with dierent topologies. Both models use the
A(q) and P(q) frontiers from the core-scaling model.
5.1 Amdahl’s Law Upper-bounds: CmpM
U
Hill and Marty extended Amdahl’s Law [1] to study a range of
multicore topologies by considering the fraction of parallel code in
a workload [15]. Their models describe symmetric, asymmetric,
dynamic, and composed multicore topologies, considering area as
the constraint and using Pollack’s rule–the performance of a core
is proportional to the square root of its area–to estimate the perfor-
mance of multicores. We extend their work and incorporate power
as a primary design constraint, independent of area. Then, we de-
termine the optimal number of cores and speedup for topology. The
CmpM
U
model does not dierentiate between CPU-like and GPU-
like architectures, since it abstracts away the chip organization.
Per Amdahl’s Law [1], system speedup is
1
(1f )+
f
S
where f repre-
sents the portion that can be optimized, or enhanced, and S repre-
sents the speedup achievable on the enhanced portion. In the case
of parallel processing with perfect parallelization, f can be thought
of as the parallel portion of code and S as the number of proces-
sor cores. Table 3 lists the derived corollaries for each multicore
topology, where T DP is the chip power budget and DIE
AREA
is the
area budget. The q parameter denotes the performance of a single
core. Speedup is measured against a baseline core with perfor-
mance q
Baseline
. The upper-bound speedup of a single core over the
baseline is computed as S
U
(q) = q/q
Baseline
.
Symmetric Multicore: The parallel fraction ( f ) is distributed across
the N
S ym
(q) cores each of which has S
U
(q) speedup over the base-
line. The serial code-fraction, 1 f , runs only on one core.
Asymmetric Multicore: All cores (including the large core), con-
tribute to execution of the parallel code. Terms q
L
and q
S
denote
performance of the large core and a single small core, respectively.
The number of small cores is bounded by the power consumption
or area of the large core.
Dynamic Multicore: Unlike the asymmetric case, if power is the
dominant constraint, the number of small cores is not bounded by
the power consumption of the large core. However, if area is the
dominant constraint, the number of small cores is bounded by the
area of the large core.
Composed Multicore: The area overhead supporting the com-
posed topology is τ. Thus, the area of small cores increases by
a factor of (1 + τ). No power overhead is assumed for the compos-
ability support in the small cores. We assume that τ increases from
10% up to 400% depending on the total area of the composed core.
We assume performance of the composed core cannot exceed per-
formance of a scaled single-core Nehalem at 45 nm. The composed
core consumes the power of a same-size uniprocessor core.
5.2 Realistic Performance Model: CmpM
R
The above corollaries provide a strict upper-bound on parallel
performance, but do not have the level of detail required to explore
microarchitectural features (cache organization, memory bandwidth,
number of threads per core, etc.) and workload behavior (mem-
ory access pattern and level of multithread parallelism in the ap-
plication). Guz et al. proposed a model to consider first-order im-
pacts of these additional microarchitectural features [13]. We ex-
tend their approach to build the multicore model that incorporates
application behavior, microarchitectural features, and physical con-
straints. Using this model, we consider single-threaded cores with
large caches to cover the CPU multicore design space and mas-
sively threaded cores with minimal caches to cover the GPU mul-
ticore design space. For each of these multicore organizations, we
consider the four possible topologies.
The CmpM
R
model formulates the performance of a multicore
in terms of chip organization (CPU-like or GPU-like), frequency,
CPI, cache hierarchy, and memory bandwidth. The model also in-
cludes application behaviors such as the degree of thread-level par-
allelism, the frequency of load and store instructions, and the cache
miss rate. To first order, the model considers stalls due to mem-
ory dependences and resource constraints (bandwidth or functional
units). The input parameters to the model, and how, if at all, they
are impacted by the multicore design choices are listed in Table 4.
Microarchitectural Features
Multithreaded performance (Per f ) of an Multithreaded performance
(Per f ) of an either CPU-like or GPU-like multicore running a fully
parallel ( f = 1) and multithreaded application is calculated in terms
of instructions per second in Equation (1) by multiplying the num-
ber of cores (N) by the core utilization (η) and scaling by the ratio
of the processor frequency to CPI
exe
:
Per f = min
N
f req
CPI
exe
η,
BW
max
r
m
× m
L1
× b
!
(1)

Citations
More filters
Proceedings ArticleDOI
24 Feb 2014
TL;DR: This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
Abstract: Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

1,582 citations

Proceedings ArticleDOI
13 Dec 2014
TL;DR: This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
Abstract: Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

1,486 citations

Proceedings ArticleDOI
03 Mar 2012
TL;DR: This work identifies the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Abstract: Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads.In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that today's predominant processor micro-architecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core micro-architecture. Moreover, while today's predominant micro-architecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.

860 citations

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A programming model is defined that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results and is faster and more energy efficient than executing the original code.
Abstract: This paper describes a learning-based approach to the acceleration of approximate programs. We describe the \emph{Parrot transformation}, a program transformation that selects and trains a neural network to mimic a region of imperative code. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a \emph{neural processing unit} (NPU). The NPU is tightly coupled to the processor pipeline to accelerate small code regions. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides whole-application speedup of 2.3x and energy savings of 3.0x on average with quality loss of at most 9.6%.

532 citations

Journal ArticleDOI
14 Aug 2015-Science
TL;DR: The advantages and challenges for incorporating nanomaterials into transistors to improve performance are discussed in the context of different transistor applications, along with the breakthroughs needed to enable the next generation of technological advancement.
Abstract: For more than 50 years, silicon transistors have been continuously shrunk to meet the projections of Moore's law but are now reaching fundamental limits on speed and power use. With these limits at hand, nanomaterials offer great promise for improving transistor performance and adding new applications through the coming decades. With different transistors needed in everything from high-performance servers to thin-film display backplanes, it is important to understand the targeted application needs when considering new material options. Here the distinction between high-performance and thin-film transistors is reviewed, along with the benefits and challenges to using nanomaterials in such transistors. In particular, progress on carbon nanotubes, as well as graphene and related materials (including transition metal dichalcogenides and X-enes), outlines the advances and further research needed to enable their use in transistors for high-performance computing, thin films, or completely new technologies such as flexible and transparent devices.

471 citations

References
More filters
Journal ArticleDOI
01 Jan 1998
TL;DR: Integrated circuits will lead to such wonders as home computers or at least terminals connected to a central computer, automatic controls for automobiles, and personal portable communications equipment as mentioned in this paper. But the biggest potential lies in the production of large systems.
Abstract: The future of integrated electronics is the future of electronics itself. The advantages of integration will bring about a proliferation of electronics, pushing this science into many new areas. Integrated circuits will lead to such wonders as home computers—or at least terminals connected to a central computer—automatic controls for automobiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be feasible today. But the biggest potential lies in the production of large systems. In telephone communications, integrated circuits in digital filters will separate channels on multiplex equipment. Integrated circuits will also switch telephone circuits and perform data processing. Computers will be more powerful, and will be organized in completely different ways. For example, memories built of integrated electronics may be distributed throughout the machine instead of being concentrated in a central unit. In addition, the improved reliability made possible by integrated circuits will allow the construction of larger processing units. Machines similar to those in existence today will be built at lower costs and with faster turnaround.

9,647 citations

Journal Article
TL;DR: Integrated circuits will lead to such wonders as home computers or at least terminals connected to a central computer, automatic controls for automobiles, and personal portable communications equipment as discussed by the authors. But the biggest potential lies in the production of large systems.
Abstract: The future of integrated electronics is the future of electronics itself. The advantages of integration will bring about a proliferation of electronics, pushing this science into many new areas. Integrated circuits will lead to such wonders as home computers—or at least terminals connected to a central computer—automatic controls for automobiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be feasible today. But the biggest potential lies in the production of large systems. In telephone communications, integrated circuits in digital filters will separate channels on multiplex equipment. Integrated circuits will also switch telephone circuits and perform data processing. Computers will be more powerful, and will be organized in completely different ways. For example, memories built of integrated electronics may be distributed throughout the machine instead of being concentrated in a central unit. In addition, the improved reliability made possible by integrated circuits will allow the construction of larger processing units. Machines similar to those in existence today will be built at lower costs and with faster turnaround.

6,077 citations

Proceedings ArticleDOI
Gene Myron Amdahl1
18 Apr 1967
TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Abstract: For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.

3,653 citations

Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations

Journal ArticleDOI
TL;DR: This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 /spl mu/.
Abstract: This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 /spl mu/. Scaling relationships are presented which show how a conventional MOSFET can be reduced in size. An improved small device structure is presented that uses ion implantation, to provide shallow source and drain regions and a nonuniform substrate doping profile. One-dimensional models are used to predict the substrate doping profile and the corresponding threshold voltage versus source voltage characteristic. A two-dimensional current transport model is used to predict the relative degree of short-channel effects for different device parameter combinations. Polysilicon-gate MOSFET's with channel lengths as short as 0.5 /spl mu/ were fabricated, and the device characteristics measured and compared with predicted values. The performance improvement expected from using these very small devices in highly miniaturized integrated circuits is projected.

3,008 citations


"Dark silicon and the end of multico..." refers background or methods in this paper

  • ...For the past three decades, through device, circuit, microarchitecture, architecture, and compiler advances, Moore’s law, coupled with Dennard scaling, has resulted in commensurate exponential performance increases.(2) The recent shift to multicore designs aims to increase the number of cores using the increasing transistor count to continue the proportional scaling of performance....

    [...]

  • ...Their model uses area as the primary constraint and models single-core area/performance tradeoff using Pollack’s rule (Performance / ffiffiffiffiffiffiffi p Area) without considering technology trends.(2) Azizi et al....

    [...]

Frequently Asked Questions (12)
Q1. What are the contributions in "Dark silicon and the end of multicore scaling" ?

This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, the authors use both the ITRS projections and a set of more conservative device scaling parameters. To model singlecore scaling, the authors combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, the authors build a detailed performance model of upper-bound performance and lowerbound core power. The multicore designs the authors study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. 

As the memory bandwidth increases, the speedup improves as the bandwidth can keep more threads fed with data; however, the increases are limited by power and/or parallelism and in 10 out of 12 benchmarks speedups do not increase by more than 2× compared to the baseline, 200 GB/s. 

With the failure of Dennard scaling–and thus slowed supply voltage scaling–core count scaling may be in jeopardy, which would leave the community with no clear scaling path to exploit continued transistor count increases. 

The symmetric topology achieves the lower bound on speedups; with speedups that are no more than 10% higher, the dynamic and composed topologies achieve the upper-bound. 

From Atom and Tesla die photo inspections, the authors estimate that 8 small MT cores, their shared L1 cache, and their thread register file can fit in the same area as one Atom processor. 

Microarchitectural Features Multithreaded performance (Per f ) of an Multithreaded performance (Per f ) of an either CPU-like or GPU-like multicore running a fully parallel ( f = 1) and multithreaded application is calculated in terms of instructions per second in Equation (1) by multiplying the number of cores (N) by the core utilization (η) and scaling by the ratio of the processor frequency to CPIexe:Per f = min ( Nf req CPIexe η, BWmax rm × mL1 × b) 

Only four benchmarks have sufficient parallelism to even hypothetically sustain Moore’s Law level speedup, but dark silicon due to power limitations constrains what can be realized. 

The fraction of dark silicon can then be computed by subtracting the area occupied by these cores from the total die area allocated to processor cores. 

As depicted, at a fixed Vdd setting, scaling down the frequency from Freqmax, results in a power/performance point inside of the optimal Pareto curve, or a suboptimal design point. 

Across the PARSEC benchmarks, the optimal percentage of chip devoted to cache varies from 20% to 50% depending on benchmark memory access characteristics. 

The model assumes that each thread effectively only sees its own slice of the cache and the cache hit rate function may over or underestimate. 

To compute the overall speedup of different multicore topologies using the CmpMR model, the authors calculate the baseline multithreaded performance for all benchmarks by providing the Per f equation with the inputs corresponding to a Quad-core Nehalem at 45 nm.