Dark silicon and the end of multicore scaling
Summary (4 min read)
2. OVERVIEW
- Figure 1 shows how this paper combines models and empirical measurements to project multicore performance and chip utilization.
- The authors consider ITRS Roadmap projections [18] and conservative scaling parameters from Borkar’s recent study [7].
- The core-level model provides the maximum performance that a single-core can sustain for any given area.
- The CPU multicore organization represents Intel Nehalemlike, heavy-weight multicore designs with fast caches and high singlethread performance.
- The design leverages the high-performing large core for the serial portion of code and leverages the numerous small cores as well as the large core to exploit the parallel portion of code.
3. DEVICE MODEL
- The authors consider two different technology scaling schemes to build a device scaling model.
- The first scheme uses projections from the ITRS 2010 technology roadmap [18].
- The second scheme, which the authors call conservative scaling, is based on predictions presented by Borkar and represents a less optimistic view [7].
- The power scaling factor is computed using the predicted frequency, voltage, and gate capacitance scaling factors in accordance with the P = αCV2dd f equation.
- The ITRS roadmap predicts that multi-gate MOSFETs, such as FinTETs, will supersede planar bulk at 22 nm [18].
4. CORE MODEL
- This paper uses Pareto frontiers to provide single-core power/performance and area/performance tradeoffs at each technology node while abstracting away specific details of the cores.
- These functions are derived from the data collected for a large set of processors.
- The power/performance Pareto frontier represents the optimal design points in terms of power and performance [16].
- Similarly, the area/performance Pareto frontier represents the optimal design points in the area/performance design space.
- Below, the authors first describe why separate area and power functions are required.
4.1 Decoupling Area and Power Constraints
- Furthermore, these studies consider the power consumption of a core to be directly proportional to its transistor count.
- This assumption makes power an areadependent constraint.
- Power is a function of not only area, but also supply voltage and frequency.
- Since these no longer scale at historical rates, Pollack’s rule is insufficient for modeling core power.
4.2 Pareto Frontier Derivation
- Figure 2(a) shows the power/performance single-core design space.
- To derive the quadratic area/performance Pareto frontier ), die photos of four microarchitectures, including Intel Atom, Intel Core, AMD Shanghai, and Intel Nehalem, are used to estimate the core areas (excluding level 2 and level 3 caches).
- The authors allocate 20% of the chip power budget to leakage power.
- To derive the Pareto frontiers at 45 nm, the authors fit a cubic polynomial, P(q), to the points along the edge of the power/performance design space.
- Figure 2(d) shows the result of voltage/frequency scaling on the design points along the power/performance frontier.
4.3 Device Scaling × Core Scaling
- Performance, measured in SPECmark, is assumed to scale linearly with frequency.
- This is an optimistic assumption, which ignores the effects of memory latency and bandwidth on the performance.
- Figures 2(e) and 2(f) show the scaled Pareto frontiers for the ITRS and conservative scaling schemes.
- Conservative scaling, however, suggests that performance will increase only by 34%, and power will decrease by 74%.
5. MULTICORE MODEL
- The authors first present a simple upper-bound (CmpMU ) model for multicore scaling that builds upon Amdahl’s Law to estimate the speedup of area- and power-constrained multicores.
- Their models describe symmetric, asymmetric, dynamic, and composed multicore topologies, considering area as the constraint and using Pollack’s rule–the performance of a core is proportional to the square root of its area–to estimate the performance of multicores.
- The authors extend their approach to build the multicore model that incorporates application behavior, microarchitectural features, and physical constraints.
- Figure 3(a), which includes both CPU and GPU data, shows that the model is optimistic.
- While their results are impressively close to Intel’s empirical measurements using similar benchmarks [21], the match in the model’s maximum speedup prediction (12× vs 11× in the Intel study) is an anomaly.
6. DEVICE × CORE × CMP SCALING
- The authors now describe how the three models are combined to produce projections for optimal performance, number of cores, and amount of dark silicon.
- To determine the best core configuration at each technology node, the authors consider only the processor design points along the area/performance and power/performance Pareto frontiers as they represent the most efficient design points.
- The fraction of dark silicon can then be computed by subtracting the area occupied by these cores from the total die area allocated to processor cores.
- This exhaustive search is performed separately for Amdahl’s Law CmpMU , CPU-like CmpMR , and GPU-like CmpMR models.
- The authors optimistically add cores until either the power or area budget is reached.
7. SCALING AND FUTURE MULTICORES
- Then, to achieve an understanding of speedups for real workloads, the authors consider the PARSEC benchmarks and examine both CPU-like and GPU-like multicore organizations under the four topologies using their CmpMR model.
- The authors also describe sources of dark silicon and perform sensitivity studies for cache organization and memory bandwidth.
7.2 Analysis using Real Workloads
- The authors now consider PARSEC applications executing on CPU- and GPU-like chips.
- The study considers all four symmetric, asymmetric, dynamic, and composed multicore topologies (see Table 1) using the CmpMR realistic model.
- There are two reasons for this discrepancy.
- Second, their study optimizes core count and multicore configuration for general purpose workloads similar to the PARSEC suite.
- The authors assume Fermi is optimized for graphics rendering.
7.3 Sources of Dark Silicon
- To understand whether parallelism or power is the primary source of dark silicon, the authors examine their model results with power and parallelism levels alone varying in separate experiments as shown in Figure 6 for the 8 nm node (2018).
- First, the authors set power to be the “only” constraint, and vary the level of parallelism in the PARSEC applications from 0.75 to 0.99, assuming programmer effort can somehow realize this.
- The markers show the level of parallelism in their current implementation.
- With conservative scaling, this best-case speedup is 6.3×.
- Eight of twelve benchmarks show no more than 10X speedup even with practically unlimited power, i.e. parallelism is the primary contributor to dark silicon.
7.4 Sensitivity Studies
- The authors analysis thus far examined “typical” configurations and showed poor scalability for the multicore approach.
- The authors model allows us to do such studies, and shows that only small benefits are possible from such simple changes.
- The authors elaborate on two representative studies below.
- Figure 7(b) illustrates the sensitivity of PARSEC performance to the available memory bandwidth for symmetric GPU multicores at 45 nm, also known as Memory bandwidth.
7.5 Summary
- Figure 9 summarizes all the speedup projections in a single scatter plot.
- For every benchmark at each technology node, the authors plot the eight possible configurations, (CPU, GPU) × (symmetric, asymmetric, dynamic, composed).
- The solid curve indicates performance Moore’s Law or doubling performance with every technology node.
- As depicted, due to the power and parallelism limitations, a significant gap exists between what is achievable and what is expected by Moore’s Law.
- Results for ITRS scaling are slightly better but not by much.
7.6 Limitations
- The authors modeling includes certain limitations, which the authors argue do not significantly change the results.
- SMT support can improve the power efficiency of the cores for parallel workloads to some extent.
- There is consensus that the number of these components will increase and hence they will further eat into the power budget, reducing speedups.
- Questions may still linger on the model’s accuracy and whether its assumptions contribute to the performance projections that fall well below the ideal 32×.
- First, in all instances, the authors selected parameter values that would be favorable towards performance.
9. CONCLUSIONS
- For decades, Dennard scaling permitted more transistors, faster transistors, and more energy efficient transistors with each new process node, justifying the enormous costs required to develop each new process node.
- Dennard scaling’s failure led the industry to race down the multicore path, which for some time permitted performance scaling for parallel and multitasked workloads, permitting the economics of process scaling to hold.
- The authors believe that the ITRS projections are much too optimistic, especially in the challenging sub-22 nanometer environment.
- The conservative model the authors use in this paper more closely tracks recent history.
- There is a silver lining for architects, however:.
Did you find this useful? Give us your feedback
Citations
1,582 citations
1,486 citations
860 citations
532 citations
471 citations
References
9,647 citations
6,077 citations
3,653 citations
3,514 citations
3,008 citations
"Dark silicon and the end of multico..." refers background or methods in this paper
...For the past three decades, through device, circuit, microarchitecture, architecture, and compiler advances, Moore’s law, coupled with Dennard scaling, has resulted in commensurate exponential performance increases.(2) The recent shift to multicore designs aims to increase the number of cores using the increasing transistor count to continue the proportional scaling of performance....
[...]
...Their model uses area as the primary constraint and models single-core area/performance tradeoff using Pollack’s rule (Performance / ffiffiffiffiffiffiffi p Area) without considering technology trends.(2) Azizi et al....
[...]
Related Papers (5)
Frequently Asked Questions (12)
Q2. How many GB/s does the speedup increase in the benchmarks?
As the memory bandwidth increases, the speedup improves as the bandwidth can keep more threads fed with data; however, the increases are limited by power and/or parallelism and in 10 out of 12 benchmarks speedups do not increase by more than 2× compared to the baseline, 200 GB/s.
Q3. What is the reason why the community has no clear scaling path to exploit?
With the failure of Dennard scaling–and thus slowed supply voltage scaling–core count scaling may be in jeopardy, which would leave the community with no clear scaling path to exploit continued transistor count increases.
Q4. What is the best-case speedup for a symmetric topology?
The symmetric topology achieves the lower bound on speedups; with speedups that are no more than 10% higher, the dynamic and composed topologies achieve the upper-bound.
Q5. How many thread cores can fit in the same area as one Atom processor?
From Atom and Tesla die photo inspections, the authors estimate that 8 small MT cores, their shared L1 cache, and their thread register file can fit in the same area as one Atom processor.
Q6. What is the simplest way to calculate the performance of a multicore?
Microarchitectural Features Multithreaded performance (Per f ) of an Multithreaded performance (Per f ) of an either CPU-like or GPU-like multicore running a fully parallel ( f = 1) and multithreaded application is calculated in terms of instructions per second in Equation (1) by multiplying the number of cores (N) by the core utilization (η) and scaling by the ratio of the processor frequency to CPIexe:Per f = min ( Nf req CPIexe η, BWmax rm × mL1 × b)
Q7. What is the main reason why the benchmarks are not optimized for parallelism?
Only four benchmarks have sufficient parallelism to even hypothetically sustain Moore’s Law level speedup, but dark silicon due to power limitations constrains what can be realized.
Q8. How is the fraction of dark silicon calculated?
The fraction of dark silicon can then be computed by subtracting the area occupied by these cores from the total die area allocated to processor cores.
Q9. What is the result of scaling down the frequency from Freqmax?
As depicted, at a fixed Vdd setting, scaling down the frequency from Freqmax, results in a power/performance point inside of the optimal Pareto curve, or a suboptimal design point.
Q10. How much of the chip area is devoted to cache?
Across the PARSEC benchmarks, the optimal percentage of chip devoted to cache varies from 20% to 50% depending on benchmark memory access characteristics.
Q11. What is the model’s assumption about the performance of a thread?
The model assumes that each thread effectively only sees its own slice of the cache and the cache hit rate function may over or underestimate.
Q12. How does the CmpMR model compute the performance of a multicore?
To compute the overall speedup of different multicore topologies using the CmpMR model, the authors calculate the baseline multithreaded performance for all benchmarks by providing the Per f equation with the inputs corresponding to a Quad-core Nehalem at 45 nm.