scispace - formally typeset
Open AccessJournal ArticleDOI

The future of microprocessors

Reads0
Chats0
TLDR
Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.
Abstract
Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

read more

Content maybe subject to copyright    Report

MAY 2 0 1 1 | VOL. 54 | NO. 5 | COMMUNICATIONS OF THE ACM 67
MICROPROCESSORSSINGLE-CHIP COMPUTERS—are
the building blocks of the information world. Their
performance has grown 1,000-fold over the past 20
years, driven by transistor speed and energy scaling, as
well as by microarchitecture advances that exploited
the transistor density gains from Moore’s Law. In the
next two decades, diminishing tran-
sistor-speed scaling and practical en-
ergy limits create new challenges for
continued performance scaling. As
a result, the frequency of operations
will increase slowly, with energy the
key limiter of performance, forcing
designs to use large-scale parallel-
ism, heterogeneous cores, and accel-
erators to achieve performance and
energy efficiency. Software-hardware
partnership to achieve efficient data
orchestration is increasingly critical in
the drive toward energy-proportional
computing.
Our aim here is to reflect and proj-
ect the macro trends shaping the fu-
ture of microprocessors and sketch in
broad strokes where processor design
is going. We enumerate key research
challenges and suggest promising
research directions. Since dramatic
changes are coming, we also seek to
inspire the research community to in-
vent new ideas and solutions address
how to sustain computing’s exponen-
tial improvement.
Microprocessors (see Figure 1) were
invented in 1971,
28
but it’s difficult to-
day to believe any of the early inventors
could have conceived their extraor-
dinary evolution in structure and use
over the past 40 years. Microprocessors
today not only involve complex micro-
The Future
of
Microprocessors
DO I: 10. 11 45/ 194 14 87. 19 415 07
Energy efficiency is the new fundamental
limiter of processor performance,
way beyond numbers of processors.
BY SHEKHAR BORKAR AND ANDREW A. CHIEN
key insights
Moore’s Law continues but demands
radical changes in architecture and
software.
Architectures will go beyond
homogeneous parallelism, embrace
heterogeneity, and exploit the bounty
of transistors to incorporate
application-customized hardware.
Software must increase parallelism
and exploit heterogeneous and
application-customized hardware
to deliver performance growth.

68 C O M M U N I C ATIONS OF THE AC M | M AY 2011 | VO L . 5 4 | N O. 5
contributed articles
architectures and multiple execution
engines (cores) but have grown to in-
clude all sorts of additional functions,
including floating-point units, caches,
memory controllers, and media-pro-
cessing engines. However, the defin-
ing characteristics of a microprocessor
remain—a single semiconductor chip
embodying the primary computation
(data transformation) engine in a com-
puting system.
Because our own greatest access
and insight involves Intel designs and
data, our graphs and estimates draw
heavily on them. In some cases, they
may not be representative of the entire
industry but certainly represent a large
fraction. Such a forthright view, solidly
grounded, best supports our goals for
this article.
20 Years of Exponential
Performance Gains
For the past 20 years, rapid growth in
microprocessor performance has been
enabled by three key technology driv-
ers—transistor-speed scaling, core mi-
croarchitecture techniques, and cache
memories—discussed in turn in the
following sections:
Transistor-speed scaling. The MOS
transistor has been the workhorse for
decades, scaling in performance by
nearly ve orders of magnitude and
providing the foundation for today’s
unprecedented compute performance.
The basic recipe for technology scaling
was laid down by Robert N. Dennard of
IBM
17
in the early 1970s and followed
for the past three decades. The scal-
ing recipe calls for reducing transistor
dimensions by 30% every generation
(two years) and keeping electric fields
constant everywhere in the transis-
tor to maintain reliability. This might
sound simple but is increasingly diffi-
cult to continue for reasons discussed
later. Classical transistor scaling pro-
vided three major benefits that made
possible rapid growth in compute per-
formance.
First, the transistor dimensions are
scaled by 30% (0.7x), their area shrinks
50%, doubling the transistor density
every technology generation—the fun-
damental reason behind Moore’s Law.
Second, as the transistor is scaled, its
performance increases by about 40%
(0.7x delay reduction, or 1.4x frequen-
cy increase), providing higher system
performance. Third, to keep the elec-
tric field constant, supply voltage is re-
duced by 30%, reducing energy by 65%,
or power (at 1.4x frequency) by 50%
(active power = CV
2
f). Putting it all to-
gether, in every technology generation
transistor integration doubles, circuits
are 40% faster, and system power con-
sumption (with twice as many transis-
tors) stays the same. This serendipi-
tous scaling (almost too good to be
true) enabled three-orders-of-magni-
tude increase in microprocessor per-
formance over the past 20 years. Chip
architects exploited transistor density
to create complex architectures and
transistor speed to increase frequency,
achieving it all within a reasonable
power and energy envelope.
Core microarchitecture tech-
niques. Advanced microarchitectures
have deployed the abundance of tran-
sistor-integration capacity, employing
a dizzying array of techniques, includ-
ing pipelining, branch prediction,
out-of-order execution, and specula-
tion, to deliver ever-increasing perfor-
mance. Figure 2 outlines advances in
microarchitecture, showing increases
in die area and performance and en-
ergy efficiency (performance/watt),
all normalized in the same process
technology. It uses characteristics of
Intel microprocessors (such as 386,
486, Pentium, Pentium Pro, and Pen-
tium 4), with performance measured
by benchmark SpecInt (92, 95, and
2000 representing the current bench-
mark for the era) at each data point.
It compares each microarchitecture
advance with a design without the ad-
Figure 1. Evolution of Intel microprocessors 1971–2009.
Intel 4004, 1971
1 core, no cache
23K transistors
Intel 8088, 1978
1 core, no cache
29K transistors
Intel Mehalem-EX, 2009
8 cores, 24MB cache
2.3B transistors
Figure 2. Architecture advances and energy efficiency.
Die Area
Integer Performance (X)
FP Performance (X)
Int Performance/Watt (X)
386 to 486
486 to Pentium
Pentium to P6
P6 to Pentium 4
Pentium 4
to Core
4
3
2
1
0
On-die cache,
pipelined
Increase (X)
Super-scalar OOO-Speculative Deep pipeline Back to non-deep
pipeline

contributed articles
MAY 2 0 1 1 | VOL. 54 | NO. 5 | C O M M U N I C ATIONS OF THE AC M 69
vance (such as introducing an on-die
cache by comparing 486 to 386 in 1μ
technology and superscalar microar-
chitecture of Pentium in 0.7μ technol-
ogy with 486).
This data shows that on-die caches
and pipeline architectures used tran-
sistors well, providing a significant
performance boost without compro-
mising energy efficiency. In this era,
superscalar, and out-of-order archi-
tectures provided sizable performance
benefits at a cost in energy efficiency.
Of these architectures, deep-pipe-
lined design seems to have delivered
the lowest performance increase for
the same area and power increase as
out-of-order and speculative design,
incurring the greatest cost in energy
efficiency. The term “deep pipelined
architecture” describes deeper pipe-
line, as well as other circuit and mi-
croarchitectural techniques (such as
trace cache and self-resetting domino
logic) employed to achieve even high-
er frequency. Evident from the data is
that reverting to a non-deep pipeline
reclaimed energy efficiency by drop-
ping these expensive and inefficient
techniques.
When transistor performance in-
creases frequency of operation, the
performance of a well-tuned system
generally increases, with frequency
subject to the performance limits of
other parts of the system. Historically,
microarchitecture techniques exploit-
ing the growth in available transistors
have delivered performance increases
empirically described by Pollack’s
Rule,
32
whereby performance increas-
es (when not limited by other parts
of the system) as the square root of
the number of transistors or area of
a processor (see Figure 3). According
to Pollack’s Rule, each new technol-
ogy generation doubles the number
of transistors on a chip, enabling a
new microarchitecture that delivers a
40% performance increase. The faster
transistors provide an additional 40%
performance (increased frequency),
almost doubling overall performance
within the same power envelope (per
scaling theory). In practice, however,
implementing a new microarchitec-
ture every generation is difficult, so
microarchitecture gains are typically
less. In recent microprocessors, the in-
creasing drive for energy efficiency has
caused designers to forego many of
these microarchitecture techniques.
As Pollack’s Rule broadly captures
area, power, and performance trade-
offs from several generations of mi-
croarchitecture, we use it as a rule
of thumb to estimate single-thread
performance in various scenarios
throughout this article.
Cache memory architecture. Dy-
namic memory technology (DRAM)
has also advanced dramatically with
Moore’s Law over the past 40 years but
with different characteristics. For ex-
ample, memory density has doubled
nearly every two years, while perfor-
mance has improved more slowly (see
Figure 4a). This slower improvement
in cycle time has produced a memory
bottleneck that could reduce a sys-
tem’s overall performance. Figure 4b
outlines the increasing speed dispar-
ity, growing from 10s to 100s of proces-
sor clock cycles per memory access. It
has lately flattened out due to the flat-
tening of processor clock frequency.
Unaddressed, the memory-latency gap
would have eliminated and could still
eliminate most of the benefits of pro-
cessor improvement.
The reason for slow improvement
of DRAM speed is practical, not tech-
nological. It’s a misconception that
DRAM technology based on capacitor
storage is inherently slower; rather, the
memory organization is optimized for
density and lower cost, making it slow-
er. The DRAM market has demanded
large capacity at minimum cost over
speed, depending on small and fast
caches on the microprocessor die to
emulate high-performance memory
by providing the necessary bandwidth
and low latency based on data locality.
The emergence of sophisticated, yet
effective, memory hierarchies allowed
DRAM to emphasize density and cost
over speed. At first, processors used a
single level of cache, but, as processor
speed increased, two to three levels of
cache hierarchies were introduced to
span the growing speed gap between
Figure 3. Increased performance vs. area in the same process technology follows
Pollack’s Rule.
10.0
1.0
0.1
0.1 1.0
10.0
Integer Performance (X)
Area (X)
Performance ~ Sqrt(Area)
Slope =0.5
Pentium 4 to Core
P6 to Pentium 4
Pentium to P6
486 to Pentium
386 to 486
Figure 4. DRAM density and performance, 1980–2010.
DRAM Density
CPU
Speed
GAP
DRAM Speed
100,000
10,000
1,000
100
10
1
1,000
100
10
1
1980 19801990 19902000 20002010 2010
Relative
CPU Clocks/DRAM Latency
(a) (b)

70 C O M M U N I C ATIONS OF THE AC M | M AY 2011 | VO L . 5 4 | N O. 5
contributed articles
processor and memory.
33,37
In these
hierarchies, the lowest-level caches
were small but fast enough to match
the processor’s needs in terms of high
bandwidth and low latency; higher lev-
els of the cache hierarchy were then
optimized for size and speed.
Figure 5 outlines the evolution of
on-die caches over the past two de-
cades, plotting cache capacity (a) and
percentage of die area (b) for Intel
microprocessors. At first, cache sizes
increased slowly, with decreasing die
area devoted to cache, and most of the
available transistor budget was devot-
ed to core microarchitecture advances.
During this period, processors were
probably cache-starved. As energy be-
came a concern, increasing cache size
for performance has proven more en-
ergy efficient than additional core-mi-
croarchitecture techniques requiring
energy-intensive logic. For this reason,
more and more transistor budget and
die area are allocated in caches.
The transistor-scaling-and-micro-
architecture-improvement cycle has
been sustained for more than two
decades, delivering 1,000-fold perfor-
mance improvement. How long will it
continue? To better understand and
predict future performance, we decou-
ple performance gain due to transistor
speed and microarchitecture by com-
paring the same microarchitecture
on different process technologies and
new microarchitectures with the previ-
ous ones, then compound the perfor-
mance gain.
Figure 6 divides the cumulative
1,000-fold Intel microprocessor per-
formance increase over the past two
decades into performance delivered by
transistor speed (frequency) and due to
microarchitecture. Almost two-orders-
of-magnitude of this performance in-
crease is due to transistor speed alone,
now leveling off due to the numerous
challenges described in the following
sections.
The Next 20 Years
Microprocessor technology has deliv-
ered three-orders-of-magnitude per-
formance improvement over the past
two decades, so continuing this tra-
jectory would require at least 30x per-
formance increase by 2020. Micropro-
Figure 5. Evolution of on-die caches.
10,000
1,000
100
10
1
60%
50%
40%
30%
20%
10%
0%
1u 1u0.5u 0.5u0.25u 0.25u0.13u 0.13u65nm 65nm
On-die cache (KB)
On-die cache %
of total die area
(a) (b)
Figure 7. Unconstrained evolution of a microprocessor results in excessive power
consumption.
Unconstrained Evolution 100mm
2
Die
Power (Watts)
500
400
300
200
100
0
2002 2006 2010 2014 2008
Figure 6. Performance increase separated into transistor speed and microarchitecture
performance.
Integer Performance
Transistor Performance
Floating-Point Performance
Transistor Performance
10,000
1,000
100
10
1
10,000
1,000
100
10
1
1.5u 1.5u0.5u 0.5u0.18u 0.18u65nm 65nm
Relative
Relative
(a) (b)
Table 1. New technology scaling
challenges.
Decreased transistor scaling benefits:
Despite continuing miniaturization, little
performance improvement and little
reduction in switching energy (decreasing
performance benefits of scaling) [ITRS].
Flat total energy budget: package
power and mobile/embedded computing
drives energy-efficiency requirements.
Table 2. Ongoing technology scaling.
Increasing transistor density (in area
and volume) and count: through
continued feature scaling, process
innovations, and packaging innovations.
Need for increasing locality and
reduced bandwidth per operation:
as performance of the microprocessor
increases, and the data sets for
applications continue to grow.

contributed articles
MAY 2 0 1 1 | VOL. 54 | NO. 5 | C O M M U N I C ATIONS OF THE AC M 71
cessor-performance scaling faces new
challenges (see Table 1) precluding
use of energy-inefficient microarchi-
tecture innovations developed over the
past two decades. Further, chip archi-
tects must face these challenges with
an ongoing industry expectation of a
30x performance increase in the next
decade and 1,000x increase by 2030
(see Table 2).
As the transistor scales, supply
voltage scales down, and the thresh-
old voltage of the transistor (when
the transistor starts conducting) also
scales down. But the transistor is not
a perfect switch, leaking some small
amount of current when turned off,
increasing exponentially with reduc-
tion in the threshold voltage. In ad-
dition, the exponentially increasing
transistor-integration capacity exacer-
bates the effect; as a result, a substan-
tial portion of power consumption is
due to leakage. To keep leakage under
control, the threshold voltage cannot
be lowered further and, indeed, must
increase, reducing transistor perfor-
mance.
10
As transistors have reached atomic
dimensions, lithography and variabil-
ity pose further scaling challenges, af-
fecting supply-voltage scaling.
11
With
limited supply-voltage scaling, energy
and power reduction is limited, ad-
versely affecting further integration
of transistors. Therefore, transistor-
integration capacity will continue with
scaling, though with limited perfor-
mance and power benefit. The chal-
lenge for chip architects is to use this
integration capacity to continue to im-
prove performance.
Package power/total energy con-
sumption limits number of logic tran-
sistors. If chip architects simply add
more cores as transistor-integration
capacity becomes available and oper-
ate the chips at the highest frequen-
cy the transistors and designs can
achieve, then the power consumption
of the chips would be prohibitive (see
Figure 7). Chip architects must limit
frequency and number of cores to keep
power within reasonable bounds, but
doing so severely limits improvement
in microprocessor performance.
Consider the transistor-integration
capacity affordable in a given power
envelope for reasonable die size. For
regular desktop applications the pow-
er envelope is around 65 watts, and
the die size is around 100mm
2
. Figure
8 outlines a simple analysis for 45nm
process technology node; the x-axis is
the number of logic transistors inte-
grated on the die, and the two y-axes
are the amount of cache that would fit
and the power the die would consume.
As the number of logic transistors on
the die increases (x-axis), the size of the
cache decreases, and power dissipa-
tion increases. This analysis assumes
average activity factor for logic and
cache observed in today’s micropro-
cessors. If the die integrates no logic at
all, then the entire die could be popu-
lated with about 16MB of cache and
consume less than 10 watts of power,
since caches consume less power than
logic (Case A). On the other hand, if it
integrates no cache at all, then it could
integrate 75 million transistors for log-
ic, consuming almost 90 watts of pow-
er (Case B). For 65 watts, the die could
integrate 50 million transistors for
logic and about 6MB of cache (Case C).
Traditional wisdom suggests investing maximum transistors in the 90% case, with
the goal of using precious transistors to increase single-thread performance that can
be applied broadly. In the new scaling regime typified by slow transistor performance
and energy improvement, it often makes no sense to add transistors to a single core
as energy efficiency suffers. Using additional transistors to build more cores produces
a limited benefit—increased performance for applications with thread parallelism.
In this world, 90/10 optimization no longer applies. Instead, optimizing with an
accelerator for a 10% case, then another for a different 10% case, then another 10%
case can often produce a system with better overall energy efficiency and performance.
We call this “1010 optimization,”
14
as the goal is to attack performance as a set of
10% optimization opportunities—a different way of thinking about transistor cost,
operating the chip with 10% of the transistors active—90% inactive, but a different 10%
at each point in time.
Historically, transistors on a chip were expensive due to the associated design
effort, validation and testing, and ultimately manufacturing cost. But 20 generations
of Moore’s Law and advances in design and validation have shifted the balance.
Building systems where the 10% of the transistors that can operate within the energy
budget are configured optimally (an accelerator well-suited to the application) may
well be the right solution. The choice of 10 cases is illustrative, and a 55, 77, 1010,
or 1212 architecture might be appropriate for a particular design.
Death of
90/10 Optimization,
Rise of
10×10 Optimization
Figure 8. Transistor integration capacity at a fixed power envelope.
Case B
Case A, 0 Logic, 8W
Case A, 16MB of Cache
Case C
50MT Logic
6MB Cache
Power Dissipation
Cache Size
100
80
60
40
20
0
18
16
14
12
10
8
6
4
2
0
0 20 40 60 80
Total Power (Watts)
Logic Transistors (Millions)
2008, 45nm, 100mm
2
Cache (MB)

Citations
More filters

Could Energy Hamper Future Developments in Information and Communication Technologies (ICT) and Knowledge Engineering

TL;DR: This claim that “Moore's law” for silicon chip design and production of integrated circuits will remain valid only for the next 1 or 2 decades due to energy limitations is presented in terms of energy, information and communications technologies (ICT), computer engineering, knowledge and knowledge engineering, and electronics views.
Journal ArticleDOI

10x10: A Case Study in Highly-Programmable and Energy-Efficient Heterogeneous Federated Architecture

TL;DR: Using the 10x10 architecture and an integrated image and vision benchmark as a case study, the performance and energy benefits achievable are explored and benefits as large as 597x (performance) and 137x energy are observed.

A Study of the Usability of Multicore Threading Tools

Ami Marowka
TL;DR: This paper reports on the evaluation of the usability of two Intel threading tools: Intel Thread Profiler a visual profiling tool for monitoring the parallelism gain of parallel applications and Intel Thread Checker a sophisticated tool for finding potential deadlocks and data race conditions in a parallel code.

LOLCAT : Relaxed Linear References for Lock-free Programming

TL;DR: A linear reference is a reference guaranteed to be unaliased, which is a powerful property that simplifies reasoning about programs, but is also a property that is too strong for certain applications.
Patent

Waved time multiplexing

TL;DR: In this paper, a command flit can be transmitted from a sender node of a network-on-chip (NOC) to a destination node of the NOC via an intermediate node along a circuit-switched path.
References
More filters
Proceedings ArticleDOI

Validity of the single processor approach to achieving large scale computing capabilities

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Proceedings ArticleDOI

The PARSEC benchmark suite: characterization and architectural implications

TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Proceedings ArticleDOI

Benchmarking cloud serving systems with YCSB

TL;DR: This work presents the "Yahoo! Cloud Serving Benchmark" (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems, and defines a core set of benchmarks and reports results for four widely used systems.
Journal ArticleDOI

Design of ion-implanted MOSFET's with very small physical dimensions

TL;DR: This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 /spl mu/.
Journal ArticleDOI

The Case for Energy-Proportional Computing

TL;DR: Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use, particularly the memory and disk subsystems.
Related Papers (5)
Frequently Asked Questions (19)
Q1. What are the contributions in this paper?

Since dramatic changes are coming, the authors also seek to inspire the research community to invent new ideas and solutions address how to sustain computing ’ s exponential improvement. 

Because the future winners are far from clear today, it is way too early to predict whether some form of scaling ( perhaps energy ) will continue or there will be no scaling at all. Moreover, the challenges processor design will faces in the next decade will be dwarfed by the challenges posed by these alternative technologies, rendering today ’ s challenges a warm-up exercise for what lies ahead. 

Aggressive voltage scaling provides an avenue for utilizing the unused transistor-integration capacity for logic to deliver higher performance. 

As the transistor scales, supply voltage scales down, and the threshold voltage of the transistor (when the transistor starts conducting) also scales down. 

The clusters could be connected through wide (high-bandwidth) low-swing (lowenergy) busses or through packet- or circuit-switched networks, depending on distance. 

When transistor performance increases frequency of operation, the performance of a well-tuned system generally increases, with frequency subject to the performance limits of other parts of the system. 

the transistor budget from the unused cache could be used to integrate even more cores with the power density of the cache. 

Aggressive use of customized accelerators will yield the highest performance and greatest energy efficiency on many applications. 

Chip architects must limit frequency and number of cores to keep power within reasonable bounds, but doing so severely limits improvement in microprocessor performance. 

In the future, data movement over these networks must be limited to conserve energy, and, more important, due to the large size of local storage data bandwidth, demand on the network will be reduced. 

For 65 watts, the die could integrate 50 million transistors for logic and about 6MB of cache (Case C).traditional wisdom suggests investing maximum transistors in the 90% case, with the goal of using precious transistors to increase single-thread performance that can be applied broadly. 

In some cases, units hardwired to a particular data representation or computational algorithm can achieve 50x–500x greater energy efficiency than a general-purpose register organization. 

variation in the threshold voltage manifests itself as variation in the speed of the core, the slowest circuit in the core determines the frequency of operation of the core, and a large core is more susceptible to lower frequency of operation due to variations. 

The faster transistors provide an additional 40% performance (increased frequency), almost doubling overall performance within the same power envelope (per scaling theory). 

Extreme studies27,38 suggest that aggressive high-performance and extreme-energy-efficient systems may go further, eschewing the overhead of programmability features that software engineers have come to take for granted; for example, these future systems may drop hardware support for a single flat address space (which normally wastes energy on address manipulation/computing), single-memory hierarchy (coherence and monitoring energy overhead), and steady rate of execution (adapting to the available energy budget). 

For the past 20 years, rapid growth in microprocessor performance has been enabled by three key technology drivers—transistor-speed scaling, core microarchitecture techniques, and cache memories—discussed in turn in the following sections:Transistor-speed scaling. 

Applying Pollack’s Rule, a single processor core with 150 million transistors will provide only about 2.5x microarchitecture performance improvement over today’s 25-million-transistor core, well shy of their 30x goal, while 80MB of cache is probably more than enough for the cores (see Table 3). 

Many older parallel machines used irregular and circuit-switched networks31,41; Figure 12 describes a return to hybrid switched networks for on-chip interconnects. 

Another customization approach constrains the types of parallelism that can be executed efficiently, enabling a simpler core, coordination, and memory structures; for example, many CPUs increase energy efficiency by restricting memory access structure and control flexibility in single-instruction, multiple-data or vector (SIMD) structures,1,2 while GPUs encourage programs to express structured sets of threads that can be aligned and executed efficiently.