What are the future works in this paper?

Because the future winners are far from clear today, it is way too early to predict whether some form of scaling ( perhaps energy ) will continue or there will be no scaling at all. Moreover, the challenges processor design will faces in the next decade will be dwarfed by the challenges posed by these alternative technologies, rendering today ’ s challenges a warm-up exercise for what lies ahead.

What is the way to use the unused transistor-integration capacity for logic?

Aggressive voltage scaling provides an avenue for utilizing the unused transistor-integration capacity for logic to deliver higher performance.

How many bits can be connected to a cluster?

The clusters could be connected through wide (high-bandwidth) low-swing (lowenergy) busses or through packet- or circuit-switched networks, depending on distance.

What is the advantage of using the unused cache?

the transistor budget from the unused cache could be used to integrate even more cores with the power density of the cache.

What is the way to achieve the highest performance and energy efficiency?

Aggressive use of customized accelerators will yield the highest performance and greatest energy efficiency on many applications.

How many watts of power can be saved by limiting the data movement over the network?

In the future, data movement over these networks must be limited to conserve energy, and, more important, due to the large size of local storage data bandwidth, demand on the network will be reduced.

How many cores can be hardwired to a particular data representation or computational algorithm?

In some cases, units hardwired to a particular data representation or computational algorithm can achieve 50x–500x greater energy efficiency than a general-purpose register organization.

What is the effect of variation on the speed of the core?

variation in the threshold voltage manifests itself as variation in the speed of the core, the slowest circuit in the core determines the frequency of operation of the core, and a large core is more susceptible to lower frequency of operation due to variations.

What are the common examples of extreme energy-efficient systems?

Extreme studies27,38 suggest that aggressive high-performance and extreme-energy-efficient systems may go further, eschewing the overhead of programmability features that software engineers have come to take for granted; for example, these future systems may drop hardware support for a single flat address space (which normally wastes energy on address manipulation/computing), single-memory hierarchy (coherence and monitoring energy overhead), and steady rate of execution (adapting to the available energy budget).

How many transistors can be integrated into a single processor core?

Applying Pollack’s Rule, a single processor core with 150 million transistors will provide only about 2.5x microarchitecture performance improvement over today’s 25-million-transistor core, well shy of their 30x goal, while 80MB of cache is probably more than enough for the cores (see Table 3).

How many parallel machines used irregular and circuit-switched networks?

Many older parallel machines used irregular and circuit-switched networks31,41; Figure 12 describes a return to hybrid switched networks for on-chip interconnects.

What is the difference between a customized CPU and a GPU?

Another customization approach constrains the types of parallelism that can be executed efficiently, enabling a simpler core, coordination, and memory structures; for example, many CPUs increase energy efficiency by restricting memory access structure and control flexibility in single-instruction, multiple-data or vector (SIMD) structures,1,2 while GPUs encourage programs to express structured sets of threads that can be aligned and executed efficiently.

(Open Access) The future of microprocessors (2011) | Shekhar Borkar

Q: What are the contributions in this paper?

Since dramatic changes are coming, the authors also seek to inspire the research community to invent new ideas and solutions address how to sustain computing ’ s exponential improvement.

Q: What is the effect of frequency of a well-tuned system?

When transistor performance increases frequency of operation, the performance of a well-tuned system generally increases, with frequency subject to the performance limits of other parts of the system.

MAY 2 0 1 1 | VOL. 54 | NO. 5 | COMMUNICATIONS OF THE ACM 67

MICROPROCESSORS—SINGLE-CHIP COMPUTERS—are

the building blocks of the information world. Their

performance has grown 1,000-fold over the past 20

years, driven by transistor speed and energy scaling, as

well as by microarchitecture advances that exploited

the transistor density gains from Moore’s Law. In the

next two decades, diminishing tran-

sistor-speed scaling and practical en-

ergy limits create new challenges for

continued performance scaling. As

a result, the frequency of operations

will increase slowly, with energy the

key limiter of performance, forcing

designs to use large-scale parallel-

ism, heterogeneous cores, and accel-

erators to achieve performance and

energy efﬁciency. Software-hardware

partnership to achieve efﬁcient data

orchestration is increasingly critical in

the drive toward energy-proportional

computing.

Our aim here is to reﬂect and proj-

ect the macro trends shaping the fu-

ture of microprocessors and sketch in

broad strokes where processor design

is going. We enumerate key research

challenges and suggest promising

research directions. Since dramatic

changes are coming, we also seek to

inspire the research community to in-

vent new ideas and solutions address

how to sustain computing’s exponen-

tial improvement.

Microprocessors (see Figure 1) were

invented in 1971,

but it’s difﬁcult to-

day to believe any of the early inventors

could have conceived their extraor-

dinary evolution in structure and use

over the past 40 years. Microprocessors

today not only involve complex micro-

The Future

Microprocessors

DO I: 10. 11 45/ 194 14 87. 19 415 07

Energy efﬁciency is the new fundamental

limiter of processor performance,

way beyond numbers of processors.

BY SHEKHAR BORKAR AND ANDREW A. CHIEN

key insights

Moore’s Law continues but demands

radical changes in architecture and

software.

Architectures will go beyond

homogeneous parallelism, embrace

heterogeneity, and exploit the bounty

of transistors to incorporate

application-customized hardware.

Software must increase parallelism

and exploit heterogeneous and

application-customized hardware

to deliver performance growth.

68 C O M M U N I C ATIONS OF THE AC M | M AY 2011 | VO L . 5 4 | N O. 5

contributed articles

architectures and multiple execution

engines (cores) but have grown to in-

clude all sorts of additional functions,

including ﬂoating-point units, caches,

memory controllers, and media-pro-

cessing engines. However, the deﬁn-

ing characteristics of a microprocessor

remain—a single semiconductor chip

embodying the primary computation

(data transformation) engine in a com-

puting system.

Because our own greatest access

and insight involves Intel designs and

data, our graphs and estimates draw

heavily on them. In some cases, they

may not be representative of the entire

industry but certainly represent a large

fraction. Such a forthright view, solidly

grounded, best supports our goals for

this article.

20 Years of Exponential

Performance Gains

For the past 20 years, rapid growth in

microprocessor performance has been

enabled by three key technology driv-

ers—transistor-speed scaling, core mi-

croarchitecture techniques, and cache

memories—discussed in turn in the

following sections:

Transistor-speed scaling. The MOS

transistor has been the workhorse for

decades, scaling in performance by

nearly ﬁve orders of magnitude and

providing the foundation for today’s

unprecedented compute performance.

The basic recipe for technology scaling

was laid down by Robert N. Dennard of

IBM

in the early 1970s and followed

for the past three decades. The scal-

ing recipe calls for reducing transistor

dimensions by 30% every generation

(two years) and keeping electric ﬁelds

constant everywhere in the transis-

tor to maintain reliability. This might

sound simple but is increasingly difﬁ-

cult to continue for reasons discussed

later. Classical transistor scaling pro-

vided three major beneﬁts that made

possible rapid growth in compute per-

formance.

First, the transistor dimensions are

scaled by 30% (0.7x), their area shrinks

50%, doubling the transistor density

every technology generation—the fun-

damental reason behind Moore’s Law.

Second, as the transistor is scaled, its

performance increases by about 40%

(0.7x delay reduction, or 1.4x frequen-

cy increase), providing higher system

performance. Third, to keep the elec-

tric ﬁeld constant, supply voltage is re-

duced by 30%, reducing energy by 65%,

or power (at 1.4x frequency) by 50%

(active power = CV

f). Putting it all to-

gether, in every technology generation

transistor integration doubles, circuits

are 40% faster, and system power con-

sumption (with twice as many transis-

tors) stays the same. This serendipi-

tous scaling (almost too good to be

true) enabled three-orders-of-magni-

tude increase in microprocessor per-

formance over the past 20 years. Chip

architects exploited transistor density

to create complex architectures and

transistor speed to increase frequency,

achieving it all within a reasonable

power and energy envelope.

Core microarchitecture tech-

niques. Advanced microarchitectures

have deployed the abundance of tran-

sistor-integration capacity, employing

a dizzying array of techniques, includ-

ing pipelining, branch prediction,

out-of-order execution, and specula-

tion, to deliver ever-increasing perfor-

mance. Figure 2 outlines advances in

microarchitecture, showing increases

in die area and performance and en-

ergy efﬁciency (performance/watt),

all normalized in the same process

technology. It uses characteristics of

Intel microprocessors (such as 386,

486, Pentium, Pentium Pro, and Pen-

tium 4), with performance measured

by benchmark SpecInt (92, 95, and

2000 representing the current bench-

mark for the era) at each data point.

It compares each microarchitecture

advance with a design without the ad-

Figure 1. Evolution of Intel microprocessors 1971–2009.

Intel 4004, 1971

1 core, no cache

23K transistors

Intel 8088, 1978

1 core, no cache

29K transistors

Intel Mehalem-EX, 2009

8 cores, 24MB cache

2.3B transistors

Figure 2. Architecture advances and energy efﬁciency.

Die Area

Integer Performance (X)

FP Performance (X)

Int Performance/Watt (X)

386 to 486

486 to Pentium

Pentium to P6

P6 to Pentium 4

Pentium 4

to Core

On-die cache,

pipelined

Increase (X)

Super-scalar OOO-Speculative Deep pipeline Back to non-deep

pipeline

contributed articles

MAY 2 0 1 1 | VOL. 54 | NO. 5 | C O M M U N I C ATIONS OF THE AC M 69

vance (such as introducing an on-die

cache by comparing 486 to 386 in 1μ

technology and superscalar microar-

chitecture of Pentium in 0.7μ technol-

ogy with 486).

This data shows that on-die caches

and pipeline architectures used tran-

sistors well, providing a signiﬁcant

performance boost without compro-

mising energy efﬁciency. In this era,

superscalar, and out-of-order archi-

tectures provided sizable performance

beneﬁts at a cost in energy efﬁciency.

Of these architectures, deep-pipe-

lined design seems to have delivered

the lowest performance increase for

the same area and power increase as

out-of-order and speculative design,

incurring the greatest cost in energy

efﬁciency. The term “deep pipelined

architecture” describes deeper pipe-

line, as well as other circuit and mi-

croarchitectural techniques (such as

trace cache and self-resetting domino

logic) employed to achieve even high-

er frequency. Evident from the data is

that reverting to a non-deep pipeline

reclaimed energy efﬁciency by drop-

ping these expensive and inefﬁcient

techniques.

When transistor performance in-

creases frequency of operation, the

performance of a well-tuned system

generally increases, with frequency

subject to the performance limits of

other parts of the system. Historically,

microarchitecture techniques exploit-

ing the growth in available transistors

have delivered performance increases

empirically described by Pollack’s

Rule,

whereby performance increas-

es (when not limited by other parts

of the system) as the square root of

the number of transistors or area of

a processor (see Figure 3). According

to Pollack’s Rule, each new technol-

ogy generation doubles the number

of transistors on a chip, enabling a

new microarchitecture that delivers a

40% performance increase. The faster

transistors provide an additional 40%

performance (increased frequency),

almost doubling overall performance

within the same power envelope (per

scaling theory). In practice, however,

implementing a new microarchitec-

ture every generation is difﬁcult, so

microarchitecture gains are typically

less. In recent microprocessors, the in-

creasing drive for energy efﬁciency has

caused designers to forego many of

these microarchitecture techniques.

As Pollack’s Rule broadly captures

area, power, and performance trade-

offs from several generations of mi-

croarchitecture, we use it as a rule

of thumb to estimate single-thread

performance in various scenarios

throughout this article.

Cache memory architecture. Dy-

namic memory technology (DRAM)

has also advanced dramatically with

Moore’s Law over the past 40 years but

with different characteristics. For ex-

ample, memory density has doubled

nearly every two years, while perfor-

mance has improved more slowly (see

Figure 4a). This slower improvement

in cycle time has produced a memory

bottleneck that could reduce a sys-

tem’s overall performance. Figure 4b

outlines the increasing speed dispar-

ity, growing from 10s to 100s of proces-

sor clock cycles per memory access. It

has lately ﬂattened out due to the ﬂat-

tening of processor clock frequency.

Unaddressed, the memory-latency gap

would have eliminated and could still

eliminate most of the beneﬁts of pro-

cessor improvement.

The reason for slow improvement

of DRAM speed is practical, not tech-

nological. It’s a misconception that

DRAM technology based on capacitor

storage is inherently slower; rather, the

memory organization is optimized for

density and lower cost, making it slow-

er. The DRAM market has demanded

large capacity at minimum cost over

speed, depending on small and fast

caches on the microprocessor die to

emulate high-performance memory

by providing the necessary bandwidth

and low latency based on data locality.

The emergence of sophisticated, yet

effective, memory hierarchies allowed

DRAM to emphasize density and cost

over speed. At ﬁrst, processors used a

single level of cache, but, as processor

speed increased, two to three levels of

cache hierarchies were introduced to

span the growing speed gap between

Figure 3. Increased performance vs. area in the same process technology follows

Pollack’s Rule.

10.0

1.0

0.1

0.1 1.0

10.0

Integer Performance (X)

Area (X)

Performance ~ Sqrt(Area)

Slope =0.5

Pentium 4 to Core

P6 to Pentium 4

Pentium to P6

486 to Pentium

386 to 486

Figure 4. DRAM density and performance, 1980–2010.

DRAM Density

CPU

Speed

GAP

DRAM Speed

100,000

10,000

1,000

100

1,000

100

1980 19801990 19902000 20002010 2010

Relative

CPU Clocks/DRAM Latency

(a) (b)

70 C O M M U N I C ATIONS OF THE AC M | M AY 2011 | VO L . 5 4 | N O. 5

contributed articles

processor and memory.

33,37

In these

hierarchies, the lowest-level caches

were small but fast enough to match

the processor’s needs in terms of high

bandwidth and low latency; higher lev-

els of the cache hierarchy were then

optimized for size and speed.

Figure 5 outlines the evolution of

on-die caches over the past two de-

cades, plotting cache capacity (a) and

percentage of die area (b) for Intel

microprocessors. At ﬁrst, cache sizes

increased slowly, with decreasing die

area devoted to cache, and most of the

available transistor budget was devot-

ed to core microarchitecture advances.

During this period, processors were

probably cache-starved. As energy be-

came a concern, increasing cache size

for performance has proven more en-

ergy efﬁcient than additional core-mi-

croarchitecture techniques requiring

energy-intensive logic. For this reason,

more and more transistor budget and

die area are allocated in caches.

The transistor-scaling-and-micro-

architecture-improvement cycle has

been sustained for more than two

decades, delivering 1,000-fold perfor-

mance improvement. How long will it

continue? To better understand and

predict future performance, we decou-

ple performance gain due to transistor

speed and microarchitecture by com-

paring the same microarchitecture

on different process technologies and

new microarchitectures with the previ-

ous ones, then compound the perfor-

mance gain.

Figure 6 divides the cumulative

1,000-fold Intel microprocessor per-

formance increase over the past two

decades into performance delivered by

transistor speed (frequency) and due to

microarchitecture. Almost two-orders-

of-magnitude of this performance in-

crease is due to transistor speed alone,

now leveling off due to the numerous

challenges described in the following

sections.

The Next 20 Years

Microprocessor technology has deliv-

ered three-orders-of-magnitude per-

formance improvement over the past

two decades, so continuing this tra-

jectory would require at least 30x per-

formance increase by 2020. Micropro-

Figure 5. Evolution of on-die caches.

10,000

1,000

100

60%

50%

40%

30%

20%

10%

1u 1u0.5u 0.5u0.25u 0.25u0.13u 0.13u65nm 65nm

On-die cache (KB)

On-die cache %

of total die area

(a) (b)

Figure 7. Unconstrained evolution of a microprocessor results in excessive power

consumption.

Unconstrained Evolution 100mm

Die

Power (Watts)

500

400

300

200

100

2002 2006 2010 2014 2008

Figure 6. Performance increase separated into transistor speed and microarchitecture

performance.

Integer Performance

Transistor Performance

Floating-Point Performance

Transistor Performance

10,000

1,000

100

10,000

1,000

100

1.5u 1.5u0.5u 0.5u0.18u 0.18u65nm 65nm

Relative

(a) (b)

Table 1. New technology scaling

challenges.

Decreased transistor scaling beneﬁts:

Despite continuing miniaturization, little

performance improvement and little

reduction in switching energy (decreasing

performance beneﬁts of scaling) [ITRS].

Flat total energy budget: package

power and mobile/embedded computing

drives energy-efﬁciency requirements.

Table 2. Ongoing technology scaling.

Increasing transistor density (in area

and volume) and count: through

continued feature scaling, process

innovations, and packaging innovations.

Need for increasing locality and

reduced bandwidth per operation:

as performance of the microprocessor

increases, and the data sets for

applications continue to grow.

contributed articles

MAY 2 0 1 1 | VOL. 54 | NO. 5 | C O M M U N I C ATIONS OF THE AC M 71

cessor-performance scaling faces new

challenges (see Table 1) precluding

use of energy-inefﬁcient microarchi-

tecture innovations developed over the

past two decades. Further, chip archi-

tects must face these challenges with

an ongoing industry expectation of a

30x performance increase in the next

decade and 1,000x increase by 2030

(see Table 2).

As the transistor scales, supply

voltage scales down, and the thresh-

old voltage of the transistor (when

the transistor starts conducting) also

scales down. But the transistor is not

a perfect switch, leaking some small

amount of current when turned off,

increasing exponentially with reduc-

tion in the threshold voltage. In ad-

dition, the exponentially increasing

transistor-integration capacity exacer-

bates the effect; as a result, a substan-

tial portion of power consumption is

due to leakage. To keep leakage under

control, the threshold voltage cannot

be lowered further and, indeed, must

increase, reducing transistor perfor-

mance.

As transistors have reached atomic

dimensions, lithography and variabil-

ity pose further scaling challenges, af-

fecting supply-voltage scaling.

With

limited supply-voltage scaling, energy

and power reduction is limited, ad-

versely affecting further integration

of transistors. Therefore, transistor-

integration capacity will continue with

scaling, though with limited perfor-

mance and power beneﬁt. The chal-

lenge for chip architects is to use this

integration capacity to continue to im-

prove performance.

Package power/total energy con-

sumption limits number of logic tran-

sistors. If chip architects simply add

more cores as transistor-integration

capacity becomes available and oper-

ate the chips at the highest frequen-

cy the transistors and designs can

achieve, then the power consumption

of the chips would be prohibitive (see

Figure 7). Chip architects must limit

frequency and number of cores to keep

power within reasonable bounds, but

doing so severely limits improvement

in microprocessor performance.

Consider the transistor-integration

capacity affordable in a given power

envelope for reasonable die size. For

regular desktop applications the pow-

er envelope is around 65 watts, and

the die size is around 100mm

. Figure

8 outlines a simple analysis for 45nm

process technology node; the x-axis is

the number of logic transistors inte-

grated on the die, and the two y-axes

are the amount of cache that would ﬁt

and the power the die would consume.

As the number of logic transistors on

the die increases (x-axis), the size of the

cache decreases, and power dissipa-

tion increases. This analysis assumes

average activity factor for logic and

cache observed in today’s micropro-

cessors. If the die integrates no logic at

all, then the entire die could be popu-

lated with about 16MB of cache and

consume less than 10 watts of power,

since caches consume less power than

logic (Case A). On the other hand, if it

integrates no cache at all, then it could

integrate 75 million transistors for log-

ic, consuming almost 90 watts of pow-

er (Case B). For 65 watts, the die could

integrate 50 million transistors for

logic and about 6MB of cache (Case C).

Traditional wisdom suggests investing maximum transistors in the 90% case, with

the goal of using precious transistors to increase single-thread performance that can

be applied broadly. In the new scaling regime typiﬁed by slow transistor performance

and energy improvement, it often makes no sense to add transistors to a single core

as energy efﬁciency suffers. Using additional transistors to build more cores produces

a limited beneﬁt—increased performance for applications with thread parallelism.

In this world, 90/10 optimization no longer applies. Instead, optimizing with an

accelerator for a 10% case, then another for a different 10% case, then another 10%

case can often produce a system with better overall energy efﬁciency and performance.

We call this “1010 optimization,”

as the goal is to attack performance as a set of

10% optimization opportunities—a different way of thinking about transistor cost,

operating the chip with 10% of the transistors active—90% inactive, but a different 10%

at each point in time.

Historically, transistors on a chip were expensive due to the associated design

effort, validation and testing, and ultimately manufacturing cost. But 20 generations

of Moore’s Law and advances in design and validation have shifted the balance.

Building systems where the 10% of the transistors that can operate within the energy

budget are conﬁgured optimally (an accelerator well-suited to the application) may

well be the right solution. The choice of 10 cases is illustrative, and a 55, 77, 1010,

or 1212 architecture might be appropriate for a particular design.

Death of

90/10 Optimization,

Rise of

10×10 Optimization

Figure 8. Transistor integration capacity at a ﬁxed power envelope.

Case B

Case A, 0 Logic, 8W

Case A, 16MB of Cache

Case C

50MT Logic

6MB Cache

Power Dissipation

Cache Size

100

0 20 40 60 80

Total Power (Watts)

Logic Transistors (Millions)

2008, 45nm, 100mm

Cache (MB)

The future of microprocessors

Figures

Citations

Achieving High Performance and High Productivity in Next Generational Parallel Programming Languages

Extremely Energy-Efficient Integrated Circuits Using Adiabatic Superconductor Logic

Exploiting concurrency and heterogeneity for energy-efficient computing

Optimizing Program Performance via Similarity, Using Feature-aware and Feature-agnostic Characterization Approaches

Pushing the Limits of Online Auto-Tuning: Machine Code Optimization in Short-Running Kernels

References

Validity of the single processor approach to achieving large scale computing capabilities

The PARSEC benchmark suite: characterization and architectural implications

Benchmarking cloud serving systems with YCSB

Design of ion-implanted MOSFET's with very small physical dimensions

The Case for Energy-Proportional Computing

Related Papers (5)

Cramming More Components onto Integrated Circuits

Computer Architecture: A Quantitative Approach

The gem5 simulator

McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Design of ion-implanted MOSFET's with very small physical dimensions

Frequently Asked Questions (19)

Q1. What are the contributions in this paper?

Q2. What are the future works in this paper?

Q3. What is the way to use the unused transistor-integration capacity for logic?

Q4. What is the effect of the transistor on the supply voltage?

Q5. How many bits can be connected to a cluster?

Q6. What is the effect of frequency of a well-tuned system?

Q7. What is the advantage of using the unused cache?

Q8. What is the way to achieve the highest performance and energy efficiency?

Q9. What is the challenge for chip architects?

Q10. How many watts of power can be saved by limiting the data movement over the network?

Q11. How many transistors can be integrated into a die?

Q12. How many cores can be hardwired to a particular data representation or computational algorithm?

Q13. What is the effect of variation on the speed of the core?

Q14. What is the effect of the faster transistors on the performance of a system?

Q15. What are the common examples of extreme energy-efficient systems?

Q16. What is the main reason for the rapid growth in microprocessor performance?

Q17. How many transistors can be integrated into a single processor core?

Q18. How many parallel machines used irregular and circuit-switched networks?

Q19. What is the difference between a customized CPU and a GPU?