scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The future of microprocessors

01 May 2011-Communications of The ACM (ACM)-Vol. 54, Iss: 5, pp 67-77
TL;DR: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.
Abstract: Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.

Summary (1 min read)

20 Years of exponential Performance Gains

  • For the past 20 years, rapid growth in microprocessor performance has been enabled by three key technology drivers-transistor-speed scaling, core microarchitecture techniques, and cache memories-discussed in turn in the following sections: Transistor-speed scaling.
  • This might sound simple but is increasingly difficult to continue for reasons discussed later.
  • Classical transistor scaling provided three major benefits that made possible rapid growth in compute performance.
  • Figure 4b outlines the increasing speed disparity, growing from 10s to 100s of processor clock cycles per memory access.

the next 20 Years

  • Microprocessor technology has delivered three-orders-of-magnitude performance improvement over the past two decades, so continuing this trajectory would require at least 30x performance increase by 2020.
  • 11 With limited supply-voltage scaling, energy and power reduction is limited, adversely affecting further integration of transistors.
  • Multiple cores and customization will be the major drivers for future microprocessor performance (total chip performance).
  • In extreme cases, high-performance computing and embedded applications may even manage these complexities explicitly.

Conclusion

  • The past 20 years were truly the great old days for Moore's Law scaling and microprocessor performance; dramatic improvements in transistor density, speed, and energy, combined with microarchitecture and memoryhierarchy techniques delivered 1,000fold microprocessor performance improvement.
  • The next 20 years-the pretty good new days, as progress continues-will be more difficult, with Moore's Law scaling producing continuing improvement in transistor density but comparatively little improvement in transistor speed and energy.
  • The pretty good old days of scaling that processor design faces today are helping prepare us for these new challenges.
  • The authors thank him and the members and presenters to the working groups for valuable insightful discussions over the past few years.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

MAY 2 0 1 1 | VOL. 54 | NO. 5 | COMMUNICATIONS OF THE ACM 67
MICROPROCESSORSSINGLE-CHIP COMPUTERS—are
the building blocks of the information world. Their
performance has grown 1,000-fold over the past 20
years, driven by transistor speed and energy scaling, as
well as by microarchitecture advances that exploited
the transistor density gains from Moore’s Law. In the
next two decades, diminishing tran-
sistor-speed scaling and practical en-
ergy limits create new challenges for
continued performance scaling. As
a result, the frequency of operations
will increase slowly, with energy the
key limiter of performance, forcing
designs to use large-scale parallel-
ism, heterogeneous cores, and accel-
erators to achieve performance and
energy efficiency. Software-hardware
partnership to achieve efficient data
orchestration is increasingly critical in
the drive toward energy-proportional
computing.
Our aim here is to reflect and proj-
ect the macro trends shaping the fu-
ture of microprocessors and sketch in
broad strokes where processor design
is going. We enumerate key research
challenges and suggest promising
research directions. Since dramatic
changes are coming, we also seek to
inspire the research community to in-
vent new ideas and solutions address
how to sustain computing’s exponen-
tial improvement.
Microprocessors (see Figure 1) were
invented in 1971,
28
but it’s difficult to-
day to believe any of the early inventors
could have conceived their extraor-
dinary evolution in structure and use
over the past 40 years. Microprocessors
today not only involve complex micro-
The Future
of
Microprocessors
DO I: 10. 11 45/ 194 14 87. 19 415 07
Energy efficiency is the new fundamental
limiter of processor performance,
way beyond numbers of processors.
BY SHEKHAR BORKAR AND ANDREW A. CHIEN
key insights
Moore’s Law continues but demands
radical changes in architecture and
software.
Architectures will go beyond
homogeneous parallelism, embrace
heterogeneity, and exploit the bounty
of transistors to incorporate
application-customized hardware.
Software must increase parallelism
and exploit heterogeneous and
application-customized hardware
to deliver performance growth.

68 C O M M U N I C ATIONS OF THE AC M | M AY 2011 | VO L . 5 4 | N O. 5
contributed articles
architectures and multiple execution
engines (cores) but have grown to in-
clude all sorts of additional functions,
including floating-point units, caches,
memory controllers, and media-pro-
cessing engines. However, the defin-
ing characteristics of a microprocessor
remain—a single semiconductor chip
embodying the primary computation
(data transformation) engine in a com-
puting system.
Because our own greatest access
and insight involves Intel designs and
data, our graphs and estimates draw
heavily on them. In some cases, they
may not be representative of the entire
industry but certainly represent a large
fraction. Such a forthright view, solidly
grounded, best supports our goals for
this article.
20 Years of Exponential
Performance Gains
For the past 20 years, rapid growth in
microprocessor performance has been
enabled by three key technology driv-
ers—transistor-speed scaling, core mi-
croarchitecture techniques, and cache
memories—discussed in turn in the
following sections:
Transistor-speed scaling. The MOS
transistor has been the workhorse for
decades, scaling in performance by
nearly ve orders of magnitude and
providing the foundation for today’s
unprecedented compute performance.
The basic recipe for technology scaling
was laid down by Robert N. Dennard of
IBM
17
in the early 1970s and followed
for the past three decades. The scal-
ing recipe calls for reducing transistor
dimensions by 30% every generation
(two years) and keeping electric fields
constant everywhere in the transis-
tor to maintain reliability. This might
sound simple but is increasingly diffi-
cult to continue for reasons discussed
later. Classical transistor scaling pro-
vided three major benefits that made
possible rapid growth in compute per-
formance.
First, the transistor dimensions are
scaled by 30% (0.7x), their area shrinks
50%, doubling the transistor density
every technology generation—the fun-
damental reason behind Moore’s Law.
Second, as the transistor is scaled, its
performance increases by about 40%
(0.7x delay reduction, or 1.4x frequen-
cy increase), providing higher system
performance. Third, to keep the elec-
tric field constant, supply voltage is re-
duced by 30%, reducing energy by 65%,
or power (at 1.4x frequency) by 50%
(active power = CV
2
f). Putting it all to-
gether, in every technology generation
transistor integration doubles, circuits
are 40% faster, and system power con-
sumption (with twice as many transis-
tors) stays the same. This serendipi-
tous scaling (almost too good to be
true) enabled three-orders-of-magni-
tude increase in microprocessor per-
formance over the past 20 years. Chip
architects exploited transistor density
to create complex architectures and
transistor speed to increase frequency,
achieving it all within a reasonable
power and energy envelope.
Core microarchitecture tech-
niques. Advanced microarchitectures
have deployed the abundance of tran-
sistor-integration capacity, employing
a dizzying array of techniques, includ-
ing pipelining, branch prediction,
out-of-order execution, and specula-
tion, to deliver ever-increasing perfor-
mance. Figure 2 outlines advances in
microarchitecture, showing increases
in die area and performance and en-
ergy efficiency (performance/watt),
all normalized in the same process
technology. It uses characteristics of
Intel microprocessors (such as 386,
486, Pentium, Pentium Pro, and Pen-
tium 4), with performance measured
by benchmark SpecInt (92, 95, and
2000 representing the current bench-
mark for the era) at each data point.
It compares each microarchitecture
advance with a design without the ad-
Figure 1. Evolution of Intel microprocessors 1971–2009.
Intel 4004, 1971
1 core, no cache
23K transistors
Intel 8088, 1978
1 core, no cache
29K transistors
Intel Mehalem-EX, 2009
8 cores, 24MB cache
2.3B transistors
Figure 2. Architecture advances and energy efficiency.
Die Area
Integer Performance (X)
FP Performance (X)
Int Performance/Watt (X)
386 to 486
486 to Pentium
Pentium to P6
P6 to Pentium 4
Pentium 4
to Core
4
3
2
1
0
On-die cache,
pipelined
Increase (X)
Super-scalar OOO-Speculative Deep pipeline Back to non-deep
pipeline

contributed articles
MAY 2 0 1 1 | VOL. 54 | NO. 5 | C O M M U N I C ATIONS OF THE AC M 69
vance (such as introducing an on-die
cache by comparing 486 to 386 in 1μ
technology and superscalar microar-
chitecture of Pentium in 0.7μ technol-
ogy with 486).
This data shows that on-die caches
and pipeline architectures used tran-
sistors well, providing a significant
performance boost without compro-
mising energy efficiency. In this era,
superscalar, and out-of-order archi-
tectures provided sizable performance
benefits at a cost in energy efficiency.
Of these architectures, deep-pipe-
lined design seems to have delivered
the lowest performance increase for
the same area and power increase as
out-of-order and speculative design,
incurring the greatest cost in energy
efficiency. The term “deep pipelined
architecture” describes deeper pipe-
line, as well as other circuit and mi-
croarchitectural techniques (such as
trace cache and self-resetting domino
logic) employed to achieve even high-
er frequency. Evident from the data is
that reverting to a non-deep pipeline
reclaimed energy efficiency by drop-
ping these expensive and inefficient
techniques.
When transistor performance in-
creases frequency of operation, the
performance of a well-tuned system
generally increases, with frequency
subject to the performance limits of
other parts of the system. Historically,
microarchitecture techniques exploit-
ing the growth in available transistors
have delivered performance increases
empirically described by Pollack’s
Rule,
32
whereby performance increas-
es (when not limited by other parts
of the system) as the square root of
the number of transistors or area of
a processor (see Figure 3). According
to Pollack’s Rule, each new technol-
ogy generation doubles the number
of transistors on a chip, enabling a
new microarchitecture that delivers a
40% performance increase. The faster
transistors provide an additional 40%
performance (increased frequency),
almost doubling overall performance
within the same power envelope (per
scaling theory). In practice, however,
implementing a new microarchitec-
ture every generation is difficult, so
microarchitecture gains are typically
less. In recent microprocessors, the in-
creasing drive for energy efficiency has
caused designers to forego many of
these microarchitecture techniques.
As Pollack’s Rule broadly captures
area, power, and performance trade-
offs from several generations of mi-
croarchitecture, we use it as a rule
of thumb to estimate single-thread
performance in various scenarios
throughout this article.
Cache memory architecture. Dy-
namic memory technology (DRAM)
has also advanced dramatically with
Moore’s Law over the past 40 years but
with different characteristics. For ex-
ample, memory density has doubled
nearly every two years, while perfor-
mance has improved more slowly (see
Figure 4a). This slower improvement
in cycle time has produced a memory
bottleneck that could reduce a sys-
tem’s overall performance. Figure 4b
outlines the increasing speed dispar-
ity, growing from 10s to 100s of proces-
sor clock cycles per memory access. It
has lately flattened out due to the flat-
tening of processor clock frequency.
Unaddressed, the memory-latency gap
would have eliminated and could still
eliminate most of the benefits of pro-
cessor improvement.
The reason for slow improvement
of DRAM speed is practical, not tech-
nological. It’s a misconception that
DRAM technology based on capacitor
storage is inherently slower; rather, the
memory organization is optimized for
density and lower cost, making it slow-
er. The DRAM market has demanded
large capacity at minimum cost over
speed, depending on small and fast
caches on the microprocessor die to
emulate high-performance memory
by providing the necessary bandwidth
and low latency based on data locality.
The emergence of sophisticated, yet
effective, memory hierarchies allowed
DRAM to emphasize density and cost
over speed. At first, processors used a
single level of cache, but, as processor
speed increased, two to three levels of
cache hierarchies were introduced to
span the growing speed gap between
Figure 3. Increased performance vs. area in the same process technology follows
Pollack’s Rule.
10.0
1.0
0.1
0.1 1.0
10.0
Integer Performance (X)
Area (X)
Performance ~ Sqrt(Area)
Slope =0.5
Pentium 4 to Core
P6 to Pentium 4
Pentium to P6
486 to Pentium
386 to 486
Figure 4. DRAM density and performance, 1980–2010.
DRAM Density
CPU
Speed
GAP
DRAM Speed
100,000
10,000
1,000
100
10
1
1,000
100
10
1
1980 19801990 19902000 20002010 2010
Relative
CPU Clocks/DRAM Latency
(a) (b)

70 C O M M U N I C ATIONS OF THE AC M | M AY 2011 | VO L . 5 4 | N O. 5
contributed articles
processor and memory.
33,37
In these
hierarchies, the lowest-level caches
were small but fast enough to match
the processor’s needs in terms of high
bandwidth and low latency; higher lev-
els of the cache hierarchy were then
optimized for size and speed.
Figure 5 outlines the evolution of
on-die caches over the past two de-
cades, plotting cache capacity (a) and
percentage of die area (b) for Intel
microprocessors. At first, cache sizes
increased slowly, with decreasing die
area devoted to cache, and most of the
available transistor budget was devot-
ed to core microarchitecture advances.
During this period, processors were
probably cache-starved. As energy be-
came a concern, increasing cache size
for performance has proven more en-
ergy efficient than additional core-mi-
croarchitecture techniques requiring
energy-intensive logic. For this reason,
more and more transistor budget and
die area are allocated in caches.
The transistor-scaling-and-micro-
architecture-improvement cycle has
been sustained for more than two
decades, delivering 1,000-fold perfor-
mance improvement. How long will it
continue? To better understand and
predict future performance, we decou-
ple performance gain due to transistor
speed and microarchitecture by com-
paring the same microarchitecture
on different process technologies and
new microarchitectures with the previ-
ous ones, then compound the perfor-
mance gain.
Figure 6 divides the cumulative
1,000-fold Intel microprocessor per-
formance increase over the past two
decades into performance delivered by
transistor speed (frequency) and due to
microarchitecture. Almost two-orders-
of-magnitude of this performance in-
crease is due to transistor speed alone,
now leveling off due to the numerous
challenges described in the following
sections.
The Next 20 Years
Microprocessor technology has deliv-
ered three-orders-of-magnitude per-
formance improvement over the past
two decades, so continuing this tra-
jectory would require at least 30x per-
formance increase by 2020. Micropro-
Figure 5. Evolution of on-die caches.
10,000
1,000
100
10
1
60%
50%
40%
30%
20%
10%
0%
1u 1u0.5u 0.5u0.25u 0.25u0.13u 0.13u65nm 65nm
On-die cache (KB)
On-die cache %
of total die area
(a) (b)
Figure 7. Unconstrained evolution of a microprocessor results in excessive power
consumption.
Unconstrained Evolution 100mm
2
Die
Power (Watts)
500
400
300
200
100
0
2002 2006 2010 2014 2008
Figure 6. Performance increase separated into transistor speed and microarchitecture
performance.
Integer Performance
Transistor Performance
Floating-Point Performance
Transistor Performance
10,000
1,000
100
10
1
10,000
1,000
100
10
1
1.5u 1.5u0.5u 0.5u0.18u 0.18u65nm 65nm
Relative
Relative
(a) (b)
Table 1. New technology scaling
challenges.
Decreased transistor scaling benefits:
Despite continuing miniaturization, little
performance improvement and little
reduction in switching energy (decreasing
performance benefits of scaling) [ITRS].
Flat total energy budget: package
power and mobile/embedded computing
drives energy-efficiency requirements.
Table 2. Ongoing technology scaling.
Increasing transistor density (in area
and volume) and count: through
continued feature scaling, process
innovations, and packaging innovations.
Need for increasing locality and
reduced bandwidth per operation:
as performance of the microprocessor
increases, and the data sets for
applications continue to grow.

contributed articles
MAY 2 0 1 1 | VOL. 54 | NO. 5 | C O M M U N I C ATIONS OF THE AC M 71
cessor-performance scaling faces new
challenges (see Table 1) precluding
use of energy-inefficient microarchi-
tecture innovations developed over the
past two decades. Further, chip archi-
tects must face these challenges with
an ongoing industry expectation of a
30x performance increase in the next
decade and 1,000x increase by 2030
(see Table 2).
As the transistor scales, supply
voltage scales down, and the thresh-
old voltage of the transistor (when
the transistor starts conducting) also
scales down. But the transistor is not
a perfect switch, leaking some small
amount of current when turned off,
increasing exponentially with reduc-
tion in the threshold voltage. In ad-
dition, the exponentially increasing
transistor-integration capacity exacer-
bates the effect; as a result, a substan-
tial portion of power consumption is
due to leakage. To keep leakage under
control, the threshold voltage cannot
be lowered further and, indeed, must
increase, reducing transistor perfor-
mance.
10
As transistors have reached atomic
dimensions, lithography and variabil-
ity pose further scaling challenges, af-
fecting supply-voltage scaling.
11
With
limited supply-voltage scaling, energy
and power reduction is limited, ad-
versely affecting further integration
of transistors. Therefore, transistor-
integration capacity will continue with
scaling, though with limited perfor-
mance and power benefit. The chal-
lenge for chip architects is to use this
integration capacity to continue to im-
prove performance.
Package power/total energy con-
sumption limits number of logic tran-
sistors. If chip architects simply add
more cores as transistor-integration
capacity becomes available and oper-
ate the chips at the highest frequen-
cy the transistors and designs can
achieve, then the power consumption
of the chips would be prohibitive (see
Figure 7). Chip architects must limit
frequency and number of cores to keep
power within reasonable bounds, but
doing so severely limits improvement
in microprocessor performance.
Consider the transistor-integration
capacity affordable in a given power
envelope for reasonable die size. For
regular desktop applications the pow-
er envelope is around 65 watts, and
the die size is around 100mm
2
. Figure
8 outlines a simple analysis for 45nm
process technology node; the x-axis is
the number of logic transistors inte-
grated on the die, and the two y-axes
are the amount of cache that would fit
and the power the die would consume.
As the number of logic transistors on
the die increases (x-axis), the size of the
cache decreases, and power dissipa-
tion increases. This analysis assumes
average activity factor for logic and
cache observed in today’s micropro-
cessors. If the die integrates no logic at
all, then the entire die could be popu-
lated with about 16MB of cache and
consume less than 10 watts of power,
since caches consume less power than
logic (Case A). On the other hand, if it
integrates no cache at all, then it could
integrate 75 million transistors for log-
ic, consuming almost 90 watts of pow-
er (Case B). For 65 watts, the die could
integrate 50 million transistors for
logic and about 6MB of cache (Case C).
Traditional wisdom suggests investing maximum transistors in the 90% case, with
the goal of using precious transistors to increase single-thread performance that can
be applied broadly. In the new scaling regime typified by slow transistor performance
and energy improvement, it often makes no sense to add transistors to a single core
as energy efficiency suffers. Using additional transistors to build more cores produces
a limited benefit—increased performance for applications with thread parallelism.
In this world, 90/10 optimization no longer applies. Instead, optimizing with an
accelerator for a 10% case, then another for a different 10% case, then another 10%
case can often produce a system with better overall energy efficiency and performance.
We call this “1010 optimization,”
14
as the goal is to attack performance as a set of
10% optimization opportunities—a different way of thinking about transistor cost,
operating the chip with 10% of the transistors active—90% inactive, but a different 10%
at each point in time.
Historically, transistors on a chip were expensive due to the associated design
effort, validation and testing, and ultimately manufacturing cost. But 20 generations
of Moore’s Law and advances in design and validation have shifted the balance.
Building systems where the 10% of the transistors that can operate within the energy
budget are configured optimally (an accelerator well-suited to the application) may
well be the right solution. The choice of 10 cases is illustrative, and a 55, 77, 1010,
or 1212 architecture might be appropriate for a particular design.
Death of
90/10 Optimization,
Rise of
10×10 Optimization
Figure 8. Transistor integration capacity at a fixed power envelope.
Case B
Case A, 0 Logic, 8W
Case A, 16MB of Cache
Case C
50MT Logic
6MB Cache
Power Dissipation
Cache Size
100
80
60
40
20
0
18
16
14
12
10
8
6
4
2
0
0 20 40 60 80
Total Power (Watts)
Logic Transistors (Millions)
2008, 45nm, 100mm
2
Cache (MB)

Citations
More filters
Book
30 Apr 2010
TL;DR: This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model using the open-source Hadoop implementation, with a focus on scalability and the tradeoffs associated with distributed processing of large datasets.
Abstract: This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model [1], using the open-source Hadoop implementation. The focus will be on scalability and the tradeoffs associated with distributed processing of large datasets. Content will include general discussions about algorithm design, presentation of illustrative algorithms, case studies in HLT applications, as well as practical advice in writing Hadoop programs and running Hadoop clusters.

538 citations

Journal ArticleDOI
01 Mar 2017
TL;DR: The "end of Moore's law" as discussed by the authors has been widely recognized as a major barrier to further miniaturization of semiconductor technology. But the field effect transistor is approaching some physical limits, and the associated rising costs and reduced return on investment appear to be slowing the pace of development.
Abstract: The insights contained in Gordon Moore's now famous 1965 and 1975 papers have broadly guided the development of semiconductor electronics for over 50 years. However, the field-effect transistor is approaching some physical limits to further miniaturization, and the associated rising costs and reduced return on investment appear to be slowing the pace of development. Far from signaling an end to progress, this gradual "end of Moore's law" will open a new era in information technology as the focus of research and development shifts from miniaturization of long-established technologies to the coordinated introduction of new devices, new integration technologies, and new architectures for computing.

461 citations

Journal ArticleDOI
TL;DR: This work uses a first-published methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.
Abstract: High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today’s system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.

433 citations

Proceedings ArticleDOI
03 Mar 2012
TL;DR: An ISA extension that provides approximate operations and storage is described that gives the hardware freedom to save energy at the cost of accuracy and Truffle, a microarchitecture design that efficiently supports the ISA extensions is proposed.
Abstract: Disciplined approximate programming lets programmers declare which parts of a program can be computed approximately and consequently at a lower energy cost. The compiler proves statically that all approximate computation is properly isolated from precise computation. The hardware is then free to selectively apply approximate storage and approximate computation with no need to perform dynamic correctness checks.In this paper, we propose an efficient mapping of disciplined approximate programming onto hardware. We describe an ISA extension that provides approximate operations and storage, which give the hardware freedom to save energy at the cost of accuracy. We then propose Truffle, a microarchitecture design that efficiently supports the ISA extensions. The basis of our design is dual-voltage operation, with a high voltage for precise operations and a low voltage for approximate operations. The key aspect of the microarchitecture is its dependence on the instruction stream to determine when to use the low voltage. We evaluate the power savings potential of in-order and out-of-order Truffle configurations and explore the resulting quality of service degradation. We evaluate several applications and demonstrate energy savings up to 43%.

423 citations


Cites background from "The future of microprocessors"

  • ...Potential benefits go beyond reduced power demands in servers and longer battery life in mobile devices; reducing power consumption is becoming a requirement due to limits of device scaling in what is termed the dark silicon problem [4, 11]....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors demonstrate a silicon modulator operating with less than one femtojoule energy and are able to compensate for thermal drift over a 7.5°C temperature range.
Abstract: Optical modulators on silicon promise to deliver ultralow power communication networks between or within computer chips. Here, the authors demonstrate a silicon modulator operating with less than one femtojoule energy and are able to compensate for thermal drift over a 7.5 °C temperature range.

379 citations

References
More filters
Proceedings ArticleDOI
Gene Myron Amdahl1
18 Apr 1967
TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Abstract: For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.

3,653 citations

Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations

Proceedings ArticleDOI
Brian F. Cooper1, Adam Silberstein1, Erwin Tam1, Raghu Ramakrishnan1, Russell Sears1 
10 Jun 2010
TL;DR: This work presents the "Yahoo! Cloud Serving Benchmark" (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems, and defines a core set of benchmarks and reports results for four widely used systems.
Abstract: While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud OLTP" applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being applied to a diverse range of applications that differ considerably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of apples-to-apples performance comparisons, makes it difficult to understand the tradeoffs between systems and the workloads for which they are suited. We present the "Yahoo! Cloud Serving Benchmark" (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!'s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the development of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible--it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.

3,276 citations


"The future of microprocessors" refers background in this paper

  • ...Shifting responsibility increases potential achievable energy efficiency, but realizing it depends on significant advances in applications, compilers and runtimes, and operating systems to understand and even predict the application and workload behavior.(7,16,19) However, these advances require radical research breakthroughs and major changes in software practice (see Table 7)....

    [...]

Journal ArticleDOI
TL;DR: This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 /spl mu/.
Abstract: This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 /spl mu/. Scaling relationships are presented which show how a conventional MOSFET can be reduced in size. An improved small device structure is presented that uses ion implantation, to provide shallow source and drain regions and a nonuniform substrate doping profile. One-dimensional models are used to predict the substrate doping profile and the corresponding threshold voltage versus source voltage characteristic. A two-dimensional current transport model is used to predict the relative degree of short-channel effects for different device parameter combinations. Polysilicon-gate MOSFET's with channel lengths as short as 0.5 /spl mu/ were fabricated, and the device characteristics measured and compared with predicted values. The performance improvement expected from using these very small devices in highly miniaturized integrated circuits is projected.

3,008 citations

Journal ArticleDOI
Luiz Andre Barroso1, Urs Hölzle1
TL;DR: Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use, particularly the memory and disk subsystems.
Abstract: Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use. Achieving energy proportionality will require significant improvements in the energy usage profile of every system component, particularly the memory and disk subsystems.

2,499 citations

Frequently Asked Questions (19)
Q1. What are the contributions in this paper?

Since dramatic changes are coming, the authors also seek to inspire the research community to invent new ideas and solutions address how to sustain computing ’ s exponential improvement. 

Because the future winners are far from clear today, it is way too early to predict whether some form of scaling ( perhaps energy ) will continue or there will be no scaling at all. Moreover, the challenges processor design will faces in the next decade will be dwarfed by the challenges posed by these alternative technologies, rendering today ’ s challenges a warm-up exercise for what lies ahead. 

Aggressive voltage scaling provides an avenue for utilizing the unused transistor-integration capacity for logic to deliver higher performance. 

As the transistor scales, supply voltage scales down, and the threshold voltage of the transistor (when the transistor starts conducting) also scales down. 

The clusters could be connected through wide (high-bandwidth) low-swing (lowenergy) busses or through packet- or circuit-switched networks, depending on distance. 

When transistor performance increases frequency of operation, the performance of a well-tuned system generally increases, with frequency subject to the performance limits of other parts of the system. 

the transistor budget from the unused cache could be used to integrate even more cores with the power density of the cache. 

Aggressive use of customized accelerators will yield the highest performance and greatest energy efficiency on many applications. 

Chip architects must limit frequency and number of cores to keep power within reasonable bounds, but doing so severely limits improvement in microprocessor performance. 

In the future, data movement over these networks must be limited to conserve energy, and, more important, due to the large size of local storage data bandwidth, demand on the network will be reduced. 

For 65 watts, the die could integrate 50 million transistors for logic and about 6MB of cache (Case C).traditional wisdom suggests investing maximum transistors in the 90% case, with the goal of using precious transistors to increase single-thread performance that can be applied broadly. 

In some cases, units hardwired to a particular data representation or computational algorithm can achieve 50x–500x greater energy efficiency than a general-purpose register organization. 

variation in the threshold voltage manifests itself as variation in the speed of the core, the slowest circuit in the core determines the frequency of operation of the core, and a large core is more susceptible to lower frequency of operation due to variations. 

The faster transistors provide an additional 40% performance (increased frequency), almost doubling overall performance within the same power envelope (per scaling theory). 

Extreme studies27,38 suggest that aggressive high-performance and extreme-energy-efficient systems may go further, eschewing the overhead of programmability features that software engineers have come to take for granted; for example, these future systems may drop hardware support for a single flat address space (which normally wastes energy on address manipulation/computing), single-memory hierarchy (coherence and monitoring energy overhead), and steady rate of execution (adapting to the available energy budget). 

For the past 20 years, rapid growth in microprocessor performance has been enabled by three key technology drivers—transistor-speed scaling, core microarchitecture techniques, and cache memories—discussed in turn in the following sections:Transistor-speed scaling. 

Applying Pollack’s Rule, a single processor core with 150 million transistors will provide only about 2.5x microarchitecture performance improvement over today’s 25-million-transistor core, well shy of their 30x goal, while 80MB of cache is probably more than enough for the cores (see Table 3). 

Many older parallel machines used irregular and circuit-switched networks31,41; Figure 12 describes a return to hybrid switched networks for on-chip interconnects. 

Another customization approach constrains the types of parallelism that can be executed efficiently, enabling a simpler core, coordination, and memory structures; for example, many CPUs increase energy efficiency by restricting memory access structure and control flexibility in single-instruction, multiple-data or vector (SIMD) structures,1,2 while GPUs encourage programs to express structured sets of threads that can be aligned and executed efficiently.