The future of microprocessors
read more
Citations
Achieving High Performance and High Productivity in Next Generational Parallel Programming Languages
Exploiting concurrency and heterogeneity for energy-efficient computing
Optimizing Program Performance via Similarity, Using Feature-aware and Feature-agnostic Characterization Approaches
Pushing the Limits of Online Auto-Tuning: Machine Code Optimization in Short-Running Kernels
References
Validity of the single processor approach to achieving large scale computing capabilities
The PARSEC benchmark suite: characterization and architectural implications
Benchmarking cloud serving systems with YCSB
Design of ion-implanted MOSFET's with very small physical dimensions
The Case for Energy-Proportional Computing
Related Papers (5)
The gem5 simulator
Frequently Asked Questions (19)
Q2. What are the future works in this paper?
Because the future winners are far from clear today, it is way too early to predict whether some form of scaling ( perhaps energy ) will continue or there will be no scaling at all. Moreover, the challenges processor design will faces in the next decade will be dwarfed by the challenges posed by these alternative technologies, rendering today ’ s challenges a warm-up exercise for what lies ahead.
Q3. What is the way to use the unused transistor-integration capacity for logic?
Aggressive voltage scaling provides an avenue for utilizing the unused transistor-integration capacity for logic to deliver higher performance.
Q4. What is the effect of the transistor on the supply voltage?
As the transistor scales, supply voltage scales down, and the threshold voltage of the transistor (when the transistor starts conducting) also scales down.
Q5. How many bits can be connected to a cluster?
The clusters could be connected through wide (high-bandwidth) low-swing (lowenergy) busses or through packet- or circuit-switched networks, depending on distance.
Q6. What is the effect of frequency of a well-tuned system?
When transistor performance increases frequency of operation, the performance of a well-tuned system generally increases, with frequency subject to the performance limits of other parts of the system.
Q7. What is the advantage of using the unused cache?
the transistor budget from the unused cache could be used to integrate even more cores with the power density of the cache.
Q8. What is the way to achieve the highest performance and energy efficiency?
Aggressive use of customized accelerators will yield the highest performance and greatest energy efficiency on many applications.
Q9. What is the challenge for chip architects?
Chip architects must limit frequency and number of cores to keep power within reasonable bounds, but doing so severely limits improvement in microprocessor performance.
Q10. How many watts of power can be saved by limiting the data movement over the network?
In the future, data movement over these networks must be limited to conserve energy, and, more important, due to the large size of local storage data bandwidth, demand on the network will be reduced.
Q11. How many transistors can be integrated into a die?
For 65 watts, the die could integrate 50 million transistors for logic and about 6MB of cache (Case C).traditional wisdom suggests investing maximum transistors in the 90% case, with the goal of using precious transistors to increase single-thread performance that can be applied broadly.
Q12. How many cores can be hardwired to a particular data representation or computational algorithm?
In some cases, units hardwired to a particular data representation or computational algorithm can achieve 50x–500x greater energy efficiency than a general-purpose register organization.
Q13. What is the effect of variation on the speed of the core?
variation in the threshold voltage manifests itself as variation in the speed of the core, the slowest circuit in the core determines the frequency of operation of the core, and a large core is more susceptible to lower frequency of operation due to variations.
Q14. What is the effect of the faster transistors on the performance of a system?
The faster transistors provide an additional 40% performance (increased frequency), almost doubling overall performance within the same power envelope (per scaling theory).
Q15. What are the common examples of extreme energy-efficient systems?
Extreme studies27,38 suggest that aggressive high-performance and extreme-energy-efficient systems may go further, eschewing the overhead of programmability features that software engineers have come to take for granted; for example, these future systems may drop hardware support for a single flat address space (which normally wastes energy on address manipulation/computing), single-memory hierarchy (coherence and monitoring energy overhead), and steady rate of execution (adapting to the available energy budget).
Q16. What is the main reason for the rapid growth in microprocessor performance?
For the past 20 years, rapid growth in microprocessor performance has been enabled by three key technology drivers—transistor-speed scaling, core microarchitecture techniques, and cache memories—discussed in turn in the following sections:Transistor-speed scaling.
Q17. How many transistors can be integrated into a single processor core?
Applying Pollack’s Rule, a single processor core with 150 million transistors will provide only about 2.5x microarchitecture performance improvement over today’s 25-million-transistor core, well shy of their 30x goal, while 80MB of cache is probably more than enough for the cores (see Table 3).
Q18. How many parallel machines used irregular and circuit-switched networks?
Many older parallel machines used irregular and circuit-switched networks31,41; Figure 12 describes a return to hybrid switched networks for on-chip interconnects.
Q19. What is the difference between a customized CPU and a GPU?
Another customization approach constrains the types of parallelism that can be executed efficiently, enabling a simpler core, coordination, and memory structures; for example, many CPUs increase energy efficiency by restricting memory access structure and control flexibility in single-instruction, multiple-data or vector (SIMD) structures,1,2 while GPUs encourage programs to express structured sets of threads that can be aligned and executed efficiently.