Showing papers in "IEEE Micro in 2003"

PDF

Open Access

Journal Article•DOI•

Web search for a planet: The Google cluster architecture

[...]

Luiz Andre Barroso¹, Jeffrey Dean¹, Urs Hölzle¹•Institutions (1)

01 Mar 2003-IEEE Micro

TL;DR: Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software that achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

...read moreread less

Abstract: Amenable to extensive parallelization, Google's web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. to handle this workload, Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

...read moreread less

1,129 citations

Journal Article•DOI•

Trends and challenges in VLSI circuit reliability

[...]

C. Constantinescu

01 Jul 2003-IEEE Micro

TL;DR: The main trends and challenges in circuit reliability are discussed, and evolving techniques for dealing with them are explained.

...read moreread less

Abstract: Deep-submicron technology is having a significant impact on permanent, intermittent, and transient classes of faults. This article discusses the main trends and challenges in circuit reliability, and explains evolving techniques for dealing with them.

...read moreread less

622 citations

Journal Article•DOI•

Discovering and exploiting program phases

[...]

Timothy Sherwood¹, Erez Perelman², Greg Hamerly², Suleyman Sair³, Brad Calder² - Show less +1 more•Institutions (3)

University of California, Santa Barbara¹, University of California, San Diego², North Carolina State University³

01 Nov 2003-IEEE Micro

TL;DR: It is argued that instead of assuming average behavior, it is now time to model and optimize phase-based program behavior.

...read moreread less

Abstract: Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the largest of scales (that is, over the program's complete execution). During one part of the execution, a program can be completely memory bound; in another, it can repeatedly stall on branch mispredicts. Average statistics gathered about a program might not accurately picture where the real problems lie. This realization has ramifications for many architecture and compiler techniques, from how to best schedule threads on a multithreaded machine, to feedback-directed optimizations, power management, and the simulation and test of architectures. Taking advantage of time-varying behavior requires a set of automated analytic tools and hardware techniques that can discover similarities and changes in program behavior on the largest of time scales. The challenge in building such tools is that during a program's lifetime it can execute billions or trillions of instructions. How can high-level behavior be extracted from this sea of instructions? Some programs change behavior drastically, switching between periods of high and low performance, yet system design and optimization typically focus on average system behavior. It is argued that instead of assuming average behavior, it is now time to model and optimize phase-based program behavior.

...read moreread less

279 citations

Journal Article•DOI•

Hyperthreading technology in the netburst microarchitecture

[...]

David A. Koufaty¹, Scott Rodgers¹•Institutions (1)

Intel¹

01 Mar 2003-IEEE Micro

TL;DR: HowHyperthreading technology works, that is, how a single physical processor appear as multiple logical processors to operating systems and software is described and how this technology significantly improves performance on several relevant workloads is shown.

...read moreread less

Abstract: Hyperthreading technology, which brings the concept of simultaneous multithreading to the Intel architecture, was first introduced on the Intel Xeon processor in early 2002 for the server market. In November 2002, Intel launched the technology on the Intel Pentium 4 at clock frequencies of 3.06 GHz and higher, making the technology widely available to the consumer market. This technology signals a new direction in microarchitecture development and fundamentally changes the cost-benefit tradeoffs of microarchitecture design choices. This article describes how the technology works, that is, how we make a single physical processor appear as multiple logical processors to operating systems and software. We highlight the additional structures and die area needed to implement the technology and discuss the fundamental ideas behind the technology and why we can get a 25-percent boost in performance from a technology that costs less than 5 percent in added die area. We illustrate the importance of choosing the right sharing policy for each shared resource by describing, examining, and comparing three different sharing policies: partitioned resources, threshold sharing, and full sharing. The choice of policy depends on the traffic pattern, complexity and size of the resource, potential deadlock/livelock scenarios, and other considerations. Finally, we show how this technology significantly improves performance on several relevant workloads.

...read moreread less

253 citations

Journal Article•DOI•

The AMD Opteron processor for multiprocessor servers

[...]

C.N. Keltcher¹, K.J. McGrath¹, A. Ahmed¹, Pat Conway¹•Institutions (1)

Advanced Micro Devices¹

01 Mar 2003-IEEE Micro

TL;DR: Representing AMD's entry into 64-bit computing, Opteron combines the backwards compatibility of the X86-64 architecture with a DDR memory controller and hypertransport links to deliver server-class performance.

...read moreread less

Abstract: Representing AMD's entry into 64-bit computing, Opteron combines the backwards compatibility of the X86-64 architecture with a DDR memory controller and hypertransport links to deliver server-class performance. These features also make Opteron a flexible, modular, and easily connectable component for various multiprocessor configurations.

...read moreread less

247 citations

Journal Article•DOI•

Exploiting ILP, TLP, and DLP with the polymorphous trips architecture

[...]

Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Doug Burger, Stephen W. Keckler, Charles R. Moore - Show less +4 more

01 Nov 2003-IEEE Micro

TL;DR: The Tera-op reliable intelligently adaptive processing system (TRIPS) architecture seeks to deliver system-level configurability to applications and runtime systems by employing the concept of polymorphism.

...read moreread less

Abstract: The Tera-op reliable intelligently adaptive processing system (TRIPS) architecture seeks to deliver system-level configurability to applications and runtime systems. It does so by employing the concept of polymorphism, which permits the runtime system to configure the hardware execution resources to match the mode of execution and demands of the compiler and application.

...read moreread less

206 citations

Journal Article•DOI•

Itanium 2 processor microarchitecture

[...]

C. McNairy¹, D. Soltis²•Institutions (2)

Intel¹, Hewlett-Packard²

01 Mar 2003-IEEE Micro

TL;DR: The Itanium 2 processor as discussed by the authors extends the processing power of the Itanium processor family with a capable and balanced microarchitecture. Executing up to six instructions at a time, it provides both performance and binary compatibility for Itaniumbased applications and operating systems.

...read moreread less

Abstract: The Itanium 2 processor extends the processing power of the Itanium processor family with a capable and balanced microarchitecture. Executing up to six instructions at a time, it provides both performance and binary compatibility for Itanium-based applications and operating systems.

...read moreread less

160 citations

Journal Article•DOI•

Interconnect opportunities for gigascale integration

[...]

James D. Meindl¹•Institutions (1)

Georgia Institute of Technology¹

01 May 2003-IEEE Micro

TL;DR: During the past decade, interconnects have replaced transistors as the dominant determiner of chip performance, but new and radically different interconnect technologies will become increasingly important to future gigascale microsystems.

...read moreread less

Abstract: During the past decade, interconnects have replaced transistors as the dominant determiner of chip performance. To sustain the historical rate of advance in performance, monolithic interconnect technology has rapidly evolved to keep pace with advances in transistor density and performance. New and radically different interconnect technologies will become increasingly important to future gigascale microsystems.

...read moreread less

160 citations

Journal Article•DOI•

Transient-fault recovery for chip multiprocessors

[...]

Mohamed Gomaa¹, Chad Scarbrough¹, T. N. Vijaykumar¹, Irith Pomeranz¹•Institutions (1)

Purdue University¹

01 Nov 2003-IEEE Micro

TL;DR: Chip-level redundant threading with recovery for chip multiprocessors extends previous transient-fault detection schemes to provide fault recovery and uses the trailing thread state for recovery to hide interprocessor latency.

...read moreread less

Abstract: Chip-level redundant threading with recovery (CRTR) for chip multiprocessors extends previous transient-fault detection schemes to provide fault recovery. To hide interprocessor latency, CRTR uses a long slack enabled by asymmetric commit and uses the trailing thread state for recovery. CRTR increases bandwidth supply by pipelining communication paths and reduces bandwidth demand by extending the dependence-based checking elision.

...read moreread less

157 citations

Journal Article•DOI•

Temperature-aware computer systems: Opportunities and challenges

[...]

Kevin Skadron¹, Mircea R. Stan¹, Wei Huang¹, Sivakumar Velusamy¹, Karthik Sankaranarayanan¹, David Tarjan¹ - Show less +2 more•Institutions (1)

University of Virginia¹

01 Nov 2003-IEEE Micro

TL;DR: The authors define the role of architecture techniques and describe hotspot, an accurate yet fast thermal model suitable for computer architecture research.

...read moreread less

Abstract: Temperature-aware design techniques have an important role to play in addition to traditional techniques like power-aware design and package- and board-level thermal engineering. The authors define the role of architecture techniques and describe hotspot, an accurate yet fast thermal model suitable for computer architecture research.

...read moreread less

143 citations

Journal Article•DOI•

A power model for routers: modeling Alpha 21364 and InfiniBand routers

[...]

Hangsheng Wang¹, Li-Shiuan Peh¹, Sharad Malik¹•Institutions (1)

Princeton University¹

01 Jan 2003-IEEE Micro

TL;DR: An architectural-level power model for interconnection network routers will let researchers and designers easily factor in power when exploring architectural tradeoffs.

...read moreread less

Abstract: As interconnection networks proliferate to many new applications, a low-latency high-throughput fabric no longer suffices. An architectural-level power model for interconnection network routers will let researchers and designers easily factor in power when exploring architectural tradeoffs.

...read moreread less

Journal Article•DOI•

Power evaluation of a handheld computer

[...]

Marc A. Viredaz¹, Deborah A. Wallach²•Institutions (2)

Hewlett-Packard¹, Google²

01 Jan 2003-IEEE Micro

TL;DR: In a comprehensive study using the Itsy pocket computer, the authors measure both total system power and power dissipated by individual subcircuits for representative workloads and suggest possible low-power design optimizations and power management strategies.

...read moreread less

Abstract: In a comprehensive study using the Itsy pocket computer, the authors measure both total system power and power dissipated by individual subcircuits for representative workloads. The results suggest possible low-power design optimizations and power management strategies.

...read moreread less

Journal Article•DOI•

Statistical simulation: adding efficiency to the computer designer's toolbox

[...]

Lieven Eeckhout¹, S. Nussbaum², James E. Smith³, K. De Bosschere¹•Institutions (3)

Ghent University¹, Sun Microsystems², University of Wisconsin-Madison³

01 Sep 2003-IEEE Micro

TL;DR: Statistical simulation enables quick and accurate design decisions in the early stages of computer design, at the processor and system levels, reducing total design time and cost.

...read moreread less

Abstract: Statistical simulation enables quick and accurate design decisions in the early stages of computer design, at the processor and system levels. it complements detailed but slower architectural simulations, reducing total design time and cost.

...read moreread less

Journal Article•DOI•

Microlithography: trends, challenges, solutions, and their impact on design

[...]

A.K. Wong¹•Institutions (1)

University of Hong Kong¹

01 Mar 2003-IEEE Micro

TL;DR: An indispensable ingredient for future success is improvement in the design-manufacture interface and the semiconductor industry needs continuous reduction of the k/sub 1/ factor.

...read moreread less

Abstract: With lithography parameters approaching their limits, continuous improvement requires increasing dialogues and compromises between the technology and design communities Only with such communication can semiconductor manufacturers reach the 30 nm physical-gate-length era with optical lithography Optical lithography is an enabling technology for transistor miniaturization With the wavelength and numerical aperture of exposure systems approaching their limits, the semiconductor industry needs continuous reduction of the k/sub 1/ factor Challenges include image quality improvement, proximity effect correction, and cost control An indispensable ingredient for future success is improvement in the design-manufacture interface

...read moreread less

Journal Article•DOI•

Nonuniform cache architectures for wire-delay dominated on-chip caches

[...]

Changkyu Kim, Doug Burger, Stephen W. Keckler

01 Nov 2003-IEEE Micro

TL;DR: The authors propose several designs that treat the cache as a network of banks and facilitate nonuniform accesses to different physical regions that offer low-latency access, increased scalability, and greater performance stability than conventional uniform access cache architectures.

...read moreread less

Abstract: Nonuniform cache access designs solve the on-chip wire delay problem for future large integrated caches. By embedding a network in the cache, NUCA designs let data migrate within the cache, clustering the working set nearest the processor. The authors propose several designs that treat the cache as a network of banks and facilitate nonuniform accesses to different physical regions. NUCA architectures offer low-latency access, increased scalability, and greater performance stability than conventional uniform access cache architectures.

...read moreread less

Journal Article•DOI•

Dynamic frequency and voltage scaling for a multiple-clock-domain microprocessor

[...]

Grigorios Magklis¹, Greg Semeraro², David H. Albonesi³, Steven Dropsho³, Sandhya Dwarkadas³, Michael L. Scott³ - Show less +2 more•Institutions (3)

Intel¹, Rochester Institute of Technology², University of Rochester³

01 Nov 2003-IEEE Micro

TL;DR: A multiple clock domain (MCD) microarchitecture, which uses a globally asynchronous, locally synchronous (GALS) clocking style, permits future aggressive frequency increases, maintains a synchronous design methodology, and exploits the trend of making functional blocks more autonomous.

...read moreread less

Abstract: Multiple clock domains is one solution to the increasing problem of propagating the clock signal across increasingly larger and faster chips. The ability to independently scale frequency and voltage in each domain creates a powerful means of reducing power dissipation. A multiple clock domain (MCD) microarchitecture, which uses a globally asynchronous, locally synchronous (GALS) clocking style, permits future aggressive frequency increases, maintains a synchronous design methodology, and exploits the trend of making functional blocks more autonomous. In MCD, each processor domain is internally synchronous, but domains operate asynchronously with respect to one another. Designers still apply existing synchronous design techniques to each domain, but global clock skew is no longer a constraint. Moreover, domains can have independent voltage and frequency control, enabling dynamic voltage scaling at the domain level.

...read moreread less

Journal Article•DOI•

Runahead execution: An effective alternative to large instruction windows

[...]

Onur Mutlu, Jared Stark¹, Christopher B. Wilkerson¹, Yale N. Patt²•Institutions (2)

Intel¹, University of Texas at Austin²

01 Nov 2003-IEEE Micro

TL;DR: Runahead execution uses otherwise-idle clock cycles to achieve an average 22 percent performance improvement for processors with instruction windows of contemporary sizes.

...read moreread less

Abstract: An instruction window that can tolerate latencies to DRAM memory is prohibitively complex and power hungry. To avoid having to build such large windows, runahead execution uses otherwise-idle clock cycles to achieve an average 22 percent performance improvement for processors with instruction windows of contemporary sizes. This technique incurs only a small hardware cost and does not significantly increase the processor's complexity.

...read moreread less

Journal Article•DOI•

Scalable, vector processors for embedded systems

[...]

Christos Kozyrakis¹, David A. Patterson²•Institutions (2)

Stanford University¹, University of California, Berkeley²

01 Nov 2003-IEEE Micro

TL;DR: This work evaluates the Vector IRAM architecture and shows that a compiler can vectorize embedded tasks automatically without compromising code density, and describes a prototype vector processor that outperforms high-end superscalar and VLIW designs by 1.5x to 100x for media tasks, without compromising power consumption.

...read moreread less

Abstract: For embedded applications with data-level parallelism, a vector processor offers high performance at low power consumption and low design complexity. Unlike superscalar and VLIW designs, a vector processor is scalable and can optimally match specific application requirements.To demonstrate that vector architectures meet the requirements of embedded media processing, we evaluate the Vector IRAM, or VIRAM (pronounced "V-IRAM"), architecture developed at UC Berkeley, using benchmarks from the Embedded Microprocessor Benchmark Consortium (EEMBC). Our evaluation covers all three components of the VIRAM architecture: the instruction set, the vectorizing compiler, and the processor microarchitecture. We show that a compiler can vectorize embedded tasks automatically without compromising code density. We also describe a prototype vector processor that outperforms high-end superscalar and VLIW designs by 1.5x to 100x for media tasks, without compromising power consumption. Finally, we demonstrate that clustering and modular design techniques let a vector processor scale to tens of arithmetic data paths before wide instruction-issue capabilities become necessary.

...read moreread less

Journal Article•DOI•

Measuring architectural vulnerability factors

[...]

Shubhendu S. Mukherjee¹, Christopher T. Weaver², Joel Emer¹, Steven K. Reinhardt², Todd Austin² - Show less +1 more•Institutions (2)

Intel¹, University of Michigan²

01 Nov 2003-IEEE Micro

TL;DR: This article presents a method for generating accurate soft-error estimates early in the design cycle to weigh the benefits of error protection techniques against their costs and presents a guide for using these estimates.

...read moreread less

Abstract: The continuous exponential growth in transistors per chip as described by Moore's law has spurred tremendous progress in the functionality and performance of semiconductor devices, particularly microprocessors. At the same time, each succeeding technology generation has introduced new obstacles to maintaining this growth rate. Transient faults caused by single-event upsets have emerged as a key challenge likely to gain significantly more importance in the next few design generations. Techniques for dealing with these faults exist, but they come at a cost. Designers need accurate soft-error estimates early in the design cycle to weigh the benefits of error protection techniques against their costs. This article presents a method for generating these estimates.

...read moreread less

Journal Article•DOI•

A four-terabit packet switch supporting long round-trip times

[...]

Francois Abel¹, Cyriel Minkenberg¹, Ronald P. Luijten¹, Mitch Gusat¹, Ilias Iliadis¹ - Show less +1 more•Institutions (1)

IBM¹

01 Jan 2003-IEEE Micro

TL;DR: This 4-TBPS packet switch uses a combined input- and crosspoint-queued (CICQ) structure with virtual output queuing at the ingress to achieve the scalability of input-buffered switches, the performance of output-buffering switches, and low latency.

...read moreread less

Abstract: This 4-TBPS packet switch uses a combined input- and crosspoint-queued (CICQ) structure with virtual output queuing at the ingress to achieve the scalability of input-buffered switches, the performance of output-buffered switches, and low latency.

...read moreread less

Journal Article•DOI•

Power consumption modeling and characterization of the TI C6201

[...]

Nathalie Julien¹, Johann Laurent¹, Eric Senn¹, Eric Martin¹•Institutions (1)

Sewanee: The University of the South¹

01 Sep 2003-IEEE Micro

TL;DR: This new approach characterizes power dissipation on complex dsps by relying on an initial functional-level power analysis of the target processor together with a characterization that qualifies the more significant architectural and algorithmic parameters for power Dissipation from a simple profiling of the assembly code.

...read moreread less

Abstract: This new approach characterizes power dissipation on complex dsps. its processor model relies on an initial functional-level power analysis of the target processor together with a characterization that qualifies the more significant architectural and algorithmic parameters for power dissipation. these parameters come from a simple profiling of the assembly code. This functional model accounts for deeply pipelined, superscalar, and hierarchical memory architectures.

...read moreread less

Journal Article•DOI•

Sorting and searching using ternary CAMs

[...]

Rina Panigrahy¹, Samar Sharma¹•Institutions (1)

Cisco Systems, Inc.¹

01 Jan 2003-IEEE Micro

TL;DR: This paper presents an algorithm that can sort in O(N) memory cycles using ternary content-addressable memories, a TCAM, and shows how this algorithm can be improved in the coming years.

...read moreread less

Abstract: Sorting and searching are classic problems in computing. Although several RAM-based solutions exist, algorithms using ternary content-addressable memories offer performance benefits. Using these algorithms, a TCAM can sort in O(N) memory cycles.

...read moreread less

Journal Article•DOI•

Transactional execution: toward reliable, high-performance multithreading

[...]

Ravi Rajwar¹, James R. Goodman²•Institutions (2)

Intel¹, University of Auckland²

01 Nov 2003-IEEE Micro

TL;DR: Transactional lock removal can dynamically eliminate synchronization operations and achieve transparent transactional execution by treating lock-based critical sections as lock-free optimistic transactions.

...read moreread less

Abstract: Although lock-based critical sections are the synchronization method of choice, they have significant performance limitations and lack certain properties, such as failure atomicity and stability. Addressing both these limitations requires considerable software overhead. Transactional lock removal can dynamically eliminate synchronization operations and achieve transparent transactional execution by treating lock-based critical sections as lock-free optimistic transactions.

...read moreread less

Journal Article•DOI•

TCP Splitter: a TCP/IP flow monitor in reconfigurable hardware

[...]

David V. Schuehler¹, John W. Lockwood¹•Institutions (1)

University of Washington¹

01 Jan 2003-IEEE Micro

TL;DR: This flow-monitoring circuit delivers an ordered byte stream to a client application for every TCP/IP connection it processes, using an active flow-processing algorithm.

...read moreread less

Abstract: This flow-monitoring circuit delivers an ordered byte stream to a client application for every TCP/IP connection it processes. Using an active flow-processing algorithm, TCP Splitter is a lightweight, efficient design that supports the monitoring of an almost unlimited number of flows at multigigabit line rates.

...read moreread less

Journal Article•DOI•

Electronic-system design in the automobile industry

[...]

Alberto Sangiovanni-Vincentelli¹•Institutions (1)

University of California, Berkeley¹

01 May 2003-IEEE Micro

TL;DR: A new platform-based methodology can revolutionize the way a car is designed and help to provide entertainment and communication, and to ensure safety.

...read moreread less

Abstract: Electronic components are now essential to control a car's movements and chemical, mechanical, and electrical processes; to provide entertainment and communication; and to ensure safety A new platform-based methodology can revolutionize the way a car is designed

...read moreread less

Journal Article•DOI•

Token coherence: a new framework for shared-memory multiprocessors

[...]

Milo M. K. Martin¹, Mark D. Hill¹, Darien Wood¹•Institutions (1)

University of Wisconsin-Madison¹

01 Nov 2003-IEEE Micro

TL;DR: The token coherence framework directly enforces the coherence invariant by counting tokens, which enables more obviously correct protocols that do not rely on request ordering and can operate with alternative policies that seek to improve the performance of future multiprocessors.

...read moreread less

Abstract: Commercial workload and technology trends are pushing existing shared-memory multiprocessor coherence protocols in divergent directions. Token coherence provides a framework for new coherence protocols that can reconcile these opposing trends. The token coherence framework directly enforces the coherence invariant by counting tokens (requiring all of a block's tokens to write and at least one token to read). This token-counting approach enables more obviously correct protocols that do not rely on request ordering and can operate with alternative policies that seek to improve the performance of future multiprocessors.

...read moreread less

Journal Article•DOI•

Power- and complexity-aware issue queue designs

[...]

Jaume Abella, Ramon Canal, Antonio González¹•Institutions (1)

Polytechnic University of Catalonia¹

01 Sep 2003-IEEE Micro

TL;DR: The improved performance of current microprocessors brings with it increasingly complex and power-dissipating issue logic and a range of mechanisms for tackling this problem.

...read moreread less

Abstract: The improved performance of current microprocessors brings with it increasingly complex and power-dissipating issue logic. Recent proposals introduce a range of mechanisms for tackling this problem.

...read moreread less

Journal Article•DOI•

System on a chip: changing IC design today and in the future

[...]

T.A.C.M. Claasen¹•Institutions (1)

NXP Semiconductors¹

01 May 2003-IEEE Micro

TL;DR: Market-related trends continue to drive innovation in the semiconductor industry today, they are particularly driving the design of systems on a chip, the new breed of complex, highly integrated systems.

...read moreread less

Abstract: Market-related trends continue to drive innovation in the semiconductor industry today, they are particularly driving the design of systems on a chip, the new breed of complex, highly integrated systems.

...read moreread less

Journal Article•DOI•

Checkpoint processing and recovery: an efficient, scalable alternative to reorder buffers

[...]

Haitham Akkary¹, Ravi Rajwar², S.T. Srinivasan²•Institutions (2)

Portland State University¹, Intel²

01 Nov 2003-IEEE Micro

TL;DR: A new technique, checkpoint processing and recovery, offers an efficient means of increasing the instruction window size without requiring large, cycle-critical structures, and provides a promising microarchitecture for future high-performance processors.

...read moreread less

Abstract: Processors require a combination of large instruction windows and high clock frequency to achieve high performance. Traditional processors use reorder buffers, but these structures do not scale efficiently as window size increases. A new technique, checkpoint processing and recovery, offers an efficient means of increasing the instruction window size without requiring large, cycle-critical structures, and provides a promising microarchitecture for future high-performance processors.

...read moreread less

Journal Article•DOI•

Customizing the branch predictor to reduce complexity and energy consumption

[...]

Michael C. Huang¹, Daniel Chaver², Luis Piñuel², Manuel Prieto², Francisco Tirado² - Show less +1 more•Institutions (2)

University of Rochester¹, Complutense University of Madrid²

01 Sep 2003-IEEE Micro

TL;DR: By adapting the branch target buffer's size and dynamically disabling a hybrid predictor's components, the authors create a customized branch predictor that saves a significant amount of energy with little performance degradation.

...read moreread less

Abstract: To exploit instruction-level parallelism, high-end processors use branch predictors consisting of many large, often underutilized structures that cause unnecessary energy waste and high power consumption. By adapting the branch target buffer's size and dynamically disabling a hybrid predictor's components, the authors create a customized branch predictor that saves a significant amount of energy with little performance degradation.

...read moreread less