scispace - formally typeset
Search or ask a question
Journal ArticleDOI

POWER4 system microarchitecture

J. M. Tendler1, John Steven Dodson1, J. S. Fields1, Hung Qui Le1, Balaram Sinharoy1 
01 Jan 2002-Ibm Journal of Research and Development (IBM)-Vol. 46, Iss: 1, pp 5-25
TL;DR: The processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor are described.
Abstract: The IBM POWER4 is a new microprocessor organized in a system structure that includes new technology to form systems. The name POWER4 as used in this context refers not only to a chip, but also to the structure used to interconnect chips to form systems. In this paper we describe the processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor.
Citations
More filters
Proceedings ArticleDOI
09 Dec 2006
TL;DR: The results show that the best architected policies can come within 1% of the performance of an ideal oracle, while meeting a given chip-level power budget, and are significantly better than static management, even if static scheduling is given oracular knowledge.
Abstract: Chip-level power and thermal implications will continue to rule as one of the primary design constraints and performance limiters. The gap between average and peak power actually widens with increased levels of core integration. As such, if per-core control of power levels (modes) is possible, a global power manager should be able to dynamically set the modes suitably. This would be done in tune with the workload characteristics, in order to always maintain a chip-level power that is below the specified budget. Furthermore, this should be possible without significant degradation of chip-level throughput performance. We analyze and validate this concept in detail in this paper. We assume a per-core DVFS (dynamic voltage and frequency scaling) knob to be available to such a conceptual global power manager. We evaluate several different policies for global multi-core power management. In this analysis, we consider various different objectives such as prioritization and optimized throughput. Overall, our results show that in the context of a workload comprised of SPEC benchmark threads, our best architected policies can come within 1% of the performance of an ideal oracle, while meeting a given chip-level power budget. Furthermore, we show that these global dynamic management policies perform significantly better than static management, even if static scheduling is given oracular knowledge.

667 citations


Cites background from "POWER4 system microarchitecture"

  • ...Initially, this era of chip multiprocessing (CMP) [35, 40] began with a stepback in core-level frequency (shallower processor pipelines) to allow multi-core (multi-thread) throughput increase at affordable power [15, 16, 18, 29, 41]....

    [...]

Journal Article
TL;DR: The Hyper-Threading Technology architecture is described, and the microarchitecture details of Intel's first implementation on the Intel Xeon processor family are discussed, which is an important addition to Intel's enterprise product line and will be integrated into a wide variety of products.
Abstract: Intel’s Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources. This paper describes the Hyper-Threading Technology architecture, and discusses the microarchitecture details of Intel's first implementation on the Intel Xeon processor family. Hyper-Threading Technology is an important addition to Intel’s enterprise product line and will be integrated into a wide variety of products. Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Xeon is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. INTRODUCTION The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand we cannot rely entirely on traditional approaches to processor design. Microarchitecture techniques used to achieve past processor performance improvement–superpipelining, branch prediction, super-scalar execution, out-of-order execution, caches–have made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intel’s Hyper-Threading Technology is one solution. Processor Microarchitecture Traditional approaches to processor design have focused on higher clock speeds, instruction-level parallelism (ILP), and caches. Techniques to achieve higher clock speeds involve pipelining the microarchitecture to finer granularities, also called super-pipelining. Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second. Because there will be far more instructions in-flight in a superpipelined microarchitecture, handling of events that disrupt the pipeline, e.g., cache misses, interrupts and branch mispredictions, can be costly. Intel Technology Journal Q1, 2002 Hyper-Threading Technology Architecture and Microarchitecture 2 ILP refers to techniques to increase the number of instructions executed each clock cycle. For example, a super-scalar processor has multiple parallel execution units that can process instructions simultaneously. With super-scalar execution, several instructions can be executed each clock cycle. However, with simple inorder execution, it is not enough to simply have multiple execution units. The challenge is to find enough instructions to execute. One technique is out-of-order execution where a large window of instructions is simultaneously evaluated and sent to execution units, based on instruction dependencies rather than program order. Accesses to DRAM memory are slow compared to execution speeds of the processor. One technique to reduce this latency is to add fast caches close to the processor. Caches can provide fast memory access to frequently accessed data or instructions. However, caches can only be fast when they are small. For this reason, processors often are designed with a cache hierarchy in which fast, small caches are located and operated at access latencies very close to that of the processor core, and progressively larger caches, which handle less frequently accessed data or instructions, are implemented with longer access latencies. However, there will always be times when the data needed will not be in any processor cache. Handling such cache misses requires accessing memory, and the processor is likely to quickly run out of instructions to execute before stalling on the cache miss. The vast majority of techniques to improve processor performance from one generation to the next is complex and often adds significant die-size and power costs. These techniques increase performance but not with 100% efficiency; i.e., doubling the number of execution units in a processor does not double the performance of the processor, due to limited parallelism in instruction flows. Similarly, simply doubling the clock rate does not double the performance due to the number of processor cycles lost to branch mispredictions. 0 5 10 15 20 25

580 citations

Proceedings ArticleDOI
28 Jun 2004
TL;DR: The results imply that leveraging a single microarchitecture design for multiple remaps across a few technology generations will become increasingly difficult, and motivate a need for workload specific, microarch Architectural lifetime reliability awareness at an early design stage.
Abstract: The relentless scaling of CMOS technology has provided a steady increase in processor performance for the past three decades. However, increased power densities (hence temperatures) and other scaling effects have an adverse impact on long-term processor lifetime reliability. This paper represents a first attempt at quantifying the impact of scaling on lifetime reliability due to intrinsic hard errors, taking workload characteristics into consideration. For our quantitative evaluation, we use RAMP (Srinivasan et al., 2004), a previously proposed industrial-strength model that provides reliability estimates for a workload, but for a given technology. We extend RAMP by adding scaling specific parameters to enable workload-dependent lifetime reliability evaluation at different technologies. We show that (1) scaling has a significant impact on processor hard failure rates - on average, with SPEC benchmarks, we find the failure rate of a scaled 65nm processor to be 316% higher than a similarly pipelined 180nm processor; (2) time-dependent dielectric breakdown and electromigration have the largest increases; and (3) with scaling, the difference in reliability from running at worst-case vs. typical workload operating conditions increases significantly, as does the difference from running different workloads. Our results imply that leveraging a single microarchitecture design for multiple remaps across a few technology generations will become increasingly difficult, and motivate a need for workload specific, microarchitectural lifetime reliability awareness at an early design stage.

577 citations

Proceedings ArticleDOI
01 May 2003
TL;DR: Results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
Abstract: This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue Grid Processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.

512 citations


Cites background from "POWER4 system microarchitecture"

  • ...Granularity of parallel processing elements on a chip. a) Ultra-.ne-grained FPGAs. b) Hundreds of primitive processors connected to memory banks such as a processor-in-memory (PIM) architecture or recon.gurable ALU arrays such as RaPiD [7], Piperench [9], or PACT [3]. c) Tens of simple in-order processors, such as in RAW [25] or Piranha [2] architectures. d) Coarse grained architectures consisting of 10-20 4-issue cores, such as the Power4 [22], Cy­clops [4], MultiScalar processors [19], other pro­posed speculatively-threaded CMPs [6, 20], and the polymorphous Smart Memories [15] architecture. e) Wide-issue processors with many ALUs each, such as Grid Processors [16]....

    [...]

  • ...d) Coarse grained architectures consisting of 10-20 4-issue cores, such as the Power4 [22], Cyclops [4], MultiScalar processors [19], other proposed speculatively-threaded CMPs [6, 20], and the polymorphous Smart Memories [15] architecture....

    [...]

Journal ArticleDOI
01 May 2002
TL;DR: It is found that RMT can be a more significant burden for single-processor devices than prior studies indicate, and a novel application of RMT techniques in a dual-processor device, which is term chip-level redundant threading (CRT), shows higher performance than lockstepping the two cores, especially on multithreaded workloads.
Abstract: Exponential growth in the number of on-chip transistors, coupled with reductions in voltage levels, makes each generation of microprocessors increasingly vulnerable to transient faults. In a multithreaded environment, we can detect these faults by running two copies of the same program as separate threads, feeding them identical inputs, and comparing their outputs, a technique we call Redundant Multithreading (RMT).This paper studies RMT techniques in the context of both single- and dual-processor simultaneous multithreaded (SMT) single-chip devices. Using a detailed, commercial-grade, SMT processor design we uncover subtle RMT implementation complexities, and find that RMT can be a more significant burden for single-processor devices than prior studies indicate. However, a novel application of RMT techniques in a dual-processor device, which we term chip-level redundant threading (CRT), shows higher performance than lockstepping the two cores, especially on multithreaded workloads.

500 citations


Cites background from "POWER4 system microarchitecture"

  • ..., the IBM Power4 [7] and the HP Mako [8])....

    [...]

  • ...Initial examples of these two-way chip multiprocessors (CMPs) are shipping (e.g., the IBM Power4 [7] and the HP Mako [8])....

    [...]

  • ..., [3,4,6,7,9,11]) have proposed a host of other non-multithreaded techniques for fault detection for uniprocessors....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A unique combination of high clock speeds and advanced microarchitectural techniques, including many forms of out-of-order and speculative execution, provide exceptional core computational performance in the 21264.
Abstract: Alpha microprocessors have been performance leaders since their introduction in 1992. The first generation 21064 and the later 21164 raised expectations for the newest generation-performance leadership was again a goal of the 21264 design team. Benchmark scores of 30+ SPECint95 and 58+ SPECfp95 offer convincing evidence thus far that the 21264 achieves this goal and will continue to set a high performance standard. A unique combination of high clock speeds and advanced microarchitectural techniques, including many forms of out-of-order and speculative execution, provide exceptional core computational performance in the 21264. The processor also features a high-bandwidth memory system that can quickly deliver data values to the execution core, providing robust performance for a wide range of applications, including those without cache locality. The advanced performance levels are attained while maintaining an installed application base. All Alpha generations are upward-compatible. Database, real-time visual computing, data mining, medical imaging, scientific/technical, and many other applications can utilize the outstanding performance available with the 21264.

828 citations

Proceedings ArticleDOI
01 May 1993
TL;DR: This paper shows that there are really nine variations of the same basic model Two-Level Adaptive Branch Prediction; Pan, So, and Rahmeh call it Correlation Branch Prediction, and studies the effects of different branch history lengths and pattern history table configurations.
Abstract: Recent attention to speculative execution as a mechanism for increasing performance of single instruction streams has demanded substantially better branch prediction than what has been previously available. We [1,2] and Pan, So, and Rahmen [4] have both proposed variations of the same aggressive dynamic branch predictor for handling those needs. We call the basic model Two-Level Adaptive Branch Prediction; Pan, So, and Rahmeh call it Correlation Branch Prediction. In this paper, we adopt the terminology of [2] and show that there are really nine variations of the same basic model. We compare the nine variations with respect to the amount of history information kept. We study the effects of different branch history lengths and pattern history table configurations. Finally, we evaluate the cost effectiveness of the nine variations.

379 citations

Book
01 Jun 1994
TL;DR: The PowerPC Architecture is a must for anyone who needs to understand the levels of compatibility between different processors in the PowerPC family-the 601 microprocessor, the 603 (low-end, battery-powered requirements), 604 (optimized price/performance for scaleable symmetric multiprocessors), and the 620 (for high-end technical and commercial requirements about performance).
Abstract: This is the official technical description of the PowerPC architecture and its hardware conventions, developed jointly by IBM, Motorola, and Apple. The book is an essential reference for hardware and system-software designers and applications programmers developing a range of products using implementations of the PowerPC family of microprocessors-from palmtops to teraFLOPS. The PowerPC architecture provides a stable base for software, allowing applications that run on one PowerPC processor to run consistently on any other PowerPC processor. In addition, well-designed operating systems can be moved from one processor implementation to another by making only a few minor changes. To achieve this, the specification of the architecture has been structured into three Books, corresponding to a distinct level of the architecture: Book I, User Instruction Set Architecture, describes the registers, instructions, storage model, and execution model that are available to all application programs. Book II, Virtual Environment Architecture, describes features of the architecture that permit application programs to create or modify code, to share data among programs in a multiprocessing system, and to optimize the performance of storage accesses. Book III, Operating Environment Architecture, describes features of the architecture that permit operating systems to allocate and manage storage, to handle errors encountered by application programs, to support I/O devices, and to provide the other services expected of secure, modern multiprocessor operating systems. An important feature of these specifications is that they only constrain implementations on matters that affect software compatibility. Even more significant, they specify the architecture in a manner that is independent of implementation. The PowerPC Architecture is a must for anyone who needs to understand the levels of compatibility between different processors in the PowerPC family-the 601 microprocessor, the 603 (low-end, battery-powered requirements), 604 (optimized price/performance for scaleable symmetric multiprocessors), and the 620 (for high-end technical and commercial requirements about performance).

226 citations

Proceedings ArticleDOI
01 May 1997
TL;DR: A new prediction mechanism, the target cache, is proposed for predicting indirect jump targets, which reduces the indirect jump misprediction rate, the overall execution time, and the Overall execution time for the perl and gcc benchmarks.
Abstract: As the issue rate and pipeline depth of high performance superscalar processors increase, the amount of speculative work issued also increases. Because speculative work must be thrown away in the event of a branch misprediction, wide-issue, deeply pipelined processors must employ accurate branch predictors to effectively exploit their performance potential. Many existing branch prediction schemes are capable of accurately predicting the direction of conditional branches. However, these schemes are ineffective in predicting the targets of indirect jumps achieving, on average, a prediction accuracy rate of 51.8% for the SPECint95 benchmarks. In this paper, we propose a new prediction mechanism, the target cache, for predicting indirect jump targets. For the perl and gcc benchmarks, this mechanism reduces the indirect jump misprediction rate by 93.4% and 63.35% and the overall execution time by 14% and 5%.

158 citations

Journal ArticleDOI
TL;DR: The circuit and physical design of POWER4 is described and emphasis is placed on aspects of the design methodology, clock distribution, circuits, power, integration, and timing that enabled the design team to meet the project goals and to complete the design on schedule.
Abstract: The IBM POWER4 processor is a 174-milliontransistor chip that runs at a clock frequency of greater than 1.3 GHz. It contains two microprocessor cores, high-speed buses, and an on-chip memory subsystem. The complexity and size of POWER4, together with its high operating frequency, presented a number of significant challenges for its multisite design team. This paper describes the circuit and physical design of POWER4 and gives results that were achieved. Emphasis is placed on aspects of the design methodology, clock distribution, circuits, power, integration, and timing that enabled the design team to meet the project goals and to complete the design on schedule.

122 citations