scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 1997"


Proceedings ArticleDOI
Kemal Ebcioglu1, Erik R. Altman1
01 May 1997
TL;DR: The architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O are discussed.
Abstract: Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (DynamicallyArchitectedInstructionSet fromYorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.

406 citations


Proceedings ArticleDOI
01 Dec 1997
TL;DR: This work proposes a method for compressing programs in embedded processors where instruction memory size dominates cost and achieves an average size reduction of 39%, 34%, and 26%, respectively, for SPEC CINT95 programs.
Abstract: Proposes a method for compressing programs in embedded processors where the instruction memory size dominates the cost. A post-compilation analyzer examines a program and replaces common sequences of instructions with a single instruction codeword. A microprocessor executes the compressed instruction sequences by fetching codewords from the instruction memory, expanding them back to the original sequence of instructions in the decode stage, and issuing them to the execution stages. We apply our technique to the PowerPC, ARM and i386 instruction sets and achieve an average size reduction of 39%, 34% and 26%, respectively, for SPEC CINT95 programs.

245 citations


Proceedings ArticleDOI
Hector Sanchez1, B. Kuttanna1, T. Olson1, M. Alexander1, Gianfranco Gerosa1, R. Philip1, Jose Alvarez1 
23 Feb 1997
TL;DR: The next-generation PowerPC/sup TM/ microprocessor includes a thermal assist unit (TAU) comprised of an on-chip thermal sensor and associated logic and dynamically adjusts processor operation to provide maximum performance under changing environmental conditions.
Abstract: Thermal management is an important design issue in high-performance, low-power portable computers If the computer system is designed for worst-case processor power dissipation and environmental operating conditions, it carries an area and cost penalty for the system designer The next-generation PowerPC/sup TM/ microprocessor includes a thermal assist unit (TAU) comprised of an on-chip thermal sensor and associated logic The TAU monitors the junction temperature of the processor and dynamically adjusts processor operation to provide maximum performance under changing environmental conditions The TAU is used in conjunction with other low-power features such as dynamic power management, instruction cache throttling and static low-power modes to provide comprehensive power and thermal management This paper describes the implementation of the TAU and presents its characterization and operating data from first silicon

141 citations


Proceedings ArticleDOI
01 Feb 1997
TL;DR: It is shown that software- managed address translation is just as efficient as hardware-managed address translation and it is much more flexible.
Abstract: In this paper we explore software-managed address translation. The purpose of the study is to specify the memory management design for a high clock-rate PowerPC implementation in which a simple design is a prerequisite for a fast clock and a short design cycle. We show that software-managed address translation is just as efficient as hardware-managed address translation and it is much more flexible. Operating systems such as OSF/1 and Mach charge between 0.10 and 0.28 cycles per instruction (CPI) for address translation using dedicated memory-management hardware. Software-managed translation requires 0.05 CPI. Mechanisms to support such features as shared memory, superpages, sub-page protection, and sparse address spaces can be defined completely in software, allowing much more flexibility than in hardware-defined mechanisms.

77 citations


Patent
22 May 1997
TL;DR: A page table entry management method and apparatus for the Microkernel System with the capability to program the memory management unit on the PowerPC family of processors is presented in this article, which solves the problem of a limited number of PTEs by segment aliasing when two or more processes share a segment of memory.
Abstract: A page table entry management method and apparatus provide the Microkernel System with the capability to program the memory management unit on the PowerPC family of processors. The PowerPC processors define a limited set of page table entries (PTEs) to maintain virtual to physical mappings. The page table entry management method and apparatus solves the problem of a limited number of PTEs by segment aliasing when two or more user processes share a segment of memory. The segments are aliased rather than duplicating the PTES. This significantly reduces the number of PTEs. In addition, the method provides for caching existing PTEs when the system actually runs out of PTEs. A cache of recently discarded PTEs provides a fast fault resolution when a recently used page is accessed again.

72 citations


Proceedings ArticleDOI
13 Jun 1997
TL;DR: This paper describes the verification of two CAMs from a recent PowerPC¿ microprocessor design, a Block Address Translation unit (BAT), and a Branch Target Address Cache unit (BTAC), and uses new Boolean encodings to verify CAMs.
Abstract: In this paper we report on new techniques for verifying contentaddressable memories (CAMs), and demonstrate that these techniqueswork well for large industrial designs. It was shown in [Formal verification of PowerPC(TM) arrays using symbolic trajectory evaluation], that theformal verification technique of symbolic trajectory evaluation (STE)could be used successfully on memory arrays. We have extended thatwork to verify what are perhaps the most combinatorially difficultclass of memory arrays, CAMs. We use new Boolean encodings toverify CAMs, and show that these techniques scale well, in that spacerequirements increase linearly, or sub-linearly, with the various CAMsize parameters.In this paper, we describe the verification of two CAMs froma recentPowerPC microprocessor design, a Block Address Translation unit(BAT), and a Branch Target Address Cache unit (BTAC). The BATis a complex CAM, with variable length bit masks. The BTAC is a64-entry, 64-bits per entry, fully associative CAM and is part of thespeculative instruction fetch mechanism of the microprocessor. Webelieve that ours is the first work on formally verifying CAMs, and webelieve our techniques make it feasible to efficiently verify the varietyof CAMs found on modern processors.

63 citations


Journal ArticleDOI
A.M. Rincon1, G. Cherichetti, J.A. Monzel, David R. Stauffer, M.T. Trick 
TL;DR: The authors describe a prototype cosimulation system developed for the PowerPC core and present SOC designs to illustrate their methods.
Abstract: IBM's experience with core-based designs provides insight into methodology, SOC design styles, core design trade-offs, and ASIC design processes. The authors describe a prototype cosimulation system developed for the PowerPC core and present SOC designs to illustrate their methods.

53 citations


Proceedings ArticleDOI
S. Hojat1, P. Villarrubia
12 Oct 1997
TL;DR: An approach for tight integration between a synthesis and a placement tool is described to improve timing convergence of advanced microprocessors and results in "legal" placements with, in general, lower delay, and design size.
Abstract: This paper describes an approach for tight integration between a synthesis and a placement tool. The purpose of this integration is to improve timing convergence of advanced microprocessors. It is shown that this approach results in "legal" placements with, in general, lower delay, and design size. More significantly, the number of iterations to reach a timing closure is reduced drastically. The wire length estimates that are being used to traditionally drive the timing optimization in synthesis are inadequate. Instead, the integrated approach leads to enhanced results as well as faster timing convergence. The impact of various parameters in synthesis and placement on the final results is shown.

44 citations


Patent
10 Apr 1997
TL;DR: In this paper, a byte-lane swapping logic is added to the inbound and outbound I/O data paths for transferring data between system components in the appropriate Endian format.
Abstract: To present a consistent image of storage facilities to components in Bi-modal Endian PowerPC system enviromnents, provision is made for transferring data between system components in the appropriate Endian format. Endian conversion function can be incorporated into the memory controller subsystem by adding byte-lane swapping logic on the inbound and outbound I/O data paths. With this structure, inbound data from the processor and memory bus will be converted to true Little Endian order before being sent to I/O devices. Likewise, true Little Endian data from I/O devices targeted for the processor or memory is modified to reflect the PowerPC Little Endian byte ordering convention.

35 citations


Journal ArticleDOI
TL;DR: This RISC microprocessor is a new, high-performance, PowerPC microprocessor designed specifically for the mobile and high volume desktop personal computer markets, an advanced superscalar design with six execution units, aggressive upstream branch processing, out-of-order instruction execution, and a tightly integrated "backside" L2 cache.
Abstract: This RISC microprocessor is a new, high-performance, PowerPC microprocessor designed specifically for the mobile and high volume desktop personal computer markets. It is an advanced superscalar design with six execution units, aggressive upstream branch processing, out-of-order instruction execution, and a tightly integrated "backside" L2 cache. This dual-issue engine has a four-stage pipeline with dual 32-kB eight-way set-associative L1 caches and an integrated L2 controller with on-chip L2 tag supporting up to 1 MB of external SRAM. A thermal assist unit and an instruction cache throttling mechanism are included for thermal management in mobile applications. A 60X system bus and L2 interface speeds of 100 and 250 MHz are achieved, respectively. This microprocessor achieves workstation class performance (estimated 10 SPECint95 and 9 SPECfp95) while only dissipating 5 W at 250 MHz. The 6.35-million transistor 66.5-mm/sup 2/ die is fabricated in a 2.5-V, 0.3-/spl mu/m, five-layer metal CMOS process.

29 citations


Proceedings ArticleDOI
13 Jun 1997
TL;DR: A detailed description of an industrial application of the verification methodology based on Symbolic Trajectory Evaluation to the fixed point execution unit of the PowerPC processor is presented.
Abstract: Many modern systems are designed as a set of interconnectedreactive subsystems. The subsystem verification task is toverify an implementation of the subsystem against the simple deterministichigh-level specification of the entire system. Our verificationmethodology, based on Symbolic Trajectory Evaluation, is ableto bridge the wide gap between the abstract specification and theimplementation specific details of the subsystem. This paper presentsa detailed description of an industrial application of this methodologyto the fixed point execution unit of the PowerPC processor.We were able to verify a representative instruction under all possiblestall, bypass, pipeline conditions and under all possible timingsfor interface to other functional units in the processor.

Journal ArticleDOI
F. E. Levine1, C. P. Roth1
TL;DR: An application programming interface (API) to the on-chip PM support, its design methodology, and its usage considerations, intended to meet the challenges related to the externalization of the PM support are described.
Abstract: Performance monitor (PM) support in on-chip PowerPC® microprocessors is used to analyze processor, software, and system attributes for a variety of workloads. The interface to the PowerPC 604® microprocessor, which we abbreviate “604,” has been externalized to end users. We discuss the enhanced PM support available in an upgrade of the 604, the PowerPC 604e™ microprocessor, which we abbreviate “604e.” We discuss the challenges related to the externalization of the PM support as it relates to other PowerPC processors not derived from the 604 and briefly contrast these PMs with other PMs. We also describe an application programming interface (API) to the on-chip PM support, its design methodology, and its usage considerations, intended to meet these challenges.

Proceedings ArticleDOI
12 Oct 1997
TL;DR: These techniques successfully merge code modification and compression into a single software preprocessing step and enable decompression and execution of compressed code to occur without the need of a lookaside table (LAT) or cacheLookaside buffer (CLB).
Abstract: Compressing instruction sequences can reduce the cost of embedded systems by reducing program ROM-size requirements. Compression also facilitates the use of RISC core architectures, like the PowerPC/sup TM/ architecture, in embedded systems. Compression techniques are presented which enable decompression and execution of compressed code to occur without the need of a lookaside table (LAT) or cache lookaside buffer (CLB). These techniques successfully merge code modification and compression into a single software preprocessing step. Decompression and execution of compressed code are made very simple. An application of these techniques to about 120000 instructions of PowerPC firmware code is described.

Journal ArticleDOI
TL;DR: This 533-MHz BiCMOS very large scale integration (VLSI) implementation of the PowerPC architecture contains three pipelines and a large on-chip secondary cache to achieve a peak performance of 1600 MIPS.
Abstract: This 533-MHz BiCMOS very large scale integration (VLSI) implementation of the PowerPC architecture contains three pipelines and a large on-chip secondary cache to achieve a peak performance of 1600 MIPS. The 15 mm/spl times/10 mm die contains 2.7 M transistors (2M CMOS and 0.7 M bipolar) and dissipates less than 85 W. The die is fabricated in a six-level metal, 0.5-/spl mu/m BiCMOS process and requires 3.6 and 2.1 V power supplies.

Proceedings ArticleDOI
C.J. Georgiou1, C.-S. Li
08 Jun 1997
TL;DR: The throughput of this scalable architecture for implementing multi-gigabit protocol engines is shown to be adequate for the operations and bit rates currently specified by Fibre Channel.
Abstract: We have proposed and evaluated a scalable architecture for implementing multi-gigabit protocol engines. The architecture utilizes a combination of custom-made VLSI circuitry and a general-purpose processor, such as the Intel 960 or the IBM PowerPC 403. Time critical operations such as line coding/decoding, CRC generation/checking, context-independent header processing, and buffer management are implemented in the customized VLSI part. These designs are never-the-less scalable and can be cascaded to further increase throughput. Some of the packet level processing, such as context-dependent header processing, are performed by the general-purpose processor. As processing power increases, more and more functions can be included in the general purpose processor. The throughput of this architecture is shown to be adequate for the operations and bit rates currently specified by Fibre Channel. Future CMOS technology advances will have the potential to further improve the raw throughput.

Proceedings ArticleDOI
12 Jan 1997
TL;DR: Virtual Memory MTU Reassembly (VMMR) allows hardware/software interfaces to efficiently DMA large MTUs in hardware pages and remap them to a contiguous address space and can outperform memcopy by one to two orders of magnitude.
Abstract: Message transfer unit (MTU) reassembly schemes in modern operating systems cause I/O performance degradation when MTU sizes are larger than the architecture's page size. This can happen with emerging network technologies, such as Asynchronous Transfer Mode (ATM), where MTUs can be 64 KB or greater Traditional solutions either reassemble using memory copy or preallocate contiguous memory; these, however lack speed or consume excess resources, respectively. This paper presents an alternative scheme called Virtual Memory MTU Reassembly (VMMR) which reassembles non-contiguous pages through virtual memory remapping. VMMR allows hardware/software interfaces to efficiently DMA large MTUs in hardware pages and remap them to a contiguous address space. Studies done on a PowerPC 601 show that this method can outperform memcopy by one to two orders of magnitude (the maximum VMMR bandwidth is 14.7 Gbits/sec). High-performance multimedia applications, such as video on demand and video conferencing, can greatly benefit from such a performance boost.

Proceedings ArticleDOI
C. Pyron1, J. Prado, J. Golab
01 Nov 1997
TL;DR: The first PowerPC microprocessor in the new G3 generation of designs, the MPC750, incorporates new test strategy approaches to improve the product test quality, reliability, and debug, and to reduce the total time to market.
Abstract: The first PowerPC microprocessor in the new G3 generation of designs, the MPC750, incorporates new test strategy approaches to improve the product test quality, reliability, and debug, and to reduce the total time to market.

Proceedings ArticleDOI
Sanchez1, Philip1, Alvarez1, Gerosa1
12 Jun 1997
TL;DR: A 5-bit 2.5V temperature sensor implemented in a 0.35pm CMOS technology is described, which results in a lower cost solution that minimizes board area penalty and provides more timely information to enable active thermal management.
Abstract: Hector Sanchez, Ross Philip, Jose Alvarez, Gianfranco Gerosa Motorola Austin. Texas Abstract A 5-bit 2.5V temperature sensor implemented in a 0.35pm CMOS technology is described. The sensor is fully differential and based on the PTAT voltage difference between 2 diodes, yet it does not require a bandga reference. The resolution is 4OC for a temperature range of 0 C to 128OC. The offset error is 12OC over the process corners. The integral nonlinearity is below 1 LSB and the differential nonlinearity is less than 1/2 LSB. The total area of the sensor is 0.192 mm2 and the maximum power dissipation is 1OmW at 2.5V. Introduction The advent of high performance portable electronics puts increased pressure in system integrated solutions. Cost constraints, space limitations, and limited power budgets dictate the need for reducing the number of elements at the board level. External temperature sensors suffer a time-delay in the temperature reading due to the thermal constant from the integrated circuit junction to the external sensor. Furthermore, knowledge of the power consumed and the thermal resistivities is necessary to accurately determine the internal junction temperature. Integrating the temperature sensor results in a lower cost solution that minimizes board area penalty and provides more timely information to enable active thermal management. As a result, operating systems can throttle the processor or invoke a static power savings mode. [ 11

Journal ArticleDOI
TL;DR: NStrace is a bus-driven hardware trace facility developed for the PowerPC® family of superscalar RISC microprocessors that uses a recording of activity on a target processor's bus to infer the sequence of instructions executed during that recording period.
Abstract: NStrace is a bus-driven hardware trace facility developed for the PowerPC® family of superscalar RISC microprocessors. It uses a recording of activity on a target processor's bus to infer the sequence of instructions executed during that recording period. NStrace is distinguished from related approaches by its use of an architecture-level simulator to generate the instruction sequence from the bus recording. The generated trace represents the behavior of the processor as it executes at normal speed while interacting normally with its run-time environment. Furthermore, details of the processor state that are not generally available to other trace mechanisms can be provided by the architectural simulation. There are two main components to the process of generating bus-driven instruction traces: bus capture and trace generation. Bus capture is triggered by a call to a system program that puts a particular address on the bus, then establishes the initial state of the processor by a combination of writing out register values and invalidating caches. A logic analyzer records the bus activity, and from this a file of bus transactions is produced. Trace generation proceeds by driving a processor simulator with these bus transactions and recording the sequence of instructions that results. The processor simulator is an elaboration of that developed for the PowerPC Visual Simulator. We have successfully generated instruction traces for a mix of utility programs and real applications on several microprocessor platforms running several operating systems. The capacity of the bus recording hardware is two million transactions, yielding instruction traces with lengths of the order of one hundred million instructions. This trace facility has been used for a number of studies covering a range of performance issues involving software, hardware, and their interactions.

01 Jan 1997
TL;DR: From the experimental results, and the case studies of PowerPC CAMs, it is believed that the problem of verifying all the di erent types of CAMs that are found on a modern microprocessor is solved.
Abstract: Veri cation of memory arrays is an important part of processor veri cation. Memory arrays include circuits such as on-chip caches, cache tags, register les, and branch prediction bu ers having memory cores embedded within complex logic. Such arrays cover large areas of the chip and are critical to the functionality and performance of the system. Hence, these circuits are custom designed at the transistor level to optimize area and performance. Conventional simulation based veri cation approaches do not work for arrays, as it is infeasible to simulate the astronomical number of simulation patterns that are required to verify these designs. Therefore, we need to look at formal methods to ensure the correctness of these circuits. We have adopted the formal technique of Symbolic Trajectory Evaluation (STE) to solve the array veri cation problem. STE uses a form of symbolic simulation to check whether a nite state system satis es a formula expressed in a carefully restricted temporal logic. It can handle switch-level circuits and detailed system timing. However, STE does not resolve many fundamental issues important for verifying arrays. These include the state explosion problem, causing prohibitively large ordered binary decision diagrams (OBDDs) for certain classes of circuits, and the switch-level analysis bottleneck, limiting the size of switch-level circuits that can be analyzed prior to running STE. Our thesis builds upon earlier work on STE to overcome these problems. We have developed techniques to exploit symmetry while verifying transistor-level circuits by STE. We show that exploiting symmetry allows one to verify systems several orders of magnitude larger than otherwise possible. We have veri edmemory arrays with multimillion transistors. The techniques we have developed also successfully overcome the switch-level analysis bottleneck. We believe that with our work, the problem of static random access memory (SRAM) veri cation is solved. We have developed techniques based on new Boolean encodings to e ciently verify content addressable memories (CAMs). Our encodings scale up well in terms of veri cation memory requirements, as compared to naive approaches. From our experimental results, and our case studies of PowerPC CAMs, we believe that we have solved the problem of verifying all the di erent types of CAMs that are found on a modern microprocessor. To facilitate the use of STE, we have developed an automated technique to identify the internal state nodes in transistor netlists. We have used the techniques developed in this thesis to successfully verify several memory arrays from state of the art PowerPC microprocessor designs.

Proceedings ArticleDOI
23 Feb 1997
TL;DR: This microprocessor is a third-generation PowerPC microprocessor and is a member of the the G3 family of PowerPC processor products, which makes it suited for high-end desktop systems, but its low typical power dissipation of 5W and size make it very attractive for portable systems as well.
Abstract: This microprocessor is a third-generation PowerPC microprocessor and is a member of the the G3 family of PowerPC processor products Although its high performance (estimated at 100 SPECint95) makes it suited for high-end desktop systems, its low typical power dissipation of 5W and size of 665 mm/sup 2/ make it very attractive for portable systems as well This microprocessor is a dual-issue superscalar machine with a four-stage pipeline, separate instruction- and data-side L1 caches (32 kBytes each), and full tags and support for up to 1 MByte of back-side L2 A thermal assist unit and I-cache throttling feature are included in the microprocessor as additional tools for thermal management This microprocessor was designed in a 025-/spl mu/m CMOS process (018-/spl mu/m Leff) to operate at a frequency of 250 MHz

Proceedings ArticleDOI
C. Hunter1
12 Oct 1997
TL;DR: Integration of diagnostics with a memory built-in self-test (BIST) allowing both an effective and efficient manufacturing test as well as an effective diagnostic capability is detailed.
Abstract: Integration of diagnostics with a memory built-in self-test (BIST) allowing both an effective and efficient manufacturing test as well as an effective diagnostic capability is detailed. Detection of failures within a memory are the primary objective of an embedded memory BIST. Inclusion of comprehensive diagnostics, to isolate faults incurred during the manufacturing or design process, are an extension to the memory BISI, further utilizing the existing circuitry and functionality. Detection and diagnosis of failures within an embedded memory can be realized throughout the entire manufacturing process.

Proceedings ArticleDOI
06 Feb 1997
TL;DR: This superscalar microprocessor is a 32b implementation of the PowerPC Architecture(TM) specification based on a micro-architecture designed for high performance and low power.
Abstract: This superscalar microprocessor is a 32b implementation of the PowerPC Architecture(TM) specification based on a micro-architecture designed for high performance and low power Two instructions per cycle can be dispatched in this superscalar design The processor includes dual 32kB 8-way instruction and data caches, a floating-point unit, two integer units, a branch unit, a load/store unit, and a system unit An L2 tag and cache controller with a dedicated L2 bus interface are added to provide a low-cost L2 cache solution using commodity SRAMs for the data

Proceedings ArticleDOI
C. Roth1, F. Levine
05 Feb 1997
TL;DR: The evolution of performance monitoring from its roots in Power/sup TM/ architecture to its current state are explored and some of their observations about issues related to the availability of the PM to end users are concluded.
Abstract: The evolution of performance monitoring (PM) from its roots in Power/sup TM/ architecture to its current state are explored. Further discussed are many of the PM features in the PowerPC 604e, and the differences between the PMs in some PowerPC processors. To hide some of these differences, an Application Programming Interface (API) was developed. The authors conclude with some of their observations about issues related to the availability of the PM to end users.

Proceedings ArticleDOI
T.H. Einstein1
01 Apr 1997
TL;DR: The rationale for heterogeneity in a multicomputers is described and a typical example of a heterogeneous system in the form of a RACE multicomputer composed of a mixture of Analog Devices' SHARC 21060 and the IBM/Motorola/Apple PowerPC 603p processors is given.
Abstract: A heterogeneous multicomputer is a multicomputer composed of two or more different types of processors. This paper describes the rationale for heterogeneity in a multicomputer and gives a typical example of a heterogeneous system in the form of a RACE multicomputer composed of a mixture of Analog Devices' SHARC 21060 and the IBM/Motorola/Apple PowerPC 603p processors. These two processors have complementary attributes, and the advantages and limitations of each are described. Multicomputers generally implement a sequence of different processing algorithms. The "optimal" processor that maximizes throughput at each step in the processing flow is generally a function of the algorithm to be executed at that step. Other factors that also influence the optimal mix of processors in a heterogeneous multicomputer include physical processing density, hardware cost, and ease of programmability.

01 Jan 1997
TL;DR: This microprocessor is a dual-issue superscaler machine with a four stage pipeline, separate Instruction and Data side Ll caches (32KB each), and full tags and support for up to a IMB of back-side L2.
Abstract: This microprocessor is a the third generation PowerPC microprocessor and is a member of the G3 family of PowerPC processor products. Although it’s high performance (estimated 10.0 SPECint95) makes it suited for high end desktop systems, it’s low typical power dissipation of 5W and size of 66.5mm2 make it very attractive for portable systems as well. This microprocessor is a dual-issue superscaler machine with a four stage pipeline, separate Instruction and Data side Ll caches (32KB each), and full tags and support for up to a IMB of back-side L2. A Thermal Assist Unit and I-cache Throttling feature are included in the microprocessor as additional tools for thermal management. This microprocessor was designed in a 0.25um CMOS process (0.1 8um Leff) to operate at a frequency of 250MHz.

Journal ArticleDOI
01 Jul 1997
TL;DR: The clock design methodology and techniques used in the design of clock distribution networks for PowerPC™ microprocessors that aim at alleviating some of the issues in clock network design that arise in this context.
Abstract: Clock distribution design for high performance microprocessors has become increasingly challenging in recent years. Design goals of state-of-the-art integrated circuits, dictate the need for clock networks with smaller skew tolerances, large sizes, and lower capacitances. In this paper we discuss some of the issues in clock network design that arise in this context. We describe the clock design methodology and techniques used in the design of clock distribution networks for PowerPC™ microprocessors that aim at alleviating some of these problems.

Book ChapterDOI
26 Aug 1997
TL;DR: This paper presents a realistic study on the case for simultaneous multithreading by using extensive simulations to determine balanced configurations of a multithreaded version of the PowerPC 620, measuring their performance on multith readed benchmarks written using the commercial P Threads API, and estimating their hardware complexity in terms of increases in die area.
Abstract: Simultaneous multithreading is a recently proposed technique in which instructions from multiple threads are dispatched and/or issued concurrently in every clock cycle This technique has been claimed to improve the latency of multithreaded programs and the throughput of multiprogrammed workloads with a minimal increase in hardware complexity This paper presents a realistic study on the case for simultaneous multithreading by using extensive simulations to determine balanced configurations of a multithreaded version of the PowerPC 620, measuring their performance on multithreaded benchmarks written using the commercial P Threads API, and estimating their hardware complexity in terms of increases in die area Our results show that a balanced 2- threaded 620 achieves a 416% to 713% speedup over the original 620 on five multithreaded benchmarks with an estimated 364% increase in die area and no impact on single thread performance The balanced 4-threaded 620 achieves a 469% to 1116% speedup over the original 620 with an estimated 704% increase in die area and a detrimental impact on single thread performance

Proceedings ArticleDOI
R. Raimi1, J. Lear
03 Nov 1997
TL;DR: It is claimed that model checking can efficiently characterize failures when certain pre-conditions are met, and the implications for verification methodologies over the full design cycle are discussed.
Abstract: When silicon is available, newly designed microprocessors ore tested in specially equipped hardware laboratories, where real applications can be run at hardware speeds. However, the large volumes of code being run, plus the limited access to the internal nodes of the chip, make it extraordinarily difficult to characterize the nature of any failures that occur. In this paper, we describe how the formal verification technique of temporal logic model checking was used to quickly characterize a design error exhibited during hardware testing of the PowerPC 620 microprocessor. We claim that model checking can efficiently characterize such failures when certain pre-conditions are met. We also show how the same error could have been revealed early in the design cycle, by model checking a short and simple correctness specification. We discuss the implications of this for verification methodologies over the full design cycle.

Proceedings ArticleDOI
05 Feb 1997
TL;DR: The microprocessor discussed in this paper is an advanced superscalar design with six execution units, aggressive upstream branch processing, out-of-order instruction execution, and a tightly integrated "backside" L2 cache that achieves workstation/server class performance while only dissipating 5 watts.
Abstract: The microprocessor discussed in this paper is a member of the G3 family of PowerPC processors, the third generation of PowerPC microprocessor products. It provides the performance levels required for high end desktop systems while offering the low typical power dissipation and small die size that make it very attractive for portable systems. It is an advanced superscalar design with six execution units, aggressive upstream branch processing, out-of-order instruction execution, and a tightly integrated "backside" L2 cache. Most notably it achieves workstation/server class performance while only dissipating 5 watts. A major portion of the design effort involved architectural performance modeling, making cost/power/performance trade-offs, and verifying performance of the implementation.