scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 1998"


Proceedings ArticleDOI
01 May 1998
TL;DR: A methodology for the design and analysis of power grids in the PowerPC™ microprocessors covering the need for power grid analysis across all stages of the design process is presented.
Abstract: We present a methodology for the design and analysis of power grids in the PowerPC/sup TM/ microprocessors. The methodology covers the need for power grid analysis across all stages of the design process. A case study showing the application of this methodology to the PowerPC/sup TM/ 750 microprocessor is presented.

190 citations


Journal ArticleDOI
TL;DR: The memory management designs of a sampling of six recent processors are considered, focusing primarily on their architectural differences, and hint at optimizations that someone designing or porting system software might want to consider.
Abstract: Here, we consider the memory management designs of a sampling of six recent processors, focusing primarily on their architectural differences, and hint at optimizations that someone designing or porting system software might want to consider. We selected examples from the most popular commercial microarchitectures: the MIPS R10000, Alpha 21164, PowerPC 604, PA-8000, UltraSPARC-I, and Pentium II. This survey describes how each processor architecture supports the common features of virtual memory: address space protection, shared memory, and large address spaces.

115 citations


Journal ArticleDOI
TL;DR: Tui is a migration system that is able to translate the memory image of a program between four common architectures (m68000, SPARC, i486 and PowerPC) and requires detailed knowledge of all data types and variables used with the program.
Abstract: Heterogeneous process migration is a technique whereby an active process is moved from one machine to another. It must then continue normal execution and communication. The source and destination processors can have a different architecture, that is, different instruction sets and data formats. Because of this heterogeneity, the entire process memory image must be translated during the migration. Tui is a migration system that is able to translate the memory image of a program (written in ANSI-C) between four common architectures (m68000, SPARC, i486 and PowerPC). This requires detailed knowledge of all data types and variables used with the program. This is not always possible in non-type-safe (but popular) languages such as ANSI-C, Pascal and Fortran. The important features of the Tui algorithm are discussed in great detail. This includes the method by which a program's entire set of data values can be located, and eventually reconstructed on the target processor. Performance figures demonstrating the viability of using Tui to migrate real applications are given. © 1998 John Wiley & Sons, Ltd.

98 citations


Journal ArticleDOI
TL;DR: This paper describes a method for improving code size efficiency involving the use of compression techniques to reduce the size of the stored code, and on-the-fly hardware decompression at full processor speed for execution.
Abstract: Code size efficiency is a critical parameter in the design of computer systems for embedded applications. This paper describes a method for improving code size efficiency involving the use of compression techniques to reduce the size of the stored code, and on-the-fly hardware decompression at full processor speed for execution. A simple frequency-based encoding scheme for PowerPC® code achieves a typical code size reduction to 60% of the original size. A corresponding decompression core has been implemented for an embedded microprocessor, such as the PowerPC 401TM. The compression/decompression scheme operates in a manner transparent to the processor and requires no changes to such tools as compilers, linkers, and loaders.

96 citations


Journal ArticleDOI
TL;DR: Although fundamentally related, DSP processors are significantly different from general purpose processors (GPPs) like the Intel Pentium or PowerPC, and the authors explain what DSP processor are and what they do.
Abstract: These days, the once obscure engineering term "DSP" (digital signal processing) is working its way into common use. It has begun to crop up on the labels of an ever wider range of products, from home audio components to answering machines. This is not merely a reflection of a new marketing strategy, however; there truly is more digital signal processing inside today's products than ever before. But why is the market for DSP processors booming? The answer is somewhat circular: as microprocessor fabrication processes have become more sophisticated, the cost of a microprocessor capable of performing DSP tasks has dropped significantly to the point where such a processor can be used in consumer products and other cost sensitive systems. As a result, more and more products have begun using DSP processors, fueling demand for faster, smaller, cheaper, more energy-efficient chips. Although fundamentally related, DSP processors are significantly different from general purpose processors (GPPs) like the Intel Pentium or PowerPC. The authors explain what DSP processors are and what they do. They also offer a guide to evaluating DSP processors for use in a product or application.

93 citations


Journal ArticleDOI
05 Feb 1998
TL;DR: This 64 b single-issue integer processor, comprised of about one million transistors, is fabricated in a 0.15 /spl mu/m effective channel length, six-metal-layer CMOS technology and intended as a vehicle to explore circuit, clocking, microarchitecture, and methodology options for high-frequency processors.
Abstract: This 64 b single-issue integer processor, comprised of about one million transistors, is fabricated in a 0.15 /spl mu/m effective channel length, six-metal-layer CMOS technology. Intended as a vehicle to explore circuit, clocking, microarchitecture, and methodology options for high-frequency processors, the processor prototype implements 60 fixed-point compare, logical, arithmetic, and rotate-merge-mask instructions of the PowerPC instruction-set architecture with single-cycle latency. The processor executes programs written in this instruction subset from cache with a 1 ns cycle. In addition, the prototype implements 36 PowerPC load/store instructions that execute as single-cycle operations (zero wait cycles) with 1.15 ns latency. Full data forwarding and full at speed scan testing are supported.

75 citations


Journal ArticleDOI
05 Feb 1998
TL;DR: In this article, a 32 b 480 MHz PowerPC reduced-instruction-set-computer (RISC) microprocessor is migrated into an advanced 0.2 /spl mu/m CMOS technology with copper interconnects and multi-threshold transistors.
Abstract: A 32 b 480 MHz PowerPC reduced-instruction-set-computer (RISC) microprocessor is migrated into an advanced 0.2 /spl mu/m CMOS technology with copper interconnects and multi-threshold transistors. These technology features have helped to increase the microprocessor internal clock frequency to 480 MHz at 2.0 V and 85/spl deg/C, and at the fast end of the process distribution. When operating at room temperature, the clock frequency increases to over 500 MHz. The microprocessor architecture includes two 32 KB L1 caches, one for data and one for instructions, integrated L2 cache controller working with L2 caches of 256 KB, 512 KB, or 1MB, and I/Os interfacing with the external bus using industry-standard 3.3 V. The microprocessor is implemented in 2.5 V CMOS technology and has migrated to 1.8 V CMOS technology.

68 citations


Book ChapterDOI
14 Jun 1998
TL;DR: A hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m=n increases from 100 to 1000 and an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations.
Abstract: We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the efficient use of the recursion for large n. This obstacle is overcome by using a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m=n increases from 100 to 1000. A successful parallel implementation on a PowerPC 604 based IBM SMP node based on dynamic load balancing is presented. For 2, 3, 4 processors and m=n=2000 it shows speedups of 1.96, 2.99, and 3.92 compared to our uniprocessor algorithm.

59 citations


Proceedings ArticleDOI
05 Oct 1998
TL;DR: Performance simulations show that the simplicity of a VLIW architecture allows a wide-issue processor to operate at high frequencies.
Abstract: Presented is an 8-issue tree-VLIW processor designed for efficient support of dynamic binary translation. This processor confronts two primary problems faced by VLIW architectures: binary compatibility and branch performance. Binary compatibility with existing architectures is achieved through dynamic binary translation which translates and schedules PowerPC instructions to take advantage of the available instruction level parallelism. Efficient branch performance is achieved through tree instructions that support multi-way path and branch selection within a single VLIW instruction. The processor architecture is described, along with design details of the branch unit, pipeline, register file and memory hierarchy for a 0.25 micron standard-cell design. Performance simulations show that the simplicity of a VLIW architecture allows a wide-issue processor to operate at high frequencies.

58 citations


Journal ArticleDOI
TL;DR: The goal of the guTS project was to demonstrate that circuit techniques, and circuit-centric design, could significantly increase the performance of microprocessors, thus providing headroom for future performance growth beyond contributions from microarchitecture and CMOS technology.
Abstract: At the IEEE International Solid State Circuits Conference this February, the IBM Austin Research Laboratory presented an experimental 64-bit integer processor called guTS (gigahertz unit Test Site). The goal of the guTS project was to demonstrate that circuit techniques, and circuit-centric design, could significantly increase the performance of microprocessors, thus providing headroom for future performance growth beyond contributions from microarchitecture and CMOS technology. To clearly distinguish the design contributions of this project from innovations in CMOS technology we chose a fabrication technology that was in production in 1997. The guTS processor is a full-custom, nearly 100% dynamic design. Its single-issue core implements 96 instructions from the integer subset of the PowerPC instruction set architecture, and covers in excess of 90% of instructions executed in typical code. Address translation, floating-point, and I/O-related instructions are omitted. All instructions, including loads and stores, execute in one cycle. We measured core speeds in excess of a gigahertz. We focus here on the circuit-centric design approach that enabled the gigahertz result. This approach requires designers to operate across the boundaries of microarchitecture, logic, circuit, and physical design. We explain why developments in CMOS technology increasingly favor this approach.

56 citations


Journal ArticleDOI
R.M. Jessani, M. Putrino1
TL;DR: The paper discusses the design complexities around the dual pass multiply array and its effect on area and performance in a given technology (PowerPC 604eTM and PowerPC 603eTM microprocessors).
Abstract: Low power, low cost, and high performance factors dictate the design of many microprocessors targeted to the low power computing market. The floating point unit occupies a significant percentage of the silicon area in a microprocessor due its wide data bandwidth (for double precision computations) and the area occupied by the multiply array. For microprocessors designed for portable products, the design site of the floating point unit plays an important role in the low cost factor driven by reduced chip area. Some microprocessors have multiply-add fused floating point units with a reduced multiply array, requiring two passes through the array for operations involving double precision multiplies. The paper discusses the design complexities around the dual pass multiply array and its effect on area and performance. Floating point unit areas and their associated multiply array areas are compared for a single and dual pass implementation in a given technology (PowerPC 604eTM and PowerPC 603eTM microprocessors, respectively).

Proceedings ArticleDOI
05 Oct 1998
TL;DR: The design methodology used to build an experimental 1.0 GigaHertz PowerPC integer microprocessor at IBM's Austin Research Laboratory will cover design and verification tools as well as circuit constraints and microarchitecture philosophy.
Abstract: This paper describes the design methodology used to build an experimental 1.0 GigaHertz PowerPC integer microprocessor at IBM's Austin Research Laboratory. The high frequency requirements dictated the chip composition to be almost entirely custom macros using dynamic circuit techniques. The methodology presented will cover design and verification tools as well as circuit constraints and microarchitecture philosophy. The microarchitecture, circuits and tools were defined by the high frequency requirements of the processor as well as the aggressive design schedule and size of the design team.

Journal Article
TL;DR: A technique for modeling environmental constraints that avoids the need for explicit construction of environments is presented and supports an assume/guarantee style of reasoning that also supports simulation monitors.
Abstract: A time-consuming and error-prone activity in symbolic model-checking is the construction of environments. We present a technique for modeling environmental constraints that avoids the need for explicit construction of environments. Moreover, our approach supports an assume/guarantee style of reasoning that also supports simulation monitors. We give examples of the use of constraints in PowerPC TMl verification.

Proceedings ArticleDOI
07 Nov 1998
TL;DR: The initial configuration of StarT-Voyager implements four forms of message passing along with S-COMA and NUMA shared memory support, and can be reconfigured to introduce new mechanisms improving usability and performance.
Abstract: This paper describes StarT-Voyager, a machine designed as an experimental platform for research in cluster system communication. The heart of StarT-Voyager is a network interface unit (NIU) that connects the memory bus of a PowerPC-based SMP to the MIT Arctic network. The NIU is highly flexible, with its set of functions easily modified by firmware or by programmable hardware, making it possible to compare different communication interfaces and implementation strategies on a common platform. Its flexibility comes from a fast embedded processor and large, fast FPGAs that surround a high-speed protected communication core. Its efficiency comes from a set of primitive operations that are implemented in hardware and are designed to reduce the firmware overhead. Our initial configuration of StarT-Voyager implements four forms of message passing along with S-COMA and NUMA shared memory support. With experimentation on the machine, it can be reconfigured to introduce new mechanisms improving usability and performance.

Journal ArticleDOI
01 Sep 1998
TL;DR: The improved time performances resulting from parallelisation of the Monte Carlo calculations makes the Eidolon Monte Carlo program an attractive tool for modelling photon transport in 3-D positron tomography.
Abstract: This paper describes the implementation of the Eidolon Monte Carlo program designed to simulate fully three-dimensional (3-D) cylindrical positron tomographs on a MIMD parallel architecture. The original code was written in Objective-C and developed under the NeXTSTEP development environment. Different steps involved in porting the software on a parallel architecture based on PowerPC 604 processors running under AIX 4.1 are presented. Basic aspects and strategies of running Monte Carlo calculations on parallel computers are described. A linear decrease of the computing time was achieved with the number of computing nodes. The improved time performances resulting from parallelisation of the Monte Carlo calculations makes it an attractive tool for modelling photon transport in 3-D positron tomography. The parallelisation paradigm used in this work is independent from the chosen parallel architecture.

Proceedings ArticleDOI
01 May 1998
TL;DR: A novel method to automate the assertion creation process which improves the efficiency and the quality of array verification and encouraging results on recent P owerPC arrays are presented.
Abstract: For verifying complex sequen tialbloc ks such as microprocessor embedded arrays, the formal method of symbolic trajectory ev aluation (STE) has achieved great success in the past [[3], [5], [6]]. P ast STE methodology for arrays requires manual creation of “assertions” to which both the RTL view and the actual design should be equivalent. In this paper, w e describe a novel method to automate the assertion creation process which improves the efficiency and the quality of array verification. Encouraging results on recent P owerPC arrays will be presented.

Journal ArticleDOI
TL;DR: This paper discusses several time-saving modifications to published Fitch-parsimony tree search algorithms, including shortcuts that allow rapid evaluation of tree lengths and fast reoptimization of trees after clipping or joining of subtrees, as well as search strategies that allows one to successively increase the exhaustiveness of branch swapping.

Proceedings ArticleDOI
17 Dec 1998
TL;DR: MIT's StarT-Voyager, a hybrid message passing/shared memory parallel machine, provides four message passing mechanisms to achieve high performance over a wide spectrum of communication types and sizes.
Abstract: No single message passing mechanism can efficiently support all types of communication that commonly occur in most parallel or distributed programs. MIT's StarT-Voyager, a hybrid message passing/shared memory parallel machine, provides four message passing mechanisms to achieve high performance over a wide spectrum of communication types and sizes. Hardware and address translation enforced protection allows direct user-level access to message passing facilities in a multiuser environment. StarT-Voyager's protection scheme improves upon past designs by not requiring strictly synchronized gang-scheduling, and by supporting non-monolithic protection domains. To minimize the development effort and cost, the machine is designed to use unmodified commercial PowerPC 604-based SMP systems as the building block. A Network End-point Subsystem (NES) card which plugs into one of each SMP's processor card slots provides the interface to Arctic, a low-latency, high-bandwidth network developed at MIT. This paper describes StarT-Voyager's message passing mechanisms and their predicted performance.

Journal ArticleDOI
TL;DR: A modular computing architecture used for intelligent control of autonomous robots, which takes the form of multiple sensing and control layers, based on Locally Intelligent Control Agents in which IBM PowerPC, SIEMENS 80C166, and INMOS Transputers are adopted.

Proceedings ArticleDOI
05 Feb 1998
TL;DR: A coarse-grained hardware-multithreaded processor for use in the IBM AS1400 uses a PowerPC architecture that supports two threads that requires the replication of the processor architecture registers for each thread.
Abstract: Implementation of a coarse-grained hardware-multithreaded processor for use in the IBM AS1400 uses a PowerPC architecture that supports two threads. Hardware multithreading is a technique for tolerating memory latency by utilizing otherwise idle cycles in the CPU. This requires the replication of the processor architecture registers for each thread. Replication is not required for the majority of processor logic such as instruction cache, data cache, TLB, instruction fetch and dispatch mechanisms, branch units, fixed-point units, floating-point units, and storage-control units.

Proceedings ArticleDOI
Yossi Malka1, Avi Ziv1
01 May 1998
TL;DR: A study on two implementations of state-of-the-art PowerPC processors that shows that statistical analysis of bug discovery data can provide quality information on the progress of verification and good predictions of the number of bugs left in the design and the future MTTF.
Abstract: Statistical analysis of bug discovery data is used in the software industry to check the quality of the testing process and estimate the reliability of the tested program. In this paper, we show that the same techniques are applicable to hardware design verification. We performed a study on two implementations of state-of-the-art PowerPC processors that shows that these techniques can provide quality information on the progress of verification and good predictions of the number of bugs left in the design and the future MTTF.

Proceedings ArticleDOI
E.K. Vida-Torku1, G. Joos
18 Oct 1998
TL;DR: The addressing and clocking schemes in PowerPC/sup TM/ microprocessor embedded memories present modeling challenges and aggressive Design for Test implementations are needed to help the test generation tools.
Abstract: The addressing and clocking schemes in PowerPC/sup TM/ microprocessor embedded memories present modeling challenges. The ability of most scan based test tools to accurately generate test patterns for these embedded memories is limited. What is needed is aggressive Design for Test implementations that can help the test generation tools. In this paper we present our experiences in the design, modeling, and test of high performance embedded memories on the PowerPC microprocessors.

Proceedings ArticleDOI
R. Raina1, R. Molyneaux
19 Feb 1998
TL;DR: A novel method is described that can be used to generate test stimuli that are random as well as self-testing for digital systems by taking advantage of certain properties of the Design Under Validation.
Abstract: This paper describes a novel method for generating test stimuli for digital systems. By taking advantage of certain properties of the Design Under Validation, the method can be used to generate test stimuli that are random as well as self-testing. We discuss the requirements and limitations of this method on practical designs. The use of this method for High-Level Design Validation of caches in PowerPC/sup TM/ microprocessors is also described. The paper concludes by identifying areas where further work is needed.

Proceedings ArticleDOI
21 Dec 1998
TL;DR: The Chidi system is a PCI-bus media processor card which performs its processing tasks on a large field-programmable gate array in conjunction with a general purpose CPU (PowerPC 604e).
Abstract: The Chidi system is a PCI-bus media processor card which performs its processing tasks on a large field-programmable gate array (Altera 10K100) in conjunction with a general purpose CPU (PowerPC 604e). Special address-generation and buffering logic (also implemented on FPGAs) allows the reconfigurable processor to share a local bus with the CPU, turning burst accesses to memory into continuous streams and converting between the memory's 64-bit words and the media data types. In this paper we present the design requirements for the Chidi system, describe the hardware architecture, and discuss the software model for its use in media processing.

Journal ArticleDOI
TL;DR: The Tore Supra Data Acquisition System was completely redesigned in 1996, for plasma control and long pulse operation, and is now based on the Rtworks package which provides all the software basic modules to build a distributed on-line system.
Abstract: The Tore Supra Data Acquisition System was completely redesigned in 1996, for plasma control and long pulse operation. It is now based on the Rtworks package which provides all the software basic modules to build a distributed on-line system, from data acquisition level up to run control and data display. In the same time, the real-time plasma control system has been improved with several control units upgraded with VME hardware and PowerPC processors interconnected through a fast shared memory network.

17 Mar 1998
TL;DR: The Rensselaer Interconnect Performance Estimator (RIPE) as mentioned in this paper is a design and evaluation tool, named RIPE, to analyze the impact on size, wireability, performance, power dissipation and reliability of single chip microprocessors as a function of interconnect, device, circuit, design and architectural parameters.
Abstract: The purpose of this work is the development of a design and evaluation tool, named "Rensselaer Interconnect Performance Estimator" (RIPE), to analyze the impact on size, wireability, performance, power dissipation and reliability of single chip microprocessors as a function of interconnect, device, circuit, design and architectural parameters. A study of existing microprocessors and their design practices has been done to identify the parameters required to model such a system to the first order. As a result, a system model encompassing memory, core logic and I/O circuitry has been presented. Compared to earlier performance estimators, such as SUSPENS and Sai-Halasz' cycle time estimator, RIPE can accurately predict the overall performance of current microprocessor systems. For the three major microprocessor architectures: DEC, PowerPC and Intel, RIPE results indicated agreement within 10% on key parameters such as transistor count, area, wiring levels, clock frequency and power dissipation. The RIPE model has also been used to study the SIA (Semiconductor Industry Association) Roadmap predictions and technology characteristics for future microprocessor systems. The results indicate that for the 0.10 $\mu$m generation, the performance of interconnect limits overall performance and a combination of performance improving design techniques, such as interconnect length limiting floorplans, new interconnect materials and architectures, are needed to be able to meet future performance goals.

Journal ArticleDOI
TL;DR: An interactive dotmatrix program for the MacOS was designed that allows comparison of DNA to protein sequences using nested3-frame translations using nested 3- frame translations.
Abstract: Summary : An interactive dotmatrix program for the MacOS was designed that allows comparison of DNA to protein sequences using nested 3-frame translations. Availability : Shareware, available at http://copan.bioz.unibas.ch/software/ Contact : burglin@ubaclu. unibas.ch

Proceedings ArticleDOI
J.L. Burns1, K.J. Nowka
11 Jun 1998
TL;DR: A unique, high-frequency dataflow macro is described for accelerating conditional-branch resolution by computing condition codes in parallel with computing the corresponding arithmetic results to improve the microarchitecture by reducing conditional- Branches latency while achieving high speed through a pulse-node, delayed-reset dynamic circuit implementation.
Abstract: Improving the speed and performance of microprocessors requires aggressive leveraging of the interplay of microarchitecture and circuit design. We describe a unique, high-frequency dataflow macro for accelerating conditional-branch resolution by computing condition codes in parallel with computing the corresponding arithmetic results. This macro improves the microarchitecture by reducing conditional-branch latency while achieving high speed through a pulse-node, delayed-reset dynamic circuit implementation. The design has been realized in a 64-bit PowerPC integer processor that operates at 1.0 GHz (0.15 micron CMOS process).

Patent
27 Apr 1998
TL;DR: In this article, the EIEIO instruction implemented within the PowerPC architecture, block other storage access instructions at the bus interface stage as opposed to the execute stage, and cacheable instructions, and other similar instructions, are allowed to complete without being blocked by such an EIEI instruction not ordered by the instruction.
Abstract: Storage access blocking instructions, such as the EIEIO instruction implemented within the PowerPC architecture, block other storage access instructions at the bus interface stage as opposed to the execute stage. Therefore, cacheable instructions, and other similar instructions, are allowed to complete without being blocked by such an EIEIO instruction not ordered by the EIEIO instruction.

Journal ArticleDOI
C. Pyron1, J. Prado, J. Golab
TL;DR: Time-to-market goals are intricately entwined with the product testing strategy for a high-performance microprocessor, resulting in an on-time product introduction coupled with improved, more effective and thorough testing.
Abstract: Time-to-market goals are intricately entwined with the product testing strategy for a high-performance microprocessor. The result is an on-time product introduction coupled with improved, more effective and thorough testing.