scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 1999"


Proceedings ArticleDOI
01 Oct 1999
TL;DR: A new program abstraction for escape analysis, the connection graph, that is used to establish reachability relationships between objects and object references is introduced and it is shown that the connectiongraph can be summarized for each method such that the same summary information may be used effectively in different calling contexts.
Abstract: This paper presents a simple and efficient data flow algorithm for escape analysis of objects in Java programs to determine (i) if an object can be allocated on the stack; (ii) if an object is accessed only by a single thread during its lifetime, so that synchronization operations on that object can be removed. We introduce a new program abstraction for escape analysis, the connection graph, that is used to establish reachability relationships between objects and object references. We show that the connection graph can be summarized for each method such that the same summary information may be used effectively in different calling contexts. We present an interprocedural algorithm that uses the above property to efficiently compute the connection graph and identify the non-escaping objects for methods and threads. The experimental results, from a prototype implementation of our framework in the IBM High Performance Compiler for Java, are very promising. The percentage of objects that may be allocated on the stack exceeds 70% of all dynamically created objects in three out of the ten benchmarks (with a median of 19%), 11% to 92% of all lock operations are eliminated in those ten programs (with a median of 51%), and the overall execution time reduction ranges from 2% to 23% (with a median of 7%) on a 333 MHz PowerPC workstation with 128 MB memory.

540 citations


Book ChapterDOI
06 Jul 1999
TL;DR: This paper shows how bounded model checking can take advantage of specialized optimizations and presents a bounded version of the cone of influence reduction, which can bring model checking into the mainstream of industrial chip design.
Abstract: In [1] Bounded Model Checking with the aid of satisfiability solving (SAT) was introduced as an alternative to symbolic model checking with BDDs In this paper we show how bounded model checking can take advantage of specialized optimizations We present a bounded version of the cone of influence reduction We have successfully applied this idea in checking safety properties of a PowerPC microprocessor at Motorola's Somerset PowerPC design center Based on that experience, we propose a verification methodology that we feel can bring model checking into the mainstream of industrial chip design

151 citations


Journal ArticleDOI
TL;DR: A workload driven simulation environment for PowerPC processor microarchitecture performance exploration is described, which summarizes the environment's properties and gives examples of its usage.
Abstract: Designers face many choices when planning a new high-performance, general purpose microprocessor. Options include superscalar organization (the ability to dispatch and execute more than one instruction at a time), out-of-order issue of instructions, speculative execution, branch prediction, and cache hierarchy. However, the interaction of multiple microarchitecture features is often counterintuitive, raising questions concerning potential performance benefits and other effects on various workloads. Complex design trade-offs require accurate and timely performance modeling, which in turn requires flexible, efficient environments for exploring microarchitecture processor performance. Workload-driven simulation models are essential for microprocessor design space exploration. A processor model must ideally: capture in sufficient detail those features that are already well defined; make evolving assumptions and approximations in interpreting the desired execution semantics for those features that are not yet well defined; and be validated against the existing specification. These requirements suggest the need for an evolving but reasonably precise specification, so that validating against such a specification provides confidence in the results. Processor model validation normally relies on behavioral timing specifications based on test cases that exercise the microarchitecture. This approach, commonly used in simulation-based functional validation methods, is also useful for performance validation. In this article, we describe a workload driven simulation environment for PowerPC processor microarchitecture performance exploration. We summarize the environment's properties and give examples of its usage.

114 citations


Proceedings Article
01 May 1999
TL;DR: "Tui" is a prototype migration system that is able to translate the memory image of a program between four common architectures (m68000, SPARC, i486 and PowerPC).
Abstract: Heterogeneous Process Migration is a technique whereby an active process is moved from one machine to another. It must then continue normal execution and communication. The source and destination processors can have a different architecture, that is, different instruction sets and data formats. Because of this heterogeneity, the entire process memory image must be translated during the migration. "Tui" is a prototype migration system that is able to translate the memory image of a program (written in ANSI-C) between four common architectures (m68000, SPARC, i486 and PowerPC). This requires detailed knowledge of all data types and variables used with the program. This is not always possible in non type-safe (but popular) languages such as C, Pascal and Fortran. The important features of the Tui algorithm are discussed in great detail. This includes the method by which a program''s entire set of data values can be located, and eventually reconstructed on the target processor. Initial performance figures demonstrating the viability of using Tui for real migration applications are given.

92 citations


Proceedings ArticleDOI
Shmuel Ur1, Yaov Yadin1
01 Jun 1999
TL;DR: This paper demonstrates a method for generation of assembler test programs that systematically probe the architecture of a PowerPC superscalar processor and describes how to translate this theory into practice in a practical way a task that was far from trivial.
Abstract: In this paper we demonstrate a method for generation of assembler test programs that systematically probe the architecture of a PowerPC superscalar processor. We show innovations such as ways to make small models for large designs, predict, with cycle accuracy the movement of instructions through the pipes (taking into account stalls and dependencies) and generation of test programs such that each reaches a new architectural state. We compare our method to the established practice of massive random generation and show that the quality of our tests, as measured by transition coverage, is much higher. The main contribution of this paper is not in theory as the theory has been discussed in previous papers, but in describing how to translate this theory into practice in a practical way a task that was far from trivial.

89 citations


Book
08 Jun 1999
TL;DR: This chapter discusses the RISC Movement in Processor Architecture, which led to the creation of the CISC Processors, and its applications, including the Intel Pentium, AMD, and Apple PowerPC families.
Abstract: 1 Basic Pipelining and Simple RISC Processors- 11 The RISC Movement in Processor Architecture- 12 Instruction Set Architecture- 13 Examples of RISC ISAs- 14 Basic Structure of a RISC Processor and Basic Cache MMU Organization- 15 Basic Pipeline Stages- 16 Pipeline Hazards and Solutions- 161 Data Hazards and Forwarding- 162 Structural Hazards- 163 Control Hazards, Delayed Branch Technique, and Static Branch Prediction- 164 Multicycle Execution- 17 RISC Processors- 171 Early Scalar RISC Processors- 172 Sun microSPARC-II- 173 MIPS R3000- 174 MIPS R4400- 175 Other Scalar RISC Processors- 176 Sun picoJava-I- 18 Lessons learned from RISC- 2 Dataflow Processors- 21 Dataflow Versus Control-Flow- 22 Pure Dataflow- 221 Static Dataflow- 222 Dynamic Dataflow- 223 Explicit Token Store Approach- 23 Augmenting Dataflow with Control-Flow- 231 Threaded Dataflow- 232 Large-Grain Dataflow- 233 Dataflow with Complex Machine Operations- 234 RISC Dataflow- 235 Hybrid Dataflow- 24 Lessons learned from Dataflow- 3 CISC Processors- 31 A Brief Look at CISC Processors- 32 Out-of-Order Execution- 33 Dynamic Scheduling- 331 Scoreboarding- 332 Tomasulo's Scheme- 333 Scoreboarding versus Tomasulo's Scheme- 34 Some CISC Microprocessors- 35 Conclusions- 4 Multiple-Issue Processors- 41 Overview of Multiple-Issue Processors- 42 I-Cache Access and Instruction Fetch- 43 Dynamic Branch Prediction and Control Speculation- 431 Branch-Target Buffer or Branch-Target Address Cache- 432 Static Branch Prediction Techniques- 433 Dynamic Branch Prediction Techniques- 434 Predicated Instructions and Multipath Execution- 435 Prediction of Indirect Branches- 436 High-Bandwidth Branch Prediction- 44 Decode- 45 Rename- 46 Issue and Dispatch- 47 Execution Stages- 48 Finalizing Pipelined Execution- 481 Completion, Commitment, Retirement and Write-Back- 482 Precise Interrupts- 483 Reorder Buffers- 484 Checkpoint Repair Mechanism and History Buffer- 485 Relaxing In-order Retirement- 49 State-of-the-Art Superscalar Processors- 491 Intel Pentium family- 492 AMD-K5, K6 and K7 families- 493 Cyrix M II and M 3 Processors- 494 DEC Alpha 21x64 family- 495 Sun UltraSPARC family- 496 HAL SPARC64 family- 497 HP PA-7000 family and PA-8000 family- 498 MIPS R10000 and descendants- 499 IBM POWER family- 4910 IBM/Motorola/Apple PowerPC family- 4911 Summary- 410 VLIW and EPIC Processors- 4101 TI TMS320C6x VLIW Processors- 4102 EPIC Processors, Intel's IA-64 ISA and Merced Processor- 411 Conclusions on Multiple-Issue Processors- 5 Future Processors to use Fine-Grain Parallelism- 51 Trends and Principles in the Giga Chip Era- 511 Technology Trends- 512 Application-and Economy-Related Trends- 513 Architectural Challenges and Implications- 52 Advanced Superscalar Processors- 53 Superspeculative Processors- 54 Multiscalar Processors- 55 Trace Processors- 56 DataScalar Processors- 57 Conclusions- 6 Future Processors to use Coarse-Grain Parallelism- 61 Utilization of more Coarse-Grain Parallelism- 62 Chip Multiprocessors- 621 Principal Chip Multiprocessor Alternatives- 622 TI TMS320C8x Multimedia Video Processors- 623 Hydra Chip Multiprocessor- 63 Multithreaded Processors- 631 Multithreading Approach for Tolerating Latencies- 632 Comparison of Multithreading and Non-Multithreading Approaches- 633 Cycle-by-Cycle Interleaving- 634 Block Interleaving- 635 Nanothreading and Microthreading- 64 Simultaneous Multithreading- 641 SMT at the University of Washington- 642 Karlsruhe Multithreaded Superscalar- 643 Other Simultaneous Multithreading Processors- 65 Simultaneous Multithreading versus Chip Multiprocessor- 66 Conclusions- 7 Processor-in-Memory, Reconfigurable, and Asynchronous Processors- 71 Processor-in-Memory- 711 The Processor-in-Memory Principle- 712 Processor-in-Memory approaches- 713 The Vector IRAM approach- 714 The Active Page model- 72 Reconfigurable Computing- 721 Concepts of Reconfigurable Computing- 722 The MorphoSys system- 723 Raw Machine- 724 Xputers and KressArrays- 725 Other Projects- 73 Asynchronous Processors- 731 Asynchronous Logic- 732 Projects- 74 Conclusions- Acronyms- References

85 citations


Proceedings ArticleDOI
21 Mar 1999
TL;DR: The design, implementation and performance of a Web server accelerator which runs on an embedded operating system and improves Web server performance by caching data is described and an API which allows application programs to explicitly add, delete, and update cached data is provided.
Abstract: We describe the design, implementation and performance of a Web server accelerator which runs on an embedded operating system and improves Web server performance by caching data. The accelerator resides in front of one or more Web servers. Our accelerator can serve up to 5000 pages/second from its cache on a 200 MHz PowerPC 604. This throughput is an order of magnitude higher than that which would be achieved by a high-performance Web server running on similar hardware under a conventional operating system such as Unix or NT. The superior performance of our system results in part from its highly optimized communications stack. In order to maximize hit rates and maintain updated caches, our accelerator provides an API which allows application programs to explicitly add, delete, and update cached data. The API allows our accelerator to cache dynamic as well as static data, analyze the SPECweb96 benchmark, and show that the accelerator can provide high hit ratios and excellent performance for workloads similar to this benchmark.

72 citations


Proceedings ArticleDOI
16 Nov 1999
TL;DR: Surprisingly, it is found that a performance increase over native code is achievable in many situations, and this holds for architectures with a wide range of memory configurations and issue widths.
Abstract: Compressing the instructions of an embedded program is important for cost-sensitive low-power control-oriented embedded computing. A number of compression schemes have been proposed to reduce program size. However, the increased instruction density has an accompanying performance cost because the instructions must be decompressed before execution. In this paper, we investigate the performance penalty of a hardware-managed code compression algorithm recently introduced in IBM's PowerPC 405. This scheme is the first to combine many previously proposed code compression techniques, making it an ideal candidate for study. We find that code compression with appropriate hardware optimizations does not have to incur much performance loss. Furthermore, our studies show this holds for architectures with a wide range of memory configurations and issue widths. Surprisingly, we find that a performance increase over native code is achievable in many situations.

62 citations


Journal ArticleDOI
15 Feb 1999
TL;DR: A 64 b PowerPC RISC microprocessor is incorporated in a 0.2 /spl mu/m CMOS technology with copper interconnects and multi-threshold transistors and next into a silicon-on-insulator (SOI) version of the same technology.
Abstract: A 64 b PowerPC RISC microprocessor is incorporated in a 0.2 /spl mu/m CMOS technology with copper interconnects and multi-threshold transistors and next into a silicon-on-insulator (SOI) version of the same technology. Some architectural changes improve CPI, including doubling the L1 instruction and data caches to 128 kB and adding a 256 kB L2 directory. The total transistor count increased from 12 M to 34 M.

51 citations


Proceedings ArticleDOI
14 Apr 1999
TL;DR: The microarchitecture and design of the vector arithmetic unit of this implementation of the AltiVec/sup TM/ technology is described, which provides new computational and storage operations for handling vectors of various data lengths and data types.
Abstract: The AltiVec/sup TM/ technology is an extension to the PowerPC architecture/sup TM/ which provides new computational and storage operations for handling vectors of various data lengths and data types. The first implementation using this technology is a low-cost, low-power processor based on the acclaimed PowerPC 750/sup TM/ microprocessor. This paper describes the microarchitecture and design of the vector arithmetic unit of this implementation.

49 citations


Proceedings ArticleDOI
10 Feb 1999
TL;DR: Routes written with AltiVec instructions can execute significantly faster sometimes by a factor of 10 or more, than traditional scalar PowerPC code, and is flexible enough to be useful in a wide variety of applications.
Abstract: Motorola's AltiVec/sup TM/ Technology provides a new, SIMD vector extension to the PowerPC/sup TM/ architecture. AltiVec adds 162 new instructions and a powerful new 128-bit datapath, capable of simultaneously executing up to 16 operations per clock. AltiVec instructions allow parallel operation on either 8, 16 or 32-bit integers, as well as 4 IEEE single-precision floating-point numbers. AltiVec technology includes highly flexible "Permute" instructions, which give the data re-organization power needed to maintain a high level of data parallelism. Fine grained data prefetch instructions are also included, which help hide the memory latency of data hungry multimedia applications. All of these features add up to a dramatic performance improvement with the first implementation of AltiVec technology: routines written with AltiVec instructions can execute significantly faster sometimes by a factor of 10 or more, than traditional scalar PowerPC code. Yet AltiVec technology is flexible enough to be useful in a wide variety of applications.

01 Jan 1999
TL;DR: BOA (Binary-translation Optimized Architecture) is presented, a processor designed to achieve high frequency by using software dynamic binary translation and how to support a very high frequency PowerPC implementation via dynamicbinary translation.
Abstract: This paper presents BOA (Binary-translation Optimized Architecture), a processor designed to achieve high frequency by using software dynamic binary translation. Processors for software binary translation are very conducive to high frequency because they can assume a simple hardware design. Binary translation eliminates the binary compatibility problem faced by other processors, while dynamic recompilation enables re-optimization of critical program code sections and eliminates the need for dynamic scheduling hardware. In this work we examine the implications of binary translation on processor architecture and software translation and how we support a very high frequency PowerPC implementation via dynamic binary translation. 1.0 Introduction The design of processors with clock speeds of 1 GHz or more has been a topic of considerable research in both industry and academia. Binary translation presents an interesting alternative for processor design as it enables good performance on simple processor designs. Processors for binary translation achieve maximum performance by enabling high frequency processors while still exploiting available parallelism in the code. The effect of both of these optimizations is to minimize overall execution time [1]: Execution Time Ζ (Number Of Instructions) Cycles Instruction Seconds Cycle One method of minimizing execution time is instruction set simplicity which can enable high frequency design. Many approaches can be taken to achieve high frequency while maintaining compatibility with existing code. Three popular possible methods include: Hardware Cracking: Hardware cracking of complex instructions into simpler "micro" operations is frequently used (e.g., Intel processors such as Pentium Pro, Pentium II, and Pentium III). Nair and Hopkins' DIF processor [2] also uses a hardware approach, to crack and schedule simple PowerPC operations into groups for high-speed execution on a VLIW (Very-Long Instruction Word) processor. Cracked code can be stored in the cache, or cracking can occur on the fly (in the pipeline), but either way, it consumes time and transistors. This limits the achievable frequency. Microcode Emulation: Microcode can also be used to maintain compatibility, with each instruction of the original architecture going to a fixed microcode routine for emulation on the new (high frequency) architecture. However, if the new instruction set architecture does not closely resemble the original architecture, the total number of instructions executed is much higher than on a direct implementation of the original machine. In addition, there is no opportunity to overlap the execution of instructions from the original machine, thus preventing the exploitation of any intrinsic parallelism. Binary Translation: A third alternative is dynamic binary translation via software as exemplified by FX!32 [3] or DAISY [4]. Since the translation is done by software, no extra transistors are needed for implementation -the same resources used to execute the application code are used by the translation software. Assuming that the high frequency processor for a binary translation machine is designed with the original in mind, the high frequency primitive operations can efficiently execute instructions for the original machine. Unlike microcode, the high frequency primitives from several original instructions can be overlapped so as to exploit parallelism. Another advantage of dynamic binary translation over hardware cracking and microcode is adaptability -the code generated can be adapted to efficiently handle special cases discernible only at runtime. For example if fixed parameters are always passed to a subroutine, the dynamic translation software can generate code which checks for these fixed parameters, and then jump to a highly optimized translation of the subroutine if these parameters are encountered3. Similarly, translations can be optimized to varying degrees depending upon the execution frequency of the code section. 1 3 Our current implementation does not support this optimization. 2 IBM Server Division, Burlington, VT, USA 1 IBM T. J. Watson Research Center, Yorktown Heights, NY, USA In looking forward to future high performance microprocessors, we have adopted the dynamic binary translation approach as it promises a desirable combination of (1) high frequency design, (2) greater degrees of parallelism, and (3) low hardware cost. In this paper, we describe the application of dynamic binary translation to the design of BOA (Binary-translation Optimized Architecture) a high-frequency EPIC-style implementation of the PowerPC architecture. The remainder of the paper describes our approach in designing a high frequency PowerPC compatible microprocessor through dynamic binary translation. Section 2 describes how the processor builds groups of PowerPC instructions and the translation process into the BOA architecture. Section 3 describes the BOA instruction set architecture and details the high frequency implementation. Section 4 gives experimental microarchitectural performance results, and Sections 5 and 6 conclude and describe future work. 2.0 Binary Translation Strategy Our overall strategy of binary translation is to translate a PowerPC instruction flow into a set of BOA instruction groups which are then stored in memory and executed natively. The system must start the process by interpreting the PowerPC stream of instructions and make intelligent decisions on when and how to translate these instructions into BOA groups. This process is described in Section 2.1. Once groups have been formed, the translation process continues by scheduling the BOA groups for performance. These optimizations are covered in Section 2.2 where again, intelligent tradeoffs must be made between code optimization, code reuse, and software optimization costs. Code execution is extremely fast when the processor is able to use BOA groups in memory. These optimized groups are generally much larger than a basic block and can contain many branches. The last instruction in each BOA group is a direct branch to the next group stored in memory. As long as a translation exists for the address pointed to by the final branch in a group, the new groups is fetched and executed. If the code does not exist, the machine re-enters interpretation mode to slowly execute PowerPC instructions and possibly build more BOA groups. Likewise, if a branch is encountered within the BOA group that is mispredicted, native execution must be halted, and a branch is taken back to the interpreter. It is therefore critical that BOA groups not only be as long as possible (to give ample opportunity for code optimization), but also that they be as high quality as possible (to avoid calls back to the interpreter). 2.1 Group Formation BOA instruction groups are formed along a single path after interpreting the entry point of a PowerPC instruction sequence 15 times. In the PowerPC interpretation process, we decided to only follow single paths of execution as to support very high-frequency hardware (described in Section 3). This single path approach is somewhat similar to that used by DIF [2], except that DIF requires special hardware to form groups and does not interpret PowerPC groups. DAISY [4, 5] by contrast forms groups from operations along multiple paths, and thus requires somewhat more time-consuming group formation and scheduling heuristics. During BOA’s interpretation phase, statistics are kept on the number of times each conditional branch is executed as well as on the total number of times it is taken, thus allowing a dynamic assessment of the probability the branch is taken. Similar information is also kept about the targets of register branches. As noted, once the group entry has been seen 15 times, the code starting at the entry point is assembled into a PowerPC group, and translated into BOA instruction groups for efficient execution on the underlying hardware. At each conditional branch point, the most likely path is followed. For efficient execution, it is best if the translated path always falls through at conditional branch instructions. In some cases, this requires that the sense of the branch be changed. As each conditional branch is reached during the translation, the probability of reaching this point from the start of the group decreases. When the probability goes below a threshold value, the group is terminated. For example, if a particular branch goes less than 12 times in its more likely direction, we might decide to terminate the group. Later in the paper, we refer this as “bias-12”, i.e. the branch must go in one direction at least 80% (12 of 15) of the time. Since we are applying this strategy in a runtime environment, quick group formation is essential, even if it can produce sub-optimal results. For example, consider the following PowerPC code:

Proceedings ArticleDOI
15 Feb 1999
TL;DR: Two vector execution units which are part of the AltiVec/sup TM/ instruction set implementation, memory subsystem bandwidth enhancements, symmetric multiprocessing support and improved floating-point performance are added.
Abstract: This superscalar microprocessor implements the PowerPC/sup TM/ Architecture specification incorporating AltiVec/sup TM/ technology. Two instructions per cycle can be dispatched to two of seven execution units in this microarchitecture designed for high execution performance, high memory bandwidth, and low power. The processor includes 8-way set-associative 32 KB instruction and data caches, a floating-point unit, two integer units, a branch unit, a load/store unit, a vector arithmetic/logic unit, a vector permute unit, and a system unit. An L2 tag and cache controller with a dedicated L2 bus interface supports L2 cache sizes of 512 KB, 1MB, or 2 MB with 2-way set associativity. At 450 MHz, and with a 2M B L2 cache, this processor is estimated to have a SPECint95 and SPECfp95 performance of 20. The processor shares many microarchitectural features with the PowerPC 750/sup TM/ microprocessor. New to this processor are two vector execution units which are part of the AltiVec/sup TM/ instruction set implementation, memory subsystem bandwidth enhancements, symmetric multiprocessing support and improved floating-point performance. Supporting up to 8 simultaneous data cache misses, the memory subsystem sustains bandwidths of 3.2 GB/s on the L2 data SRAM interface running at 200 MHz or 1.6 GB/s on the system interface running at 100 MHz.

Proceedings ArticleDOI
14 Apr 1999
TL;DR: New algorithms based on power series approximations were developed which provide significantly better performance than the Newton-Raphson algorithm for this processor and reduce the divide latency and square root latency.
Abstract: The Power3 processor is a 64-bit implementation of the PowerPC/sup TM/ architecture and is the successor to the Power2/sup TM/ processor for workstations and servers which require high performance floating point capability. The previous processors used Newton-Raphson algorithms for their implementations of divide and square root. The Power3 processor has a longer pipeline latency, which would substantially increase the latency for these instructions. Instead, new algorithms based on power series approximations were developed which provide significantly better performance than the Newton-Raphson algorithm for this processor. This paper describes the algorithms, and then shows how both the series based algorithms and the Newton-Raphson algorithms are affected by pipeline length. For the Power3, the power series algorithms reduce the divide latency by over 20% and the square root latency by 35%.

Proceedings ArticleDOI
10 Jan 1999
TL;DR: A rigorous ATPG-like methodology for validating the branch prediction mechanism of the PowerPC604 which can be easily generalized and made applicable to other processors is described.
Abstract: We describe a rigorous ATPG-like methodology for validating the branch prediction mechanism of the PowerPC604 which can be easily generalized and made applicable to other processors. Test sequences based on finite state machine (FSM) testing are derived from small FSM-like models of the branch prediction mechanism. These sequences are translated into PowerPC instruction sequences. Simulation results show that 100% coverage of the targeted functionality is achieved using a very small number of simulation cycles. Simulation of some real programs against the same targeted functionality produces coverages that range between 34% and 75% with four orders of magnitude more cycles. We also use mutation analysis to modify some functionality of the behavioral model to further illustrate the effectiveness of our generated sequence. Simulation results show that all 54 mutants in the branch prediction functionality can be detected by measuring transition coverage.


Journal ArticleDOI
TL;DR: This work focuses on tool design for the development of microarchitectures, which implement the instruction set, and the FMW PowerPC-based simulation tool will help designers accurately evaluate the effectiveness and validate the correctness of new microprocessor mechanisms.
Abstract: Microprocessor designers use multiple simulation tools with varying degrees of modeling details ranging from the instruction set of the microprocessor to the circuit implementation. We focus on tool design for the development of microarchitectures, which implement the instruction set. Microarchitecture design involves both functional and performance simulators. A functional simulator models a machine's architecture, or instruction set, with functional correctness. A performance simulator models the machine organization, or microarchitecture, and is concerned with machine performance. Sometimes these performance simulators are also referred to as cycle-accurate simulators to reflect their concern with timing issues. The FMW PowerPC-based simulation tool will help designers accurately evaluate the effectiveness and validate the correctness of new microprocessor mechanisms.

Proceedings ArticleDOI
Carol Pyron, M. Alexander1, J. Golab1, G. Joos1, B. Long1, R. Molyneaux1, R. Raina1, N. Tendolkar1 
28 Sep 1999
TL;DR: Design for manufacturability enhancements provide better tracking of initial silicon and fuse-based memory repair capabilities for improved yield and time-to-market and methodology and modeling improvements increased LSSD stuck-at fault test coverage.
Abstract: Several advances have been made in the design for testability of the MPC7400, the first fourth generation PowerPC microprocessor. The memory array built-in self-test algorithms now support detecting write-recovery defects and more comprehensive diagnostics. Delay defects can be tested with scan patterns with the phased locked loop providing the at-speed launch-capture events. Several methodology and modeling improvements increased LSSD stuck-at fault test coverage. Design for manufacturability enhancements provide better tracking of initial silicon and fuse-based memory repair capabilities for improved yield and time-to-market.

Proceedings ArticleDOI
22 Feb 1999
TL;DR: This work has implemented a variety of changes in the memory management of a native port of the Linux operating system to the PowerPC architecture to improve performance and shows that careful design to minimize the OS caching footprint, to shorten critical code paths in page fault handling, and to otherwise take full advantage of the memorymanagement hardware can have dramatic effects on performance.
Abstract: In highly cached and pipelined machines, operating system performance, and aggregate user/system performance, is enormously sensitive to small changes in cache and TLB hit rates. We have implemented a variety of changes in the memory management of a native port of the Linux operating system to the PowerPC architecture in an effort to improve performance. Our results show that careful design to minimize the OS caching footprint, to shorten critical code paths in page fault handling, and to otherwise take full advantage of the memory management hardware can have dramatic effects on performance. Our results also show that the operating system can intelligently manage MMU resources as well or better than hardware can and suggest that complex hardware MMU assistance may not be the most appropriate use of scarce chip area. Comparative benchmarks show that our optimizations result in kernel performance that is significantly better than other monolithic kernels for the same architecture and highlight the distance that micro-kernel designs will have to travel to approach the performance of a reasonably efficient monolithic kernel.

Proceedings ArticleDOI
L. Fournier1, Anatoly Koyfman1, L. Levinger1
01 Jun 1999
TL;DR: By defining a set of generic coverage models that combine program-based, specification-based and sequential bug-driven models, this work establishes the groundwork for the development of architecture validation suites for any architecture.
Abstract: This paper describes the efforts made and the results of creating an Architecture Validation Suite for the PowerPC architecture. Although many functional test suites are available for multiple architectures, little has been published on how these suites are developed and how their quality should be measured. This work provides some insights for approaching the difficult problem of building a high quality functional test suite for a given architecture. By defining a set of generic coverage models that combine program-based, specification-based and sequential bug-driven models, it establishes the groundwork for the development of architecture validation suites for any architecture.

Proceedings ArticleDOI
27 Mar 1999
TL;DR: The performance of this digital system has been shown to be superior to the analog technology typically used for quench detection; i.e., it detects resistive voltages reliably at low thresholds with minimum delay and is a more flexible system.
Abstract: A system has been developed for digitally detecting superconducting magnet quenches in real time. This system has been fully tested and is completely integrated into a Vertical Magnet Test Facility (VMTF) at FNAL. The digital technique used for this system relies on the application of digital signal processing (DSP) algorithms running on a native Motorola PowerPC VME processor. The performance of this digital system has been shown to be superior to the analog technology typically used for quench detection; i.e., it detects resistive voltages reliably at low thresholds with minimum delay and is a more flexible system.

Patent
29 Jan 1999
TL;DR: In this paper, a scheme for program executables that run in a reduced instruction set computer (RISC) architecture such as the PowerPC is disclosed. The method and system utilize scope-based compression for increasing the effectiveness of conventional compression with respect to register and literal encoding.
Abstract: A compression scheme for program executables that run in a reduced instruction set computer (RISC) architecture such as the PowerPC is disclosed. The method and system utilize scope-based compression for increasing the effectiveness of conventional compression with respect to register and literal encoding. First, discernible patterns are determined by exploiting instruction semantics and conventions that compilers adopt in register and literal usage. Additional conventions may also be set for register usage to facilitate compression. Using this information, separate scopes are created such that in each scope there is a more prevalent usage of a limited set of registers or literal value ranges, or there is an easily discernible pattern of register or literal usage. Each scope then is compressed separately by a conventional compressor. The resulting code is more compact because the small number of registers and literals in each scope makes the encoding sparser than when the compressor operates on the global scope that includes all instructions in a program. Additionally, scope-based compression reveals more frequent patterns within each scope than when considering the entire instruction stream as an opaque stream of bits.

Patent
29 Jan 1999
TL;DR: In this paper, a method and system for a compression scheme used with program executables that run in a reduced instruction set computer (RISC) architecture such as the PowerPC is disclosed.
Abstract: A method and system for a compression scheme used with program executables that run in a reduced instruction set computer (RISC) architecture such as the PowerPC is disclosed. Initially, a RISC instruction set is expanded to produce code that facilitates the removal of redundant fields. The program is then rewritten using this new expanded instruction set. Next, a filter is applied to remove redundant fields from the expanded instructions. The expanded instructions are then clustered into groups, such that instructions belonging to the same cluster show similar bit patterns. Within each cluster, the scopes are created such that register usage patterns within each scope are similar. Within each cluster, more scopes are created such that literals within each instruction scope are drawn from the same range of integers. A conventional compression technique such as Huffman encoding is then applied on each instruction scope within each cluster. Dynamic programming techniques are then used to produce the best combination of encoding among all scopes within all the different clusters. Where applicable, instruction scopes are combined that use the same encoding scheme to reduce the size of the resulting dictionary. Similarly instruction clusters are combined that use the same encoding scheme to reduce the size of the resulting dictionary.

Journal ArticleDOI
R.D. Gerke1, G.B. Kromann
TL;DR: In this paper, the failure data are plotted using Weibull distributions. Failure mechanisms for both CQFP and CBGA packages were presented and failure criteria for both packaging technologies were also presented.
Abstract: Recent trends in wafer fabrication techniques have produced devices with smaller feature dimensions, increasing gate count and chip inputs/outputs (I/Os). This trend has placed increased emphasis on microelectronics packaging. Surface-mountable packages such as the ceramic quad-flat-pack (CQFP) have provided solutions for many high I/O package issues. As the I/O count gets higher, the pitch has been driven smaller to the point where other solutions also become attractive. Surface-mountable ceramic-ball-grid array (CBGA) packages have proven to be good solutions in a variety of applications as designers seek to maximize electrical performance, reduce printed-circuit board real estate, and improve manufacturing process yields. In support of the PowerPC 603 and PowerPC 604 microprocessors, 21 mm CBGA (255 I/Os) and 32 mm (240 I/Os) and 40 mm (304 I/Os) CQFPs are being utilized. Both package types successfully meet computer environment applications. This paper describes test board assembly processes, accelerated thermal stress test setup, and solder joint failure criteria. Failure mechanisms for both packaging technologies will also be presented. The packages discussed in this paper were subjected to two accelerated thermal cycling conditions: 0 to 100/spl deg/C and -40 to 125/spl deg/C. The failure data are plotted using Weibull distributions. The accelerated failure distributions were used to predict failure distributions in application space for typical PowerPC 603 and PowerPC 604 microprocessors computer environments. To predict solder joint reliability of surface-mount technology, a key parameter is: the temperature rise above ambient at the solder joint, /spl Delta/T. In-situ field temperature measurements were taken for a range of computer platforms in an office environment, at the central-processing units. Printed-circuit boards (PCB) were not uniform, therefore only maximum temperature regions of the board were measured. These maximum temperatures revealed the mean to be less than 20/spl deg/C above ambient (i.e., /spl Delta/T<20/spl deg/C) regardless of the power of the device. The largest /spl Delta/T measured in any system was less than 30/spl deg/C above ambient. These temperature measurements of actual computer systems are in close agreement with IPC-SM-785. By utilizing the measured PCB temperature rise, solder joint fatigue life was calculated for the 21 mm ceramic ball-aid-array (CBGA), the package for the PowerPC 603/sup TM/ and PowerPC 604/sup TM/ RISC microprocessors. The average on-off /spl Delta/T for most computer applications is approximately 20/spl deg/C. For an average on-off /spl Delta/T of 30/spl deg/C, the 21 mm CBGA has an estimated fatigue life of over 25 years while the 32 mm and 40 mm CQFP's have an estimated fatigue life of over 50 years.

Proceedings ArticleDOI
13 Jan 1999
TL;DR: This end-to-end simulator for Laser Doppler wind measurement can run on either a Windows PC, a Macintosh PowerPC or a SUN station.
Abstract: LabVIEW (National Instruments) provides a powerful instrumentation system for simulations, including an excellent graphical presentation environment. Our Doppler Lidar simulation tool contains signal propagation and scattering in the atmosphere, a model of the heterodyne front end in the low SNR-regime and a processing unit for signal digitizing and frequency estimation. As a consequence of LabVIEW's programming language, G, this end-to-end simulator for Laser Doppler wind measurement can run on either a Windows PC, a Macintosh PowerPC or a SUN station.© (1999) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
Magdy S. Abadir1, R. Raina1
28 Sep 1999
TL;DR: The DFT methodology and how the DFT group is dealing with the new challenges that are facing PowerPC/sup TM/ microprocessor designs are reviewed.
Abstract: Testing of modern microprocessor designs remains a challenging problem. At Motorola's Somerset Design Center, we rely heavily on Design-For-Test (DFT) to address these challenges. To date our efforts have been very successful. This paper reviews our DFT methodology and how the DFT group is dealing with the new challenges that are facing PowerPC/sup TM/ microprocessor designs.

Journal ArticleDOI
TL;DR: This new architecture is flexible and open-ended and will enable interconnection to other Tore Supra systems such as those required for the long-term programme of long pulse (1000s) discharges.

Book ChapterDOI
31 Aug 1999
TL;DR: Beowulf-class systems along with other forms of PC clustered systems have matured to the point that they are becoming the strategy of choice for some areas of high performance applications.
Abstract: Beowulf-class systems along with other forms of PC clustered systems have matured to the point that they are becoming the strategy of choice for some areas of high performance applications. A Beowulf system is a cluster of mass market COTS personal computers interconnected by means of widely available local area network (LAN) technology. Beowulf software is based on open source code Unix-like operating systems that, in a majority of cases, is Linux. The API for Beowulf is based on message passing semantics and mechanisms including explicit models such as PVM and MPI or implicit models such as BSP of HPF. Since its introduction in 1994, Beowulf-class computing has gone through five generations of PCs from multiple microprocessor vendors including the Intel x86 family, DEC’s Alpha, and the PowerPC from IBM and Motorola. Originally, Beowulfs were implemented as small clusters in the range of 4 to 32 nodes. Larger clusters of 48 to 96 processors were deployed two and a half years ago. Today there are many systems of 100 to 300 processors with systems of over a thousand processors in the planning stage for implementation over the next year.

Proceedings ArticleDOI
01 Nov 1999
TL;DR: This paper proposes a solution to the problem of executing compressed code on embedded DSPs and reveals an average compression ratio of 75% for typical DSP programs running on the TMS320C25 processor.
Abstract: Decreasing the program size has become an important goal in the design of embedded systems targeted to mass production. This problem has led to a number of efforts aimed at designing processors with shorter instruction formats (e.g. ARM Thumb and MIPS16), or that can execute compressed code (e.g. IBM CodePack PowerPC). Much of this work has been directed towards RISC architectures though. This paper proposes a solution to the problem of executing compressed code on embedded DSPs. The experimental results reveal an average compression ratio of 75% for typical DSP programs running on the TMS320C25 processor. This number includes the size of the decompression engine. Decompression is performed by a state machine that translates codeworks into instruction sequences during program execution. The decompression engine is synthesized using the AMS standard cell library and a 0.6 /spl mu/m 5V technology. Gate level simulation of the decompression engine reveals minimum operation frequencies of 150 MHz.

Proceedings ArticleDOI
01 Jun 1999
TL;DR: The results of using a new algorithm for detecting both combinationally and sequentially false timing paths, one in which the constraints on a timing path are captured by justifying symbolic functions across latch boundaries, are presented.
Abstract: We present a new algorithm for detecting both combinationally and sequentially false timing paths, one in which the constraints on a timing path are captured by justifying symbolic functions across latch boundaries. We have implemented the algorithm and we present, here, the results of using it to detect false timing paths on a recent PowerPC microprocessor design. We believe these are the first published results showing the extent of the false path problem in industry. Our results suggest that the reporting of false paths may be compromising the effectiveness of static timing analysis.