scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 2004"


Journal ArticleDOI
01 Mar 2004
TL;DR: The experience in implementing the simulator and its uses within IBM to model future systems, support early software development, and design new system software are described.
Abstract: Mambo is a full-system simulator for modeling PowerPC-based systems. It provides building blocks for creating simulators that range from purely functional to timing-accurate. Functional versions support fast emulation of individual PowerPC instructions and the devices necessary for executing operating systems. Timing-accurate versions add the ability to account for device timing delays, and support the modeling of the PowerPC processor microarchitecture. We describe our experience in implementing the simulator and its uses within IBM to model future systems, support early software development, and design new system software.

152 citations


Proceedings ArticleDOI
27 Oct 2004
TL;DR: This paper presents an architecture description language (ADL) called ArchC, which is an open-source SystemC-based language that is specialized for processor architecture description that has a storage-based co-verification mechanism that automatically checks the consistency of a refined ArchC model against a reference (functional) description.
Abstract: This paper presents an architecture description language (ADL) called ArchC, which is an open-source SystemC-based language that is specialized for processor architecture description. Its main goal is to provide enough information, at the right level of abstraction, in order to allow users to explore and verify new architectures, by automatically generating software tools like simulators and co-verification interfaces. ArchC's key features are a storage-based co-verification mechanism that automatically checks the consistency of a refined ArchC model against a reference (functional) description, memory hierarchy modeling capability, the possibility of integration with other SystemC IPs and the automatic generation of high-level SystemC simulators. We have used ArchC to synthesize both functional and cycle-based simulators for the MIPS, Intel 8051 and SPARC V8 processors, as well as functional models of modern architectures like TMS320C62x, XScale and PowerPC.

93 citations


Proceedings ArticleDOI
09 Sep 2004
TL;DR: This work investigates the correlation between functional test frequency and that of various types of structural patterns on MPC7455, a Motorola processor executing to the PowerPC/spl trade/ instruction set architecture.
Abstract: The use of functional vectors has been an industry standard for speed binning purposes of high performance ICs. This practice can be prohibitively expensive as the ICs become faster and more complex. In comparison, structural patterns can target performance related faults in a more systematic manner. To make structural testing an effective alternative to functional testing for speed binning, structural patterns need to correlate with functional test frequencies closely. We investigate the correlation between functional test frequency and that of various types of structural patterns on MPC7455, a Motorola processor executing to the PowerPC/spl trade/ instruction set architecture.

82 citations


Proceedings ArticleDOI
28 Jun 2004
TL;DR: Analysis of the obtained data indicates significant differences between the two platforms in how errors manifest and how they are detected in the hardware and the operating system.
Abstract: The goals of this study are: (i) to compare Linux kernel (2.4.22) behavior under a broad range of errors on two target processors - the Intel Pentium 4 (P4) running RedHat Linux 9.0 and the Motorola PowerPC (G4) running YellowDog Linux 3.0 - and (ii) to understand how architectural characteristics of the target processors impact the error sensitivity of the operating system. Extensive error injection experiments involving over 115,000 faults/errors are conducted targeting the kernel code, data, stack, and CPU system registers. Analysis of the obtained data indicates significant differences between the two platforms in how errors manifest and how they are detected in the hardware and the operating system. In addition to quantifying the observed differences and similarities, the paper provides several examples to support the insights gained from this research.

62 citations


Journal ArticleDOI
TL;DR: This contribution appears to be the first thorough comparison of two public-key families, namely elliptic curve (ECC) and hyperelliptic curve cryptosystems on a wide range of embedded processor types (ARM, ColdFire, PowerPC).
Abstract: It is widely recognized that data security will play a central role in future IT systems. Providing public-key cryptographic primitives, which are the core tools for security, is often difficult on embedded processor due to computational, memory, and power constraints. This contribution appears to be the first thorough comparison of two public-key families, namely elliptic curve (ECC) and hyperelliptic curve cryptosystems on a wide range of embedded processor types (ARM, ColdFire, PowerPC). We investigated the influence of the processor type, resources, and architecture regarding throughput. Further, we improved previously known HECC algorithms resulting in a more efficient arithmetic.

43 citations


24 Sep 2004
TL;DR: The Purdue Software Receiver (PSR) as mentioned in this paper is a real-time software defined GPS receiver developed at Purdue University for research and teaching purposes, which is designed to maximize reusability of the code.
Abstract: The Purdue Software Receiver (PSR) is a real-time software defined GPS receiver developed at Purdue University for research and teaching purposes. The receiver’s software architecture was designed to maximize reusability of the code. This includes employing the receiver in a non real-time mode as a postprocessing tool for sampled GPS data as well as a realtime mode operating from an antenna and digital receiver card. Real-time operation is enabled by single instruction multiple data (SIMD) instructions found on modern x86 and PowerPC processors. The PSR is coded in C++, making use of threaded objects to encapsulate functions and related data together and to reduce unnecessary copying of data. A software construct termed the “pipewall” is used to separate the low level (correlation and tracking) functions from the higher level navigation processing. A short description of a laboratory GPS signal recording system will also be presented.

35 citations


Proceedings ArticleDOI
29 Sep 2004
TL;DR: Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivery of peak memory bandwidth for memory-bound kernel such as daxpy, while being largely insensitive to data alignment.
Abstract: We describe the design, implementation, and evaluation of a dual-issue SIMD-like extension of the PowerPC 440 floating-point unit (FPU) core. This extended FPU is targeted at both IBM's massively parallel Blue-Gene/L machine as well as more pervasive embedded platforms. It has several novel features, such as a computational crossbar and cross-load/store instructions, which enhance the performance of numerical codes. We further discuss the hardware-software co-design that was essential to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a BlueGene/L node. We describe several novel compiler and algorithmic techniques to take advantage of this architecture. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels such as daxpy, while being largely insensitive to data alignment.

34 citations


Patent
Roger Maitland1, Mark Turnbull1
12 Oct 2004
TL;DR: In this paper, a system and method for a parallel table look-up operation for a set of parallel inputs are presented. But this method requires the use of the PowerPC Altivec vperm instruction.
Abstract: A system and method for a parallel CRC calculation is provided. A set of parallel inputs are loaded into a control register, and this control register is then used with a parallel table look-up operation to look up CRC entries for each of the inputs using a single instruction. This is repeated until each input has been processed completely to produce a complete CRC. The parallel table look-up operation may be executed using the PowerPC Altivec vperm instruction.

28 citations


Proceedings ArticleDOI
13 Sep 2004
TL;DR: The challenges and implementation of a dynamically controlled clock frequency with noise suppression as well as a synchronization circuit for a multi-processor system are discussed.
Abstract: PowerTune is a power-management technique for a multi-gigahertz superscalar 64b PowerPC/sup /spl reg// processor in a 90nm technology. This paper discusses the challenges and implementation of a dynamically controlled clock frequency with noise suppression as well as a synchronization circuit for a multi-processor system.

24 citations


Journal ArticleDOI
TL;DR: Wind River VxWorks has been chosen as real-time operating system and PowerPC and Pentium processors were considered as candidates and tested and the first one has been selected due to the better performance in floating point computation.

24 citations


Patent
21 Apr 2004
TL;DR: In this article, the task contex of the user can be divided into three parts of basic, expansion and selectable according to speciality of Power PC processor structure, and only three stack entering modes of Basic, Basic and Expansion as well as all context part are applied according to condition of system disposal and task dispatching.
Abstract: The method has the following characteristics: the task contex of the user can be divided into three parts of basic, expansion and selectable according to speciality of Power PC processor structure. In interruption process, only three stack entering modes of basic, basicand expansion as well as all context part are applied according to condition of system disposal and task dispatching. The basic part stack entering is executed first. After interruption process is finished the nature of task dispatching is judged for selecting to execute the next stage of stack entering operation, to call dispatcher or to return to the user task in order to reduce unnecessary stacking operation.

Book ChapterDOI
30 Aug 2004
TL;DR: On the backend C compiler developed to target the Virtex II Pro PowerPC processor and to incorporate the Molen architecture programming paradigm, the performance efficiency is achieved using automatically generated but non-optimized DCT* hardware implementation.
Abstract: In this paper, we report on the backend C compiler developed to target the Virtex II Pro PowerPC processor and to incorporate the Molen architecture programming paradigm. To verify the compiler, we used the multimedia video frame M-JPEG encoder of which the Discrete Cosine Transform (DCT*) function was mapped on the FPGA. We obtained an overall speedup of 2.5 against a maximal theoretical speedup of 2.96. The performance efficiency of 84 % is achieved using automatically generated but non-optimized DCT* hardware implementation.

Book ChapterDOI
21 Jul 2004
TL;DR: The paper focuses on hardware synthesis results and experimental performance evaluation, proving the viability of the MOLEN concept, where the MPEG-2 application is accelerated very closely to its theoretical limits by implementing SAD, DCT and IDCT as reconfigurable co-processors.
Abstract: We use the Xilinx Virtex II ProTM technology as prototyping platform to design a MOLEN polymorphic processor, a custom computing machine based on the co-processor architectural paradigm. The PowerPC embedded in the FPGA is operating as a general purpose (core) processor and the reconfigurable fabric is used as a reconfigurable co-processor. The paper focuses on hardware synthesis results and experimental performance evaluation, proving the viability of the MOLEN concept. More precisely, the MPEG-2 application is accelerated very closely to its theoretical limits by implementing SAD, DCT and IDCT as reconfigurable co-processors. For a set of popular test video sequences the MPEG-2 encoder overall speedup is in the range between 2.64 and 3.18. The speedup of the MPEG-2 decoder varies between 1.65 and 1.94.


Proceedings ArticleDOI
08 Mar 2004
TL;DR: This work aims at reducing the overhead for cooperative multithreading context switches at compile time by using standard compiler techniques such as context-insensitive analysis and register usage is rearranged to reduce the amount of context-switch code.
Abstract: Multithreading is an efficient way to improve efficiency of processor cores in embedded products for networking infrastructures. To make such improvements also accessible to processor cores without hardware support for multithreading, we present a concept for efficient software multithreading through compiler post-pass optimization of the application code. Our approach aims at reducing the overhead for cooperative multithreading context switches at compile time by using standard compiler techniques such as context-insensitive analysis. Additionally, register usage is rearranged to reduce the amount of context-switch code by exploiting multiple-load/store instructions. Performance model analysis encourages the use of software multithreading to improve processor utilization by showing the benefit of our approach. We present results obtained by an implementation for the PowerPC ISA (Instruction Set Architecture) using the code of a real network application (iSCSI). We were able to reduce the expected run-time of a context switch to as little as 38% of the original.

Journal ArticleDOI
TL;DR: The parallel implementation of a multigrid method for unstructured finite element discretizations of solid mechanics problems is described and an algebraic framework for the parallel computations is presented, and an object‐based programming methodology using Fortran90 is described.
Abstract: We describe the parallel implementation of a multigrid method for unstructured finite element discretizations of solid mechanics problems. We focus on a distributed memory programming model and use the MPI library to perform the required interprocessor communications. We present an algebraic framework for our parallel computations, and describe an object-based programming methodology using Fortran90. The performance of the implementation is measured by solving both fixed- and scaled-size problems on three different parallel computers (an SGI Origin2000, an IBM SP2 and a Cray T3E). The code performs well in terms of speedup, parallel efficiency and scalability. However, the floating point performance is considerably below the peak values attributed to these machines. Lazy processors are documented on the Origin that produce reduced performance statistics. The solution of two problems on an SGI Origin2000, an IBM PowerPC SMP and a Linux cluster demonstrate that the algorithm performs well when applied to the unstructured meshes required for practical engineering analysis. Copyright © 2004 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
13 Sep 2004
TL;DR: A 64 b PowerPC microprocessor is introduced in 130 nm and redesigned in 90 nm SOI technology, which features PowerTune for rapid frequency and power scaling and electronic fuses.
Abstract: A 64 b PowerPC microprocessor is introduced in 130 nm and redesigned in 90 nm SOI technology. PowerPC 970 implements a SIMD instruction set with 512 kB L2 cache. It runs at 2.0 GHz with a 1.0 GHz bus in 130 nm. The 90 nm design features PowerTune for rapid frequency and power scaling and electronic fuses.

Dissertation
01 Jan 2004
TL;DR: This dissertation presents a streaming method, which is implemented and simulated on an MBX860 board and on a hardware/software co-simulation platform in which the PowerPC architecture was used, that enables small memory footprint devices to run applications larger than the physical memory by using the memory management technique.
Abstract: Downloading software from a server usually takes a noticeable amount of time, that is, noticeable to the user who wants to run the program. However, this issue can be mitigated by the use of streaming software. Software steaming is a means by which software can begin execution even while transmission of the full software program may still be in progress. Therefore, the application load time (i.e., the amount of time from when an application is selected for download to when the application can be executed) observed by the user can be significantly reduced. Moreover, unneeded software components might not be downloaded to the device, lowering memory and bandwidth usages. As a result, resource utilization such as memory and bandwidth usage may also be more efficient. Using our streaming method, an embedded device can support a wide range of applications which can be run on demand. Software streaming also enables small memory footprint devices to run applications larger than the physical memory by using our memory management technique. In this dissertation, we present a streaming method we call block streaming to transmit stream-enabled applications, including stream-enabled file I/O. We implemented a tool to partition software into blocks which can be transmitted (streamed) to the embedded device. Our streaming method was implemented and simulated on an MBX860 board and on a hardware/software co-simulation platform in which we used the PowerPC architecture. We show a robotics application that, with our software streaming method, is able to meet its deadline. The application load time for this application also improves by a factor of more than 10X when compared to downloading the entire application before running it. The experimental results also show that our implementation improves file I/O operation latency; in our examples, the performance improves up to 55.83X when compared with direct download. Finally, we show a stream-enabled game application combined with stream-enabled file I/O for which the user can start playing the game 3.18X more quickly than using only the stream-enabled game program file alone.

Proceedings Article
20 Jul 2004
TL;DR: The efficient coding and optimisation techniques used for the single instruction multiple data implementation of the algorithm have been shown to improve overall performance and as a result utilises minimum combustion event timing.
Abstract: This paper discusses a novel high performance knock processing strategy using a next generation Motorola automotive PowerPC system-on-a-chip. The proposed methodology is based on an auxiliary signal processing extension to the main PowerPC system-on-a-chip core along with various intelligent autonomous on-chip modules. Real-time software development techniques with an advanced software circular buffer implementation for processing the streaming knock sensor data have been developed. Various single instruction multiple data software optimisation techniques are employed to reduce the real-time knock algorithmic execution time. Real-time and simulation results are presented for the detection of knock on a four cylinder internal combustion engine, however, the approach is widely applicable. The efficient coding and optimisation techniques used for the single instruction multiple data implementation of the algorithm have been shown to improve overall performance and as a result utilises minimum combustion event timing.

Journal ArticleDOI
TL;DR: This article uses RTL, gate, and switch models of a design in two different flows one for test and one for functional verification to show that rectifying constraints and merging tests between the-two flows saves significant presilicon debug effort.
Abstract: This article, from the Motorola (now Freescale) PowerPC design group, presents an interesting synergy among test, equivalence verification, and constraints. The authors use RTL, gate, and switch models of a design in two different flows one for test and one for functional verification to show that rectifying constraints and merging tests between the-two flows saves significant presilicon debug effort.

Proceedings ArticleDOI
15 Aug 2004
TL;DR: This work considers the implementation of 16-bit floating point instructions on a Pentium 4 and a PowerPC G5 for image and media processing and shows that significant speed-up is obtained compared to 32-bit FP versions.
Abstract: We consider the implementation of 16-bit floating point instructions on a Pentium 4 and a PowerPC G5 for image and media processing. By measuring the execution time of benchmarks with these new simulated instructions, we show that significant speed-up is obtained compared to 32-bit FP versions. For image processing, the speed-up both comes from doubling the number of operations per SIMD instruction and the better cache behavior with byte storage. For data stream processing with arrays of structures, the speed-up mainly comes from the wider SIMD instructions.

Book ChapterDOI
06 Jun 2004
TL;DR: Eve (Expressive Velocity Engine), an object oriented C++ library designed to ease the process of writing efficient numerical applications using AltiVec, the SIMD extension designed by Apple, Motorola and IBM for PowerPC processors, offers a significant improvement in terms of expressivity.
Abstract: This paper describes eve (Expressive Velocity Engine), an object oriented C++ library designed to ease the process of writing efficient numerical applications using AltiVec, the SIMD extension designed by Apple, Motorola and IBM for PowerPC processors. Compared to the Altivec original C API, eve, offers a significant improvement in terms of expressivity. By relying on template metaprogramming techniques, this is not obtained at the expense of efficiency.

Proceedings ArticleDOI
22 Oct 2004
TL;DR: A low-level compiling technique based on a minimal code generator with parametric embedded sections to generate binary code at run-time for intensively reused functions in graphic applications where the advantages of dynamic compilation have not been fully taken into account yet.
Abstract: Knowledge of data values at run-time allows us to generate better code in terms of efficiency, size and power consumption.This paper introduces a low-level compiling technique based on a minimal code generator with parametric embedded sections to generate binary code at run-time. This generator called a "compilet" creates code and allocates registers using the data input. Then, it generates the needed instructions. Our measurements, performed on Itanium 2 and PowerPC platforms have shown a speed improvement of 43% on the Itanium 2 platform and 41% on the PowerPC one.The proposed technique proves to be particularly useful in the case of intensively reused functions in graphic applications, where the advantages of dynamic compilation have not been fully taken into account yet.

Proceedings ArticleDOI
Jing Zeng1, M. Abadir1
05 Apr 2004
TL;DR: This paper demonstrates the correlations between the functional test frequency and that of various types of structural patterns on MPC7455, a Motorola processor executing to the PowerPC/sup /spl trade// instruction set architecture.
Abstract: The utilization of functional vectors has been an industry standard for speed binning purpose. This practice can be prohibitively expensive as the ICs become faster and more complex. In comparison, structural patterns can target performance related faults in a more systematic manner. To make structural test an effective alternative to functional test for speed binning, structural patterns need to correlate with functional test frequency closely. In this paper, we demonstrate the correlations between the functional test frequency and that of various types of structural patterns on MPC7455, a Motorola processor executing to the PowerPC/sup /spl trade// instruction set architecture.

08 Sep 2004
TL;DR: At Jet Propulsion Laboratory (JPL), the feasibility of running multiple processors running in a lock step fashion to accomplish SEU mitigation and fault tolerance is demonstrated.
Abstract: Not until recently, Xilinx has developed a new field programmable gate array (FPGA) device family, Virtex-I1 Pro. In this single device, not only dies it have density logic cells (3K to125K), gigabit connectivity, on chip memory, digital clock management, but also it can have up to four IBM PowerPC 405 Processor hard cores, running up to 400MHz and 633 Mbps. To utilize this cutting edge device in space applications, a few Single Event Upset (SEU) mitigation techniques need to be implemented to a design for the device. At Jet Propulsion Laboratory (JPL), we have successfully demonstrated the feasibility of running multiple processors running in a lock step fashion to accomplish SEU mitigation and fault tolerance.

Proceedings ArticleDOI
11 Oct 2004
TL;DR: Transparent fault tolerance for massively parallel supercomputers, scalable network emulation, compiler directed strategies for flexible data sharing models, and routing algorithms for backbone IP networks are focused on.
Abstract: System X was conceived in March 2003, designed in July 2003, and by October it had achieved a sustained performance of 10.28 Teraflops, making it the third fastest supercomputer in the world today. System X has several novel features. First, it is based on an Apple G5 platform with the new IBM PowerPC 970 64-bit CPUs. Secondly, it uses a high performance switched communications fabric called Infiniband. Finally, system X is cooled by a hybrid liquid-air cooling system. In this paper, the author presents the motivation for System X, its architecture, and the challenges faced in building, deploying, and maintaining a large-scale supercomputer. The paper is focused on transparent fault tolerance for massively parallel supercomputers, scalable network emulation, compiler directed strategies for flexible data sharing models, and routing algorithms for backbone IP networks

Proceedings ArticleDOI
David L. Edwards1, H. Chambers, Mukta G. Farooq, L. Goldmann, A. Salehi 
01 Jun 2004
TL;DR: This paper describes how through a cooperative effort between Apple and IBM, a BGA reliability enhancement was evaluated and successfully implemented, which strengthens the BGA connections between the processor module and the processor card and increases long term reliability performance affected by creep and cyclic fatigue.
Abstract: Apple's Power Mac G5 systems use either one or two IBM PowerPC 970 chips. Initial systems built with the PowerPC 970 64-bit processor run at speeds up to 2.0 GHz. These chips are packaged on IBM ceramic BGA (ball grid array) modules. The high performance modules dissipate high power, which presents new packaging challenges. One of these challenges has been addressed successfully by improving the thermo-mechanical integrity of the solder interconnections between the chip carrier module and the organic processor board. The PowerPC 970 chip dissipates high power in a small area and is aggressively cooled using a state-of-the art heatsink design. This paper describes how through a cooperative effort between Apple and IBM, a BGA reliability enhancement was evaluated and successfully implemented. Use of BGA underfill strengthens the BGA connections between the processor module and the processor card and increases long term reliability performance affected by creep and cyclic fatigue.

01 Jan 2004
TL;DR: In this paper, the thermal aspects of this concurrent process that required the use of a board-level (AutoTherm from Mentor Graphics) and system-level thermal analysis tool (FLOTHERM from Flomerics) were discussed.
Abstract: Summary This paper discusses an attempt to bring thermal analysis early in the printed-circuit board design process, when designing Motorola’s PowerPC 603™ and PowerPC 604™ microprocessor-based desktop system. The goal was to assess a methodology that should help to define a real concurrent design process for future projects. We emphasize here the thermal aspects of this concurrent process that required the use of a board-level (AutoTherm from Mentor Graphics) and system-level thermal analysis tool (FLOTHERM from Flomerics). After describing the project, and the dataflow currently available between AutoTherm and FLOTHERM, we describe the practical steps that were carried out in this project, and how thermal design has finally been included as one of the constraint during the component placement phase on the printed-circuit board design. Overall the experience gained through this project on multi-level thermal analysis, as well as, working in a cross-functional team environment is presented. Also presented are the steps for implementing such a concurrent design flow.

Patent
10 Nov 2004
TL;DR: In this article, the authors propose a method to make Bootrom and VxWorks images able to be normally compiled and run all the time, which redefines and redistributes the address space for RAM_LOW_ADRS, RAM_HIGH ADRS, etc in the two images, and accordingly processes the change of the system's own Memory pool.
Abstract: The invention provides a method to make Bootrom and VxWorks images able to be normally compiled and run all the time, which redefines and redistributes the address space for RAM_LOW_ADRS, RAM_HIGH_ADRS, etc in the two images, and accordingly processes the change of the system's own Memory pool, so that when the length of the codes of the two images in the products with PowerPC series as CPU is more than 32M, they can still be normally compiled and run.

Proceedings ArticleDOI
02 Apr 2004
TL;DR: Inspired by features from both the DAISY and Crusoe/spl trade/ microprocessors, a conceptual design of a dynamically reconfigurable microprocessor is given.
Abstract: A microprocessor taxonomy is introduced based on whether: (1) the hardware is static or reconfigurable and (2) the code translation process is static or dynamic. The IBM DAISY and Transmeta Crusoe/spl trade/ microprocessors are reviewed. These static hardware microprocessors support a dynamic translation process to execute programs originally compiled for the PowerPC and Intel/spl reg/ X86 microprocessors, respectively. Inspired by features from both the DAISY and Crusoe/spl trade/ microprocessors, a conceptual design of a dynamically reconfigurable microprocessor is given. Driven by the results of a preliminary study, a specific approach to designing a reconfigurable microprocessor is presented. As a part of this approach, the concept of partitioning the instruction set of a microprocessor in order to support an application, instead of partitioning the functionality of the application, is developed.