scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 2008"


Journal ArticleDOI
TL;DR: Performance measurements show that the tested open-source software is suitable for hard real-time applications.
Abstract: We report on a set of performance measurements executed on VMEbus MVME5500 boards equipped with MPC7455 PowerPC processor, running four different operating systems: Wind River VxWorks, Linux, RTAI, and Xenomai. Some components of RTAI and Xenomai have been ported to the target architecture. Interrupt latency, rescheduling and inter-process communication times are compared in the framework of a sample real-time application. Performance measurements on Gigabit Ethernet network communication have also been carried out on the target boards. To this purpose, we have considered the Linux IP stack and RTnet, an open-source hard real-time network protocol stack for Xenomai and RTAI, which was ported to the considered architecture. Performance measurements show that the tested open-source software is suitable for hard real-time applications.

127 citations


Proceedings ArticleDOI
15 Nov 2008
TL;DR: This investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4's Opteron, yet it has less of a power advantage when considering science driven metrics for mission applications.
Abstract: BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4's Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications.

82 citations


Book ChapterDOI
10 Aug 2008
TL;DR: New chosen-message power-analysis attacks against public-key cryptosystems based on modular exponentiation, which use specific input pairs to generate collisions between squaring operations at different locations in the two power traces, are proposed.
Abstract: This paper proposes new chosen-message power-analysis attacks against public-key cryptosystems based on modular exponentiation, which use specific input pairs to generate collisions between squaring operations at different locations in the two power traces Unlike previous attacks of this kind, the new attacks can be applied to all the standard implementations of the exponentiation process: binary (left-to-right and right-to-left), m-ary, and sliding window methods The SPA countermeasure of inserting dummy multiplications can also be defeated (in some cases) by using the proposed attacks The effectiveness of the attacks is demonstrated by actual experiments with hardware and software implementations of RSA on an FPGA and the PowerPC processor, respectively In addition to the new collision generation methods, a high-accuracy waveform matching technique is introduced to detect the collisions even when the recorded signals are noisy and the clock has some jitter

81 citations


Proceedings ArticleDOI
07 Jun 2008
TL;DR: It is shown that register allocation can be viewed as solving a collection of puzzles, and the register file is model as a puzzle board and the program variables as puzzle pieces; pre-coloring and register aliasing fit in naturally.
Abstract: We show that register allocation can be viewed as solving a collection of puzzles. We model the register file as a puzzle board and the program variables as puzzle pieces; pre-coloring and register aliasing fit in naturally. For architectures such as PowerPC, x86, and StrongARM, we can solve the puzzles in polynomial time, and we have augmented the puzzle solver with a simple heuristic for spilling. For SPEC CPU2000, the compilation time of our implementation is as fast as that of the extended version of linear scan used by LLVM, which is the JIT compiler in the openGL stack of Mac OS 10.5. Our implementation produces x86 code that is of similar quality to the code produced by the slower, state-of-the-art iterated register coalescing of George and Appel with the extensions proposed by Smith, Ramsey, and Holloway in 2004.

77 citations


01 Jan 2008
TL;DR: 2004 was the year of multicore; in particular, multicore is available on current PowerPC and Sparc IV processors, and is coming in 2005 from Intel and AMD.
Abstract: Your free lunch will soon be over. What can you do about it? What are you doing about it. The major processor manufacturers and architectures, from Intel and AMD to Sparc and PowerPC, have run out of room with most of their traditional approaches to boosting CPU performance. Instead of driving clock speeds and straight-line instruction throughput ever higher, they are instead turning en masse to hyperthreading and multicore architectures. Both of these features are available on chips today; in particular, multicore is available on current PowerPC and Sparc IV processors, and is coming in 2005 from Intel and AMD. Indeed, the big theme of the 2004 In-Stat/MDR Fall Processor Forum (http://www.mdronline.com/fpf04/index.html) was multicore devices, with many companies showing new or updated multicore processors. Looking back, it's not much of a stretch to call 2004 the year of multicore.

65 citations


Proceedings ArticleDOI
17 Nov 2008
TL;DR: This work describes an approach, based on proof-producing decompilation, which both makes machine-code verification tractable and supports proof reuse between different languages, and presents examples based on detailed models of machine code for ARM, PowerPC and x86.
Abstract: Realistic formal specifications of machine languages for commercial processors consist of thousands of lines of definitions. Current methods support trustworthy proofs of the correctness of programs for one such specification. However, these methods provide little or no support for reusing proofs of the same algorithm implemented in different machine languages. We describe an approach, based on proof-producing decompilation, which both makes machine-code verification tractable and supports proof reuse between different languages. We briefly present examples based on detailed models of machine code for ARM, PowerPC and x86. The theories and tools have been implemented in the HOL4 system.

64 citations


Journal ArticleDOI
Kun Wang1, Yu Zhang1, Huayong Wang1, Xiaowei Shen1
TL;DR: A global-lock-based method to guarantee compatibility of P-Mambo with future Mambo modules is proposed, and a core-based module partition is proposed to achieve both high inter-scheduler parallelism and low inter- scheduler dependency.
Abstract: Mambo [4] is IBM's full-system simulator which models PowerPC systems, and provides a complete set of simulation tools to help IBM and its partners in pre-hardware development and performance evaluation for future systems. Currently Mambo simulates target systems on a single host thread. When the number of cores increases in a target system, Mambo's simulation performance for each core goes down. As the so-called "multi-core era" approaches, both target and host systems will have more and more cores. It is very important for Mambo to efficiently simulate a multi-core target system on a multi-core host system. Parallelization is a natural method to speed up Mambo under this situation.Parallel Mambo (P-Mambo) is a multi-threaded implementation of Mambo. Mambo's simulation engine is implemented as a user-level thread-scheduler. We propose a multi-scheduler method to adapt Mambo's simulation engine to multi-threaded execution. Based on this method a core-based module partition is proposed to achieve both high inter-scheduler parallelism and low inter-scheduler dependency. Protection of shared resources is crucial to both correctness and performance of P-Mambo. Since there are two tiers of threads in P-Mambo, protecting shared resources by only OS-level locks possibly introduces deadlocks due to user-level context switch. We propose a new lock mechanism to handle this problem. Since Mambo is an on-going project with many modules currently under development, co-existence with new modules is also important to P-Mambo. We propose a global-lock-based method to guarantee compatibility of P-Mambo with future Mambo modules.We have implemented the first version of P-Mambo in functional modes. The performance of P-Mambo has been evaluated on the OpenMP implementation of NAS Parallel Benchmark (NPB) 3.2 [12]. Preliminary experimental results show that P-Mambo achieves an average speedup of 3.4 on a 4-core host machine.

53 citations


Proceedings ArticleDOI
14 Apr 2008
TL;DR: Preliminary work is presented on a domain- specific compiler that generates implementations for arbitrary sequences of basic linear algebra operations and tunes them for memory efficiency.
Abstract: The performance bottleneck for many scientific applications is the cost of memory access inside linear algebra kernels. Tuning such kernels for memory efficiency is a complex task that reduces the productivity of computational scientists. Software libraries such as the Basic Linear Algebra Subprograms (BLAS) ameliorate this problem by providing a standard interface for which computer scientists and hardware vendors have created highly-tuned implementations. Scientific applications often require a sequence of BLAS operations, which presents further opportunities for memory optimization. However, because BLAS are tuned in isolation they do not take advantage of these opportunities. This phenomenon motivated the recent addition to the BLAS of several routines that perform sequences of operations. Unfortunately, the exact sequence of operations needed in a given situation is highly application dependent, so many more routines are needed. In this paper we present preliminary work on a domain- specific compiler that generates implementations for arbitrary sequences of basic linear algebra operations and tunes them for memory efficiency. We report experimental results for dense kernels and show speedups of 25 % to 120 % relative to sequences of calls to GotoBLAS and vendor-tuned BLAS on Intel Xeon and IBM PowerPC platforms.

48 citations


Proceedings ArticleDOI
25 Oct 2008
TL;DR: A memory consistency model and a programming model for COMIC is proposed, in which the management of synchronization and coherence is centralized in the PPE, which provides the program with an illusion of a globally shared memory.
Abstract: The Cell BE processor is a heterogeneous multicore that contains one PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). Each SPE has a small software-managed local store. Applications must explicitly control all DMA transfers of code and data between the SPE local stores and the main memory, and they must perform any coherence actions required for data transferred. The need for explicit memory management, together with the limited size of the SPE local stores, makes it challenging to program the Cell BE and achieve high performance. In this paper, we present the design and implementation of our COMIC runtime system and its programming model. It provides the program with an illusion of a globally shared memory, in which the PPE and each of the SPEs can access any shared data item, without the programmer having to worry about where the data is, or how to obtain it. COMIC is implemented entirely in software with the aid of user-level libraries provided by the Cell SDK. For each read or write operation in SPE code, a COMIC runtime function is inserted to check whether the data is available in its local store, and to automatically fetch it if it is not. We propose a memory consistency model and a programming model for COMIC, in which the management of synchronization and coherence is centralized in the PPE. To characterize the effectiveness of the COMIC runtime system, we evaluate it with twelve OpenMP benchmark applications on a Cell BE system and an SMP-like homogeneous multicore (Xeon).

47 citations


Journal ArticleDOI
TL;DR: Evaluations of these prototypes on a number of benchmark and hashing algorithm case studies indicate the enhanced resource utilization and run time performance of the developed approaches.
Abstract: A multilayer run-time reconfiguration architecture (MRRA) is developed for autonomous run-time partial reconfiguration of field-programmable gate-array (FPGA) devices. MRRA operations are partitioned into logic, translation, and reconfiguration layers along with a standardized set of application programming interfaces (APIs). At each level, resource details are encapsulated and managed for efficiency and portability during operation. In particular, FPGA configurations can be manipulated at runtime using on-chip resources. A corresponding logic control flow is developed for a prototype MRRA system on a Xilinx Virtex II Pro platform. The Virtex II Pro on-chip PowerPC core and block RAM are employed to manage control operations while multiple physical interfaces establish and supplement autonomous reconfiguration capabilities. Evaluations of these prototypes on a number of benchmark and hashing algorithm case studies indicate the enhanced resource utilization and run time performance of the developed approaches.

29 citations


Book ChapterDOI
16 Dec 2008
TL;DR: This work uses 2166 processors of the MareNostrum (IBM PowerPC 970) supercomputer to model seismic wave propagation in the inner core of the Earth following an earthquake using the spectral-element method, a high-degree finite-element technique with an exactly diagonal mass matrix.
Abstract: We use 2166 processors of the MareNostrum (IBM PowerPC 970) supercomputer to model seismic wave propagation in the inner core of the Earth following an earthquake. Simulations are performed based upon the spectral-element method, a high-degree finite-element technique with an exactly diagonal mass matrix. We use a mesh with 21 billion grid points (and therefore approximately 21 billion degrees of freedom because a scalar unknown is used in most of the mesh). A total of 2.5 terabytes of memory is needed. Our implementation is purely based upon MPI. We optimize it using the ParaVer analysis tool in order to significantly improve load balancing and therefore overall performance. Cache misses are reduced based upon renumbering of the mesh points.

Proceedings ArticleDOI
23 Sep 2008
TL;DR: This paper presents the implementation of ReconOS, the hardware/software multithreaded programming model, on both eCos and Linux-based host systems as well as on PowerPC and MicroBlaze CPUs, demonstrating that ReconOS provides a truly portable abstraction layer for programming reconfigurable computers.
Abstract: The multithreaded programming model has been shown to provide a suitable abstraction for reconfigurable computers. Previous implementations of corresponding runtime systems have been limited to a single host operating system, hardware platform, or application domain. This paper presents the implementation of ReconOS, our hardware/software multithreaded programming model, on both eCos and Linux-based host systems as well as on PowerPC and MicroBlaze CPUs. This demonstrates that ReconOS provides a truly portable abstraction layer for programming reconfigurable computers. Further, we quantify the performance of operating system calls and measure the resulting application level performance.

Journal ArticleDOI
TL;DR: A hybrid design-time/runtime reconfiguration scheduling heuristic that generates its final schedule at runtime but carries out most computations at design- time is developed.
Abstract: Due to the emergence of portable devices that must run complex dynamic applications there is a need for flexible platforms for embedded systems Runtime reconfigurable hardware can provide this flexibility but the reconfiguration latency can significantly decrease the performance When dealing with task graphs, runtime support that schedules the reconfigurations in advance can drastically reduce this overhead However, executing complex scheduling heuristics at runtime may generate an excessive penalty Hence, we have developed a hybrid design-time/runtime reconfiguration scheduling heuristic that generates its final schedule at runtime but carries out most computations at design-time We have tested our approach in a PowerPC 405 processor embedded on a FPGA demonstrating that it generates a very small runtime penalty while providing almost as good schedules as a full runtime approach

Journal ArticleDOI
TL;DR: This paper presents a high-level performance estimator based on a neural network, which easily adapts to the non-linear behaviour of the execution time in advanced architectures and presents a speed-up up to 190 times in comparison with cycle-accurate simulators, using the PowerPC 750 as target architecture.

Proceedings ArticleDOI
05 May 2008
TL;DR: This paper takes the first steps in supporting I/O intensive workloads on the Cell/BE and deriving guidelines for optimizing the execution of I/Os on heterogeneous architectures, and explores various performance enhancing techniques for such workloads with promising results.
Abstract: Recent advent of the asymmetric multi-core processors such as Cell Broadband Engine (Cell/BE) has popularized the use of heterogeneous architectures. A growing body of research is exploring the use of such architectures, especially in High-End Computing, for supporting scientific applications. However, prior research has focused on use of the available Cell/BE operating systems and runtime environments for supporting compute-intensive jobs. Data and I/O intensive workloads have largely been ignored in this domain. In this paper, we take the first steps in supporting I/O intensive workloads on the Cell/BE and deriving guidelines for optimizing the execution of I/O workloads on heterogeneous architectures. We explore various performance enhancing techniques for such workloads on an actual Cell/BE system. Among the techniques we explore, an asynchronous prefetching-based approach, which uses the PowerPC core of the Cell/BE for file prefetching and decentralized DMAs from the synergistic processing cores (SPE's), improves the performance for I/O workloads that include an encryption/decryption component by 22.2%, compared to I/O performed naively from the SPE's. Our evaluation shows promising results and lays the foundation for developing more efficient I/O support libraries for multi-core asymmetric architectures.

Proceedings ArticleDOI
10 Oct 2008
TL;DR: A method that permits to quickly estimate the power consumption at the first steps of a systempsilas design and its use at different levels in the component based AADL design flow is presented.
Abstract: This paper presents a method that permits to quickly estimate the power consumption at the first steps of a systempsilas design. We present multi-level power models and show how to use them at different levels of the specification refinement in the component based AADL design flow. PET, a power estimation tool, is being developed in the frame of the European SPICES project. It first prototype gives, in the case of a processor binding, power consumption estimations, for software components in the AADL component assembly model, with a maximal error ranging roughly from 5% to 30% depending on the refinement level. We illustrate our approach with the power model of the PowerPC 405, and its use at different levels in the AADL flow.

Journal ArticleDOI
TL;DR: This design project provides a practical introduction to System-on- Chip (SOC) design, embedded processor design, hardware-software co-design, and general FPGA development.
Abstract: This paper presents a reference design and tutorial for an embedded PowerPC subsystem core with user logic in a Xilinx field-programmable gate array (FPGA). The design and tutorial were created to help graduate students who are doing research in complex electronic applications and want to prototype their designs in an FPGA. Specifically, the design provides a starting point for any application that requires an embedded processor plus user logic that is external to the processor block, but must interface to it. In addition, this material is useful as a supplementary laboratory module in advanced FPGA design (for senior- and graduate-level courses). The design project provides a practical introduction to system-on-chip (SOC) design, embedded processor design, hardware-software codesign, and general FPGA development. The authors' assessment shows that even third-year electrical engineering students can complete the tutorial successfully (within approximately three hours). The design database and tutorial document are publicly available and can be downloaded from a website at The University of British Columbia (UBC), Vancouver, BC, Canada.

Proceedings ArticleDOI
23 Sep 2008
TL;DR: A method for generating partial bitstreams on-line for use in systems with run-time reconfigurable FPGAs, restricting the number of possible module arrangements, allows bitstream creation to be performed with relatively few computational resources.
Abstract: The paper presents a method for generating partial bitstreams on-line for use in systems with run-time reconfigurable FPGAs. Bitstream creation is performed at run-time by merging partial bitstreams from individual component modules. The process includes the capability to create connections between the modules by selection from a set of routes found during an off-line pre-processing step. Placement and interconnection of modules must follow a precise set of rules. While restricting the number of possible module arrangements, this approach allows bitstream creation to be performed with relatively few computational resources. Using a demonstration system with a Virtex-II Pro FPGA with a PowerPC 405 CPU, the process of creating at run-time a partial bitstream for 22% of the device area takes 24 ms.

Proceedings ArticleDOI
22 Jun 2008
TL;DR: This paper proposes a generic architecture and characterize its complexity, maximum frequency of operation, and global throughput for NoCs supporting 2 to 8 nodes and shows that FPGA-based designs would benefit from such architecture when high throughput must be reached.
Abstract: Networks-on-chip (NoCs) have emerged as a new design paradigm to implement MPSoCs that competes with the standard bus approach. They offer more scalability, flexibility, and bandwidth. Nevertheless, FPGA manufacturers still use the bus paradigm in their development frameworks. In this paper, we study the complexity and performances of a FPGA implementation for a crossbar NoC. We propose a generic architecture and characterize its complexity, maximum frequency of operation, and global throughput for NoCs supporting 2 to 8 nodes. Results show that FPGA-based designs would benefit from such architecture when high throughput must be reached. Finally, we present a fully functional 3times3 NoC interconnecting a PowerPC and 2 Xtensa processors implemented in a VirtexII Pro FPGA.

Proceedings ArticleDOI
01 Mar 2008
TL;DR: This paper will explore design strategies and mappings of a hyperspectral imaging (HSI) classification algorithm for a mix of processing paradigms on an advanced space computing system, featuring MPI-based parallel processing with multiple PowerPC microprocessors each coupled with kernel acceleration via FPGA and/or AltiVec resources.
Abstract: Projected demands for future space missions, where on-board sensor processing and autonomous control rapidly expand computational requirements, are outpacing technologies and trends in conventional embedded microprocessors. To achieve higher levels of performance as well as relative performance versus power consumption, new processing technologies are of increasing interest for space systems. Technologies such as reconfigurable computing based upon FPGAs and vector processing based upon SIMD processor extensions, often in tandem with conventional software processors in the form of multiparadigm computing, offer a compelling solution. This paper will explore design strategies and mappings of a hyperspectral imaging (HSI) classification algorithm for a mix of processing paradigms on an advanced space computing system, featuring MPI-based parallel processing with multiple PowerPC microprocessors each coupled with kernel acceleration via FPGA and/or AltiVec resources. Design of key components of HSI including autocorrelation matrix calculation, weight computation, and target detection will be discussed, and hardware/software performance tradeoffs evaluated. Additionally, several parallel-partitioning strategies will be considered for extending single-node performance to a clustered architecture. Performance factors in terms of execution time and parallel efficiency will be examined on an experimental testbed. Power consumption will be investigated, and tradeoffs between performance and power consumption analyzed. This work is part of the Dependable Multiprocessor (DM) project at Honeywell and the University of Florida, one of the four experiments in the Space Technology 8 (ST-8) mission of NASA's New Millennium Program.

Journal ArticleDOI
TL;DR: A novel, hardware-supported approach that optimizes the number and size of generated LUTs to improve the compression ratio and is orthogonal to approaches that take particularities of a certain instruction set architecture into account.
Abstract: Code density is of increasing concern in embedded system design since it reduces the need for the scarce resource memory and also implicitly improves further important design parameters like power consumption and performance. In this paper we introduce a novel, hardware-supported approach. Besides the code, also the lookup tables (LUTs) are compressed, that can become significant in size if the application is large and/or high compression is desired. Our scheme optimizes the number and size of generated LUTs to improve the compression ratio. To show the efficiency of our approach, we apply it to two compression schemes: ldquodictionary-basedrdquo and ldquostatisticalrdquo. We achieve an average compression ratio of 48% (already including the overhead of the LUTs). Thereby, our scheme is orthogonal to approaches that take particularities of a certain instruction set architecture into account. We have conducted evaluations using a representative set of applications and have applied it to three major embedded processor architectures, namely ARM, MIPS, and PowerPC.

Journal ArticleDOI
TL;DR: This paper presents a behavior-based error detection technique called Control Flow Checking using Branch Trace Exceptions for PowerPC processors family (CFCBTE), based on the branch trace exception feature available in the PowerPC processor family for debugging purposes.
Abstract: This paper presents a behavior-based error detection technique called Control Flow Checking using Branch Trace Exceptions for PowerPC processors family (CFCBTE) This technique is based on the branch trace exception feature available in the PowerPC processors family for debugging purposes This technique traces the target addresses of program branches at run-time and compares them with reference target addresses to detect possible violations caused by transient faults The reference target addresses are derived by a preprocessor from the source program To enhance the error detection coverage, three other mechanisms, ie, Machine Check Exception, System Trap Instructions and Work Load Timer are combined with the Branch Trace Exception mechanism The proposed technique is experimentally evaluated on a 32-bit PowerPC microcontroller using software implemented fault injection (SWIFI) and Power Supply Disturbances fault injection (PSD) A total of 6,000 faults were injected in microcontroller to measure the error detection coverage of the proposed control flow checking technique The experimental results show that this technique detects about 952% of transient errors in software implemented fault injection method and 964% of transient errors in power supply disturbances fault injection method

Proceedings ArticleDOI
17 Nov 2008
TL;DR: A pencil-and-paper procedure to design transmission-gate latches for high-speed performance, based on the Logical Effort approach, independently optimizes the master and slave section to get minimum delay, sizing all transistors in the critical path.
Abstract: In this paper we present a pencil-and-paper procedure to design transmission-gate latches for high-speed performance. The procedure, based on the Logical Effort approach, independently optimizes the master and slave section to get minimum delay, sizing all transistors in the critical path. The other devices, like keeper transistors or switches in the positive feedback networks, are sized with minimum width thus providing only a negligible capacitive load to the internal nodes. Simulations are performed on a PowerPC 603 master-slave latch designed with a 90-nm technology provided by STMicroelectronics, and the overall good performance of the proposed procedure compared to other design strategies is verified.

Proceedings ArticleDOI
20 Jul 2008
TL;DR: A 2D graphics algorithm for image resizing which is parallelized and developed on the Cell BE and indicates that the Cell processor can outperform modern RISC processors by 20x on SIMD compute intensive applications such asimage resizing.
Abstract: The IBM Cell Broadband Engine (BE) is a multi-core processor with a PowerPC host processor (PPE) and 8 synergic processor engines (SPEs). The Cell BE architecture is designed to improve upon conventional processors in terms of memory latency, bandwidth and compute power. In this paper, we describe a 2D graphics algorithm for image resizing which we parallelized and developed on the Cell BE. We report the performance measured on one Cell blade with varying numbers of synergic processor engines enabled. These results were compared to those obtained on the Cellpsilas single PPE and with all 8 SPEs disabled. The results indicate that the Cell processor can outperform modern RISC processors by 20x on SIMD compute intensive applications such as image resizing.

Book ChapterDOI
05 Jan 2008
TL;DR: This paper presents the porting, performance optimization and evaluation of CG on Cell Broadband Engine (CBE), a heterogeneous multi-core processor with SIMD accelerators, and takes advantages of CBE's particular architecture to optimize the performance of CG.
Abstract: The NAS Conjugate Gradient (CG) benchmark is an important scientific kernel used to evaluate machine performance and compare characteristics of different programming models. CG represents a computation and communication paradigm for sparse linear algebra, which is common in scientific fields. In this paper, we present the porting, performance optimization and evaluation of CG on Cell Broadband Engine (CBE). CBE, a heterogeneous multi-core processor with SIMD accelerators, is gaining attention and being deployed on supercomputers and high-end server architectures. We take advantages of CBE's particular architecture to optimize the performance of CG. We also quantify these optimizations and assess their impact. In addition, by exploring distributed nature of CBE, we present trade-off between parallelization and serialization, and Cell-specific data scheduling in its memory hierarchy. Our final result shows that the CG-Cell can achieve more than 4 times speedup over the performance of single comparable PowerPC Processor.

01 Jan 2008
TL;DR: This application note describes mitigation techniques and corresponding design flow when using a Xilinx FPGA with an embedded processor (specifically the PowerPC ® 405 found in the Virtex™-4 FX family) in high-radiation environments.
Abstract: Orbital, space-based, and extra-terrestrial applications are susceptible to the effects of high energy charged particles. Single-event upsets (SEUs) can alter the logic state of any static memory element (latch, flip-flop, or RAM cell) including the components of an embedded hard processor. These upsets are unavoidable but correctable for the logic around the processor in FPGA configuration memory. This application note describes mitigation techniques and corresponding design flow when using a Xilinx FPGA with an embedded processor (specifically the PowerPC ® 405 found in the Virtex™-4 FX family) in high-radiation environments. This example contains a block RAM scrubber example for block RAM blocks attached to the processor local bus (PLB) used for code execution. Since this technique cannot triplicate the PowerPC 405 (PPC405), the surrounding logic is mitigated as much as practically possible. Therefore, the user must determine if the system mitigation is sufficient for the target environment. Note: It is essential for the reader to have a basic understanding of the Xilinx tool flow using the Xilinx Platform Studio (XPS), triple-module-redundancy (TMR) techniques, the Xilinx TMRTool, and ISE™ software. An in-depth understanding of [Ref 1] is also essential. In addition, an understanding of VHDL design and practice is recommended.

Journal ArticleDOI
TL;DR: A solution is presented where a Linux kernel running on a PowerPC processor included in the Virtex-II Pro FPGA family is upgraded to support hardware acceleration on the ciphering tasks.
Abstract: With the growth of the portable electronic devices market, not only the protection of the data for the users but also the security of the designs themselves has grown significantly in importance. A solution is presented where a Linux kernel running on a PowerPC processor included in the Virtex-II Pro FPGA family is upgraded to support hardware acceleration on the ciphering tasks. In this way all the programs running on the PPC that make use of the Linux CryptoAPI can be accelerated by hardware in a transparent way without having the programmer to rewrite the applications. To provide more flexibility, the FPGA's self-reconfiguration capability can be used to reprogram any cryptographic algorithm demanded by the Linux CryptoAPI by just including a new software driver for the operating system, thus allowing the internal configuration access port (ICAP) of the FPGA to manage any cryptographic coprocessor at any time. The approach is validated on a real application using the Linux CryptoAPI: a ciphered file system that stores the system data in a secured way.

Journal ArticleDOI
TL;DR: This article presents a framework for automatic generation of binary utilities which relies on two innovative ideas: platform-aware modeling and more inclusive relocation handling.
Abstract: Electronic system level (ESL) modeling allows early hardware-dependent software (HDS) development. Due to broad CPU diversity and shrinking time-to-market, HDS development can neither rely on hand-retargeting binary tools, nor can it rely on pre-existent tools within standard packages. As a consequence, binary utilities which can be easily adapted to new CPU targets are of increasing interest. We present in this article a framework for automatic generation of binary utilities. It relies on two innovative ideas: platform-aware modeling and more inclusive relocation handling. Generated assemblers, linkers, disassemblers and debuggers were validated for MIPS, SPARC, PowerPC, i8051 and PIC16F84. An open-source prototype generator is available for download.

Proceedings ArticleDOI
08 Dec 2008
TL;DR: A highly efficient parallelization of the Smith-Waterman algorithm on the Cell Broadband Engine platform, a novel hybrid multicore architecture that drives the low-cost PlayStation 3 game consoles as well as the IBM BladeCenter Q22, which currently powers the fastest supercomputer in the world, Roadrunner at Los Alamos National Laboratory is presented.
Abstract: The Smith-Waterman algorithm is a dynamic programming method for determining optimal local alignments between nucleotide or protein sequences. However, it suffers from quadratic time and space complexity. As a result, many algorithmic and architectural enhancements have been proposed to solve this problem, but at the cost of reduced sensitivity in the algorithms or significant expense in hardware, respectively. This paper presents a highly efficient parallelization of the Smith-Waterman algorithm on the Cell Broadband Engine platform, a novel hybrid multicore architecture that drives the low-cost PlayStation 3 (PS3) game consoles as well as the IBM BladeCenter Q22, which currently powers the fastest supercomputer in the world, Roadrunner at Los Alamos National Laboratory. Through an innovative mapping of the optimal Smith-Waterman algorithm onto a cluster of PlayStation 3 nodes, our implementation delivers 21 to 55-fold speed-up over a high-end multicore architecture and up to 449-fold speed-up over the PowerPC processor in the PS3. Next, we evaluate the trade-offs between our Smith- Waterman implementation on the Cell with existing software and hardware implementations and show that our solution achieves the best performance-to-price ratio, when aligning realistic sequences sizes and generating the actual alignment. Finally, we show that our low-cost solution on a PS3 cluster approaches the speed of BLAST while achieving ideal sensitivity. To quantify the relationship between the two algorithms in terms of speed and sensitivity, we formally define and quantify the sensitivity of homology search methods so that trade-offs between sequence-search solutions can be evaluated in a quantitative manner.

29 Sep 2008
TL;DR: This paper presents a method that permits to estimate the power consumption of components in the AADL component assembly model, once deployed onto components on the A ADL target platform model.
Abstract: This paper presents a method that permits to estimate the power consumption of components in the AADL component assembly model, once deployed onto components in the AADL target platform model. This estimation is performed at different levels in the AADL refinement process. Multi-level power models have been specifically de- veloped for the different type of possible hardware targets: General Pur- pose Processors (GPP), Digital Signal Processors (DSP) and Field Pro- grammable Gate Arrays (FPGA). Three models are presented for a com- plex DSP (the Texas Instrument C62), a RISC GPP (the PowerPC 405), and a FPGA from Altera (Stratix EP1S80). The accuracy of these models depends on the refinement level. The maximum error introduced ranges from 70% for the FPGA at the first refinement level (only the operating frequency is considered here) to 5% for the GPP at the third refinement level (where the component's actual source code is considered).