scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 2010"


Journal ArticleDOI
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

259 citations


Journal ArticleDOI
24 May 2010
TL;DR: The Multithreaded Application Real-Time executor (MARTe) is a data driven framework environment for the development and deployment of real-time control algorithms, providing a set of strictly bounded standard interfaces to the outside world and accommodating a collection of facilities which promote the speed and ease of development, commissioning and Deployment of such systems.
Abstract: The Multithreaded Application Real-Time executor (MARTe) is a data driven framework environment for the development and deployment of real-time control algorithms. The main ideas which led to the present version of the framework were to standardise the development of real-time control systems, while providing a set of strictly bounded standard interfaces to the outside world and also accommodating a collection of facilities which promote the speed and ease of development, commissioning and deployment of such systems. At the core of every MARTe based application, is a set of independent inter-communicating software blocks, named Generic Application Modules (GAM), orchestrated by a real-time scheduler. The platform independence of its core library provides MARTe the necessary robustness and flexibility for conveniently testing applications in different environments including non-real-time operating systems. MARTe is already being used in several machines, each with its own peculiarities regarding hardware interfacing, supervisory control configuration, operating system and target control application. This paper presents and compares the most recent results of systems using MARTe: the JET Vertical Stabilisation system, which uses the Real Time Application Interface (RTAI) operating system on Intel multi-core processors; the COMPASS plasma control system, driven by Linux RT also on Intel multi-core processors; ISTTOK real-time tomography equilibrium reconstruction which shares the same support configuration of COMPASS; JET error field correction coils based on VME, PowerPC and VxWorks; FTU LH reflected power system running on VME, Intel with RTAI.

40 citations


Proceedings ArticleDOI
17 Oct 2010
TL;DR: Heterogeneous multi-core processors, such as the IBM Cell processor, are presented, and an implementation of the Java Virtual Machine which operates over the Cell processor is presented, thereby making this platforms more readily accessible to mainstream developers.
Abstract: Heterogeneous multi-core processors, such as the IBM Cell processor, can deliver high performance. However, these processors are notoriously difficult to program: different cores support different instruction set architectures, and the processor as a whole does not provide coherence between the different cores' local memories.We present Hera-JVM, an implementation of the Java Virtual Machine which operates over the Cell processor, thereby making this platforms more readily accessible to mainstream developers. Hera-JVM supports the full Java language; threads from an unmodified Java application can be simultaneously executed on both the main PowerPC-based core and on the additional SPE accelerator cores. Migration of threads between these cores is transparent from the point of view of the application, requiring no modification to Java source code or bytecode. Hera-JVM supports the existing Java Memory Model, even though the underlying hardware does not provide cache coherence between the different core types.We examine Hera-JVM's performance under a series of real-world Java benchmarks from the SpecJVM, Java Grande and Dacapo benchmark suites. These benchmarks show a wide variation in relative performance on the different core types of the Cell processor, depending upon the nature of their workload. Execution of these benchmarks on Hera-JVM can achieve speedups of up to 2.25x by using one of the Cell processor's SPE accelerator cores, compared to execution on the main PowerPC-based core. When all six SPE cores are exploited, parallel workloads can achieve speedups of up to 13x compared to execution on the single PowerPC core.

24 citations


Proceedings ArticleDOI
13 Dec 2010
TL;DR: An FPGA-based Linux test-bed was constructed for the purpose of measuring its sensitivity to single-event upsets, and a density metric for comparing the reliability of modules within the system is presented.
Abstract: An FPGA-based Linux test-bed was constructed for the purpose of measuring its sensitivity to single-event upsets. The test-bed consists of two ML410 Xilinx development boards connected using a 124-pin custom connector board. The Design Under Test (DUT) consists of the “hard core” PowerPC, running the Linux OS and several peripherals implemented in “soft” (programmable) logic. Faults were injected via the Internal Configuration Access Port (ICAP). The experiments performed here demonstrate that the Linux-based system was sensitive to 92,542 upsets-less than 0.7 percent of all tested bits. Each sensitive bit in the bit-stream is mapped to the resource and user-module to which it configures. A density metric for comparing the reliability of modules within the system is presented.

15 citations


Book ChapterDOI
01 Jan 2010
TL;DR: Key features and requirements of future NASA missions proposed within the National Research Council’s Decadal Survey will be described with ideas on how these reconfigurable FPGA technologies and development tools can combine to achieve breakthrough on–board processing performance to meet their science objectives.
Abstract: Future NASA missions will require measurements from high data rate active and passive instruments. Recent internal studies at NASA’s Jet Propulstion Laboratory (JPL) estimate approximately 1–5 Terabytes per day of raw data (uncompressed) are expected, for example, from spectroscopy instruments. Implementations of on–board processing algorithms to perform lossless data reduction are required to drastically reduce data volumes to within the downlink capabilities of the spacecraft and existing ground stations. Reconfigurable Field Programmable Gate Arrays (FPGAs) such as the XilinxTM Virtex–4 and Virtex–5 series devices can include dual core PowerPC processors thereby providing a flexible hardware and software co–design architecture to meet the on–board processing challenges of these missions while reducing the essential resources of mass and volume of earlier generation flight–qualified computing platforms such as the BAE Rad750 single board computer (SBC). Reconfigurable FPGAs also offer unique advantages over one–time programmable (OTP) FPGAs with flexible prototype development platforms that provide an important “path–to–flight” for spaceborne instruments. Reconfigurable FPGA technologies also provide in–flight flexibility with the ability to update processing algorithms as needed post–launch. This chapter will discuss these comparative technologies and present the benefits of commercially available FPGA development platforms from Xilinx for the development of NASA’s future on-board processing capabilities. Additionally, commercially available tools such as Impulse CTM have been used to adapt legacy C–code into Verilog or VHDL for implementation in FPGA fabric to achieve hardware acceleration. Key features and requirements of future NASA missions proposed within the National Research Council’s Decadal Survey will be described with ideas on how these reconfigurable FPGA technologies and development tools can combine to achieve breakthrough on–board processing performance to meet their science objectives. To provide specific demonstrations of these ideas, three unique and recent design implementations on the Xilinx V4FX60 and Virtex–5 FPGAs targeted to enable future NASA missions will be presented. They include on–board processing algorithms for a) Support Vector Machine (SVM) Classifiers similar to those in operation on the Earth Observing 1 (EO-1) Hyperion instrument, b) a Fourier transform infrared (FTIR) spectrometer, and c) a new Multiangle Spectropolarimetric Source: Aerospace Technologies Advancements, Book edited by: Dr. Thawar T. Arif, ISBN 978-953-7619-96-1, pp. 492, January 2010, INTECH, Croatia, downloaded from SCIYO.COM

14 citations


Book ChapterDOI
19 Jun 2010
TL;DR: ISAMAP, a flexible instruction mapping driven by dynamic binary translation, provides a fast translation between ISAs, under an easy-to-use description, and is capable of translating 32-bit PowerPC code to 32- bit x86 and to perform local optimizations on the resulting x86 code.
Abstract: Dynamic Binary Translation (DBT) techniques have been largely used in the migration of legacy code and in the transparent execution of programs across different architectures. They have also been used in dynamic optimizing compilers, to collect runtime information so as to improve code quality. In many cases, DBT translation mechanism misses important low-level mapping opportunities available at the source/target ISAs. Hot code performance has been shown to be central to the overall program performance, as different instruction mappings can account for high performance gains. Hence, DBT techniques that provide efficient instruction mapping at the ISA level has the potential to considerably improve performance. This paper proposes ISAMAP, a flexible instruction mapping driven by dynamic binary translation. Its mapping mechanism, provides a fast translation between ISAs, under an easy-to-use description. At its current state, ISAMAP is capable of translating 32-bit PowerPC code to 32-bit x86 and to perform local optimizations on the resulting x86 code. Our experimental results show that ISAMAP is capable of executing PowerPC code on an x86 host faster than the processor emulator QEMU, achieving speedups of up to 3.16x for SPEC CPU2000 programs.

13 citations


01 Jan 2010
TL;DR: In the benchmark prob- lem (image filter evolution) the proposed platform provides a significant speedup in comparison with a highly optimized software implementation, and is 8 times faster than previous FPGA accelerators of image filter evolution.
Abstract: A new accelerator of Cartesian genetic programming is presented in this paper. The accelerator is completely implemented in a single FPGA. The proposed architecture contains multiple instances of virtual reconfigurable circuit to evaluate several candidate solutions in parallel. An advanced memory organization was de- veloped to achieve the maximum throughput of processing. The search algorithm is implemented using the on-chip PowerPC processor. In the benchmark prob- lem (image filter evolution) the proposed platform provides a significant speedup (170) in comparison with a highly optimized software implementation. Moreover, the accelerator is 8 times faster than previous FPGA accelerators of image filter evolution.

12 citations


Proceedings ArticleDOI
01 Dec 2010
TL;DR: The communication and computation semantics of the MPI_Reduce call from the de facto Message-Passing Interface have been implemented and speedups of ≈2x to ≈800x are reported over that of a commodity cluster for small datasets, which provides significant motivation to continue the investigation into supporting additional collective communication operations directly in hardware.
Abstract: This paper demonstrates the benefits and pit-falls of implementing the collective communication operation reduce in the reconfigurable resources of an FPGA device across a cluster of all-FPGA compute nodes. Specifically, the communication and computation semantics of the MPI_Reduce call from the de facto Message-Passing Interface have been implemented. Using a synthetic benchmark a cluster of 32 FPGA nodes with a 300 MHz PowerPC processor, custom high speed network, and reduce core is compared against a conventional commodity cluster with 3.2 GHz Xeon processors and Gigabit Ethernet. The design is customized to support performing many reduce operations on small datasets while minimizing the amount of on-chip resources used, which is an increasingly common demand from domain scientists. Speedups of ≈2x to ≈800x are reported over that of a commodity cluster for small datasets, which provides significant motivation to continue the investigation into supporting additional collective communication operations directly in hardware.

11 citations


Journal ArticleDOI
TL;DR: A hardware architecture for computing direct kinematics of robot manipulators with 5 degrees of freedom using floating-point arithmetic is presented and it is implemented in Field Programmable Gate Arrays (FPGAs), demonstrating the effectiveness and high performance of the implemented cores on commercial FPGAs.
Abstract: Hardware acceleration in high performance computer systems has a particular interest for many engineering and scientific applications in which a large number of arithmetic operations and transcendental functions must be computed. In this paper a hardware architecture for computing direct kinematics of robot manipulators with 5 degrees of freedom (5 D.o.f) using floating-point arithmetic is presented for 32, 43, and 64 bit-width representations and it is implemented in Field Programmable Gate Arrays (FPGAs). The proposed architecture has been developed using several floating-point libraries for arithmetic and transcendental functions operators, allowing the designer to select (pre-synthesis) a suitable bit-width representation according to the accuracy and dynamic range, as well as the area, elapsed time and power consumption requirements of the application. Synthesis results demonstrate the effectiveness and high performance of the implemented cores on commercial FPGAs. Simulation results have been addressed in order to compute the Mean Square Error (MSE), using the Matlab as statistical estimator, validating the correct behavior of the implemented cores. Additionally, the processing time of the hardware architecture was compared with the same formulation implemented in software, using the PowerPC (FPGA embedded processor), demonstrating that the hardware architecture speeds-up by factor of 1298 the software implementation.

11 citations


Proceedings ArticleDOI
31 Aug 2010
TL;DR: The Global Tracking Unit of the AlICE Transition Radiation Detector is a high-speed, low-latency trigger processor installed at the ALICE experiment at the Large Hadron Collider, designed to significantly improve the overall detector performance by providing a complex and robust multi-event buffering scheme.
Abstract: The Global Tracking Unit of the ALICE Transition Radiation Detector is a high-speed, low-latency trigger processor installed at the ALICE experiment at the Large Hadron Collider. Based on the analysis of up to 20,000 parametrized particle track segments per event, a trigger decision is formed within approx. 2 μs. Furthermore, the system is designed to significantly improve the overall detector performance by providing a complex and robust multi-event buffering scheme. Data from the detector arrives at an aggregate net bandwidth of 2.16Tbit/s via 1080 optical links and is processed massively in parallel by 109 FPGA-based units organized in a 3-stage hierarchical structure. The embedded PowerPC cores are employed not only to build a monitoring and control system that can be interfaced by the experiment control. They are also used to realize a real-time hardware/software co-design, able to characterize the trigger performance, supervise the operation and intervene in cases of system errors.

9 citations


Journal ArticleDOI
TL;DR: A hierarchical, hybrid software-cache architecture that targets enabling prefetch techniques that enables automatic prefetch and modulo scheduling transformations and can achieve similar performance on the Cell BE processor as on a modern server-class multicore such as the IBM PowerPC 970MP processor for a set of parallel NAS applications.
Abstract: Ease of programming is one of the main requirements for the broad acceptance of multicore systems without hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that targets enabling prefetch techniques. Memory accesses are classified at compile time into two classes: high locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software-cache overhead in the innermost loop. The cache design enables automatic prefetch and modulo scheduling transformations. Performance evaluation indicates that optimized software-cache structures combined with the proposed prefetch techniques translate into speedup between 10 and 20 percent. As a result of the proposed technique, we can achieve similar performance on the Cell BE processor as on a modern server-class multicore such as the IBM PowerPC 970MP processor for a set of parallel NAS applications.

Proceedings ArticleDOI
16 Aug 2010
TL;DR: In this article, the use of static code simulation is proposed as an alternative to analyze and predict the program's behavior, in combination with a microprocessor's power model, allowing to estimate power and energy with only a small amount of run-time data.
Abstract: Current methodologies for software-level power and energy estimation use a microprocessor's power model combined with specialized tools that profile the program under study. These tools commonly rely on real-time program execution or simulations to gather the information needed, a process that usually requires a full set of real run-time data. This work proposes the use of static code simulation as an alternative to analyze and predict the program's behavior. This, in combination with a microprocessor's power model, allows to estimate power and energy with only a small amount of run-time data. Furthermore, the low execution time of the proposed method allows for its use as in iterative power optimizers. We present results obtained for a set of representative benchmark programs applied ran on a PowerPC 603e microprocessor. Power and energy estimates with mean absolute errors below 7% and 15%, respectively, are reported for the analyzed test cases.

Patent
13 Jan 2010
TL;DR: In this article, the utility model relates to a PC104-plus controller circuit board card based on a PowerPC processor, which consists of an MPC 8270 or MPC 8280 processor, a graphic controller, an SDRAM, an FLASH, a DOC disc, an RTC real-time clock, an EEPROM, a CPLD circuit, three paths of RS-232 communication interfaces, two paths of 10/100 M full-duplex self-adapting Ethernet interfaces, an 8/16-bit ISA bus interface, a
Abstract: The utility model relates to a PC104-plus controller circuit board card based on a PowerPC processor, which consists of an MPC 8270 or MPC 8280 processor, a graphic controller, an SDRAM, an FLASH, a DOC disc, an RTC real-time clock, an EEPROM, a CPLD circuit, three paths of RS-232 communication interfaces, two paths of 10/100 M full-duplex self-adapting Ethernet interfaces, an 8/16-bit ISA bus interface, a 32-bit PCI interface and an LCD video output interface of RGB, wherein the three paths of RS-232 communication interfaces are realized through an SCC interface of MPC 8270; the two paths of 10/100 M full-duplex self-adapting Ethernet interfaces are realized through an FCC interface; and the LCD video output interface of RGB is expanded and output through the graphic controller. In the practical use, an ISA bus and/or a PCI bus can be used for being intercommunicated with other PC 104 or PC 104-PLUS equipment to accomplish the data communication; the RS-232 bus can be used for data communication with other serial devices, and the Ethernet interfaces can also be used for communication with other network equipment. The LCD video interface can be used for displaying graphics and characters.

Proceedings ArticleDOI
24 May 2010
TL;DR: It is shown how the intrinsic parallelism and a mixed firmware and software implementation of the data reduction and acquisition tasks lead to a flexible system capable of extracting in real time meaningful information from the 2.5 GByte/s of raw event data produced by the front-end electronics at a nominal rate of 20 Hz.
Abstract: Among other detectors, the T2K neutrino experiment comprises three large time projection chambers segmented into over 124.000 electronics channels. The back-end electronics system is designed to distribute a reference clock to the front-end electronics, aggregate event data over seventy-two 2 Gbit/s optical links and format events that are sent via a standard PC to the global data acquisition system of the experiment. The core of this system is a set of 18 Data Concentrator Cards based on an inexpensive commercial Field Programmable Gate Array evaluation kit with specific add-ons. We describe the adaptations that were made to the original platform, and detail the design of the firmware and software running on the embedded PowerPC processor of the FPGA of a Data Concentrator Card. We show how the intrinsic parallelism and a mixed firmware and software implementation of the data reduction and acquisition tasks lead to a flexible system capable of extracting in real time meaningful information from the 2.5 GByte/s of raw event data produced by the front-end electronics at a nominal rate of 20 Hz.

Journal Article
TL;DR: System test proves that platform based on PowerPC, DSP and BLVDS can definitely satisfy requirements of digital integrated protection.
Abstract: With the development of IEC 61850 digital substation standard,a digital integrated protection scheme based on PowerPC,DSP and BLVDS is proposed.Platform made of PowerPC with VxWorks is used to complete digital sampling and sending GOOSE packet.DSP protection module is used to accomplish functions of measuring and control,and protection algorithm as well.Data stream between the above two CPUs are exchanged through high speed BLVDS bus.This paper mainly emphasis on system scheme,hardware and software platform in process level and implementation of digital sampling.System test proves that platform based on PowerPC,DSP and BLVDS can definitely satisfy requirements of digital integrated protection.

Paul Mackerras1
01 Jan 2010
TL;DR: Three low-level optimizations in the Linux® kernel for 32-bit and 64-bit PowerPC®, relating to cache flushing, memory copying, and PTE (page table entry) management are examined.
Abstract: We examine three low-level optimizations in the Linux® kernel for 32-bit and 64-bit PowerPC®, relating to cache flushing, memory copying, and PTE (page table entry) management. Benchmarking and profiling were used to identify areas where optimizations could be performed and to identify whether the optimizations actually improved performance. The cache flushing and memory copying optimizations improved performance significantly, whilst the PTE management optimization did not.

Proceedings ArticleDOI
09 Nov 2010
TL;DR: Experimental results indicate that with a relatively small degree of parallelism, corresponding to modest hardware cost, the overall frame rate can be increased between 18 and 105 % depending on processing and application parameters.
Abstract: Hardware acceleration is a popular method to boost performance in video processing applications. This paper shows how to accelerate such applications on a general-purpose CPU by means of a coprocessor that is tightly-coupled to the instruction pipeline. A method for efficient data transfer between CPU and coprocessor is developed, and the resulting data path architecture with optimum scheduling of operations is demonstrated. Based on this method, a coprocessor has been implemented in a Virtex-5 FPGA with embedded PowerPC to accelerate candidate operations of a video content analysis algorithm. Experimental results indicate that with a relatively small degree of parallelism, corresponding to modest hardware cost, the overall frame rate can be increased between 18 and 105 % depending on processing and application parameters.

Proceedings ArticleDOI
12 Aug 2010
TL;DR: A FPGA-based framework is presented that is designed and implemented on a Virtex 4 that can be used to compute Stillinger-Weber potential and extends the PowerPC instruction set to include vector operations and a custom datapath.
Abstract: The computer simulation of three-body potentials using the Stillinger-Weber method has been extensively used in the study of three-body molecular forces between partially rigid molecules such as silicon. The Stillinger-Weber method of computing three-body interactions is generally computationally intense. This paper presents a FPGA-based framework that is designed and implemented on a Virtex 4 that can be used to compute Stillinger-Weber potential. This framework extends the PowerPC instruction set to include vector operations and a custom datapath. Design details of the framework along with initial performance results with two well-known data sets are also presented. The results show that FPGA design is competitive with current microprocessors on small problems sizes and with only half of the algorithm implemented. As the problem size increases, the results suggest the FPGA-based design will gain a significant performance advantage. Coding the second half of the algorithm will increase the on-chip parallelism as well.

Proceedings ArticleDOI
17 May 2010
TL;DR: In this paper, a processor-attached in-line accelerator provides high-performance SIMD computing and power efficiency by means of a very large register file and a set of vector multimedia extensions based on IBM's PowerPC VMX.
Abstract: In this paper we evaluate the performance and power of a processor-attached in-line accelerator. The accelerator provides high-performance SIMD computing and power efficiency by means of a very large register file and a set of vector multimedia extensions based on IBM's PowerPC VMX. Our experiments show significant performance improvements and power reduction, compared to a baseline vector execution unit, mainly due to the drastic decrease of memory accesses caused by the software-managed locality of the very large register file. Total execution time is, on average, reduced by 61%, while consuming 55% less energy.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: This paper proposes an efficient, low power algorithm and its co-designed VLSI architecture for fractional-pel motion estimation (FME) in H.264/AVC, and shows that its performance in terms of transistor count, throughput and power consumption, are comparable to that of state-of-the-art ASIC implementations.
Abstract: In this paper we propose an efficient, low power algorithm and its co-designed VLSI architecture for fractional-pel motion estimation (FME) in H.264/AVC. Our fractional-pel motion estimator uses a simplified FIR filter for half-pel interpolation. Usage of this filter reduces the required number of computations and the memory size and bandwidth for half-pel interpolation. Our simulations compare our algorithm with the state-of-the-art, in terms of rate-distortion performance and computational complexity. Our VLSI architecture is prototyped on a Field Programmable System on Chip (FPSoC), comprising a Virtex-II Pro FPGA and an embedded PowerPC processor. Our results show that our algorithm on average has better rate-distortion performance, compared to previous state-of-the-art FME algorithms, while its losses compared to FME in H.264/AVC, are insignificant. Our prototyped architecture is more hardware-efficient than previous FPGA-based architectures, in terms of power consumption, throughput, area and memory utilization. We also show that its performance in terms of transistor count, throughput and power consumption, are comparable to that of state-of-the-art ASIC implementations.

Proceedings ArticleDOI
07 Jul 2010
TL;DR: An embedded system based on PowerPC and CAN bus has been designed and developed in this paper, which can solve the high demand data acquisition and monitor, network and communication effectively for the applications of electrical power system.
Abstract: The embedded system takes the application as a center, which can adapt the strict demands of application system well to the function, reliability, cost, size and power consumption etc. Freescale Corporation's PowerPC series chip has the powerful communication capacity, the system stability as well as disturbance rejection ability. The CAN bus obtains the universal applications in the industrial control field as one of most widespread fieldbus. Take Freescale Corporation PowerQUICC II technology's high-end dual-core chip MPC8248 processor as the core, an embedded data acquisition and the monitoring system platform based on PowerPC and CAN bus has been designed and developed in this paper, which can solve the high demand data acquisition and monitor, network and communication effectively for the applications of electrical power system. At the same time, the data can be transferred by Ethernet communication interface to the remote monitoring systems such as DAS or DCS for further analysis, processing and storage after being analyzed and processed by the platform. It is of great application significance.

Journal Article
TL;DR: The intelligent substation fault recorder is designed for the development of smart grid and communicates with substation control lay devices using IEC61850 MMS (Manufacturing Message Specification) to realize the inter-operation.
Abstract: The intelligent substation fault recorder is designed for the development of smart grid.The system supports the IEC6185091 or IEC61850-92 SMV(Sampled Analogue Value) packets of substation process networks and the IEC61850-8-1 GOOSE(Generic Object Oriented Substation Event) packets,which are parsed into switch status.It also supports SMV+GOOSE coexistence mode,directly acquiring them from the exchanger.It communicates with substation control lay devices using IEC61850 MMS (Manufacturing Message Specification) to realize the inter-operation.The embedded POWERPC hardware system and VxWorks realtime operating system are adopted,meeting the new requirements of intelligent substation.

01 Sep 2010
TL;DR: This report implemented NASA's LU benchmark on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA), and compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors.
Abstract: We are currently investigating the viability of many-core architectures for the acceleration of wavefront applications and this report focuses on graphics processing units (GPUs) in particular. To this end, we have implemented NASA’s LU benchmark – a real world production-grade application – on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA). This GPU implementation of the benchmark has been used to investigate the performance of a selection of GPUs, ranging from workstation-grade commodity GPUs to the HPC "Tesla” and "Fermi” GPUs. We have also compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors, including Intel (Nehalem), AMD (Opteron) and IBM (PowerPC). In previous work we have developed a predictive “plug-and-play” performance model of this class of application running on such clusters, in which CPUs communicate via the Message Passing Interface (MPI). By extending this model to also capture the performance behaviour of GPUs, we are able to: (1) comment on the effects that architectural changes will have on the performance of single-GPU solutions, and (2) make projections regarding the performance of multi-GPU solutions at larger scale.

Book ChapterDOI
01 Jun 2010
TL;DR: An Autonomous Fault Tolerant System that implements a communication gateway between a CAN bus and asynchronous communication interface is presented and can respond to an error detected by a voter in a Triple Modular Redundancy architecture reconfiguring the module that fails using pre-defined bitstreams.
Abstract: In this paper, an Autonomous Fault Tolerant System that implements a communication gateway between a CAN bus and asynchronous communication interface is presented. This gateway has been implemented using a Triple Modular Redundancy architecture at IP core level. The system can respond to an error detected by a voter in a Triple Modular Redundancy architecture reconfiguring the module that fails using pre-defined bitstreams. The whole system has been implemented in a Virtex-4 FPGA and the reconfiguration system is based on a hard PowerPC microprocessor.

Proceedings ArticleDOI
24 Oct 2010
TL;DR: In this paper, the performance and side-channel resistance of bit-sliced AES implementations on two different FPGA platforms, one based on a PowerPC processor and the second based on an LEON-3 soft-core processor, were compared.
Abstract: The Advanced Encryption Standard is used in almost every new embedded application that needs a symmetric-key cipher. In such embedded applications, high-performance as well as resistance against implementation attacks is mandatory. In this paper, we compare and contrast three different software implementations of AES. The first two are based on cryptographic lookup tables, while the third uses bit-slicing. We analyze the performance and side-channel resistance of each implementation on two different FPGA platforms, one based on a PowerPC processor, and the second based on a LEON-3 soft-core processor. Our measurements show that, on embedded platforms, a bit-sliced AES implementation does not always outperform a lookup-table based AES implementation. We also present a detailed analysis of the side-channel resistance and the source of side-channel leakage, and show that our bit-sliced implementation has eight times more side-channel leakage than the lookup-table implementations. Hence, we conclude that a variation on the implementation style for embedded software implementation of AES will not only affect performance, but also embedded system security.

Proceedings ArticleDOI
12 Mar 2010
TL;DR: An open source Embedded Linux operating system on the FPGA to manage the hardware resources and to configure the Linux for Web Server application and to control LED device through any remote machine in the same network is addressed.
Abstract: Remote controlling of devices or peripherals interfaced with an MCU is essential in couple of applications in Embedded Systems Web Server is one of the best solutions to do such task if internet is available at such remote locations The important aspect of this paper is to port an open source Embedded Linux operating system on the FPGA to manage the hardware resources It is pre-emptible and can be used in place of RTOS The foremost challenge in implementing this work is to ensure that the embedded IP cores must be visible to the operating system as well as to the user A complete toolchain is developed on a Linux desktop machine and kernel image is cross compiled for the PowerPC processor Important aspect addressed by this paper is to configure the Linux for Web Server application and to control LED device through any remote machine in the same network

Patent
10 Mar 2010
TL;DR: In this article, a hardware device and a method for assisting in processing a dynamic bandwidth allocation (DBA) algorithm is presented, which can flexibly process the DBA core algorithm and save the cost.
Abstract: The invention discloses a hardware device and a method for assisting in processing a dynamic bandwidth allocation (DBA) algorithm. The hardware device comprises a hardware logic module, a register interface control module, a synchronous dynamic RAM controller module, a FLASH controller module, an interrupt processor module, a universal asynchronous receiver/transmitter controller module, and a master-slave communication module, a master-slave communication interface module, a PowerPc CPU module, a processor bus module, a processor bus-to-on-chip- peripheral-bus bridge module and an on chip peripheral bus module which are orderly connected, wherein the PowerPc CPU module is used for processing and controlling data acquired by the hardware logic module, is connected with a master CPU interface in the hardware logic module through the master-slave communication module to finish communications between an embedded CPU and a master CPU, and controls and configures a register in the hardwarelogic module and the report and the allocation of the dynamic bandwidth allocation algorithm through a register interface module. The hardware device and the method for assisting in processing the dynamic bandwidth allocation algorithm can flexibly process the DBA core algorithm and save the cost.

Proceedings ArticleDOI
01 Feb 2010
TL;DR: In this paper, the authors describe Indiana University's implementation, performance testing, and use of a large high performance computing system, Big Red, which appeared in the 27th Top500 list as the 23rd fastest supercomputer in the world in June 2006.
Abstract: This paper describes Indiana University's implementation, performance testing, and use of a large high performance computing system. IU's Big Red, a 20.48 TFLOPS IBM e1350 BladeCenter cluster, appeared in the 27th Top500 list as the 23rd fastest supercomputer in the world in June 2006. In spring 2007, this computer was upgraded to 30.72 TFLOPS. The e1350 BladeCenter architecture, including two internal networks accessible to users and user applications and two networks used exclusively for system management, has enabled the system to provide good scalability on many important applications while being well manageable. Implementing a system based on the JS21 Blade and PowerPC 970MP processor within the US TeraGrid presented certain challenges, given that Intel-compatible processors dominate the TeraGrid. However, the particular characteristics of the PowerPC have enabled it to be highly popular among certain application communities, particularly users of molecular dynamics and weather forecasting codes. A critical aspect of Big Red's implementation has been a focus on Science Gateways, which provide graphical interfaces to systems supporting end-to-end scientific workflows. Several Science Gateways have been implemented that access Big Red as a computational resource—some via the TeraGrid, some not affiliated with the TeraGrid. In summary, Big Red has been successfully integrated with the TeraGrid, and is used by many researchers locally at IU via grids and Science Gateways. It has been a success in terms of enabling scientific discoveries at IU and, via the TeraGrid, across the US. Copyright © 2009 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
27 Oct 2010
TL;DR: A retargetable compiled simulator with three optimization techniques and taking advantage of new GCC optimizations to improve the performance is presented.
Abstract: The design of new architectures can be simplified with the use of retargetable instruction set simulation tools, which can validate the decisions in the design exploration cycle with high flexibility and reduced cost. The increasing system complexity makes the traditional approach to simulation inefficient for today's architectures. The compiled simulation technique makes use of a priori knowledge about the application to accelerate the simulation with high efficiency. This paper presents a retargetable compiled simulator with three optimization techniques and taking advantage of new GCC optimizations to improve the performance. Three architectures were modeled and tested, MIPS, SPARC and PowerPC. Our MIPS model achieved the best results, with average of 651 million instruction per second, and only 2.8 times slower than native execution.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: This research implemented a multiprocessor architecture to support real-time image processing on FPGA that achieved 75–80% performance improvement compared to its single Microblaze counterpart and is faster than the single-powerPC implementation on FPFA.
Abstract: Real-time image processing demands much more processing power than a conventional processor can deliver. As a result hardware acceleration became necessary to augments processors with application-specific coprocessors. Due to the limited resources on FPGA and nature of some sequential algorithms, it is difficult to depend entirely on slice resources. In this research, we implemented a multiprocessor architecture to support real-time image processing on FPGA. Furthermore, we benchmarked and compared our implemented architectures with their counterparts. The operational structure of multiprocessor architecture consists of on-chip processors implemented in a parallel manner with efficient memory and bus architectures. The performance properties such as accuracy, throughput and efficiency are measured and presented. Multiprocessor systems are effective in software level parallelism on FPGA. Our quad-Microblaze architecture achieved 75–80% performance improvement compared to its single Microblaze counterpart. Moreover, the quad-Microblaze design is faster than the single-powerPC implementation on FPFA. Therefore, multi-processor architecture with customised coprocessors are effective for implementing custom parallel architecture to achieve real-time image processing.