Showing papers on "PowerPC published in 2009"

PDF

Open Access

Proceedings Article•DOI•

Nuclei: GPU-Accelerated Many-Core Network Coding

[...]

Hassan Shojania¹, Baochun Li¹, Xin Wang²•Institutions (2)

University of Toronto¹, Fudan University²

19 Apr 2009

TL;DR: This work shows how the GPU, with a design involving thousands of lightweight threads, can boost network coding performance significantly, and can be deployed as an attractive alternative and complementary solution to multi-core servers, by offering a better price/performance advantage.

...read moreread less

Abstract: While it is a well known result that network coding achieves optimal flow rates in multicast sessions, its potential for practical use has remained to be a question, due to its high computational complexity. Our previous work has attempted to design a hardware-accelerated and multi-threaded implementation of network coding to fully utilize multi-core CPUs, as well as SSE2 and AltiVec SIMD vector instructions on X86 and PowerPC processors. This paper represents another step forward, and presents the first attempt in the literature to maximize the performance of network coding by taking advantage of not only multi-core CPUs, but also potentially hundreds of computing cores in commodity off-the-shelf graphics processing units (GPU). With GPU computing gaining momentum as a result of increased hardware capabilities and improved programmability, our work shows how the GPU, with a design involving thousands of lightweight threads, can boost network coding performance significantly. Many-core GPUs can be deployed as an attractive alternative and complementary solution to multi-core servers, by offering a better price/performance advantage. In fact, multi-core CPUs and many-core GPUs can be deployed and used to perform network coding simultaneously, potentially useful in media streaming servers where hundreds of peers are served concurrently by these dedicated servers. In this paper, we present Nuclei, the design and implementation of GPU-based network coding. With Nuclei, only one mainstream NVidia 8800 GT GPU outperforms an 8-core Intel Xeon server in most test cases. A combined CPU-GPU encoding scenario achieves coding rates of up to 116 MB/second for a variety of coding settings, which is sufficient to saturate a Gigabit Ethernet interface.

...read moreread less

72 citations

Proceedings Article•DOI•

Cloud Computing for Satellite Data Processing on High End Compute Clusters

[...]

N. Golpayegani¹, Milton Halem¹•Institutions (1)

University of Maryland, Baltimore County¹

21 Sep 2009

TL;DR: By installing Hadoop on a cluster of IBM PowerPC blade clusters, it is shown that it can efficiently process multiyear remote sensing data, expect to see speed performance improvements over conventional multi-processor methodologies, and have more memory efficient implementation allowing for finer grid resolutions.

...read moreread less

Abstract: Hadoop is a Distributed Filesystem and MapReduce framework originally developed for search applications by Google and subsequently adopted by the Apache foundation as an open source system. We propose that this parallel computing framework is well suited for a variety of service oriented science applications and, in particular, for satellite data processing of remote sensing systems. We show that, by installing Hadoop on a cluster of IBM PowerPC blade clusters, we can efficiently process multiyear remote sensing data, expect to see speed performance improvements over conventional multi-processor methodologies, and have more memory efficient implementation allowing for finer grid resolutions. Moreover, these improvements can be met without significant changes in coding structure.

...read moreread less

67 citations

Proceedings Article•DOI•

Simplify to survive: prescriptive layouts ensure profitable scaling to 32nm and beyond

[...]

Lars W. Liebmann¹, Larry Pileggi², Jason D. Hibbeler¹, Vyacheslav Rovner², Tejas Jhaveri², Greg Northrop¹ - Show less +2 more•Institutions (2)

IBM¹, PDF Solutions²

13 Mar 2009-Proceedings of SPIE

TL;DR: The time-to-market driven need to maintain concurrent process-design co-development, even in spite of discontinuous patterning, process, and device innovation, is reiterated and shortcomings in traditional Design for Manufacturability solutions are identified.

...read moreread less

Abstract: The time-to-market driven need to maintain concurrent process-design co-development, even in spite of discontinuous patterning, process, and device innovation is reiterated. The escalating design rule complexity resulting from increasing layout sensitivities in physical and electrical yield and the resulting risk to profitable technology scaling is reviewed. Shortcomings in traditional Design for Manufacturability (DfM) solutions are identified and contrasted to the highly successful integrated design-technology co-optimization used for SRAM and other memory arrays. The feasibility of extending memory-style design-technology co-optimization, based on a highly simplified layout environment, to logic chips is demonstrated. Layout density benefits, modeled patterning and electrical yield improvements, as well as substantially improved layout simplicity are quantified in a conventional versus template-based design comparison on a 65nm IBM PowerPC 405 microprocessor core. The adaptability of this highly regularized template-based design solution to different yield concerns and design styles is shown in the extension of this work to 32nm with an increased focus on interconnect redundancy. In closing, the work not covered in this paper, focused on the process side of the integrated process-design co-optimization, is introduced.

...read moreread less

57 citations

Proceedings Article•DOI•

Automatically mapping applications to a self-reconfiguring platform

[...]

Karel Bruneel¹, Fatma Abouelella¹, Dirk Stroobandt¹•Institutions (1)

Ghent University¹

20 Apr 2009

TL;DR: A tool flow is described that can automatically map a large set of applications to a self-reconfiguring platform, without an excessive need for resources at run-time, and is successfully used to implement an adaptive 32-tap FIR filter on a Xilinx XUP board.

...read moreread less

Abstract: The inherent reconfigurability of FPGAs enables us to optimize an FPGA implementation in different time intervals by generating new optimized FPGA configurations and reconfiguring the FPGA at the interval boundaries. With conventional methods, generating a configuration at run-time requires an unacceptable amount of resources. In this paper, we describe a tool flow that can automatically map a large set of applications to a self-reconfiguring platform, without an excessive need for resources at run-time. The self-reconfiguring platform is implemented on a Xilinx Virtex-II Pro FPGA and uses the FPGA's PowerPC as configuration manager. This configuration manager generates optimized configurations on-the-fly and writes them to the configuration memory using the ICAP. We successfully used our approach to implement an adaptive 32-tap FIR filter on a Xilinx XUP board. This resulted in a 40% reduction in FPGA resources compared to a conventional implementation and a manageable reconfiguration overhead.

...read moreread less

45 citations

Journal Article•DOI•

3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors

[...]

Mauricio Araya-Polo¹, F. Rubio¹, Raúl de la Cruz¹, Mauricio Hanzich¹, José María Cela¹, Daniele Paolo Scarpazza² - Show less +2 more•Institutions (2)

Barcelona Supercomputing Center¹, IBM²

01 Jan 2009

TL;DR: This paper presents a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance, and the kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth.

...read moreread less

Abstract: Reverse-Time Migration (RTM) is a state-of-the-art technique in seismic acoustic imaging, because of the quality and integrity of the images it provides. Oil and gas companies trust RTM with crucial decisions on multi-million-dollar drilling investments. But RTM requires vastly more computational power than its predecessor techniques, and this has somewhat hindered its practical success. On the other hand, despite multi-core architectures promise to deliver unprecedented computational power, little attention has been devoted to mapping efficiently RTM to multi-cores. In this paper, we present a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance. The kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth. Our Cell/B.E. implementation outperforms a traditional processor (PowerPC 970MP) in terms of performance (with an 15.0× speedup) and energy-efficiency (with a 10.0× increase in the GFlops/W delivered). Also, it is the fastest RTM implementation available to the best of our knowledge. These results increase the practical usability of RTM. Also, the RTM-Cell/B.E. combination proves to be a strong competitor in the seismic arena.

...read moreread less

41 citations

Proceedings Article•DOI•

Characterizing the Performance of Big Memory on Blue Gene Linux

[...]

Kazutomo Yoshii¹, Kamil Iskra¹, Harish Naik¹, Pete Beckmanm¹, P. Chris Broekema² - Show less +1 more•Institutions (2)

Argonne National Laboratory¹, ASTRON²

22 Sep 2009

TL;DR: Originally intended exclusively for compute node tasks, the design and implementation of “Big Memory”— an alternative, transparent memory space for computational processes turns out to dramatically improve the performance of certain I/O node applications as well.

...read moreread less

Abstract: Efficient use of Linux for high-performance applications on Blue Gene/P (BG/P) compute nodes is challenging because of severe performance hits resulting from translation lookaside buffer (TLB) misses and a hard-to-program torus network DMA controller. To address these difficulties, we present the design and implementation of “Big Memory”— an alternative, transparent memory space for computational processes. Big Memory uses extremely large memory pages available on PowerPC CPUs to create a TLB-miss-free, flat memory area that can be used for application code and data and is easier to use for DMA operations. One of our singlenode memory benchmarks shows that the performance gap between regular PowerPC Linux with 4KB pages and IBM BG/P compute node kernel (CNK) is about 68% in the worst case. Big Memory narrows the worst case performance gap to just 0.04%. We verify this result on 1024 nodes of Blue Gene/P using the NAS Parallel Benchmarks and find the performance under Linux with Big Memory to fluctuate within 0.7% of CNK. Originally intended exclusively for compute node tasks, our new memory subsystem turns out to dramatically improve the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray (LOFAR) radio telescope as an example.

...read moreread less

39 citations

Proceedings Article•DOI•

Instruction-level simulation of a cluster at scale

[...]

Edgar A. León¹, Rolf Riesen², Arthur B. Maccabe³, Patrick G. Bridges¹•Institutions (3)

University of New Mexico¹, Sandia National Laboratories², Oak Ridge National Laboratory³

14 Nov 2009

TL;DR: A scalable simulator that couples a cycle-accurate node simulator with a supercomputer network model and shows its suitability for architecture research by evaluating the impact of cache injection on parallel application performance.

...read moreread less

Abstract: Instruction-level simulation is necessary to evaluate new architectures. However, single-node simulation cannot predict the behavior of a parallel application on a supercomputer. We present a scalable simulator that couples a cycle-accurate node simulator with a supercomputer network model. Our simulator executes individual instances of IBM's Mambo PowerPC simulator on hundreds of cores. We integrated a NIC emulator into Mambo and model the network instead of fully simulating it. This decouples the individual node simulators and makes our design scalable.Our simulator runs unmodified parallel message-passing applications on hundreds of nodes. We can change network and detailed node parameters, inject network traffic directly into caches, and use different policies to decide when that is an advantage.This paper describes our simulator in detail, evaluates it, and demonstrates its scalability. We show its suitability for architecture research by evaluating the impact of cache injection on parallel application performance.

...read moreread less

34 citations

Journal Article•DOI•

High performance motion detection: some trends toward new embedded architectures for vision systems

[...]

Lionel Lacassagne¹, Antoine Manzanera², Julien Denoulet³, Alain Merigot¹•Institutions (3)

University of Paris-Sud¹, École Normale Supérieure², Pierre-and-Marie-Curie University³

01 Jun 2009-Journal of Real-time Image Processing

TL;DR: This article presents some implementations of robust motion detection algorithms on three architectures: a general purpose RISC processor—the PowerPC G4—a parallel artificial retina dedicated to low level image processing—Pvlsar34—and the Associative Mesh, a specialized architecture based on associative net.

...read moreread less

Abstract: The goal of this article is to compare some optimised implementations on current high performance platforms in order to highlight architectural trends in the field of embedded architectures and to get an estimation of what should be the components of a next generation vision system. We present some implementations of robust motion detection algorithms on three architectures: a general purpose RISC processor—the PowerPC G4—a parallel artificial retina dedicated to low level image processing—Pvlsar34—and the Associative Mesh, a specialized architecture based on associative net. To handle the different aspects and constraints of embedded systems, execution time and power consumption of these architectures are compared.

...read moreread less

29 citations

Book Chapter•DOI•

Verified LISP Implementations on ARM, x86 and PowerPC

[...]

Magnus O. Myreen¹, Michael J. C. Gordon¹•Institutions (1)

University of Cambridge¹

20 Aug 2009

TL;DR: This paper reports on a case study, which is believed to be the first to produce a formally verified end-to-end implementation of a functional programming language running on commercial processors.

...read moreread less

Abstract: This paper reports on a case study, which we believe is the first to produce a formally verified end-to-end implementation of a functional programming language running on commercial processors. Interpreters for the core of McCarthy's LISP 1.5 were implemented in ARM, x86 and PowerPC machine code, and proved to correctly parse, evaluate and print LISP s-expressions. The proof of evaluation required working on top of verified implementations of memory allocation and garbage collection. All proofs are mechanised in the HOL4 theorem prover.

...read moreread less

21 citations

Journal Article•DOI•

FPSoC-Based Architecture for a Fast Motion Estimation Algorithm in H.264/AVC

[...]

Obianuju Ndili¹, Tokunbo Ogunfunmi¹•Institutions (1)

Santa Clara University¹

18 Nov 2009-Eurasip Journal on Embedded Systems

TL;DR: The results show that the modified hybrid FME algorithm on average, outperforms previous state-of-the-art FME algorithms, while its losses when compared with FSME, in terms of PSNR performance and computation time, are insignificant.

...read moreread less

Abstract: There is an increasing need for high quality video on low power, portable devices. Possible target applications range from entertainment and personal communications to security and health care. While H.264/AVC answers the need for high quality video at lower bit rates, it is significantly more complex than previous coding standards and thus results in greater power consumption in practical implementations. In particular, motion estimation (ME), in H.264/AVC consumes the largest power in an H.264/AVC encoder. It is therefore critical to speed-up integer ME in H.264/AVC via fast motion estimation (FME) algorithms and hardware acceleration. In this paper, we present our hardware oriented modifications to a hybrid FME algorithm, our architecture based on the modified algorithm, and our implementation and prototype on a PowerPC-based Field Programmable System on Chip (FPSoC). Our results show that the modified hybrid FME algorithm on average, outperforms previous state-of-the-art FME algorithms, while its losses when compared with FSME, in terms of PSNR performance and computation time, are insignificant. We show that although our implementation platform is FPGA-based, our implementation results compare favourably with previous architectures implemented on ASICs. Finally we also show an improvement over some existing architectures implemented on FPGAs.

...read moreread less

17 citations

Journal Article•DOI•

Instruction-Based Online Periodic Self-Testing of Microprocessors with Floating-Point Units

[...]

G. Xenoulis¹, Dimitris Gizopoulos¹, Mihalis Psarakis¹, Antonis Paschalis²•Institutions (2)

University of Piraeus¹, National and Kapodistrian University of Athens²

01 Apr 2009-IEEE Transactions on Dependable and Secure Computing

TL;DR: This paper developed self-test programs for single and double precision FP units on 32-bit and 64-bit microprocessor architectures and evaluated them with respect to the requirements of low-cost online periodic self-testing: fault coverage, memory footprint, execution time, and power consumption, assuming different memory hierarchy configurations.

...read moreread less

Abstract: Online periodic testing of microprocessors is a valuable means to increase the reliability of a low-cost system, when neither hardware nor time redundant protection schemes can be applied. This is particularly valid for floating-point (FP) units, which are becoming more common in embedded systems and are usually protected from operational faults through costly hardware redundant approaches. In this paper, we present scalable instruction-based self-test program development for both single and double precision FP units considering different instruction sets (MIPS, PowerPC, and Alpha), different microprocessor architectures (32/64-bit architectures) and different memory configurations. Moreover, we introduce bit-level manipulation instruction sequences that are essential for the development of FP unit's self-test programs. We developed self-test programs for single and double precision FP units on 32-bit and 64-bit microprocessor architectures and evaluated them with respect to the requirements of low-cost online periodic self-testing: fault coverage, memory footprint, execution time, and power consumption, assuming different memory hierarchy configurations. Our comprehensive experimental evaluations reveal that the instruction set architecture plays a significant role in the development of self-test programs. Additionally, we suggest the most suitable self-test program development approach when memory footprint or low power consumption is of paramount importance.

...read moreread less

Journal Article•DOI•

A computational science IDE for HPC systems: design and applications

[...]

David E. Hudak¹, Neil Ludban¹, Ashok K. Krishnamurthy¹, Vijay Gadepally¹, Siddharth Samsi¹, John Nehrbass² - Show less +2 more•Institutions (2)

Ohio Supercomputer Center¹, Performance Technologies, Incorporated²

01 Feb 2009-International Journal of Parallel Programming

TL;DR: The goals and status for the ParaM project are described and the development of applications in signal and image processing that use ParAM are described, which has achieved 60% of the bandwidth of an equivalent C/MPI benchmark.

...read moreread less

Abstract: Software engineering studies have shown that programmer productivity is improved through the use of computational science integrated development environments (or CSIDE, pronounced "sea side") such as MATLAB. Scientists often desire to use high-performance computing (HPC) systems to run their existing CSIDE scripts with large data sets. ParaM is a CSIDE distribution that provides parallel execution of MATLAB scripts on HPC systems at large shared computer centers. ParaM runs on a range of processor architectures (e.g., x86, x64, Itanium, PowerPC) and its MPI binding, known as bcMPI, supports a number of interconnect architectures (e.g., Myrinet and InfiniBand). On a cluster at Ohio Supercomputer Center, bcMPI with blocking communication has achieved 60% of the bandwidth of an equivalent C/MPI benchmark. In this paper, we describe goals and status for the ParaM project and the development of applications in signal and image processing that use ParaM.

...read moreread less

Proceedings Article•DOI•

Hardware/software co-design of an ATCA-based computation platform for data acquisition and triggering

[...]

Qiang Wang, Axel Jantsch¹, Dapeng Jin, Andreas Kopp, Wolfgang Kuehn, Johannes Lang, Soeren Lange, Lu Li, Ming Liu, Zhen'an Liu, Zhonghai Lu¹, David Muenchow, Johannes Roskoss, Hao Xu - Show less +10 more•Institutions (1)

Royal Institute of Technology¹

10 May 2009

TL;DR: An ATCA-based computation platform for data acquisition and trigger(TDAQ) applications has been developed for multiple future projects such as PANDA, HADES, and BESIII and a hardware/software co-design approach is proposed to ease and accelerate the development for different experiments.

...read moreread less

Abstract: An ATCA-based computation platform for data acquisition and trigger(TDAQ) applications has been developed for multiple future projects such as PANDA, HADES, and BESIII. Each Compute Node (CN) appears as one of the fourteen Field Replaceable Units (FRU) in an ATCA shelf, which in total features a high performance of 1890 Gbps inter-FPGA onboard channels, 1456 Gbps inter-board backplane connections, 728 Gbps full-duplex optical links, 70 Gbps Ethernet, 140 GBytes DDR2 SDRAM, and all computing resources of 70 Xilinx Virtex-4 FX60 FPGAs. Corresponding to the system architecture, a hardware/software co-design approach is proposed to ease and accelerate the development for different experiments. In the uniform system design, application-specific computation is to be implemented as customized hardware co-processors, while the embedded PowerPC processor takes charge of flexible slow controls and transmission protocol processing.

...read moreread less

Proceedings Article•DOI•

Direct implementation of shift and reset in the MinCaml compiler

[...]

Moe Masuko¹, Kenichi Asai¹•Institutions (1)

Ochanomizu University¹

30 Aug 2009

TL;DR: This paper presents a direct implementation of delimited control operators shift and reset in the MinCaml compiler and shows all the details of how composable continuations can be implemented in the PowerPC microprocessor using a stack discipline.

...read moreread less

Abstract: Although delimited control operators are becoming one of the useful tools to manipulate flow of programs, their direct and compiled implementation in a low-level language has not been proposed so far. The only direct and low-level implementations available are Gasbichler and Sperber's implementation in the Scheme 48 virtual machine and Kiselyov's implementation in the OCaml bytecode. Even though these implementations do provide insight into how stack frames are composed, they are not directly portable to compiled implementation in assembly language. This paper presents a direct implementation of delimited control operators shift and reset in the MinCaml compiler. It shows all the details of how composable continuations can be implemented in the PowerPC microprocessor using a stack discipline. We also show an implementation that copies stack frames lazily. To our knowledge, this is the first implementation of shift/reset in assembly language. It makes clear at the assembly language level what we have informally described so far, such as "copying and composing stack frames" and "inserting a reset mark when captured continuations are called". We demonstrate various benchmarks to show the performance of our implementation and discuss its pros and cons.

...read moreread less

Proceedings Article•DOI•

Quantitative analysis of sequence alignment applications on multiprocessor architectures

[...]

Friman Sánchez Castaño¹, Alex Ramirez², Mateo Valero²•Institutions (2)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center²

18 May 2009

TL;DR: This work investigates how bioinformatics applications benefit from parallel architectures that combine different alternatives to exploit coarse- and fine-grain parallelism, and shows that a share memory architecture like the PowerPC 970MP of Marenostrum can surpass a heterogeneous machine like the current Cell BE.

...read moreread less

Abstract: The exponential growth of databases that contains biological information (such as protein and DNA data) demands great efforts to improve the performance of computational platforms. In this work we investigate how bioinformatics applications benefit from parallel architectures that combine different alternatives to exploit coarse- and fine-grain parallelism. As a case of analysis we study the performance behavior of the Ssearch application that implements the Smith-Waterman algorithm, which is a dynamic programing approach that explores the similarity between a pair of sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). We study how this algorithm can take advantage of different parallel machines like the SGI Altix, IBM Power6, Cell BE machines and MareNostrum. Our results show that a share memory architecture like the PowerPC 970MP of Marenostrum can surpass a heterogeneous machine like the current Cell BE. Our quantitative analysis includes not only a study of scalability of the performance in terms of speedup, but also includes the analysis of bottlenecks in the execution of the application. This analysis is carried out through the study of the execution phases that the application presents.

...read moreread less

Journal Article•DOI•

Programming the Linpack benchmark for the IBM PowerXCell 8i processor

[...]

Michael Kistler¹, John A. Gunnels¹, Daniel Alan Brokenshire¹, Brad Benton¹•Institutions (1)

IBM¹

01 Jan 2009

TL;DR: The design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i 1 processors is presented and it is explained how the standard open source implementation of Linpack was modified to accelerate key computational kernels using the SPEs of the PowerXcell 8i processors.

...read moreread less

Abstract: In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i 1 processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™ 2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.

...read moreread less

Book Chapter•DOI•

Symbolic Analysis via Semantic Reinterpretation

[...]

Junghee Lim¹, Akash Lal¹, Thomas Reps¹•Institutions (1)

University of Wisconsin-Madison¹

26 Jun 2009

TL;DR: A novel technique is presented to create a system in which, for the cost of writing just one specification, an interpreter for the programming language of interest obtains automatically-generated, mutually-consistent implementations of all three symbolic-analysis primitives.

...read moreread less

Abstract: The paper presents a novel technique to create implementations of the basic primitives used in symbolic program analysis: forward symbolic evaluation , weakest liberal precondition , and symbolic composition . We used the technique to create a system in which, for the cost of writing just one specification--an interpreter for the programming language of interest--one obtains automatically-generated, mutually-consistent implementations of all three symbolic-analysis primitives. This can be carried out even for languages with pointers and address arithmetic. Our implementation has been used to generate symbolic-analysis primitives for x86 and PowerPC.

...read moreread less

Proceedings Article•DOI•

QUICK: A flexible full-system functional model

[...]

Dam Sunwoo¹, Joonsoo Kim¹, Derek Chiou¹•Institutions (1)

University of Texas at Austin¹

26 Apr 2009

TL;DR: QUICK is introduced, an implementation of a full-system Complete-and-Rollback functional model that supports the x86 and PowerPC ISAs, boots unmodified Windows XP and Linux, and runs unmodified applications such as YouTube on Internet Explorer while fully supporting rollbacks, including across I/O operations.

...read moreread less

Abstract: In this paper, we introduce the concept of full-system Complete-and-Rollback functional simulators that make efficient functional models in functional/timing partitioned simulators. Complete-and-Rollback functional simulators can efficiently drive simulators of resolutions ranging from functional-only to cycle-accurate for a wide range of simulated machines. Complete-and-Rollback functional models achieve their capabilities by executing instructions to completion, enabling their execution to be highly optimized, but providing rollback capabilities to enable on-the-fly modifications to the functional execution. We also introduce QUICK, an implementation of a full-system Complete-and-Rollback functional model that supports the x86 and PowerPC ISAs, boots unmodified Windows XP and Linux, and runs unmodified applications such as YouTube on Internet Explorer while fully supporting rollbacks, including across I/O operations. We present various case studies using QUICK and conduct performance analyses to demonstrate its simulation performance.

...read moreread less

Proceedings Article•DOI•

Design and development of Modbus/RTU master monitoring system based on embedded PowerPC platform

[...]

Daogang Peng¹, Hao Zhang¹, Jiannian Weng¹, Hui Li¹, Fei Xia¹ - Show less +1 more•Institutions (1)

Shanghai University of Electric Power¹

05 Jul 2009

TL;DR: A data acquisition and monitoring platform using Modbus/RTU master protocol based on embedded PowerPC and embedded Linux operating system is designed in this paper, which can realize the functions in industrial fields such as data acquisition, remote monitoring and network communication etc.

...read moreread less

Abstract: Modbus protocol is widely used in the industrial control field because of its excellent reliability, flexibility, real-time performance etc. It becomes one of the international actuality industrial standards and including lots of industrial equipments using Modbus as their communication protocol such as PLC, DCS, intelligent instruments etc. Embedded system focus on applications, and it can adapt to the strict requirements of functions, reliability, cost, size, power consumption, and so on. PowerPC series processor of Freescale semiconductor CO., LTD is an ideal platform of RISC embedded applications, which has powerful communication ability, system stability and disturb rejection ability. Based on the MPC8248 embedded processor of Freescale, a data acquisition and monitoring platform using Modbus/RTU master protocol based on embedded PowerPC and embedded Linux operating system is designed in this paper, which can realize the functions in industrial fields such as data acquisition, remote monitoring and network communication etc.

...read moreread less

Book Chapter•DOI•

Power and Energy Estimations in Model-Based Design

[...]

Eric Senn¹, Saadia Douhib, Dominique Blouin, Johann Laurent, Skander Turki, Jean-Philippe Diguet - Show less +2 more•Institutions (1)

University of Southern Brittany¹

01 Jan 2009

TL;DR: A methodology to model Inter-Process Communications (IPC) power and energy consumption is introduced, and this methodology on the building and use of a model for Ethernet based inter-process communications is illustrated.

...read moreread less

Abstract: The aim of our works is to provide for methods and tools to quickly estimate the power consumption at the first steps of a system design. We introduce multi-level power models and show how to use them at different levels of the specification refinement in the model-based AADL (Architecture & Analysis Design Language) design flow. Those power models, with the underlying methodology for power estimation, are currently being integrated in the Open Source AADL Tool Environment (OSATE) under the name CAT: Consumption Analysis Toolbox. Its first prototype gives, in the case of a processor binding, power consumption estimations, for software components in the AADL component assembly model, with a maximal error ranging roughly from 5% at the lowest refinement level (the source code of the software component is known), to 30% at the highest level (only the operating frequency and basic target configuration parameters are considered). We illustrate our approach with the power model of a simple RISC (PowerPC 405), of a complex DSP (TI C62), and of a FPGA (from ALTERA). We show how those models can be used at different levels in the AADL flow. Obviously, the power consumption of Operating System (OS) services is also to be considered here. We show that the OS principal impact on the overall consumption is mainly due to services implying data transfers. We introduce a methodology to model Inter-Process Communications (IPC) power and energy consumption, and illustrate this methodology on the building and use of a model for Ethernet based inter-process communications.

...read moreread less

Proceedings Article•DOI•

Runtime Memory Allocation in a Heterogeneous Reconfigurable Platform

[...]

Vlad-Mihai Sima¹, Koen Bertels¹•Institutions (1)

Delft University of Technology¹

09 Dec 2009

TL;DR: A runtime memory allocation algorithm, that aims to substantially reduce the overhead caused by shared-memory accesses by allocating memory directly in the local scratch pad memories of the heterogeneous platform, is presented.

...read moreread less

Abstract: In this paper, we present a runtime memory allocation algorithm, that aims to substantially reduce the overhead caused by shared-memory accesses by allocating memory directly in the local scratch pad memories. We target a heterogeneous platform, with a complex memory hierarchy. Using special instrumentation, we determine what memory areas are used in functions that could run on different processing elements, like, for example a reconfigurable logic array. Based on profile information, the programmer annotates some functions as candidates for accelerated execution. Then, an algorithm decides the best allocation, taking into account the various processing elements and special scratch pad memories of the heterogeneous platform. Tests are performed on our prototype platform, a Virtex ML410 with Linux operating system, containing a PowerPC processor and a Xilinx FPGA, implementing the MOLEN programming paradigm. We test the algorithm using both state of the art H.264 video encoder as well as other synthetic applications. The performance improvement for the H.264 application is 14% compared to the software only version while the overhead is less than 1% of the application execution time. This improvement is the optimal improvement that can be obtained by optimizing the memory allocation. For the synthetic applications the results are within 5% of the optimum.

...read moreread less

Proceedings Article•DOI•

Dynamic Partial Reconfigurable Embedded System to Achieve Hardware Flexibility Using 8051 Based RTOS on Xilinx FPGA

[...]

Jitendra Zalke, Sandeep Kumar Pandey

28 Dec 2009

TL;DR: This paper presents an approach for dynamic partial selfreconfiguration that enables FPGAs to reconfigure itself dynamically and partially under the control of an external processor.

...read moreread less

Abstract: Field Programmable Gate Arrays (FPGAs) are increasingly being used for many systems and efficient System-on-a-Chip (SOC) designs. Hence, dynamic partial self reconfiguration (DPSR) of the FPGA can be regarded as one of essentials of making hardware flexible and achieving power efficiency and optimizing area too. This paper presents an approach for dynamic partial selfreconfiguration that enables FPGAs to reconfigure itself dynamically and partially under the control of an external processor. The reconfiguration process is accomplished without an internal configuration access port (ICAP), which should be used either with Micro Blaze soft core or with PowerPC hard core using HWICAP core for the On-Chip Peripheral Bus (OPB). It can also be used for any other FPGA architectures, such as Virtex-II (Pro), Virtex-4, Virtex-5, etc.

...read moreread less

Dissertation•

Framework for self reconfigurable system on a Xilinx FPGA.

[...]

Sverre Hamre

01 Jan 2009

TL;DR: This thesis aims to establish a framework for partial self reconfiguration on a FPGA, showing that it is possible and how it can be done, but more research must be done to further simplify and enhance the framework.

...read moreread less

Abstract: Partial self reconfigurable hardware has not yet become main stream, even though the technology is available. Currently FPGA manufacturer like Xilinx has FPGA devices that can do partial self reconfiguration. These and earlier FPGA devices were used mostly for prototyping and testing of designs, before producing ASICS, since FPGA devices was to expensive to be used in final production designs. Now as prices for these devices are coming down, it is more and more normal to see them in consumer devices. Like routers and switches where protocols can change fast. Using a FPGA in these devices, the manufacturer has the possibility to update the device if there are protocol updates or bugs in the design. But currently this reconfiguration is of the complete design not just modules when they are needed. The main problem why partial self reconfiguration is not used currently, is the lack of tools, to simplify the design and usage of such a system. In this thesis different aspects of partial self reconfiguration will be evaluated. Current research status are evaluated and a proof of concept incorporating most of this research are created. Trying to establish a framework for partial self reconfiguration on a FPGA. In the work the Suzaku-V platform is used, this platform utilizes a Virtex-II or Virtex-IV FPGA from Xilinx. To be able to partially reconfigure these FPGA's the configuration logic and configuration bitstream has been researched. By understanding the bitstream a program has been developed that can read out or insert modules in a bitstream. The partial reconfiguration in the proof of concept is controlled by a CPU on the FPGA running Linux. By running Linux on the CPU simplifies many aspects of development, since many programs and communication methods are readily available in Linux. Partial self reconfiguration on a FPGA with a hard core powerPC running Linux is a complicated task to solve. Many problems were encounter working with the task, hopefully were many of these issues addressed and answered, simplifying further work. Since this is only the beginning, showing that it is possible and how it can be done, but more research must be done to further simplify and enhance the framework.

...read moreread less

Proceedings Article•DOI•

FPGA Implementation for Direct Kinematics of a Spherical Robot Manipulator

[...]

Diego F. Sanchez, Daniel M. Munoz, Carlos H. Llanos, José Maurício S. T. Motta

09 Dec 2009

TL;DR: A hardware architecture for computing direct kinematics of robot manipulators using floating-point arithmetic is presented for 32, 43 and 64 bit-width representations and Synthesis and simulation results demonstrate the accuracy and high performance of the implemented hardware architecture.

...read moreread less

Abstract: The sequential behavior of general purpose processors presents limitations in applications that require high processing speeds. One of the advantages of FPGAs implementations is the parallel process capability, allowing acceleration of complex algorithms. Nowadays it is common to find FPGA implementations in applications requiring high speed processing. In this paper a hardware architecture for computing direct kinematics of robot manipulators using floating-point arithmetic is presented for 32, 43 and 64 bit-width representations. Otherwise, the processing time of the hardware architecture is compared with the same formulation implemented in software, using the PowerPC (FPGA embedded processor). The proposed architecture was validated using Matlab results as a statistical estimator in order to compute the Mean Square Error (MSE). Synthesis and simulation results demonstrate the accuracy and high performance of the implemented hardware architecture.

...read moreread less

Automatically generating the back end of a compiler using declarative machine descriptions

[...]

Norman F. Ramsey¹, João Rafael de Oliveira Dias¹•Institutions (1)

Harvard University¹

01 Jan 2009

TL;DR: This work shows how, for machines of practical interest, to generate the back end of a compiler, using the compiler architecture developed by Davidson and Fraser (1984), which can generate a naive instruction selector and rely upon a machine-independent optimizer to improve the machine instructions.

...read moreread less

Abstract: Although I have proven that the general problem is undecidable, I show how, for machines of practical interest, to generate the back end of a compiler. Unlike previous work on generating back ends, I generate the machine-dependent components of the back end using only information that is independent of the compiler's internal data structures and intermediate form. My techniques substantially reduce the burden of retargeting the compiler: although it is still necessary to master the target machine's instruction set, it is not necessary to master the data structures and algorithms in the compiler's back end. Instead, the machine-dependent knowledge is isolated in the declarative machine descriptions. The largest machine-dependent component in a back end is the instruction selector. Previous work has shown that it is difficult to generate a high-quality instruction selector. But by adopting the compiler architecture developed by Davidson and Fraser (1984), I can generate a naive instruction selector and rely upon a machine-independent optimizer to improve the machine instructions. Unlike previous work, my generated back ends produce code that is as good as the code produced by hand-written back ends. My code generator translates a source program into tiles, where each tile implements a simple computation like addition. To implement the tiles, I compose machine instructions in sequence and use equational reasoning to identify sequences that implement tiles. Because it is undecidable whether a tile can be implemented, I use a heuristic to limit the set of sequences considered. Unlike standard heuristics, which may limit the length of a sequence, the number of sequences considered, or the complexity of the result computed by a sequence, my heuristic uses a new idea: to limit the amount of reasoning required to show that a sequence of instructions implements a tile. The limit, which is chosen empirically, enables my search to find instruction selectors for the x86, PowerPC, and ARM in a few minutes each.

...read moreread less

Proceedings Article•DOI•

Reliable High Speed Data Acquisition System Using FPGA

[...]

Samrat L. Sabat¹, Kumar D. Ajay¹, P. Rangababu¹•Institutions (1)

University of Hyderabad¹

16 Dec 2009

TL;DR: This paper presents design and implementation of a Data Acquisition System (DAS) in Field Programmable Gate Arrays (FPGA) using System on Chip (SoC) methodology and a suitable framing interface along with error detection is proposed and interfaced with Xilinx Aurora IP core.

...read moreread less

Abstract: This paper presents design and implementation of a Data Acquisition System (DAS) in Field Programmable Gate Arrays (FPGA) using System on Chip (SoC) methodology. To ensure the reliability of data being transmitted over the Channel, a suitable framing interface along with error detection is proposed and interfaced with Xilinx Aurora IP core. The proposed DAS is capable of transmitting the data @ 1.25 Gbps over the channel. The main advantage of the proposed DAS is that the configuration and monitoring is done using the in-built PowerPC processor of FPGA through Universal Synchronous Asynchronous Receiver Transmitter (UART). This type of DAS is suitable for multi channel acoustic data acquisition and in Electronics Warfare systems.

...read moreread less

Proceedings Article•DOI•

Implementation of watershed based image segmentation algorithm in virtex II pro platform

[...]

Nadia Smaoui Zghal¹, Ameni Yangui Jammoussi¹, Dorra Sellami Masmoudi¹•Institutions (1)

Icos¹

06 Apr 2009

TL;DR: A watershed based segmentation algorithm on a Virtex II pro platform enabled by the embedded processor power PC with low execution time and minimal internal FPGA consumed resources is implemented.

...read moreread less

Abstract: Watershed transformation is a powerful technique that can be efficiently used for image segmentation. In this paper, we implement a watershed based segmentation algorithm on a Virtex II pro platform. The main contribution of this work is the low execution time and minimal internal FPGA consumed resources. 'The proposed architecture includes two main blocs. First, a gradient of the image is generated using morphological operators (dilation-erosion). Then, Watershed is applied to the resultant image based on immersion principle. The proposed implementation was optimized with respect to hardware resources occupation and speed, based on a Codesign methodology. Our approach makes use of the potential software in Virtex II PRO platform enabled by the embedded processor power PC. Firstly, in a high design level, the whole design was done in PowerPC. Then, the optimal design is carried out by analyzing the timing of the different portions of the algorithm and implementing the time extensive parts as hardware This strategy leads to acceptable hardware resources occupation and a maximum frequency performance of approximately 100 MHz. As illustration, we apply our design to the segmentation of Cameraman image.

...read moreread less

Book Chapter•DOI•

Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture

[...]

Joseph Gebis¹, Leonid Oliker¹, John Shalf², Samuel Williams¹, Katherine Yelick¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

19 Feb 2009

TL;DR: The Virtual Vector Architecture (ViVA), which combines the memory semantics of vector computers with a software-controlled scratchpad memory in order to provide a more effective and practical approach to latency hiding, is presented.

...read moreread less

Abstract: The disparity between microprocessor clock frequencies and memory latency is a primary reason why many demanding applications run well below peak achievable performance. Software controlled scratchpad memories, such as the Cell local store, attempt to ameliorate this discrepancy by enabling precise control over memory movement; however, scratchpad technology confronts the programmer and compiler with an unfamiliar and difficult programming model. In this work, we present the Virtual Vector Architecture (ViVA), which combines the memory semantics of vector computers with a software-controlled scratchpad memory in order to provide a more effective and practical approach to latency hiding. ViVA requires minimal changes to the core design and could thus be easily integrated with conventional processor cores. To validate our approach, we implemented ViVA on the Mambo cycle-accurate full system simulator, which was carefully calibrated to match the performance on our underlying PowerPC Apple G5 architecture. Results show that ViVA is able to deliver significant performance benefits over scalar techniques for a variety of memory access patterns as well as two important memory-bound compact kernels, corner turn and sparse matrix-vector multiplication -- achieving 2x---13x improvement compared the scalar version. Overall, our preliminary ViVA exploration points to a promising approach for improving application performance on leading microprocessors with minimal design and complexity costs, in a power efficient manner.

...read moreread less

A Translator from CRL2 representation of PowerPC Assembly to ALF

[...]

Deepthi Devaki A R

01 Jan 2009

TL;DR: The main objective of this thesis work is to write a translator from CRL2's representation of PowerPC assembler code to ALF, a format known as CRL (Control flow Representation Language) to represent various types of object code formats in terms of control flow graphs.

...read moreread less

Abstract: Real Time systems are systems which must give accurate results within a precise time period. These systems have now become an indispensable aspect of our day to day lives. As the importance of real ...

...read moreread less

Journal Article•DOI•

directCell: hybrid systems with tightly coupled accelerators

[...]

Hartmut Penner¹, Utz Bacher¹, Jan Kunigk¹, C. Rund¹, Heiko Schick¹ - Show less +1 more•Institutions (1)

IBM¹

01 Sep 2009-Ibm Journal of Research and Development

TL;DR: The challenge in programming, execution control, and operation on the accelerators that were faced during the design and implementation of a prototype and solutions to overcome them are explained and an outlook on where the directCell approach promises to better solve customer problems is provided.

...read moreread less

Abstract: The Cell Broadband Engine® (Cell/B.E.) processor is a hybrid IBM PowerPC® processor. In blade servers and PCI Express® card systems, it has been used primarily in a server context, with Linux® as the operating system. Because neither Linux as an operating system nor a PowerPC processor-based architecture is the preferred choice for all applications, some installations use the Cell/B.E. processor in a coupled hybrid environment, which has implications for the complexity of systems management, the programming model, and performance. In the directCell approach, we use the Cell/B.E. processor as a processing device connected to a host via a PCI Express link using direct memory access and memory-mapped I/O (input/output). The Cell/B.E. processor functions as a processor and is perceived by the host like a device while maintaining the native Cell/B.E. processor programming approach. We describe the problems with the current practice that led us to the directCell approach. We explain the challenge in programming, execution control, and operation on the accelerators that were faced during the design and implementation of a prototype and present solutions to overcome them. We also provide an outlook on where the directCell approach promises to better solve customer problems.

...read moreread less