scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 2009"


Proceedings ArticleDOI
19 Apr 2009
TL;DR: This work shows how the GPU, with a design involving thousands of lightweight threads, can boost network coding performance significantly, and can be deployed as an attractive alternative and complementary solution to multi-core servers, by offering a better price/performance advantage.
Abstract: While it is a well known result that network coding achieves optimal flow rates in multicast sessions, its potential for practical use has remained to be a question, due to its high computational complexity. Our previous work has attempted to design a hardware-accelerated and multi-threaded implementation of network coding to fully utilize multi-core CPUs, as well as SSE2 and AltiVec SIMD vector instructions on X86 and PowerPC processors. This paper represents another step forward, and presents the first attempt in the literature to maximize the performance of network coding by taking advantage of not only multi-core CPUs, but also potentially hundreds of computing cores in commodity off-the-shelf graphics processing units (GPU). With GPU computing gaining momentum as a result of increased hardware capabilities and improved programmability, our work shows how the GPU, with a design involving thousands of lightweight threads, can boost network coding performance significantly. Many-core GPUs can be deployed as an attractive alternative and complementary solution to multi-core servers, by offering a better price/performance advantage. In fact, multi-core CPUs and many-core GPUs can be deployed and used to perform network coding simultaneously, potentially useful in media streaming servers where hundreds of peers are served concurrently by these dedicated servers. In this paper, we present Nuclei, the design and implementation of GPU-based network coding. With Nuclei, only one mainstream NVidia 8800 GT GPU outperforms an 8-core Intel Xeon server in most test cases. A combined CPU-GPU encoding scenario achieves coding rates of up to 116 MB/second for a variety of coding settings, which is sufficient to saturate a Gigabit Ethernet interface.

72 citations


Proceedings ArticleDOI
21 Sep 2009
TL;DR: By installing Hadoop on a cluster of IBM PowerPC blade clusters, it is shown that it can efficiently process multiyear remote sensing data, expect to see speed performance improvements over conventional multi-processor methodologies, and have more memory efficient implementation allowing for finer grid resolutions.
Abstract: Hadoop is a Distributed Filesystem and MapReduce framework originally developed for search applications by Google and subsequently adopted by the Apache foundation as an open source system. We propose that this parallel computing framework is well suited for a variety of service oriented science applications and, in particular, for satellite data processing of remote sensing systems. We show that, by installing Hadoop on a cluster of IBM PowerPC blade clusters, we can efficiently process multiyear remote sensing data, expect to see speed performance improvements over conventional multi-processor methodologies, and have more memory efficient implementation allowing for finer grid resolutions. Moreover, these improvements can be met without significant changes in coding structure.

67 citations


Proceedings ArticleDOI
TL;DR: The time-to-market driven need to maintain concurrent process-design co-development, even in spite of discontinuous patterning, process, and device innovation, is reiterated and shortcomings in traditional Design for Manufacturability solutions are identified.
Abstract: The time-to-market driven need to maintain concurrent process-design co-development, even in spite of discontinuous patterning, process, and device innovation is reiterated. The escalating design rule complexity resulting from increasing layout sensitivities in physical and electrical yield and the resulting risk to profitable technology scaling is reviewed. Shortcomings in traditional Design for Manufacturability (DfM) solutions are identified and contrasted to the highly successful integrated design-technology co-optimization used for SRAM and other memory arrays. The feasibility of extending memory-style design-technology co-optimization, based on a highly simplified layout environment, to logic chips is demonstrated. Layout density benefits, modeled patterning and electrical yield improvements, as well as substantially improved layout simplicity are quantified in a conventional versus template-based design comparison on a 65nm IBM PowerPC 405 microprocessor core. The adaptability of this highly regularized template-based design solution to different yield concerns and design styles is shown in the extension of this work to 32nm with an increased focus on interconnect redundancy. In closing, the work not covered in this paper, focused on the process side of the integrated process-design co-optimization, is introduced.

57 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: A tool flow is described that can automatically map a large set of applications to a self-reconfiguring platform, without an excessive need for resources at run-time, and is successfully used to implement an adaptive 32-tap FIR filter on a Xilinx XUP board.
Abstract: The inherent reconfigurability of FPGAs enables us to optimize an FPGA implementation in different time intervals by generating new optimized FPGA configurations and reconfiguring the FPGA at the interval boundaries. With conventional methods, generating a configuration at run-time requires an unacceptable amount of resources. In this paper, we describe a tool flow that can automatically map a large set of applications to a self-reconfiguring platform, without an excessive need for resources at run-time. The self-reconfiguring platform is implemented on a Xilinx Virtex-II Pro FPGA and uses the FPGA's PowerPC as configuration manager. This configuration manager generates optimized configurations on-the-fly and writes them to the configuration memory using the ICAP. We successfully used our approach to implement an adaptive 32-tap FIR filter on a Xilinx XUP board. This resulted in a 40% reduction in FPGA resources compared to a conventional implementation and a manageable reconfiguration overhead.

45 citations


Journal ArticleDOI
01 Jan 2009
TL;DR: This paper presents a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance, and the kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth.
Abstract: Reverse-Time Migration (RTM) is a state-of-the-art technique in seismic acoustic imaging, because of the quality and integrity of the images it provides. Oil and gas companies trust RTM with crucial decisions on multi-million-dollar drilling investments. But RTM requires vastly more computational power than its predecessor techniques, and this has somewhat hindered its practical success. On the other hand, despite multi-core architectures promise to deliver unprecedented computational power, little attention has been devoted to mapping efficiently RTM to multi-cores. In this paper, we present a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance. The kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth. Our Cell/B.E. implementation outperforms a traditional processor (PowerPC 970MP) in terms of performance (with an 15.0× speedup) and energy-efficiency (with a 10.0× increase in the GFlops/W delivered). Also, it is the fastest RTM implementation available to the best of our knowledge. These results increase the practical usability of RTM. Also, the RTM-Cell/B.E. combination proves to be a strong competitor in the seismic arena.

41 citations


Proceedings ArticleDOI
22 Sep 2009
TL;DR: Originally intended exclusively for compute node tasks, the design and implementation of “Big Memory”— an alternative, transparent memory space for computational processes turns out to dramatically improve the performance of certain I/O node applications as well.
Abstract: Efficient use of Linux for high-performance applications on Blue Gene/P (BG/P) compute nodes is challenging because of severe performance hits resulting from translation lookaside buffer (TLB) misses and a hard-to-program torus network DMA controller. To address these difficulties, we present the design and implementation of “Big Memory”— an alternative, transparent memory space for computational processes. Big Memory uses extremely large memory pages available on PowerPC CPUs to create a TLB-miss-free, flat memory area that can be used for application code and data and is easier to use for DMA operations. One of our singlenode memory benchmarks shows that the performance gap between regular PowerPC Linux with 4KB pages and IBM BG/P compute node kernel (CNK) is about 68% in the worst case. Big Memory narrows the worst case performance gap to just 0.04%. We verify this result on 1024 nodes of Blue Gene/P using the NAS Parallel Benchmarks and find the performance under Linux with Big Memory to fluctuate within 0.7% of CNK. Originally intended exclusively for compute node tasks, our new memory subsystem turns out to dramatically improve the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray (LOFAR) radio telescope as an example.

39 citations


Proceedings ArticleDOI
14 Nov 2009
TL;DR: A scalable simulator that couples a cycle-accurate node simulator with a supercomputer network model and shows its suitability for architecture research by evaluating the impact of cache injection on parallel application performance.
Abstract: Instruction-level simulation is necessary to evaluate new architectures. However, single-node simulation cannot predict the behavior of a parallel application on a supercomputer. We present a scalable simulator that couples a cycle-accurate node simulator with a supercomputer network model. Our simulator executes individual instances of IBM's Mambo PowerPC simulator on hundreds of cores. We integrated a NIC emulator into Mambo and model the network instead of fully simulating it. This decouples the individual node simulators and makes our design scalable.Our simulator runs unmodified parallel message-passing applications on hundreds of nodes. We can change network and detailed node parameters, inject network traffic directly into caches, and use different policies to decide when that is an advantage.This paper describes our simulator in detail, evaluates it, and demonstrates its scalability. We show its suitability for architecture research by evaluating the impact of cache injection on parallel application performance.

34 citations


Journal ArticleDOI
TL;DR: This article presents some implementations of robust motion detection algorithms on three architectures: a general purpose RISC processor—the PowerPC G4—a parallel artificial retina dedicated to low level image processing—Pvlsar34—and the Associative Mesh, a specialized architecture based on associative net.
Abstract: The goal of this article is to compare some optimised implementations on current high performance platforms in order to highlight architectural trends in the field of embedded architectures and to get an estimation of what should be the components of a next generation vision system. We present some implementations of robust motion detection algorithms on three architectures: a general purpose RISC processor—the PowerPC G4—a parallel artificial retina dedicated to low level image processing—Pvlsar34—and the Associative Mesh, a specialized architecture based on associative net. To handle the different aspects and constraints of embedded systems, execution time and power consumption of these architectures are compared.

29 citations


Book ChapterDOI
20 Aug 2009
TL;DR: This paper reports on a case study, which is believed to be the first to produce a formally verified end-to-end implementation of a functional programming language running on commercial processors.
Abstract: This paper reports on a case study, which we believe is the first to produce a formally verified end-to-end implementation of a functional programming language running on commercial processors. Interpreters for the core of McCarthy's LISP 1.5 were implemented in ARM, x86 and PowerPC machine code, and proved to correctly parse, evaluate and print LISP s-expressions. The proof of evaluation required working on top of verified implementations of memory allocation and garbage collection. All proofs are mechanised in the HOL4 theorem prover.

21 citations


Journal ArticleDOI
TL;DR: The results show that the modified hybrid FME algorithm on average, outperforms previous state-of-the-art FME algorithms, while its losses when compared with FSME, in terms of PSNR performance and computation time, are insignificant.
Abstract: There is an increasing need for high quality video on low power, portable devices. Possible target applications range from entertainment and personal communications to security and health care. While H.264/AVC answers the need for high quality video at lower bit rates, it is significantly more complex than previous coding standards and thus results in greater power consumption in practical implementations. In particular, motion estimation (ME), in H.264/AVC consumes the largest power in an H.264/AVC encoder. It is therefore critical to speed-up integer ME in H.264/AVC via fast motion estimation (FME) algorithms and hardware acceleration. In this paper, we present our hardware oriented modifications to a hybrid FME algorithm, our architecture based on the modified algorithm, and our implementation and prototype on a PowerPC-based Field Programmable System on Chip (FPSoC). Our results show that the modified hybrid FME algorithm on average, outperforms previous state-of-the-art FME algorithms, while its losses when compared with FSME, in terms of PSNR performance and computation time, are insignificant. We show that although our implementation platform is FPGA-based, our implementation results compare favourably with previous architectures implemented on ASICs. Finally we also show an improvement over some existing architectures implemented on FPGAs.

17 citations


Journal ArticleDOI
TL;DR: This paper developed self-test programs for single and double precision FP units on 32-bit and 64-bit microprocessor architectures and evaluated them with respect to the requirements of low-cost online periodic self-testing: fault coverage, memory footprint, execution time, and power consumption, assuming different memory hierarchy configurations.
Abstract: Online periodic testing of microprocessors is a valuable means to increase the reliability of a low-cost system, when neither hardware nor time redundant protection schemes can be applied. This is particularly valid for floating-point (FP) units, which are becoming more common in embedded systems and are usually protected from operational faults through costly hardware redundant approaches. In this paper, we present scalable instruction-based self-test program development for both single and double precision FP units considering different instruction sets (MIPS, PowerPC, and Alpha), different microprocessor architectures (32/64-bit architectures) and different memory configurations. Moreover, we introduce bit-level manipulation instruction sequences that are essential for the development of FP unit's self-test programs. We developed self-test programs for single and double precision FP units on 32-bit and 64-bit microprocessor architectures and evaluated them with respect to the requirements of low-cost online periodic self-testing: fault coverage, memory footprint, execution time, and power consumption, assuming different memory hierarchy configurations. Our comprehensive experimental evaluations reveal that the instruction set architecture plays a significant role in the development of self-test programs. Additionally, we suggest the most suitable self-test program development approach when memory footprint or low power consumption is of paramount importance.

Journal ArticleDOI
TL;DR: The goals and status for the ParaM project are described and the development of applications in signal and image processing that use ParAM are described, which has achieved 60% of the bandwidth of an equivalent C/MPI benchmark.
Abstract: Software engineering studies have shown that programmer productivity is improved through the use of computational science integrated development environments (or CSIDE, pronounced "sea side") such as MATLAB. Scientists often desire to use high-performance computing (HPC) systems to run their existing CSIDE scripts with large data sets. ParaM is a CSIDE distribution that provides parallel execution of MATLAB scripts on HPC systems at large shared computer centers. ParaM runs on a range of processor architectures (e.g., x86, x64, Itanium, PowerPC) and its MPI binding, known as bcMPI, supports a number of interconnect architectures (e.g., Myrinet and InfiniBand). On a cluster at Ohio Supercomputer Center, bcMPI with blocking communication has achieved 60% of the bandwidth of an equivalent C/MPI benchmark. In this paper, we describe goals and status for the ParaM project and the development of applications in signal and image processing that use ParaM.

Proceedings ArticleDOI
10 May 2009
TL;DR: An ATCA-based computation platform for data acquisition and trigger(TDAQ) applications has been developed for multiple future projects such as PANDA, HADES, and BESIII and a hardware/software co-design approach is proposed to ease and accelerate the development for different experiments.
Abstract: An ATCA-based computation platform for data acquisition and trigger(TDAQ) applications has been developed for multiple future projects such as PANDA, HADES, and BESIII. Each Compute Node (CN) appears as one of the fourteen Field Replaceable Units (FRU) in an ATCA shelf, which in total features a high performance of 1890 Gbps inter-FPGA onboard channels, 1456 Gbps inter-board backplane connections, 728 Gbps full-duplex optical links, 70 Gbps Ethernet, 140 GBytes DDR2 SDRAM, and all computing resources of 70 Xilinx Virtex-4 FX60 FPGAs. Corresponding to the system architecture, a hardware/software co-design approach is proposed to ease and accelerate the development for different experiments. In the uniform system design, application-specific computation is to be implemented as customized hardware co-processors, while the embedded PowerPC processor takes charge of flexible slow controls and transmission protocol processing.

Proceedings ArticleDOI
30 Aug 2009
TL;DR: This paper presents a direct implementation of delimited control operators shift and reset in the MinCaml compiler and shows all the details of how composable continuations can be implemented in the PowerPC microprocessor using a stack discipline.
Abstract: Although delimited control operators are becoming one of the useful tools to manipulate flow of programs, their direct and compiled implementation in a low-level language has not been proposed so far. The only direct and low-level implementations available are Gasbichler and Sperber's implementation in the Scheme 48 virtual machine and Kiselyov's implementation in the OCaml bytecode. Even though these implementations do provide insight into how stack frames are composed, they are not directly portable to compiled implementation in assembly language. This paper presents a direct implementation of delimited control operators shift and reset in the MinCaml compiler. It shows all the details of how composable continuations can be implemented in the PowerPC microprocessor using a stack discipline. We also show an implementation that copies stack frames lazily. To our knowledge, this is the first implementation of shift/reset in assembly language. It makes clear at the assembly language level what we have informally described so far, such as "copying and composing stack frames" and "inserting a reset mark when captured continuations are called". We demonstrate various benchmarks to show the performance of our implementation and discuss its pros and cons.

Proceedings ArticleDOI
18 May 2009
TL;DR: This work investigates how bioinformatics applications benefit from parallel architectures that combine different alternatives to exploit coarse- and fine-grain parallelism, and shows that a share memory architecture like the PowerPC 970MP of Marenostrum can surpass a heterogeneous machine like the current Cell BE.
Abstract: The exponential growth of databases that contains biological information (such as protein and DNA data) demands great efforts to improve the performance of computational platforms. In this work we investigate how bioinformatics applications benefit from parallel architectures that combine different alternatives to exploit coarse- and fine-grain parallelism. As a case of analysis we study the performance behavior of the Ssearch application that implements the Smith-Waterman algorithm, which is a dynamic programing approach that explores the similarity between a pair of sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). We study how this algorithm can take advantage of different parallel machines like the SGI Altix, IBM Power6, Cell BE machines and MareNostrum. Our results show that a share memory architecture like the PowerPC 970MP of Marenostrum can surpass a heterogeneous machine like the current Cell BE. Our quantitative analysis includes not only a study of scalability of the performance in terms of speedup, but also includes the analysis of bottlenecks in the execution of the application. This analysis is carried out through the study of the execution phases that the application presents.

Journal ArticleDOI
01 Jan 2009
TL;DR: The design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i 1 processors is presented and it is explained how the standard open source implementation of Linpack was modified to accelerate key computational kernels using the SPEs of the PowerXcell 8i processors.
Abstract: In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i 1 processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™ 2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.

Book ChapterDOI
26 Jun 2009
TL;DR: A novel technique is presented to create a system in which, for the cost of writing just one specification, an interpreter for the programming language of interest obtains automatically-generated, mutually-consistent implementations of all three symbolic-analysis primitives.
Abstract: The paper presents a novel technique to create implementations of the basic primitives used in symbolic program analysis: forward symbolic evaluation , weakest liberal precondition , and symbolic composition . We used the technique to create a system in which, for the cost of writing just one specification--an interpreter for the programming language of interest--one obtains automatically-generated, mutually-consistent implementations of all three symbolic-analysis primitives. This can be carried out even for languages with pointers and address arithmetic. Our implementation has been used to generate symbolic-analysis primitives for x86 and PowerPC.

Proceedings ArticleDOI
26 Apr 2009
TL;DR: QUICK is introduced, an implementation of a full-system Complete-and-Rollback functional model that supports the x86 and PowerPC ISAs, boots unmodified Windows XP and Linux, and runs unmodified applications such as YouTube on Internet Explorer while fully supporting rollbacks, including across I/O operations.
Abstract: In this paper, we introduce the concept of full-system Complete-and-Rollback functional simulators that make efficient functional models in functional/timing partitioned simulators. Complete-and-Rollback functional simulators can efficiently drive simulators of resolutions ranging from functional-only to cycle-accurate for a wide range of simulated machines. Complete-and-Rollback functional models achieve their capabilities by executing instructions to completion, enabling their execution to be highly optimized, but providing rollback capabilities to enable on-the-fly modifications to the functional execution. We also introduce QUICK, an implementation of a full-system Complete-and-Rollback functional model that supports the x86 and PowerPC ISAs, boots unmodified Windows XP and Linux, and runs unmodified applications such as YouTube on Internet Explorer while fully supporting rollbacks, including across I/O operations. We present various case studies using QUICK and conduct performance analyses to demonstrate its simulation performance.

Proceedings ArticleDOI
05 Jul 2009
TL;DR: A data acquisition and monitoring platform using Modbus/RTU master protocol based on embedded PowerPC and embedded Linux operating system is designed in this paper, which can realize the functions in industrial fields such as data acquisition, remote monitoring and network communication etc.
Abstract: Modbus protocol is widely used in the industrial control field because of its excellent reliability, flexibility, real-time performance etc. It becomes one of the international actuality industrial standards and including lots of industrial equipments using Modbus as their communication protocol such as PLC, DCS, intelligent instruments etc. Embedded system focus on applications, and it can adapt to the strict requirements of functions, reliability, cost, size, power consumption, and so on. PowerPC series processor of Freescale semiconductor CO., LTD is an ideal platform of RISC embedded applications, which has powerful communication ability, system stability and disturb rejection ability. Based on the MPC8248 embedded processor of Freescale, a data acquisition and monitoring platform using Modbus/RTU master protocol based on embedded PowerPC and embedded Linux operating system is designed in this paper, which can realize the functions in industrial fields such as data acquisition, remote monitoring and network communication etc.

Book ChapterDOI
01 Jan 2009
TL;DR: A methodology to model Inter-Process Communications (IPC) power and energy consumption is introduced, and this methodology on the building and use of a model for Ethernet based inter-process communications is illustrated.
Abstract: The aim of our works is to provide for methods and tools to quickly estimate the power consumption at the first steps of a system design. We introduce multi-level power models and show how to use them at different levels of the specification refinement in the model-based AADL (Architecture & Analysis Design Language) design flow. Those power models, with the underlying methodology for power estimation, are currently being integrated in the Open Source AADL Tool Environment (OSATE) under the name CAT: Consumption Analysis Toolbox. Its first prototype gives, in the case of a processor binding, power consumption estimations, for software components in the AADL component assembly model, with a maximal error ranging roughly from 5% at the lowest refinement level (the source code of the software component is known), to 30% at the highest level (only the operating frequency and basic target configuration parameters are considered). We illustrate our approach with the power model of a simple RISC (PowerPC 405), of a complex DSP (TI C62), and of a FPGA (from ALTERA). We show how those models can be used at different levels in the AADL flow. Obviously, the power consumption of Operating System (OS) services is also to be considered here. We show that the OS principal impact on the overall consumption is mainly due to services implying data transfers. We introduce a methodology to model Inter-Process Communications (IPC) power and energy consumption, and illustrate this methodology on the building and use of a model for Ethernet based inter-process communications.

Proceedings ArticleDOI
09 Dec 2009
TL;DR: A runtime memory allocation algorithm, that aims to substantially reduce the overhead caused by shared-memory accesses by allocating memory directly in the local scratch pad memories of the heterogeneous platform, is presented.
Abstract: In this paper, we present a runtime memory allocation algorithm, that aims to substantially reduce the overhead caused by shared-memory accesses by allocating memory directly in the local scratch pad memories. We target a heterogeneous platform, with a complex memory hierarchy. Using special instrumentation, we determine what memory areas are used in functions that could run on different processing elements, like, for example a reconfigurable logic array. Based on profile information, the programmer annotates some functions as candidates for accelerated execution. Then, an algorithm decides the best allocation, taking into account the various processing elements and special scratch pad memories of the heterogeneous platform. Tests are performed on our prototype platform, a Virtex ML410 with Linux operating system, containing a PowerPC processor and a Xilinx FPGA, implementing the MOLEN programming paradigm. We test the algorithm using both state of the art H.264 video encoder as well as other synthetic applications. The performance improvement for the H.264 application is 14% compared to the software only version while the overhead is less than 1% of the application execution time. This improvement is the optimal improvement that can be obtained by optimizing the memory allocation. For the synthetic applications the results are within 5% of the optimum.

Proceedings ArticleDOI
28 Dec 2009
TL;DR: This paper presents an approach for dynamic partial selfreconfiguration that enables FPGAs to reconfigure itself dynamically and partially under the control of an external processor.
Abstract: Field Programmable Gate Arrays (FPGAs) are increasingly being used for many systems and efficient System-on-a-Chip (SOC) designs. Hence, dynamic partial self reconfiguration (DPSR) of the FPGA can be regarded as one of essentials of making hardware flexible and achieving power efficiency and optimizing area too. This paper presents an approach for dynamic partial selfreconfiguration that enables FPGAs to reconfigure itself dynamically and partially under the control of an external processor. The reconfiguration process is accomplished without an internal configuration access port (ICAP), which should be used either with Micro Blaze soft core or with PowerPC hard core using HWICAP core for the On-Chip Peripheral Bus (OPB). It can also be used for any other FPGA architectures, such as Virtex-II (Pro), Virtex-4, Virtex-5, etc.

Dissertation
01 Jan 2009
TL;DR: This thesis aims to establish a framework for partial self reconfiguration on a FPGA, showing that it is possible and how it can be done, but more research must be done to further simplify and enhance the framework.
Abstract: Partial self reconfigurable hardware has not yet become main stream, even though the technology is available. Currently FPGA manufacturer like Xilinx has FPGA devices that can do partial self reconfiguration. These and earlier FPGA devices were used mostly for prototyping and testing of designs, before producing ASICS, since FPGA devices was to expensive to be used in final production designs. Now as prices for these devices are coming down, it is more and more normal to see them in consumer devices. Like routers and switches where protocols can change fast. Using a FPGA in these devices, the manufacturer has the possibility to update the device if there are protocol updates or bugs in the design. But currently this reconfiguration is of the complete design not just modules when they are needed. The main problem why partial self reconfiguration is not used currently, is the lack of tools, to simplify the design and usage of such a system. In this thesis different aspects of partial self reconfiguration will be evaluated. Current research status are evaluated and a proof of concept incorporating most of this research are created. Trying to establish a framework for partial self reconfiguration on a FPGA. In the work the Suzaku-V platform is used, this platform utilizes a Virtex-II or Virtex-IV FPGA from Xilinx. To be able to partially reconfigure these FPGA's the configuration logic and configuration bitstream has been researched. By understanding the bitstream a program has been developed that can read out or insert modules in a bitstream. The partial reconfiguration in the proof of concept is controlled by a CPU on the FPGA running Linux. By running Linux on the CPU simplifies many aspects of development, since many programs and communication methods are readily available in Linux. Partial self reconfiguration on a FPGA with a hard core powerPC running Linux is a complicated task to solve. Many problems were encounter working with the task, hopefully were many of these issues addressed and answered, simplifying further work. Since this is only the beginning, showing that it is possible and how it can be done, but more research must be done to further simplify and enhance the framework.

Proceedings ArticleDOI
09 Dec 2009
TL;DR: A hardware architecture for computing direct kinematics of robot manipulators using floating-point arithmetic is presented for 32, 43 and 64 bit-width representations and Synthesis and simulation results demonstrate the accuracy and high performance of the implemented hardware architecture.
Abstract: The sequential behavior of general purpose processors presents limitations in applications that require high processing speeds. One of the advantages of FPGAs implementations is the parallel process capability, allowing acceleration of complex algorithms. Nowadays it is common to find FPGA implementations in applications requiring high speed processing. In this paper a hardware architecture for computing direct kinematics of robot manipulators using floating-point arithmetic is presented for 32, 43 and 64 bit-width representations. Otherwise, the processing time of the hardware architecture is compared with the same formulation implemented in software, using the PowerPC (FPGA embedded processor). The proposed architecture was validated using Matlab results as a statistical estimator in order to compute the Mean Square Error (MSE). Synthesis and simulation results demonstrate the accuracy and high performance of the implemented hardware architecture.

01 Jan 2009
TL;DR: This work shows how, for machines of practical interest, to generate the back end of a compiler, using the compiler architecture developed by Davidson and Fraser (1984), which can generate a naive instruction selector and rely upon a machine-independent optimizer to improve the machine instructions.
Abstract: Although I have proven that the general problem is undecidable, I show how, for machines of practical interest, to generate the back end of a compiler. Unlike previous work on generating back ends, I generate the machine-dependent components of the back end using only information that is independent of the compiler's internal data structures and intermediate form. My techniques substantially reduce the burden of retargeting the compiler: although it is still necessary to master the target machine's instruction set, it is not necessary to master the data structures and algorithms in the compiler's back end. Instead, the machine-dependent knowledge is isolated in the declarative machine descriptions. The largest machine-dependent component in a back end is the instruction selector. Previous work has shown that it is difficult to generate a high-quality instruction selector. But by adopting the compiler architecture developed by Davidson and Fraser (1984), I can generate a naive instruction selector and rely upon a machine-independent optimizer to improve the machine instructions. Unlike previous work, my generated back ends produce code that is as good as the code produced by hand-written back ends. My code generator translates a source program into tiles, where each tile implements a simple computation like addition. To implement the tiles, I compose machine instructions in sequence and use equational reasoning to identify sequences that implement tiles. Because it is undecidable whether a tile can be implemented, I use a heuristic to limit the set of sequences considered. Unlike standard heuristics, which may limit the length of a sequence, the number of sequences considered, or the complexity of the result computed by a sequence, my heuristic uses a new idea: to limit the amount of reasoning required to show that a sequence of instructions implements a tile. The limit, which is chosen empirically, enables my search to find instruction selectors for the x86, PowerPC, and ARM in a few minutes each.

Proceedings ArticleDOI
16 Dec 2009
TL;DR: This paper presents design and implementation of a Data Acquisition System (DAS) in Field Programmable Gate Arrays (FPGA) using System on Chip (SoC) methodology and a suitable framing interface along with error detection is proposed and interfaced with Xilinx Aurora IP core.
Abstract: This paper presents design and implementation of a Data Acquisition System (DAS) in Field Programmable Gate Arrays (FPGA) using System on Chip (SoC) methodology. To ensure the reliability of data being transmitted over the Channel, a suitable framing interface along with error detection is proposed and interfaced with Xilinx Aurora IP core. The proposed DAS is capable of transmitting the data @ 1.25 Gbps over the channel. The main advantage of the proposed DAS is that the configuration and monitoring is done using the in-built PowerPC processor of FPGA through Universal Synchronous Asynchronous Receiver Transmitter (UART). This type of DAS is suitable for multi channel acoustic data acquisition and in Electronics Warfare systems.

Proceedings ArticleDOI
06 Apr 2009
TL;DR: A watershed based segmentation algorithm on a Virtex II pro platform enabled by the embedded processor power PC with low execution time and minimal internal FPGA consumed resources is implemented.
Abstract: Watershed transformation is a powerful technique that can be efficiently used for image segmentation. In this paper, we implement a watershed based segmentation algorithm on a Virtex II pro platform. The main contribution of this work is the low execution time and minimal internal FPGA consumed resources. 'The proposed architecture includes two main blocs. First, a gradient of the image is generated using morphological operators (dilation-erosion). Then, Watershed is applied to the resultant image based on immersion principle. The proposed implementation was optimized with respect to hardware resources occupation and speed, based on a Codesign methodology. Our approach makes use of the potential software in Virtex II PRO platform enabled by the embedded processor power PC. Firstly, in a high design level, the whole design was done in PowerPC. Then, the optimal design is carried out by analyzing the timing of the different portions of the algorithm and implementing the time extensive parts as hardware This strategy leads to acceptable hardware resources occupation and a maximum frequency performance of approximately 100 MHz. As illustration, we apply our design to the segmentation of Cameraman image.

Book ChapterDOI
19 Feb 2009
TL;DR: The Virtual Vector Architecture (ViVA), which combines the memory semantics of vector computers with a software-controlled scratchpad memory in order to provide a more effective and practical approach to latency hiding, is presented.
Abstract: The disparity between microprocessor clock frequencies and memory latency is a primary reason why many demanding applications run well below peak achievable performance. Software controlled scratchpad memories, such as the Cell local store, attempt to ameliorate this discrepancy by enabling precise control over memory movement; however, scratchpad technology confronts the programmer and compiler with an unfamiliar and difficult programming model. In this work, we present the Virtual Vector Architecture (ViVA), which combines the memory semantics of vector computers with a software-controlled scratchpad memory in order to provide a more effective and practical approach to latency hiding. ViVA requires minimal changes to the core design and could thus be easily integrated with conventional processor cores. To validate our approach, we implemented ViVA on the Mambo cycle-accurate full system simulator, which was carefully calibrated to match the performance on our underlying PowerPC Apple G5 architecture. Results show that ViVA is able to deliver significant performance benefits over scalar techniques for a variety of memory access patterns as well as two important memory-bound compact kernels, corner turn and sparse matrix-vector multiplication -- achieving 2x---13x improvement compared the scalar version. Overall, our preliminary ViVA exploration points to a promising approach for improving application performance on leading microprocessors with minimal design and complexity costs, in a power efficient manner.

01 Jan 2009
TL;DR: The main objective of this thesis work is to write a translator from CRL2's representation of PowerPC assembler code to ALF, a format known as CRL (Control flow Representation Language) to represent various types of object code formats in terms of control flow graphs.
Abstract: Real Time systems are systems which must give accurate results within a precise time period. These systems have now become an indispensable aspect of our day to day lives. As the importance of real ...

Journal ArticleDOI
Hartmut Penner1, Utz Bacher1, Jan Kunigk1, C. Rund1, Heiko Schick1 
TL;DR: The challenge in programming, execution control, and operation on the accelerators that were faced during the design and implementation of a prototype and solutions to overcome them are explained and an outlook on where the directCell approach promises to better solve customer problems is provided.
Abstract: The Cell Broadband Engine® (Cell/B.E.) processor is a hybrid IBM PowerPC® processor. In blade servers and PCI Express® card systems, it has been used primarily in a server context, with Linux® as the operating system. Because neither Linux as an operating system nor a PowerPC processor-based architecture is the preferred choice for all applications, some installations use the Cell/B.E. processor in a coupled hybrid environment, which has implications for the complexity of systems management, the programming model, and performance. In the directCell approach, we use the Cell/B.E. processor as a processing device connected to a host via a PCI Express link using direct memory access and memory-mapped I/O (input/output). The Cell/B.E. processor functions as a processor and is perceived by the host like a device while maintaining the native Cell/B.E. processor programming approach. We describe the problems with the current practice that led us to the directCell approach. We explain the challenge in programming, execution control, and operation on the accelerators that were faced during the design and implementation of a prototype and present solutions to overcome them. We also provide an outlook on where the directCell approach promises to better solve customer problems.