scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 2013"


01 Jan 2013
TL;DR: Looking back, it’s not much of a stretch to call 2004 the year of multicore, as many companies showed new or updated multicore processors.
Abstract: The major processor manufacturers and architectures, from Intel and AMD to Sparc and PowerPC, have run out of room with most of their traditional approaches to boosting CPU performance. Instead of driving clock speeds and straight-line instruction throughput ever higher, they are instead turning en masse to hyperthreading and multicore architectures. Both of these features are already available on chips today; in particular, multicore is available on current PowerPC and Sparc IV processors, and is coming in 2005 from Intel and AMD. Indeed, the big theme of the 2004 InStat/MDR Fall Processor Forum was multicore devices, as many companies showed new or updated multicore processors. Looking back, it’s not much of a stretch to call 2004 the year of multicore.

683 citations


Proceedings ArticleDOI
29 Sep 2013
TL;DR: A novel host-compiled simulation approach that provides close to cycle-accurate estimation of energy and timing metrics in a retargetable manner, using flexible, architecture description language (ADL) based reference models is proposed.
Abstract: With traditional cycle-accurate or instruction-set simulations of processors often being too slow, host-compiled or source-level software execution approaches have recently become popular. Such high-level simulations can achieve order of magnitude speedups, but approaches that can achieve highly accurate characterization of both power and performance metrics are lacking. In this paper, we propose a novel host-compiled simulation approach that provides close to cycle-accurate estimation of energy and timing metrics in a retargetable manner, using flexible, architecture description language (ADL) based reference models. Our automated flow considers typical front- and back-end optimizations by working at the compiler-generated intermediate representation (IR). Path-dependent execution effects are accurately captured through pairwise characterization and back-annotation of basic code blocks with all possible predecessors. Results from applying our approach to PowerPC targets running various benchmark suites show that close to native average speeds of 2000 MIPS at more than 98% timing and energy accuracy can be achieved.

42 citations


Book ChapterDOI
17 Jun 2013
TL;DR: Proteus is a multi-core hypervisor for PowerPC-based embedded systems, which supports both full virtualization and paravirtualization without relying on special hardware support.
Abstract: System virtualization’s integration of multiple software stacks with maintained isolation on multi-core architectures has the potential to meet high functionality and reliability requirements in a resource efficient manner. Paravirtualization is the prevailing approach in the embedded domain. Its applicability is however limited, since not all operating systems can be ported to the paravirtualization application programming interface. Proteus is a multi-core hypervisor for PowerPC-based embedded systems, which supports both full virtualization and paravirtualization without relying on special hardware support. The hypervisor ensures spatial and temporal separation of the guest systems. The evaluation indicates a low memory footprint of 15 kilobytes and the configurability allows for an application-specific inclusion of components. The interrupt latencies and the execution times for hypercall handlers, emulation routines, and virtual machine context switches are analyzed.

26 citations


Book ChapterDOI
18 Nov 2013
TL;DR: This paper examines the performance of a suite of applications on three different architectures: Edison, a Cray XC30 with Intel Ivy Bridge processors; Hopper and Cielo, both CrayXE6’s with AMD Magny–Cours processors; and Mira, an IBM BlueGene/Q with PowerPC A2 processors.
Abstract: In this paper, we examine the performance of a suite of applications on three different architectures: Edison, a Cray XC30 with Intel Ivy Bridge processors; Hopper and Cielo, both Cray XE6’s with AMD Magny–Cours processors; and Mira, an IBM BlueGene/Q with PowerPC A2 processors. The applications chosen are a subset of the applications used in a joint procurement effort between Lawrence Berkeley National Laboratory, Los Alamos National Laboratory and Sandia National Laboratories. Strong scaling results are presented, using both MPI-only and MPI+OpenMP execution models.

17 citations


Journal ArticleDOI
TL;DR: A scalable coprocessor for accelerating the Differential Evolution (DE) algorithm is presented and results show an acceleration of 76.79-105× and 5.19-6.91× with respect to floating and fixed point DE in embedded processor.
Abstract: In this study, a scalable coprocessor for accelerating the Differential Evolution (DE) algorithm is presented. The coprocessor is interfaced with PowerPC embedded processor of Xilinx Virtex-5 FX70T Field Programmable Gate Array. In the proposed design, the DE algorithm module is tightly coupled with fitness function module to reduce communication and control overhead. The fixed point DE algorithm is implemented in the coprocessor whereas both fixed and floating point DE are implemented in the embedded processor. Performance of the coprocessor is evaluated by optimising benchmark functions of different complexities. The implementation results show that the coprocessor is 73.14-160.2× and 2.19-27.63× faster compared to the software execution time of the floating and fixed point algorithm respectively. As a case study, spectrum allocation problem of cognitive radio network is evaluated with the coprocessor. Results show an acceleration of 76.79-105× and 5.19-6.91× with respect to floating and fixed point DE in embedded processor. It is also observed that the application occupies 56% of BRAM, 54% of DSP48E, 16% of slice LUTs and maximum frequency of operation as 63.55 MHz in a Virtex-5 FPGA. This type of coprocessor is suitable for embedded applications where the fitness function remains unchanged.

13 citations


Journal ArticleDOI
Fatma Abouelella1, Tom Davidson1, Wim Meeus1, Karel Bruneel1, Dirk Stroobandt1 
TL;DR: This article explores how to efficiently build DCS systems by presenting a variety of possible solutions for the specialization process and the overhead associated with each of them and shows that the use of the CP along with the SRL configuration achieves minimum overhead in terms of resources and time.
Abstract: Dynamic circuit specialization (DCS) is a technique used to implement FPGA applications where some of the input data, called parameters, change slowly compared to other inputs. Each time the parameter values change, the FPGA is reconfigured by a configuration that is specialized for those new parameter values. This specialized configuration is much smaller and faster than a regular configuration. However, the overhead associated with the specialization process should be minimized to achieve the desired benefits of using the DCS technique. This overhead is represented by both the FPGA resources needed to specialize the FPGA at runtime and by the specialization time. The introduction of parameterized configurations [Bruneel and Stroobandt 2008] has improved the efficiency of DCS implementations. However, the specialization overhead still takes a considerable amount of resources and time.In this article, we explore how to efficiently build DCS systems by presenting a variety of possible solutions for the specialization process and the overhead associated with each of them. We split the specialization process into two main phases: the evaluation and the configuration phase. The PowerPC embedded processor, the MicroBlaze, and a customized processor (CP) are used as alternatives in the evaluation phase. In the configuration phase, the ICAP and a custom configuration interface (SRL configuration) are used as alternatives. Each solution is used to implement a DCS system for three applications: an adaptive finite impulse response (FIR) filter, a ternary content-addressable memory (TCAM), and a regular expression matcher (RegEx). The experiments show that the use of our CP along with the SRL configuration achieves minimum overhead in terms of resources and time. Our CP is 1.8 and 3.5 times smaller than the PowerPC and the area-optimized implementation of the MicroBlaze, respectively. Moreover, the use of the CP enables a more compact representation for the parameterized configuration in comparison to both the PowerPC and the MicroBlaze processors. For instance, in the FIR, the parameterized configuration compiled for our CP is 6--7 times smaller than that for the embedded processors.

12 citations


01 Jan 2013
TL;DR: This work shows how inc reasing the size of a processor's instruction set, in turn, increases the amount of hardware needed to run that processor, and proposes a new measure of pro cessor resource utilization, core density.
Abstract: The manner in which the resources of a microprocess or are used affects its performance, power consumption and size. In this work we show how inc reasing the size of a processor's instruction set, in turn, increases the amount of hardware needed to impleme nt that processor. We also study how efficiently th e hardware resources of four processor architectures are used by measuring the static instruction set u tilization of a group of benchmark applications. The architect ures examined are the Intel x86, Intel x86-64, MIP S64, and PowerPC. We introduce the notions of instruction sexact cores and general-purpose cores, and then we use these concepts to propose a new measure of pro cessor resource utilization, core density. Based on the core density measure we show that on average 9 exact cor es are equivalent to a single general-purpose core in the existing architectures and that in particular insta nces this multiplier can go up to 48 exact cores.

10 citations


Journal ArticleDOI
01 May 2013
TL;DR: This paper proposes a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework and describes an implementation of this method in Python targeting the IBM® Blue Gene/P supercomputer’s PowerPC® 450 core.
Abstract: Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer's PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7A— speedup over the best previously published results.

9 citations


Journal Article
TL;DR: The aim of the project was to develop a simulation for detecting and classifying currency coins by machine vision using NI 1742 Smart Camera using labview RT (Real- Time) software in conjunction with Vision Assistant 2010.
Abstract: The aim of the project was to develop a simulation for detecting and classifying currency coins by machine vision using NI 1742 Smart Camera. Powered by a 533 mhz powerpc processor, it greatly enhances processing speed as it is a dedicated processor simplifying machine vision by analyzing images directly. For programming the same, labview RT (Real- Time) software was used in conjunction with Vision Assistant 2010.

8 citations


Proceedings ArticleDOI
06 May 2013
TL;DR: The proposed design of FC node adopts a FPGA module with PCI interface and a PowerPC module that can achieve the mapping from the FC protocols to the engineering application and has good performance at high-speed data transmission, and flexible scalability.
Abstract: Fibre Channel (FC) has been well applied in storage network and avionics environments. It is being implemented as one kind of avionics communication architecture for a variety of next generation aircrafts. However, a challenge engineers have to face is how to realize the complicated FC protocols in a feasible way. So, we propose a design of FC node. It adopts a FPGA module with PCI interface and a PowerPC module. In this way, it can achieve the mapping from the FC protocols to the engineering application. We have carried out some experiments to validate the data transfer rate of 1.0625Gbps and 2.125Gbps. The experimental results show that the proposed design has good performance at high-speed data transmission with the FC protocols, and flexible scalability.

7 citations


Proceedings ArticleDOI
20 May 2013
TL;DR: The Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (BG/Q) system that offers several new opportunities for tuning and scaling scientific applications.
Abstract: The Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (BG/Q) system. The BG/Q system is the third generation in Blue Gene architecture from IBM and like its predecessors combines system-onchip technology with a proprietary interconnect (5-D torus). Each compute node has 16 augmented PowerPC A2 processor cores with support for simultaneous multithreading, 4-wide double precision SIMD, and different data prefetching mechanisms. Mira offers several new opportunities for tuning and scaling scientific applications. This paper discusses our early experience with a subset of micro-benchmarks, MPI benchmarks, and a variety of science and engineering applications running at ALCF. Both performance and power are studied and results on BG/Q is compared with its predecessor BG/P. Several lessons gleaned from tuning applications on the BG/Q architecture for better performance and scalability are shared.

Proceedings ArticleDOI
24 Oct 2013
TL;DR: This paper proposes a new method of code compression for embedded systems called CC-MLD (Compressed Code using Huffman-Based Multi-Level Dictionary), which applies two compression techniques and it uses the Huffman code compression algorithm.
Abstract: This paper proposes a new method of code compression for embedded systems called by us as CC-MLD (Compressed Code using Huffman-Based Multi-Level Dictionary). This method applies two compression techniques and it uses the Huffman code compression algorithm. A single dictionary is divided into two levels and it is shared by both techniques. We performed simulations using applications from MiBench and we have used four embedded processors (ARM, MIPS, PowerPC and SPARC). Our method reduces code size up to 30.6% (including all extra costs for these four platforms). We have implemented the decompressor using VHDL and FPGA and we obtained only one clock from decompression process.

Journal ArticleDOI
TL;DR: XtratuM, a real-time hypervisor designed and implemented based on the concept of a partitioned system, is introduced by enabling partitions to execute simultaneously in spatial and temporal isolation without interfering with each other, but sharing the same hardware.
Abstract: High-performance processors give opportunities and challenges for development of real-time and embedded applications. New advances in hardware introduce new questions as alternatives to enable multiple applications to share a single processor and memory, so that the high-performance hardware that contains millions of transistors can be fully utilized, as also the way to keep system dependable and stable by making applications stay in spatial and temporal isolation inside same system. It is introduced in this paper XtratuM, a real-time hypervisor designed and implemented based on the concept of a partitioned system, by enabling partitions to execute simultaneously in spatial and temporal isolation without interfering with each other, but sharing the same hardware. Still in this paper, we provide a brief introduction on partitioned systems and its significance, also presenting the prototype implementation of XtratuM on PowerPC architecture including essential parts: hypercalls, timer, interrupt, and memory management implementations. Benchmark applications have been carried out to illustrate that the model implemented by XtratuM is suitable to offer the capability of spatial and temporal isolation under real-time requirements.

Journal Article
TL;DR: A high precision time-keeping model is realized by the statistics and records for the continuous and effective pulse per second(PPS) interval and a dynamic adjustment method for time resampling is presented through the flexible setting for interrupted cycle of timer.
Abstract: In view of the rather high requirement of time performance for the merging unit in the smart substation,through the concrete analysis on the implementation principle of the merging unit using the interpolation algorithm,and by fully utilizing the computing performance of PowerPC together with the parallel processing capabilities of FPGA,the whole system is divided into several cooperated models and a design solution of key links for merging unit is presented.In this solution,a high precision time-keeping model is realized by the statistics and records for the continuous and effective pulse per second(PPS) interval.To realize synchronization with the external time signal,a dynamic adjustment method for time resampling is presented through the flexible setting for interrupted cycle of timer.At the same time,an equal interval output of SV9-2 datagram is accurately realized through the division of output delay and caching feature of FPGA.

Journal ArticleDOI
TL;DR: Development environment, software structure, verification and management method of the operational flight program, which has the functions of I/O processing with avionics, flight control logic calculation, fault diagnosis and redundancy mode is described.
Abstract: The operational flight program(OFP) which has the functions of I/O processing with avionics, flight control logic calculation, fault diagnosis and redundancy mode is embedded in the flight control computer of Smart UAV. The OFP was developed in the environment of PowerPC 755 processor and VxWorks 5.5 real-time operating system. The OFP consists of memory access module, device I/O signal processing module and flight control logic module, and each module was designed to hierarchical structure. Memory access and signal processing modules were verified from bench test, and flight control logic module was verified from hardware-in-the-loop simulation(HILS) test, ground integration test, tethered test and flight test. This paper describes development environment, software structure, verification and management method of the OFP.

Proceedings ArticleDOI
11 Nov 2013
TL;DR: FireBird is presented, the first PowerPC based SoC for reliable operation beyond 200°C, using a dynamically reconfigurable clock frequency, exhaustive clock gating, and electromigration-resistant power supply rings.
Abstract: PowerPC Architecture microcontrollers are commonly used in embedded applications. In this work we present FireBird, the first PowerPC based SoC for reliable operation beyond 200°C. Designing SoCs for reliable operation at high temperatures is a significant challenge, due to increased static leakage current, reduced carrier mobility, and increased electro-migration. To alleviate the consequences of high temperatures, this paper proposes to customize a PowerPC e200 based SoC by using a dynamically reconfigurable clock frequency, exhaustive clock gating, and electromigration-resistant power supply rings. A 20×9 mm2 chip implementing this design has been fabricated in 0.35 μm CMOS technology. The custom testing procedure showed the expected maximum operating frequency reduction from 38MHz at room-temperature to 30 MHz at 200°C, which illustrates the importance of an adaptable clock frequency under temperature variation. At 200°C, the maximum power dissipation at 3.3 V supply voltage was 1.2W and the idle state static leakage current was 3.4 mA. Silicon measurements proved that this design outperforms PowerPC based SoCs available in the high-temperature microcontrollers market which are not operational at temperatures above 125°C.

Journal ArticleDOI
TL;DR: In this document major firmware and software achievements concerning the PowerPC implementation, tested on ROD prototypes, will be reported.
Abstract: The Insertable B-layer project is planned for the upgrade of the ATLAS experiment at LHC. A silicon layer will be inserted into the existing Pixel Detector together with new electronics. The readout off-detector system is implemented with a Back-Of-Crate module implementing I/O functionality and a Readout-Driver card (ROD) for data processing. The ROD hosts the electronics devoted to control operations implemented both with a back-compatible solution (using a Digital Signal Processor) and with a PowerPC embedded into an FPGA. In this document major firmware and software achievements concerning the PowerPC implementation, tested on ROD prototypes, will be reported.

Patent
08 May 2013
TL;DR: In this paper, the authors proposed a multifunctional data bus communication module in the condition of low power consumption, which adopts a mode of combining a low-end CPU (central processing unit) of a PowerPC series with an FPGA to integrate three kinds of data buses of MIL-STD-1553B, ARINC429 and RS422 into a card.
Abstract: The invention belongs to the field of data communication, and particularly relates to a multifunctional data bus communication module in the condition of low power consumption. The module adopts a mode of combining a low-end CPU (central processing unit) of a PowerPC series with an FPGA (field programmable gate array) to integrate three kinds of data buses of MIL-STD-1553B, ARINC429 and RS422 into a card. A specific implementation method of the module mainly includes: (1) connecting all the three kinds of data buses into the FPGA and then combining the data buses with an MPC8315 minimum working system through a local bus so as to form an uniform data communication platform; (2) configuring a bottom layer BSP (board support package) of MPC8315 to enable the three kinds of data buses to be adopted with an uniform data transmission protocol and guarantee consistency with an upper hardware platform; and (3) designing an internal logical circuit of the FPGA and unifying read-write operation of the three kinds of data buses into one mode so as to achieve read-write operation among the MPC8315 and the data buses. The module has the advantages that by utilizing programmability of the FPGA, the three kinds of data buses of MIL-STD-1553B, ARINC429 and RS422 are unified into one mode and then combined with the low-power-consumption CPU of the PowerPC architecture, so that multiple data bus communication functions are achieved, and power consumption of the card and a system are lowered greatly.

Book ChapterDOI
28 Feb 2013
TL;DR: The proof of the concept of the SMILE HPRC has been exhaustively tested with two complex and demanding applications: the Monte Carlo financial simulation and the Boolean Synthesis using Genetic Algorithms.
Abstract: High Performance Reconfigurable Computing (HPRC) has emerged as an alternative way to accelerate applications using FPGAs. Although these HPRC systems have a performance comparable to standard supercomputers and at a much lower cost, HPRC systems are still not affordable for many institutions. We present a low-cost HPRC system built on standard FPGA boards with an architecture that can execute many scientific applications faster than in Graphical Processor Units and traditional supercomputers. The system is made up of 32 low-cost FPGA boards and a custom-made high-speed network interface using RocketIO interfaces. We have designed a SystemC methodology and CAD framework that allow the designer to simulate any MPI scientific application before generating the final implementation files. The software runs on the PowerPC processor embedded in the FPGA on a light ad-hoc implementation of MPI, and the hardware is automatically translated from SystemC to Verilog, and connected to the PowerPC. This makes the SMILE HPRC system fully compatible with any existing MPI application. The proof of the concept of the SMILE HPRC has been exhaustively tested with two complex and demanding applications: the Monte Carlo financial simulation and the Boolean Synthesis using Genetic Algorithms. The results show a remarkable performance, reasonable costs, small power consumption, no need of cooling systems, small physical space requirements, system scalability and software portability.

Proceedings ArticleDOI
30 Aug 2013
TL;DR: The CORDIC algorithm is an iterative convergence algorithm that performs a rotation iteratively using a series of specific incremental rotation angles selected so that each iteration is performed by shift and add operation, which fit for FPGA implementation, and can be parallel in a chip to fullfill different latency and throughput.
Abstract: Historically,computationally-intensive data processing for space-borne instruments has heavily relied on groundbased processing system.But with recent advances in FPGAs such as Xilinx Virtex-4 and Virtex-5 series devices that including PowerPC processors and DSP blocks thereby provding a flexible hardware and software co-design architecture to meet computationally-intensive data processing need,So it is able to shift more processing on– board;for high data active and passive instruments,such as interferometer,Implementations of on-board processing algorithms to perform lossless data reductions can dramatically reduce the data rates,therefore relaxing the downlink data bandwidth requirements.The interferograms are performs the inverse fourier transform on-board in order to decrease the transmission rate.In [Revercomb et al.] paper show that only use the modulus of the complx spectrum will lead to big calibration errors.So the amplitude and angle of the complex spectrum is need for radiometric cablibration,but there have a big challenge for on board obtained the amplitude and angle of the complex spectrum.In this paper,we introduce the CORDIC algorithm to slove it. The CORDIC algorithm is an iterative convergence algorithm that performs a rotation iteratively using a series of specific incremental rotation angles selected so that each iteration is performed by shift and add operation,which fit for FPGA implementation,and can be parallel in a chip to fullfill different latency and throughput.Implemention results with Xilinx FPGA are summarized.

Patent
24 Apr 2013
TL;DR: In this paper, a data transmission long-range control system consisting of a PowerPC processor and a second microprocessor is described, where a duel-port RAM is connected between the first processor and the second processor.
Abstract: The invention discloses a data transmission long-range control system. The system comprises a PowerPC processor and a second microprocessor. A first microprocessor and a second microprocessor are connected with the PowerPC processor through peripheral component interconnect (PCI) bus interfaces. The second microprocessor is connected with a long-range control module of an upper computer through a controller area network (CAN) bus. A duel-port RAM is connected between the first microprocessor and the second microprocessor. In the data transmission long-range control system, the processors are clear in division, fast in processing speed, and strong in processing capability so that the problem of scarce capacity of a single processor is solved, data exchange between multiple processors through a PCI bus is realized, and processing capacity and transmission speed between the processors are greatly improved.

Proceedings ArticleDOI
29 May 2013
TL;DR: This work on resource estimation for the various task scheduling policies using XILKERNEL is first of its kind on resource utilization for a given embedded RTOS environment.
Abstract: The present day FPGA (Field Programmable Gate Array) technology is capable to design high performance embedded systems based on its soft core (MicroBlaze) and hard core (PowerPC) processors, embedded memories and other IP cores. Embedded system design demands use of limited hardware resources with as minimum power as possible while providing higher throughput. One way to decrease the complexity of application is to use a thread-oriented design where a process is divided into a number of manageable pieces known as threads. Each thread is responsible for some part of the application, thus providing multitasking. Further, for real-time task execution we need to have an efficient RTOS (Real Time Operating System) infrastructure on FPGA. Deciding a particular scheduling algorithm for thread execution requires the knowledge of resource utilization for the specific scheduling policy. Hence, a proper exploration of the various thread scheduling algorithms in terms of resource utilization, for a given embedded platform is of much importance. The incorporation of XILKERNEL RTOS in FPGA is a latest facility. Though there exists a few research work on analyzing the resource requirement in multitasking scenario for a given embedded RTOS environment, our work on resource estimation for the various task scheduling policies using XILKERNEL is first of its kind. Implementation of real-time scheduling algorithm like RMS on XILKERNEL has also been endeavored, using OS virtualization, since it is not directly supported by the kernel of XILKERNEL.

Proceedings ArticleDOI
Zhihui Hu1, Yu Zhou
21 Jun 2013
TL;DR: This work tested and analyzed software and hardware timestamp that affects the performance of GPS clock timing system and makes it with the support of hardware timestamp of 100 ns level timing accuracy.
Abstract: IEEE1588 standard Precision Time Protocol (PTP) is proposed to solve the high precision time synchronization problems in the application field. This technique will be ported to LINUX + inhibits PTP program POWERPC platform, makes it with the support of hardware timestamp of 100 ns level timing accuracy. We tested and analyzed software and hardware timestamp that affects the performance of GPS clock timing system.

Dissertation
01 Jan 2013
TL;DR: In this paper, the authors implemented a sub-threshold D flip-flop block library in layout and compared the performance of different D flipflop blocks in both schematic and layout, and the results were compared to each other and earlier results found in papers.
Abstract: The need for Ultra Low Power systems has increased with increasing number of portable devices. The maintenance costs of battery powered systems can be greatly reduced by improving the battery time, especially in places where battery replacement is hard or impossible. Implementation of subthreshold D flip-flops in layout is one step closer to having a subthreshold building block library. The task for this thesis is to implement D flip-flop blocks, which are highly suitable for subthreshold operation in layout. These are the PowerPC 603, C$^2$MOS, a Classic NAND-based D flip-flop, and two Minority3-based D flip-flops. The D flip-flops are first custom designed for $250mV$ in schematic at transistor level, and then implemented in layout. The implementation in layout focuses on high robustness against process variations, by using high regularity for the cost of area. The D flip-flops are simulated in both schematic and layout, and the results are compared to each other and earlier results found in papers. The results show that the PowerPC 603 has the lowest PDP, the lowest power consumption, very low propagation delay, and an average relative standard deviation for delay. The C$^2$MOS has the lowest propagation delay, low power consumption and low PDP results. However, it has the highest relative standard deviation on delay. The Minority3-based D flip-flops have a very low relative standard deviation for delay, which makes them the most robust against process variations in this sense. However, they have the highest propagation delay, highest power consumption and PDP, and consumes the highest chip area. The Classic NAND-based D flip-flop has good PDP and power consumption results, but a high delay and average standard deviation for delay. Earlier papers show similar results for the C$^2$MOS and the PowerPC 603, but no results are found for the rest. Future work consists of implementing and testing forced-stacked blocks, body biasing, high threshold voltage transistors, and tape-out measurements.

Journal ArticleDOI
TL;DR: The clustering based change detection algorithm for Ubiquitous Multimedia Environment is selected for evaluating the effect of different memory components (DDR/BRAM) on performance of the system in terms of frame rate (frames per second).
Abstract: Advances in FPGA technology have dramatically increased the use of FPGAs for computer vision applications. Availability of on-chip processor (like PowerPC) made it possible to design embedded systems using FPGAs for video processing applications. The objective of this research is to evaluate the performance of different memory components available on FPGA boards for embedded/platform-based implementations of image/video processing applications. The clustering based change detection algorithm for Ubiquitous Multimedia Environment is selected for evaluating the effect of different memory components (DDR/BRAM) on performance of the system in terms of frame rate (frames per second).

Journal ArticleDOI
29 Oct 2013
TL;DR: This paper proposes an integrated architecture using PowerPC processor on Net FPGA and embedded Linux operating system on NetFPGA platform which not only provides developers with an environment for software execution which added more flexibility, but also enhanced the system to provide more applied possibilities on development.
Abstract: Among numerous embedded platforms, NetFPGA provides developers with a freely programmable FPGA component to design custom functionalities in networking. However, most hardware projects are developed based on reference designs without embedded operating system. For hybrid developments on multi-layers, there will be some difficulties to apply. On the other hand, due to the limited resources on embedded platform, both performance and flexibility need to be concerned on implementation. And for networking processing, it is quite difficult to adjust control parameters without software environment. Therefore, this paper proposes an integrated architecture using PowerPC processor on NetFPGA and embedded Linux operating system on NetFPGA platform. This not only provides developers with an environment for software execution which added more flexibility, but also enhanced the system to provide more applied possibilities on development.

Dissertation
31 Oct 2013
TL;DR: A middleware architecture for scripting languages is discussed that provides for seamless dynamic scripting access to the C API of native libraries without the need for compilation of wrapper modules.
Abstract: Scripting languages are becoming increasingly prevalent as a tool for rapid application development. However, numerous efficient “best-practice” software solutions are initially available as C libraries. Scripting "bindings" to C libraries are typically implemented as C wrapper modules that need to be developed and compiled for every language-library-platform combination. We discuss a middleware architecture for scripting languages that provides for seamless dynamic scripting access to the C API of native libraries without the need for compilation of wrapper modules. We gave a proof-of-concept by example of an implementation for R in which C libraries, such as OpenGL and SDL, are loaded as if these were an extension to R. The model is based on automation for making arbitrary C APIs available and dynamic operations for interoperability with native code and data that are carried out at the machine level using a Dynamic Foreign Function Interface. The latter need to conform with the ABI (Application Binary Interface) and Calling Conventions of the processor hardware platform. We give an overview of ABIs across five processor-architecture families and we then discuss a portable abstraction layer for making foreign function calls and handling of callbacks. Detailed descriptions are given that explain the interface design as well as port implementations for X86, ARM, PowerPC, MIPS and SPARC processor-architecture families.

01 Aug 2013
TL;DR: A generic proof methodology to automatically prove correctness of design transformations introduced at the Register-Transfer Level (RTL) to achieve lower power dissipation in hardware systems and guarantees the correctness of any low power transformation by providing a functional equivalence proof of the hardware design before and after the transformation.
Abstract: We present a generic proof methodology to automatically prove correctness of design transformations introduced at the Register-Transfer Level (RTL) to achieve lower power dissipation in hardware systems. We also introduce a new algorithm to reduce switching activity power dissipation in microprocessors. We further apply our technique in a completely different domain of dynamic power management of Systems-on-Chip (SoCs). We demonstrate our methodology on real-life circuits. In this thesis, we address the dual problem of transforming hardware systems at higher levels of abstraction to achieve lower power dissipation, and a reliable way to verify the correctness of the afore-mentioned transformations. The thesis is in three parts. The first part introduces Instruction-driven Slicing, a new algorithm to automatically introduce RTL/System level annotations in microprocessors to achieve lower switching power dissipation. The second part introduces Dedicated Rewriting, a rewriting based generic proof methodology to automatically prove correctness of such high-level transformations for lowering power dissipation. The third part implements dedicated rewriting in the context of dynamically managing power dissipation of mobile and hand-held devices. We first present instruction-driven slicing, a new technique for annotating microprocessor descriptions at the Register Transfer Level in order to achieve lower power dissipation. Our technique automatically annotates existing RTL code to optimize the circuit for lowering power dissipated by switching activity. Our technique can be applied at the architectural level as well, achieving similar power gains. We first demonstrate our technique on architectural and RTL models of a 32-bit OpenRISC pipelined processor (OR1200), showing power gains for the SPEC2000 benchmarks. These annotations achieve reduction in power dissipation by changing the logic of the design. We further extend our technique to an out-of-order superscalar core and demonstrate power gains for the same SPEC2000 benchmarks on architectural and RTL models of PUMA, a fixed point out-of-order PowerPC microprocessor. We next present dedicated rewriting, a novel technique to automatically prove the correctness of low power transformations in hardware systems described at the Register Transfer Level. We guarantee the correctness of any low power transformation by providing a functional equivalence proof of the hardware design before and after the transformation. Dedicated rewriting is a highly automated deductive verification technique specially honed for proving correctness of low power transformations. We provide a notion of equivalence and establish the equivalence proof within our dedicated rewriting system. We demonstrate our technique on a non-trivial case study. We show equivalence of a Verilog RTL implementation of a Viterbi decoder, a component of the DRM System-On-Chip (SoC), before and after the application of multiple low power transformations. We next apply dedicated rewriting to a broader context of…

01 Aug 2013
TL;DR: The results show that when processors are categorized by microarchitectural families and certain restrictions to input size are employed, linear correlation shows promise for being an effective performance predictor for the IP kernels.
Abstract: : This report presents an in-depth performance characterization of a variety of processors released over the last decade The processors considered include Intel and PowerPC and vary widely with respect to architectural design parameters The benchmark experiments utilize applications from two different classes of codes The first class, consisting of synthetic benchmarks, includes the popular Dhrystone and Whetstone suites The second class includes a set of widely used Image Processing (IP) kernels Following the presentation of the results from these experiments, a set of techniques for performance prediction is given based on linear correlation This report provides an evaluation of the effectiveness of these techniques The results show that when processors are categorized by microarchitectural families and certain restrictions to input size are employed, linear correlation shows promise for being an effective performance predictor for the IP kernels

Patent
09 Oct 2013
TL;DR: In this article, a communication control method and device of a train display screen is described, which consists of a PowerPC processor, a field programmable gate array (FPGA) and external equipment.
Abstract: The invention discloses a communication control method and device of a train display screen. The train display screen comprises a PowerPC processor, a field programmable gate array (FPGA) and external equipment. The communication control method includes that the FPGA receives configuration parameters sent by the PowerPC processor, the configuration parameters correspond to the external equipment, the external equipment to be controlled is confirmed by the FPGA according to the configuration parameters, and the FPGA monitors work of the external equipment to be controlled. Therefore, according to the scheme, stability and reliability of the train display screen are improved.