scispace - formally typeset
Search or ask a question

Showing papers on "PowerPC published in 2019"


Proceedings ArticleDOI
14 May 2019
TL;DR: The workflow of TensorFlow for image recognition is considered, highlighting the strong dependency of the performance in the training phase on the availability of arithmetic libraries optimized for the underlying architecture and which hardware/software configurations to efficiently support machine learning workloads on HPC clusters.
Abstract: The recent rapid growth of the data-flow programming paradigm enabled the development of specific architectures, e.g., for machine learning. The most known example is the Tensor Processing Unit (TPU) by Google. Standard data-centers, however, still can not foresee large partitions dedicated to machine learning specific architectures. Within data-centers, the High-Performance Computing (HPC) clusters are highly parallel machines targeting a broad class of compute-intensive workflows, as such they can be used for tackling machine learning challenges. On top of this, HPC architectures are rapidly changing, including accelerators and instruction sets other than the classical x86 CPUs. In this blurry scenario, identifying which are the best hardware/software configurations to efficiently support machine learning workloads on HPC clusters is not trivial. In this paper, we considered the workflow of TensorFlow for image recognition. We highlight the strong dependency of the performance in the training phase on the availability of arithmetic libraries optimized for the underlying architecture. Following the example of Intel leveraging the MKL libraries for improving the TensorFlow performance, we plugged the Arm Performance Libraries into TensorFlow and tested on an HPC cluster based on Marvell ThunderX2 CPUs. Also, we performed a scalability study on three state-of-the-art HPC clusters based on different CPU architectures, x86 Intel Skylake, Arm-v8 Marvell ThunderX2, and PowerPC IBM Power9.

17 citations


Proceedings ArticleDOI
02 Mar 2019
TL;DR: Several OpenMP-parallelized applications, including a color search, Sobel filter, Mandelbrot set generator, hyperspectral imaging target classifier, and image thumbnailer, were benchmarked on these processing platforms, establishing the capabilities of both the RAD5545 and HPSC processors for on-board parallel processing of computationally-demanding applications for future space missions.
Abstract: Researchers, corporations, and government entities are seeking to deploy increasingly compute-intensive workloads on space platforms. This need is driving the development of two new radiation-hardened, multi-core space processors, the BAE Systems RAD5545™ processor and the Boeing High-Performance Spaceflight Computing (HPSC) processor. As these systems remained in the development phase, the Freescale P5020DS and P5040DS systems, based on the same PowerPC e5500 architecture as the RAD5545 processor, and the Hardkernel ODROID-C2, sharing the same ARM Cortex-A53 core as the HPSC processor, were selected as facsimiles for evaluation. Several OpenMP-parallelized applications, including a color search, Sobel filter, Mandelbrot set generator, hyperspectral imaging target classifier, and image thumbnailer, were benchmarked on these processing platforms. Performance and energy consumption results on these facsimiles were scaled to forecasted frequencies of the radiation-hardened devices in development. In these studies, the RAD5545 achieved the highest and most consistent parallel efficiency, up to 99%. The HPSC processor achieved faster execution times, averaging about half that of the RAD5545 processor, with lower energy consumption. The evaluated applications reached up to 3.9 speedup across four cores. The frequency-scaling methods were validated by comparing the set of scaled measures with data points from an underclocked facsimile, which yielded an average accuracy of 97% between estimated and measured results. These performance outcomes help to establish the capabilities of both the RAD5545 and HPSC processors for on-board parallel processing of computationally-demanding applications for future space missions.

15 citations


Journal ArticleDOI
01 Sep 2019
TL;DR: This multi-node characterization of the Emu Chick extends an earlier single-node investigation of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication and demonstrates that for many basic operations the EmU Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture.
Abstract: The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less “Gossamer” cores for computational work and rely on a typical stationary core (PowerPC) to run basic operating system functions and migrate threads between nodes. In this multi-node characterization of the Emu Chick, we extend an earlier single-node investigation [1] of the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication. We compare the Emu Chick hardware to architectural simulation and an Intel Xeon-based platform. Our results demonstrate that for many basic operations the Emu Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture although bandwidth usage suffers for computationally intensive workloads like SpMV. Moreover, the Emu Chick provides stable, predictable performance with up to 65% of the peak bandwidth utilization on a random-access pointer chasing benchmark with weak locality.

8 citations


Book ChapterDOI
TL;DR: The idea of porting eChronos on a chip which is open-source and effective, thus reducing the cost of embedded systems and increasing the secure system development as a whole is presented.
Abstract: eChronos is a formally verified real-time operating system (RTOS) designed for embedded microcontrollers. eChronos was targeted for tightly constrained devices without memory management units. Currently, eChronos is available on proprietary designs like ARM, PowerPC, and Intel architectures. eChronos is adopted in safety critical systems like aircraft control system and medical implant devices. eChronos is one of the very few system softwares not been ported to RISC-V. RISC-V is an open-source instruction set architecture (ISA) that enables a new era of processor development. Many standard operating systems, software toolchain has migrated to the RISC-V architecture. According to the latest trends [1], RISC-V is replacing many proprietary chips. As a secure RTOS, it is attractive to port on an open-source ISA. SHAKTI and PicoRV32 are some of the proven open-source RISC-V designs available. Now having a secure RTOS on an open-source hardware design, designed based on an open-source ISA makes it more interesting. In addition to this, the current architectures supported by eChronos are all proprietary designs [2], and porting eChronos to the RISC-V architecture increases the secure system development as a whole. This paper presents an idea of porting eChronos on a chip which is open-source and effective, thus reducing the cost of embedded systems. Designing a open-source system that is completely open-source reduces the overall cost, increased the security, and can be critically reviewed. This paper explores the design and architecture aspect involved in porting eChronos to RISC-V. The authors have successfully ported eChronos to RISC-V architecture and verified it on spike [3]. The port of RISC-V to eChronos is made available open-source by authors [4]. Along with that, the safe removal of architectural dependencies and subsequent changes in eChronos is also analyzed.

3 citations


Journal ArticleDOI
TL;DR: This contribution discusses porting the LHCb Stack from x86_64 architecture to both architectures aarch64 and ppc64le with the goal to evaluate the performance and the cost of the computing infrastructure for the High Level Trigger (HLT).
Abstract: LHCb is undergoing major changes in its data selection and processing chain for the upcoming LHC Run 3 starting in 2021. With this in sight several initiatives have been launched to optimise the software stack. This contribution discusses porting the LHCb Stack from x86_64 architecture to both architectures aarch64 and ppc64le with the goal to evaluate the performance and the cost of the computing infrastructure for the High Level Trigger (HLT). This requires porting a stack with more than five million lines of code and finding working versions of external libraries provided by LCG. Across all software packages the biggest challenge is the growing use of vectorisation - as many vectorisation libraries are specialised on x86 architecture and do not have any support for other architectures. In spite of these challenges we have successfully ported the LHCb High Level Trigger code to aarch64 and ppc64le. This contribution discusses the status and plans for the porting of the software as well as the LHCb approach for tackling code vectorisation in a platform independent way.

2 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: CrossDiff, a cross-architecture binary function searching system, which aims to find similar functions in binaries belonging to different architectures such as x86, ARM, MIPS, and PowerPC, and shows that CrossDiff is efficient and scalable with high accuracy.
Abstract: Code reuse is widespread in software development for various CPU architectures. It can cause many problems, such as software plagiarism and known vulnerabilities. To tackle these problems, we propose CrossDiff, a cross-architecture binary function searching system. We aim to find similar functions in binaries belonging to different architectures such as x86, ARM, MIPS, and PowerPC. The system has four phases. The pre-processing phase lifts the binaries into LLVM IR codes and optimizes the codes to reduce the impacts of different architectures and other factors. Secondly, it extracts Bytehash, Simhash, and other function semantic features. Then it uses Bytehash to find equal function matches, Simhash to find similar function candidates and semantic features for precise comparison. Finally, it optimizes the result leveraging the structural information in the disassembly code. Results show that CrossDiff is efficient and scalable with high accuracy.

2 citations


Patent
02 Jul 2019
TL;DR: In this paper, an asymmetric data processing device based on a multi-core POWERPC processor is described, and the beneficial effects of the utility model are that the data processing devices are high in real-time performance, strong in stability and strong in expansibility, providing higher performance and lower power consumption compared with a conventional data device.
Abstract: The utility model discloses an asymmetric data processing device based on a multi-core POWERPC processor. The system is characterized in that the system comprises a multi-core POWERPC processor, an FPGA module, an Ethernet switching chip, a debugging connector connected with the Ethernet switching chip, a Rapidio switching chip and an interface connector connected with the Rapidio switching chip;one end of the FPGA module, one end of the Ethernet switching chip, one end of the debugging connector and one end of the Rapidio switching chip are connected with the multi-core POWERPC processor. And the other ends of the FPGA module and the Ethernet switching chip are respectively connected with the interface connector. The beneficial effects of the utility model are that the data processing device is high in real-time performance, strong in stability and strong in expansibility, provides higher performance and lower power consumption compared with a conventional data device, and avoids thedefects of a data processing device in the prior art.

1 citations


Patent
21 Jun 2019
TL;DR: In this paper, a high-precision synchronous information processing system based on a PowerPC processor is presented, which includes an Ethernet chip, a NOR FLASH memory, and a DDR3 memory.
Abstract: The utility model discloses a high-precision synchronous information processing system based on a PowerPC processor. The system comprises a PowerPC processor, an Ethernet chip, an NOR FLASH memory anda DDR3 memory, wherein the Ethernet chip, the NOR FLASH memory and the DDR3 memory are respectively connected with the PowerPC processor; wherein the Ethernet chips comprise a gigabit Ethernet chip A, a gigabit Ethernet chip B and a 100M Ethernet chip C, and the gigabit Ethernet chip A, the gigabit Ethernet chip B and the 100M Ethernet chip C are respectively connected with the PowerPC processor.According to the embedded computer information processing system, five paths of Ethernet interfaces supporting a high-precision synchronization technology are used, the embedded computer informationprocessing system has a strong synchronization function and is matched with the strong operational capability of the PowerPC processor to realize a synchronization information processing function, andthe embedded computer information processing system can synchronize external equipment and can be adapted to a network and is greatly ahead of a common embedded computer information processing system.

1 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: In this article, the authors present test results for the CPU, cache, DRAM interface unit, and associated processor local bus circuitry of the radiation-hardened BRE440, a PowerPC 440-based SOC processor fabricated on the Honeywell HX5000 150 nm technology node.
Abstract: SEE test results are presented for the CPU, cache, DRAM interface unit, and associated processorlocal bus circuitry of the radiation-hardened BRE440, a PowerPC 440-based SOC processor fabricated on the Honeywell HX5000 150 nm technology node.

1 citations


Patent
04 Jun 2019
TL;DR: In this paper, an FPGA-based 60X bus bridging system is presented, which consists of a main bridge control module which carries out decoding processing on the 60x bus of the PowerPC processor so as to acquire address decodes and output control information and the address decoding.
Abstract: The invention provides an FPGA (Field Programmable Gate Array)-based 60X bus bridging system, an FPGA-based 60X bus bridging method and a medium. The FPGA-based 60X bus bridging system comprises a main bridge control module which carries out decoding processing on the 60X bus of the PowerPC processor so as to acquire address decodes and output control information and the address decodes; And a DDR2 control module which is used for caching the DDR communication data from the 60X bus according to the received control information and controlling the external DDR2 memory logic. The system has an independent 60X bus response time sequence technology, is not influenced by an external module, and can ensure the stability of the processor, and each bus interface has a plurality of caches which aremutually independent at the same time, so that the response time of the processor is shortened, and the bus access rate is improved. According to the invention, the PowerPC processor is connected with the FPGA chip, and the 60X bus is converted into each peripheral chip interface by using the FPGA, so that an original 60X special-purpose bridge switching chip is replaced; Performance is higher, connection is flexible and convenient, and expansion is easy.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: For a proprietary VoIP soft-switch product road map, it is planned to provide a service model as independent of hardware and compatible with NFV and port from Motorola PowerPC to Intel X86 processor architecture.
Abstract: For a proprietary VoIP soft-switch product road map, it is planned to provide a service model as independent of hardware and compatible with NFV. Currently, Call Agent Core component of VoIP soft-switch and associated applications are compiled with Motorola PowerPC processor. The aim here is to port from Motorola PowerPC to Intel X86 processor architecture. In this study, requirement analysis and design items about Call Agent Core component are described.

Book ChapterDOI
25 May 2019
TL;DR: The design scheme of the vehicle simulation subsystem of this testing platform is completed, and multiple actual line data and different EMU data are used for testing, which verifies the feasibility and versatility of the system.
Abstract: According to the operational principle and functional requirements of testing platform for CBTC system, this paper focuses on the research and design of vehicle simulation subsystem of this testing platform. First of all, through the analysis of the functions of the CBTC testing platform, the design scheme of the vehicle simulation subsystem is completed. Then, under the PowerPC hardware platform and software platform based on embedded Linux operating system, the hardware design of the unit adapter and the development process of BSP driver between the vehicle subsystem and the CBTC testing platform are introduced in detail. Finally, the vehicle simulation subsystem is realized by establishing the vehicle dynamic model and developing the master computer. The subsystem is connected to the CBTC testing platform, and multiple actual line data and different EMU data are used for testing, which verifies the feasibility and versatility of the system.

Patent
01 Jan 2019
TL;DR: In this paper, a 1553B and Zigbee protocol conversion device consisting of an isolation transformer, an 1553b controller, an FPGA gate array, a PowerPC processor and a Zigbee module are connected successively.
Abstract: The invention, which relates to the technical field of communication protocol conversion, provides a 1553B and Zigbee protocol conversion device comprising a 1553B connector, an isolation transformer,a 1553B controller, an FPGA gate array, a PowerPC processor and a Zigbee module that are connected successively. A bus signal voltage conversion chip is connected between the 1553B controller and theFPGA gate array; and the PowerPC processor is also connected with a DDR chip and an NOR FLASH chip. According to the invention, wireless communication connection between the 1553B devices is realizedby applying the wireless network form of the Zigbee ad hoc network to the 1553B communication bus, thereby realizing a short-distance, low-complexity, self-organizing, low-power-consumption wirelessnetwork mode to carry out conversion of 1553B protocol data.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: The paper develops a hardware accelerator for H.264 video decoder interfaced as an Auxiliary Processing Unit (APU) with the embedded PowerPC (PPC440) processor in System on Chip (SoC) platform on Xilinx Virtex-5 development board.
Abstract: This objective of the paper is to develop a hardware accelerator for H.264 video decoder. This can be achieved through H.264 decoder interfaced as an Auxiliary Processing Unit (APU) with the embedded PowerPC (PPC440) processor in System on Chip (SoC) platform on Xilinx Virtex-5 development board. The H.264 APU accelerator is tested with various video sequences and found 7x acceleration as compared to equivalent software execution and literature

Proceedings ArticleDOI
10 May 2019
TL;DR: This paper designed and implemented a direct memory access (DMA) architecture of PCI-Express (PCIe) between Xilinx field programmable gate array (FPGA) and Freescale PowerPC that provides a high-performance and low-occupancy alternative to commercial products.
Abstract: This paper designed and implemented a direct memory access (DMA) architecture of PCI-Express (PCIe) between Xilinx field programmable gate array (FPGA) and Freescale PowerPC. The DMA architecture based on FPGA is compatible with the Xilinx PCIe core while the DMA architecture based on POWERPC is compatible with VxBus of VxWorks. The solutions provide a high-performance and low-occupancy alternative to commercial products. In order to maximize the PCIe throughput while minimizing the FPGA resources utilization, a novel strategy for the DMA engine is adopted, where the DMA register list is stored not only inside the FPGA during initialization phase but also in the central memory of the host CPU. The FPGA design package is complemented with simple register access to control the DMA engine by a VxWorks driver. The design is compatible with Xilinx FPGA Kintex Ultrascale Family, and operates with the Xilinx PCIe endpoint Generation 1 with lane configurations x8. A data throughput of more than 666 MBytes/s (memory write with data from FPGA to PowerPC) has been achieved with the single PCIe Gen1 x8 lanes endpoint of this design.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: In presence of frequent pre-emptions, the throughput reduces by only 3% on AIOP, compared to 25% on optimized presentday NPU architectures, and the absolute throughput and latency numbers are 2X better.
Abstract: This paper presents the recent advancements made on the Advanced-IO-Processor (AIOP), a Network Processor (NPU) architecture designed by NXP Semiconductors. The base architecture consists of multi-tasking PowerPC processor cores combined with hardware accelerators for common packet processing functions. Each core is equipped with dedicated hardware for rapid task scheduling and switching on every hardware accelerator call, thus providing very high throughput. A hardware pre-emption controller snoops on the accelerator completions and sends task pre-emption requests to the cores. This reduces the latency of real-time tasks by quickly switching to the high priority task on the core without any performance penalty. A novel concept of prioritythresholding is further used to avoid latency uncertainty on lower priority tasks. The paper shows that these features make the AIOP architecture very effective in handling the conflicting requirements of high-throughput and low-latency for nextgeneration wireless applications like WiFi (802.11ax) and 5G. In presence of frequent pre-emptions, the throughput reduces by only 3% on AIOP, compared to 25% on optimized presentday NPU architectures. Further, the absolute throughput and latency numbers are 2X better.