scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 2016"


Proceedings ArticleDOI
02 Nov 2016
TL;DR: TensorFlow as mentioned in this paper is a machine learning system that operates at large scale and in heterogeneous environments, using dataflow graphs to represent computation, shared state, and the operations that mutate that state.
Abstract: TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.

10,913 citations


Posted Content
TL;DR: The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
Abstract: TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.

5,542 citations


MonographDOI
01 Jan 2016
TL;DR: In this article, a comprehensive introduction to parallel computing is provided, discussing theoretical issues such as the fundamentals of concurrent processes, models of parallel and distributed computing, and metrics for evaluating and comparing parallel algorithms, as well as practical issues, including methods of designing and implementing shared-and distributed-memory programs, and standards for parallel program implementation.
Abstract: The constantly increasing demand for more computing power can seem impossible to keep up with. However, multicore processors capable of performing computations in parallel allow computers to tackle ever larger problems in a wide variety of applications. This book provides a comprehensive introduction to parallel computing, discussing theoretical issues such as the fundamentals of concurrent processes, models of parallel and distributed computing, and metrics for evaluating and comparing parallel algorithms, as well as practical issues, including methods of designing and implementing shared- and distributed-memory programs, and standards for parallel program implementation, in particular MPI and OpenMP interfaces. Each chapter presents the basics in one place followed by advanced topics, allowing novices and experienced practitioners to quickly find what they need. A glossary and more than 80 exercises with selected solutions aid comprehension. The book is recommended as a text for advanced undergraduate or graduate students and as a reference for practitioners.

572 citations


Proceedings ArticleDOI
18 Apr 2016
TL;DR: GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.
Abstract: Large-scale deep learning requires huge computational resources to train a multi-layer neural network. Recent systems propose using 100s to 1000s of machines to train networks with tens of layers and billions of connections. While the computation involved can be done more efficiently on GPUs than on more traditional CPU cores, training such networks on a single GPU is too slow and training on distributed GPUs can be inefficient, due to data movement overheads, GPU stalls, and limited GPU memory. This paper describes a new parameter server, called GeePS, that supports scalable deep learning across GPUs distributed among multiple machines, overcoming these obstacles. We show that GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single-node code). Moreover, GeePS achieves a higher training throughput with just four GPU machines than that a state-of-the-art CPU-only system achieves with 108 machines.

315 citations


Proceedings ArticleDOI
12 Mar 2016
TL;DR: A new class of hardware accelerators for large-scale combinatorial optimization and deep learning based on memristive Boltzmann machines is examined based on recently developed resistive RAM (RRAM) technology, achieving 57x higher performance and 25x lower energy with virtually no loss in the quality of the solution to the optimization problems.
Abstract: The Boltzmann machine is a massively parallel computational model capable of solving a broad class of combinatorial optimization problems. In recent years, it has been successfully applied to training deep machine learning models on massive datasets. High performance implementations of the Boltzmann machine using GPUs, MPI-based HPC clusters, and FPGAs have been proposed in the literature. Regrettably, the required all-to-all communication among the processing units limits the performance of these efforts. This paper examines a new class of hardware accelerators for large-scale combinatorial optimization and deep learning based on memristive Boltzmann machines. A massively parallel, memory-centric hardware accelerator is proposed based on recently developed resistive RAM (RRAM) technology. The proposed accelerator exploits the electrical properties of RRAm to realize in situ, fine-grained parallel computation within memory arrays, thereby eliminating the need for exchanging data between the memory cells and the computational units. Two classical optimization problems, graph partitioning and boolean satisfiability, and a deep belief network application are mapped onto the proposed hardware. As compared to a multicore system, the proposed accelerator achieves 57x higher performance and 25x lower energy with virtually no loss in the quality of the solution to the optimization problems. The memristive accelerator is also compared against an RRAM based processing-in-memory (PIM) system, with respective performance and energy improvements of 6.89x and 5.2x.

173 citations


Proceedings ArticleDOI
25 Mar 2016
TL;DR: OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework that leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design.
Abstract: Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBM's 32nm process and multiple Xilinx FPGA prototypes.

165 citations


Proceedings ArticleDOI
25 Mar 2016
TL;DR: PUPiL provides a promising way to enforce power caps with greater performance than current state-of-the-art hardware-only approaches, and combines hardware's fast reaction time with software's flexibility.
Abstract: Power and thermal dissipation constrain multicore performance scaling. Modern processors are built such that they could sustain damaging levels of power dissipation, creating a need for systems that can implement processor power caps. A particular challenge is developing systems that can maximize performance within a power cap, and approaches have been proposed in both software and hardware. Software approaches are flexible, allowing multiple hardware resources to be coordinated for maximum performance, but software is slow, requiring a long time to converge to the power target. In contrast, hardware power capping quickly converges to the the power cap, but only manages voltage and frequency, limiting its potential performance. In this work we propose PUPiL, a hybrid software/hardware power capping system. Unlike previous approaches, PUPiL combines hardware's fast reaction time with software's flexibility. We implement PUPiL on real Linux/x86 platform and compare it to Intel's commercial hardware power capping system for both single and multi-application workloads. We find PUPiL provides the same reaction time as Intel's hardware with significantly higher performance. On average, PUPiL outperforms hardware by from 1:18-2:4 depending on workload and power target. Thus, PUPiL provides a promising way to enforce power caps with greater performance than current state-of-the-art hardware-only approaches.

150 citations


Journal ArticleDOI
TL;DR: FarGO3D as mentioned in this paper is a magnetohydrodynamics code developed with a special emphasis on protoplanetary disks physics and planet-disk interactions, and parallelized with MPI.
Abstract: We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on protoplanetary disks physics and planet-disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of the magnetic field to machine accuracy, coupled to a method of characteristics for the evaluation of electromotive forces and Lorentz forces. Orbital advection is implemented, and an N-body solver is included to simulate planets or stars interacting with the gas. We present our implementation in detail and present a number of widely known tests for comparison purposes. One strength of FARGO3D is that it can run on both "Graphical Processing Units" (GPUs) or "Central Processing unit" (CPUs), achieving large speed up with respect to CPU cores. We describe our implementation choices, which allow a user with no prior knowledge of GPU programming to develop new routines for the CPU, and have them translated automatically for the GPU.

123 citations


Proceedings ArticleDOI
11 Apr 2016
TL;DR: It is shown that cache partitioning does not necessarily ensure predictable cache performance in modern COTS multicore platforms that use non-blocking caches to exploit memory- level-parallelism (MLP), and a hardware and system software collaborative approach is proposed to efficiently eliminate MSHR contention for multicore real-time systems.
Abstract: In this paper, we show that cache partitioning does not necessarily ensure predictable cache performance in modern COTS multicore platforms that use non-blocking caches to exploit memory- level-parallelism (MLP). Through carefully designed experiments using three real COTS multicore platforms (four distinct CPU architectures) and a cycle- accurate full system simulator, we show that special hardware registers in non-blocking caches, known as Miss Status Holding Registers (MSHRs), which track the status of outstanding cache-misses, can be a significant source of contention; we observe up to 21X WCET increase in a real COTS multicore platform due to MSHR contention. We propose a hardware and system software (OS) collaborative approach to efficiently eliminate MSHR contention for multicore real-time systems. Our approach includes a low-cost hardware extension that enables dynamic control of per-core MLP by the OS. Using the hardware extension, the OS scheduler then globally controls each core's MLP in such a way that eliminates MSHR contention and maximizes overall throughput of the system. We implement the hardware extension in a cycle- accurate fullsystem simulator and the scheduler modification in Linux 3.14 kernel. We evaluate the effectiveness of our approach using a set of synthetic and macro benchmarks. In a case study, we achieve up to 19% WCET reduction (average: 13%) for a set of EEMBC benchmarks compared to a baseline cache partitioning setup.

104 citations


Journal ArticleDOI
18 Jun 2016
TL;DR: A hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing is described.
Abstract: Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration.We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.

93 citations


Journal ArticleDOI
TL;DR: An area-efficient, globally asynchronous, locally synchronous network-on-chip (NoC) architecture for a hard real-time multiprocessor platform that uses statically scheduled time-division multiplexing (TDM) to control the communication over a structure of routers, links, and network interfaces (NIs).
Abstract: In this paper, we present an area-efficient, globally asynchronous, locally synchronous network-on-chip (NoC) architecture for a hard real-time multiprocessor platform. The NoC implements message-passing communication between processor cores. It uses statically scheduled time-division multiplexing (TDM) to control the communication over a structure of routers, links, and network interfaces (NIs) to offer real-time guarantees. The area-efficient design is a result of two contributions: 1) asynchronous routers combined with TDM scheduling and 2) a novel NI microarchitecture. Together they result in a design in which data are transferred in a pipelined fashion, from the local memory of the sending core to the local memory of the receiving core, without any dynamic arbitration, buffering, and clock synchronization. The routers use two-phase bundled-data handshake latches based on the Mousetrap latch controller and are extended with a clock gating mechanism to reduce the energy consumption. The NIs integrate the direct memory access functionality and the TDM schedule, and use dual-ported local memories to avoid buffering, flow-control, and synchronization. To verify the design, we have implemented a 4 $\times $ 4 bitorus NoC in 65-nm CMOS technology and we present results on area, speed, and energy consumption for the router, NI, NoC, and postlayout.

Proceedings ArticleDOI
05 Oct 2016
TL;DR: Experimental results demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3x compared to a conventional CPU-only cluster.
Abstract: With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

Journal ArticleDOI
TL;DR: This work presents the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types.
Abstract: CUDA, OpenCL, and OpenMP are popular programming models for the multicore architectures of CPUs and many-core architectures of GPUs or Xeon Phis At the same time, computational scientists face the question of which programming model to use to obtain their scientific results We present the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types Since the respective compute back end can be selected at runtime, one can seamlessly switch between different hardware types without the need for error-prone and time-consuming recompilation steps We present new benchmark results for sparse linear algebra operations in ViennaCL, complementing results for the dense linear algebra operations in ViennaCL reported in earlier work Comparisons with vendor libraries show that ViennaCL provides better overall performance for sparse matrix-vector and sparse mat

Journal ArticleDOI
TL;DR: This paper is primarily dedicated to the distributed memory parallelization of particle methods, targeting several thousands of CPU cores, targeting one particular particle method which is Smoothed Particle Hydrodynamics (SPH), one of the most widespread today in the literature as well as in engineering.

Journal ArticleDOI
TL;DR: In this article, the authors employ the recently published Execution-Cache-Memory ECM model to describe the single-core and multi-core performance of streaming kernels, and derive a simple, phenomenological power model for the Sandy Bridge processor.
Abstract: Modern multi-core chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory ECM model to describe the single-core and multi-core performance of streaming kernels. The model refines the well-known roofline model, because it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multi-core chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multi-core processors and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice Boltzmann flow solver code. Copyright © 2013 John Wiley & Sons, Ltd.

Journal ArticleDOI
01 Sep 2016
TL;DR: PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology is proposed.
Abstract: Novel pervasive devices such as smart surveillance cameras and autonomous micro-UAVs could greatly benefit from the availability of a computing device supporting embedded computer vision at a very low power budget. To this end, we propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology. We show that PULP performance can be scaled over a 1x-354x range, with a peak theoretical energy efficiency of 211 GOPS/W. We present performance results for several demanding kernels from the image processing and vision domain, with post-layout power modeling: a motion detection application that can run at an efficiency up to 192 GOPS/W (90 % of the theoretical peak); a ConvNet-based detector for smart surveillance that can be switched between 0.7 and 27fps operating modes, scaling energy consumption per frame between 1.2 and 12mJ on a 320 ×240 image; and FAST + Lucas-Kanade optical flow on a 128 ×128 image at the ultra-low energy budget of 14 μJ per frame at 60fps.

Journal ArticleDOI
Ramzi Mahmoudi1, Mohamed Akil1
TL;DR: In this paper, the authors presented an enhanced computation method for smoothing 2D object in binary case, which provides a parallel computation and better memory management, while preserving the topology of the original image by using homotopic transformations defined in the framework of digital topology.
Abstract: To prepare images for better segmentation, we need preprocessing applications, such as smoothing, to reduce noise. In this paper, we present an enhanced computation method for smoothing 2D object in binary case. Unlike existing approaches, proposed method provides a parallel computation and better memory management, while preserving the topology (number of connected components) of the original image by using homotopic transformations defined in the framework of digital topology. We introduce an adapted parallelization strategy called split, distribute and merge (SDM) strategy which allows efficient parallelization of a large class of topological operators. To achieve a good speedup and better memory allocation, we cared about task scheduling and managing. Distributed work during smoothing process is done by a variable number of threads. Tests on 2D grayscale image (512*512), using shared memory parallel machine (SMPM) with 8 CPU cores (2 Xeon E5405 running at frequency of 2 GHz), showed an enhancement of 5.2 with cache success rate of 70%.

Journal ArticleDOI
01 Jan 2016
TL;DR: The effects of the three most important transfer parameters that are used to enhance data transfer throughput are analyzed: pipelining, parallelism and concurrency and two different transfer optimization algorithms are presented that outperform the most popular data transfer tools in majority of the cases.
Abstract: In end-to-end data transfers, there are several factors affecting the data transfer throughput, such as the network characteristics (e.g., network bandwidth, round-trip-time, background traffic); end-system characteristics (e.g., NIC capacity, number of CPU cores and their clock rate, number of disk drives and their I/O rate); and the dataset characteristics (e.g., average file size, dataset size, file size distribution). Optimization of big data transfers over inter-cloud and intra-cloud networks is a challenging task that requires joint-consideration of all of these parameters. This optimization task becomes even more challenging when transferring datasets comprised of heterogeneous file sizes (i.e., large files and small files mixed). Previous work in this area only focuses on the end-system and network characteristics however does not provide models regarding the dataset characteristics. In this study, we analyze the effects of the three most important transfer parameters that are used to enhance data transfer throughput: pipelining, parallelism and concurrency . We provide models and guidelines to set the best values for these parameters and present two different transfer optimization algorithms that use the models developed. The tests conducted over high-speed networking and cloud testbeds show that our algorithms outperform the most popular data transfer tools like Globus Online and UDT in majority of the cases.

Proceedings ArticleDOI
27 Feb 2016
TL;DR: The results demonstrate the high-degree of flexibility and configurability of the approach, and show the effectiveness of the elastic scaling strategies compared with existing state-of-the-art techniques used in similar scenarios.
Abstract: This paper addresses the problem of designing scaling strategies for elastic data stream processing. Elasticity allows applications to rapidly change their configuration on-the-fly (e.g., the amount of used resources) in response to dynamic workload fluctuations. In this work we face this problem by adopting the Model Predictive Control technique, a control-theoretic method aimed at finding the optimal application configuration along a limited prediction horizon in the future by solving an online optimization problem. Our control strategies are designed to address latency constraints, using Queueing Theory models, and energy consumption by changing the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) support available in the modern multicore CPUs. The proactive capabilities, in addition to the latency- and energy-awareness, represent the novel features of our approach. To validate our methodology, we develop a thorough set of experiments on a high-frequency trading application. The results demonstrate the high-degree of flexibility and configurability of our approach, and show the effectiveness of our elastic scaling strategies compared with existing state-of-the-art techniques used in similar scenarios.

Proceedings ArticleDOI
05 Jun 2016
TL;DR: A first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures by combining resilience techniques across various layers of the system stack by combining circuit-level hardening, logic-level parity checking, and micro-architectural recovery.
Abstract: We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (798 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. We demonstrate the practicality and effectiveness of our framework using two diverse designs: a simple, in-order processor core and a complex, out-of-order processor core. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50∗ improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with ∼1% additional energy cost for a 50∗ improvement in silent data corruption rate).

Proceedings ArticleDOI
11 Apr 2016
TL;DR: This paper shows how contention-free execution of real-time tasks can be achieved on scratchpad-based architectures, and how a separation of application logic and I/O perations in the time domain can be enforced, and proposes a novel scratchPad-centric OS design for multi-core platforms.
Abstract: Multi-core processors have replaced single-core systems in almost every segment of the industry. Unfortunately, their increased complexity often causes a loss of temporal predictability which represents a key requirement for hard real-time systems. Major sources of unpredictability are the shared low level resources, such as the memory hierarchy and the I/O subsystem. In this paper, we approach the problem of shared resource arbitration at an OS-level and propose a novel scratchpad-centric OS design for multi-core platforms. In the proposed OS, the predictable usage of shared resources across multiple cores represents a central design-time goal. Hence, we show (i) how contention-free execution of real-time tasks can be achieved on scratchpad-based architectures, and (ii) how a separation of application logic and I/O perations in the time domain can be enforced. To validate the proposed design, we implemented the proposed OS using a commercial-off-the-shelf (COTS) platform. Experiments show that this novel design delivers predictable temporal behavior to hard real-time tasks, and it improves performance up to 2.1x compared to traditional approaches.

Journal ArticleDOI
TL;DR: In Quest-V, a hypervisor is only needed to bootstrap the system, recover from certain faults, and establish communication channels between sandboxes, which reduces the memory footprint of the most privileged protection domain and removes it from the control path during normal system operation, thereby heightening security.
Abstract: Multi- and many-core processors are becoming increasingly popular in embedded systems. Many of these processors now feature hardware virtualization capabilities, as found on the ARM Cortex A15 and x86 architectures with Intel VT-x or AMD-V support. Hardware virtualization provides a way to partition physical resources, including processor cores, memory, and I/O devices, among guest virtual machines (VMs). Each VM is then able to host tasks of a specific criticality level, as part of a mixed-criticality system with different timing and safety requirements. However, traditional virtual machine systems are inappropriate for mixed-criticality computing. They use hypervisors to schedule separate VMs on physical processor cores. The costs of trapping into hypervisors to multiplex and manage machine physical resources on behalf of separate guests are too expensive for many time-critical tasks. Additionally, traditional hypervisors have memory footprints that are often too large for many embedded computing systems. In this article, we discuss the design of the Quest-V separation kernel, which partitions services of different criticality levels across separate VMs, or sandboxes. Each sandbox encapsulates a subset of machine physical resources that it manages without requiring intervention from a hypervisor. In Quest-V, a hypervisor is only needed to bootstrap the system, recover from certain faults, and establish communication channels between sandboxes. This not only reduces the memory footprint of the most privileged protection domain but also removes it from the control path during normal system operation, thereby heightening security.

Journal ArticleDOI
TL;DR: The level set equation is solved with a semi-Lagrangian method which, similar to its serial implementation, is free of any time-step restrictions, and a scalable global interpolation scheme on adaptive tree-based grids is introduced.

Journal ArticleDOI
TL;DR: This paper proposes an alternative solution based on a layered OMS/NMS LDPC decoding algorithm that can be efficiently implemented on a multi-core device using Single Instruction Multiple Data and Single Program Multiple Data programming models.
Abstract: Low-Density Parity-Check (LDPC) codes are an efficient way to correct transmission errors in digital communication systems. Although initially targeting strictly to ASICs due to computation complexity, LDPC decoders have been recently ported to multicore and many-core systems. Most works focused on taking advantage of GPU devices. In this paper, we propose an alternative solution based on a layered OMS/NMS LDPC decoding algorithm that can be efficiently implemented on a multi-core device using Single Instruction Multiple Data (SIMD) and Single Program Multiple Data (SPMD) programming models. Several experimentations were performed on a x86 processor target. Throughputs up to 170 Mbps were achieved on a single core of an INTEL Core i7 processor when executing 20 layered-based decoding iterations. Throughputs reaches up to 560 Mbps on four INTEL Core-i7 cores. Experimentation results show that the proposed implementations achieved similar BER correction performance than previous works. Moreover, much higher throughputs have been achieved by comparison with all previous GPU and CPU works. They range from x1.4 to x8 by comparison with recent GPU works.

Proceedings ArticleDOI
19 Oct 2016
TL;DR: A response time analysis technique for Synchronous Data Flow programs mapped to multiple parallel dependent tasks running on a compute cluster of the Kalray MPPA-256 many-core processor is introduced and leads to response times that are a factor of 4.15 smaller for this application.
Abstract: In this paper we introduce a response time analysis technique for Synchronous Data Flow programs mapped to multiple parallel dependent tasks running on a compute cluster of the Kalray MPPA-256 many-core processor. The analysis we derive computes a set of response times and release dates that respect the constraints in the task dependency graph. We extend the Multicore Response Time Analysis (MRTA) framework by deriving a mathematical model of the multi-level bus arbitration policy used by the MPPA. Further, we refine the analysis to account for the release dates and response times of co-runners, and the use of memory banks. Further improvements to the precision of the analysis were achieved by splitting each task into two sequential phases, with the majority of the memory accesses in the first phase, and a small number of writes in the second phase. Our experimental evaluation focused on an avionics case study. Using measurements from the Kalray MPPA-256 as a basis, we show that the new analysis leads to response times that are a factor of 4.15 smaller for this application, than the default approach of assuming worst-case interference on each memory access.

Journal ArticleDOI
TL;DR: This article evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications, namely, sparse matrix multifrontal factorizations that feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption.
Abstract: To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This article evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications, namely, sparse matrix multifrontal factorizations that feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system.

Journal ArticleDOI
TL;DR: Nornir is proposed, an algorithm to automatically derive, without relying on historical data about previous executions, performance and power consumption models of an application in different configurations, by using these models to select a close-to-optimal configuration for the given user requirement, either performance or power consumption.
Abstract: In current computing systems, many applications require guarantees on their maximum power consumption to not exceed the available power budget. On the other hand, for some applications, it could be possible to decrease their performance, yet maintain an acceptable level, in order to reduce their power consumption. To provide such guarantees, a possible solution consists in changing the number of cores assigned to the application, their clock frequency, and the placement of application threads over the cores. However, power consumption and performance have different trends depending on the application considered and on its input. Finding a configuration of resources satisfying user requirements is, in the general case, a challenging task. In this article, we propose Nornir, an algorithm to automatically derive, without relying on historical data about previous executions, performance and power consumption models of an application in different configurations. By using these models, we are able to select a close-to-optimal configuration for the given user requirement, either performance or power consumption. The configuration of the application will be changed on-the-fly throughout the execution to adapt to workload fluctuations, external interferences, and/or application’s phase changes. We validate the algorithm by simulating it over the applications of the Parsec benchmark suit. Then, we implement our algorithm and we analyse its accuracy and overhead over some of these applications on a real execution environment. Eventually, we compare the quality of our proposal with that of the optimal algorithm and of some state-of-the-art solutions.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: The first shared SMP algorithm for fully parallel contour tree computation is reported, with for-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speedUp in NVIDIA Thrust.
Abstract: As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this form of analysis While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance We report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust

Journal ArticleDOI
TL;DR: This work presents the first global memory-centric scheduling policy for memory-intensive task sets whose jobs can be modeled as a sequence of memory- intensive (memory phase) and execution-intensive (execution phase) phases and introduces the notion of virtual memory cores as a fundamental technique for performing phase-based response time analysis forMemory- intensive task sets.
Abstract: As the number of cores increases, more master components can simultaneously access main memory. In real-time systems, this ongoing trend is leading to crippling pessimism when computing the worst-case cache miss time, since a memory request could potentially contend with other requests coming from every other core in the system. CPU-centric scheduling policies, therefore, are no longer sufficient to guarantee schedulability without introducing unacceptable pessimism for memory-intensive task sets. For this reason, we believe a shift is needed towards real-time scheduling approaches that can prevent timing interference from memory contention, while still making efficient use of the multicore platform. Previously, we have demonstrated the practicality of the PREM task model, where each job consists of a sequence of phases, some of which access memory and some of which perform only computation on cached data. In this work, we present the first global memory-centric scheduling policy for memory-intensive task sets whose jobs can be modeled as a sequence of memory-intensive (memory phase) and execution-intensive (execution phase) phases. The proposed policy is parameterizable based on the number of cores which are allowed to concurrently access main memory without saturating it. Building upon results from multicore response-time analysis, we introduce the notion of virtual memory cores as a fundamental technique for performing phase-based response time analysis for memory-intensive task sets. Finally, we use synthetic task set generation to demonstrate that proposed scheduling policy and related schedulability bound do indeed better schedule memory-intensive task sets when compared to state-of-art multicore scheduling.

Proceedings ArticleDOI
23 May 2016
TL;DR: This work proposes a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs, which can attain speedup roughly equal to the number of physical cores within the PRAM model of parallel computation.
Abstract: Convolutional networks (ConvNets) have become a popular approach to computer vision. It is important to accelerate ConvNet training, which is computationally costly. We propose a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs. Applying Brent's theorem to the task dependency graph implies that linear speedup with the number of processors is attainable within the PRAM model of parallel computation, for wide network architectures. To attain such performance on real shared-memory machines, our algorithm computes convolutions converging on the same node of the network with temporal locality to reduce cache misses, and sums the convergent convolution outputs via an almost wait-free concurrent method to reduce time spent in critical sections. We implement the algorithm with a publicly available software package called ZNN. Benchmarking with multi-core CPUs shows that ZNN can attain speedup roughly equal to the number of physical cores. We also show that ZNN can attain over 90x speedup on a many-core CPU (Xeon Phi Knights Corner). These speedups are achieved for network architectures with widths that are in common use. The task parallelism of the ZNN algorithm is suited to CPUs, while the SIMD parallelism of previous algorithms is compatible with GPUs. Through examples, we show that ZNN can be either faster or slower than certain GPU implementations depending on specifics of the network architecture, filter sizes, and density and size of the output patch. ZNN may be less costly to develop and maintain, due to the relative ease of general-purpose CPU programming.