scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 2015"


Proceedings ArticleDOI
28 Oct 2015
TL;DR: MoonGen as discussed by the authors is a high-speed packet generator that can saturate 10 GbE links with minimum-sized packets while using only a single CPU core by running on top of the packet processing framework DPDK.
Abstract: We present MoonGen, a flexible high-speed packet generator. It can saturate 10 GbE links with minimum-sized packets while using only a single CPU core by running on top of the packet processing framework DPDK. Linear multi-core scaling allows for even higher rates: We have tested MoonGen with up to 178.5 Mpps at 120 Gbit/s. Moving the whole packet generation logic into user-controlled Lua scripts allows us to achieve the highest possible flexibility. In addition, we utilize hardware features of commodity NICs that have not been used for packet generators previously. A key feature is the measurement of latency with sub-microsecond precision and accuracy by using hardware timestamping capabilities of modern commodity NICs. We address timing issues with software-based packet generators and apply methods to mitigate them with both hardware support and with a novel method to control the inter-packet gap in software. Features that were previously only possible with hardware-based solutions are now provided by MoonGen on commodity hardware. MoonGen is available as free software under the MIT license in our git repository at https://github.com/emmericp/MoonGen

263 citations


Proceedings ArticleDOI
14 Mar 2015
TL;DR: This is the first model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization automatically and is up to 1.81x better than that achieved through manual tuning in Halide, a state-of-the-art language and compiler forimage processing pipelines.
Abstract: This paper presents the design and implementation of PolyMage, a domain-specific language and compiler for image processing pipelines. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each stage typically performs one of point-wise, stencil, reduction or data-dependent operations on image pixels. Individual stages in a pipeline typically exhibit abundant data parallelism that can be exploited with relative ease. However, the stages also require high memory bandwidth preventing effective utilization of parallelism available on modern architectures. For applications that demand high performance, the traditional options are to use optimized libraries like OpenCV or to optimize manually. While using libraries precludes optimization across library routines, manual optimization accounting for both parallelism and locality is very tedious. The focus of our system, PolyMage, is on automatically generating high-performance implementations of image processing pipelines expressed in a high-level declarative language. Our optimization approach primarily relies on the transformation and code generation capabilities of the polyhedral compiler framework. To the best of our knowledge, this is the first model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization automatically. Experimental results on a modern multicore system show that the performance achieved by our automatic approach is up to 1.81x better than that achieved through manual tuning in Halide, a state-of-the-art language and compiler for image processing pipelines. For a camera raw image processing pipeline, our performance is comparable to that of a hand-tuned implementation.

185 citations


Journal ArticleDOI
TL;DR: The core microarchitecture innovations made in the POWER8 processor, designed to significantly improve both single-thread performance and single-core throughput over its predecessor, the POWER7® processor, are described.
Abstract: The POWER8™ processor is the latest RISC (Reduced Instruction Set Computer) microprocessor from IBM. It is fabricated using the company's 22-nm Silicon on Insulator (SOI) technology with 15 layers of metal, and it has been designed to significantly improve both single-thread performance and single-core throughput over its predecessor, the POWER7® processor. The rate of increase in processor frequency enabled by new silicon technology advancements has decreased dramatically in recent generations, as compared to the historic trend. This has caused many processor designs in the industry to show very little improvement in either single-thread or single-core performance, and, instead, larger numbers of cores are primarily pursued in each generation. Going against this industry trend, the POWER8 processor relies on a much improved core and nest microarchitecture to achieve approximately one-and-a-half times the single-thread performance and twice the single-core throughput of the POWER7 processor in several commercial applications. Combined with a 50% increase in the number of cores (from 8 in the POWER7 processor to 12 in the POWER8 processor), the result is a processor that leads the industry in performance for enterprise workloads. This paper describes the core microarchitecture innovations made in the POWER8 processor that resulted in these significant performance benefits.

154 citations


Journal ArticleDOI
TL;DR: The weighted ensemble (THE AUTHORS) path sampling approach orchestrates an ensemble of parallel calculations with intermittent communication to enhance the sampling of rare events, such as molecular associations or conformational changes in proteins or peptides, at any scale.
Abstract: The weighted ensemble (WE) path sampling approach orchestrates an ensemble of parallel calculations with intermittent communication to enhance the sampling of rare events, such as molecular associations or conformational changes in proteins or peptides. Trajectories are replicated and pruned in a way that focuses computational effort on underexplored regions of configuration space while maintaining rigorous kinetics. To enable the simulation of rare events at any scale (e.g., atomistic, cellular), we have developed an open-source, interoperable, and highly scalable software package for the execution and analysis of WE simulations: WESTPA (The Weighted Ensemble Simulation Toolkit with Parallelization and Analysis). WESTPA scales to thousands of CPU cores and includes a suite of analysis tools that have been implemented in a massively parallel fashion. The software has been designed to interface conveniently with any dynamics engine and has already been used with a variety of molecular dynamics (e.g., GROMA...

125 citations


Journal ArticleDOI
TL;DR: The vision processing unit incorporates parallelism, instruction set architecture, and microarchitectural features to provide highly sustainable performance efficiency across a range of computational-Imaging and computer vision applications, including those with low latency requirements on the order of milliseconds.
Abstract: Myriad 2 is a multicore, always-on system on chip that supports computational imaging and visual awareness for mobile, wearable, and embedded applications. The vision processing unit incorporates parallelism, instruction set architecture, and microarchitectural features to provide highly sustainable performance efficiency across a range of computationalImaging and computer vision applications, including those with low latency requirements on the order of milliseconds.

121 citations


Proceedings Article
12 Aug 2015
TL;DR: This work demonstrates that even seemingly strong isolation techniques based on dedicated cores can be circumvented through the use of thermal channels, and shows a limitation in the isolation that can be achieved on existing multi-core systems.
Abstract: Side channels remain a challenge to information flow control and security in modern computing platforms. Resource partitioning techniques that minimise the number of shared resources among processes are often used to address this challenge. In this work, we focus on multicore platforms and we demonstrate that even seemingly strong isolation techniques based on dedicated cores can be circumvented through the use of thermal channels. Specifically, we show that the processor core temperature can be used both as a side channel as well as a covert communication channel even when the system implements strong spatial and temporal partitioning. Our experiments on an Intel Xeon server platform demonstrate covert thermal channels that achieve up to 12.5 bps and weak thermal side channels that can detect processes executed on neighbouring cores. This work therefore shows a limitation in the isolation that can be achieved on existing multi-core systems.

111 citations


Proceedings ArticleDOI
09 Mar 2015
TL;DR: This paper presents a DTPM algorithm based on a practical temperature prediction methodology using system identification that dynamically computes a power budget using the predicted temperature, and controls the types and number of active processors as well as their frequencies.
Abstract: Heterogeneous multiprocessor systems-on-chip (MPSoCs) powering mobile platforms integrate multiple asymmetric CPU cores, a GPU, and many specialized processors. When the MPSoC operates close to its peak performance, power dissipation easily increases the temperature, hence adversely impacts reliability. Since using a fan is not a viable solution for hand-held devices, there is a strong need for dynamic thermal and power management (DTPM) algorithms that can regulate temperature with minimal performance impact. This paper presents a DTPM algorithm based on a practical temperature prediction methodology using system identification. The DTPM algorithm dynamically computes a power budget using the predicted temperature, and controls the types and number of active processors as well as their frequencies. Experiments on an octa-core big. LITTLE processor and common Android apps demonstrate that the proposed technique predicts temperature within 3% accuracy, while the DTPM algorithm provides around 6× reduction in temperature variance, and as large as 16% reduction in total platform power compared to using a fan.

102 citations


Book
02 Jan 2015
TL;DR: New, abstract models of real-time tasks that capture accurately the salient features of real application systems that are to be implemented on multiprocessor platforms are derived, and rules for mapping application systems onto the most appropriate models are identified.
Abstract: This book provides a comprehensive overview of both theoretical and pragmatic aspects of resource-allocation and scheduling in multiprocessor and multicore hard-real-time systems. The authors derive new, abstract models of real-time tasks that capture accurately the salient features of real application systems that are to be implemented on multiprocessor platforms, and identify rules for mapping application systems onto the most appropriate models. New run-time multiprocessor scheduling algorithms are presented, which are demonstrably better than those currently used, both in terms of run-time efficiency and tractability of off-line analysis. Readers will benefit from a new design and analysis framework for multiprocessor real-time systems, which will translate into a significantly enhanced ability to provide formally verified, safety-critical real-time systems at a significantly lower cost.

98 citations


Proceedings ArticleDOI
04 Oct 2015
TL;DR: CRONO as discussed by the authors is a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors, which can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks.
Abstract: Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.

93 citations


Journal ArticleDOI
TL;DR: This paper describes the implementation of a parallel fast multipole method for evaluating potentials for discrete and continuous source distributions and discusses several algorithmic improvements and performance optimizations including cache locality, vectorization, shared memory parallelism and use of coprocessors.
Abstract: We describe our implementation of a parallel fast multipole method for evaluating potentials for discrete and continuous source distributions. The first requires summation over the source points and the second requiring integration over a continuous source density. Both problems require (N2) complexity when computed directly; however, can be accelerated to (N) time using FMM. In our PVFMM software library, we use kernel independent FMM and this allows us to compute potentials for a wide range of elliptic kernels. Our method is high order, adaptive and scalable. In this paper, we discuss several algorithmic improvements and performance optimizations including cache locality, vectorization, shared memory parallelism and use of coprocessors. Our distributed memory implementation uses space-filling curve for partitioning data and a hypercube communication scheme. We present convergence results for Laplace, Stokes and Helmholtz (low wavenumber) kernels for both particle and volume FMM. We measure efficiency of our method in terms of CPU cycles per unknown for different accuracies and different kernels. We also demonstrate scalability of our implementation up to several thousand processor cores on the Stampede platform at the Texas Advanced Computing Center.

86 citations


Journal ArticleDOI
TL;DR: This article presents a survey of cache management techniques for real-time embedded systems, from the first studies of the field in 1990 up to the latest research published in 2014, and provides a detailed comparison in terms of similarities and differences.
Abstract: Multicore processors are being extensively used by real-time systems, mainly because of their demand for increased computing power. However, multicore processors have shared resources that affect the predictability of real-time systems, which is the key to correctly estimate the worst-case execution time of tasks. One of the main factors for unpredictability in a multicore processor is the cache memory hierarchy. Recently, many research works have proposed different techniques to deal with caches in multicore processors in the context of real-time systems. Nevertheless, a review and categorization of these techniques is still an open topic and would be very useful for the real-time community. In this article, we present a survey of cache management techniques for real-time embedded systems, from the first studies of the field in 1990 up to the latest research published in 2014. We categorize the main research works and provide a detailed comparison in terms of similarities and differences. We also identify key challenges and discuss future research directions.

Proceedings ArticleDOI
09 Mar 2015
TL;DR: Octopus-Man is presented, a novel QoS-aware task management solution that dynamically maps latency-sensitive tasks to the least power-hungry processing resources that are sufficient to meet the QoS requirements.
Abstract: Heterogeneous multicore architectures have the potential to improve energy efficiency by integrating power-efficient wimpy cores with high-performing brawny cores. However, it is an open question as how to deliver energy reduction while ensuring the quality of service (QoS) of latency-sensitive web-services running on such heterogeneous multicores in warehouse-scale computers (WSCs). In this work, we first investigate the implications of heterogeneous multicores in WSCs and show that directly adopting heterogeneous multicores without re-designing the software stack to provide QoS management leads to significant QoS violations. We then present Octopus-Man, a novel QoS-aware task management solution that dynamically maps latency-sensitive tasks to the least power-hungry processing resources that are sufficient to meet the QoS requirements. Using carefully-designed feedback-control mechanisms, Octopus-Man addresses critical challenges that emerge due to uncertainties in workload fluctuations and adaptation dynamics in a real system. Our evaluation using web-search and memcached running on a real-system Intel heterogeneous prototype demonstrates that Octopus-Man improves energy efficiency by up to 41% (CPU power) and up to 15% (system power) over an all-brawny WSC design while adhering to specified QoS targets.

Proceedings ArticleDOI
18 Oct 2015
TL;DR: In this paper, the authors investigate the performance tradeoffs between compare-and-swap (CAS) operations and various characteristics of such systems, such as the structure of caches, and present a set of detailed benchmarks for latency and bandwidth of different atomics.
Abstract: Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallelprogramming. Yet, performance tradeoffs between these opera-tions and various characteristics of such systems, such as thestructure of caches, are unclear and have not been thoroughlyanalyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailedbenchmarks for latency and bandwidth of different atomics. Weconsider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveilsurprising performance relationships between the consideredatomics and architectural properties such as the coherence stateof the accessed cache lines. One key finding is that all thetested atomics have comparable latency and bandwidth even ifthey are characterized by different consensus numbers. Anotherinsight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between theissued operations. Finally, we discuss solutions to the discoveredperformance issues in the analyzed architectures. Our analysiscan be used for making better design and algorithmic decisionsin parallel programming on various architectures deployed inboth off-the-shelf machines and large compute systems.

Proceedings ArticleDOI
08 Jul 2015
TL;DR: In this article, the authors present a parallelism-aware worst-case memory interference delay analysis for COTS multicore systems, focusing on LLC and DRAM bank partitioned systems.
Abstract: In modern Commercial Off-The-Shelf (COTS) mul-ticore systems, each core can generate many parallel memory requests at a time. The processing of these parallel requests in the DRAM controller greatly affects the memory interference delay experienced by running tasks on the platform. In this paper, we present a new parallelism-aware worst-case memory interference delay analysis for COTS multicore systems. The analysis considers a COTS processor that can generate multiple outstanding requests and a COTS DRAM controller that has a separate read and write request buffer, prioritizes reads over writes, and supports out-of-order request processing. Focusing on LLC and DRAM bank partitioned systems, our analysis computes worst-case upper bounds on memory-interference delays, caused by competing memory requests. We validate our analysis on a Gem5 full-system simulator modeling a realistic COTS multicore platform, with a set of carefully designed synthetic benchmarks as well as SPEC2006benchmarks. The evaluation results show that our analysis produces safe upper bounds in all tested benchmarks, while the current state-of-the-art analysis significantly under-estimates the delays.

Journal ArticleDOI
TL;DR: The design and implementation of dense matrix multiplication on the Movidius Myriad architecture is described and its performance and energy efficiency are evaluated, showing significant potential for scientific-computing tasks.
Abstract: In recent years, a new generation of ultralow-power processors have emerged that are aimed primarily at signal processing in mobile computing. However, their architecture could make some of these useful for other applications. Algorithms originally developed for scientific computing are used increasingly in signal conditioning and emerging fields such as computer vision, increasing the demand for computing power in mobile systems. In this article, the authors describe the design and implementation of dense matrix multiplication on the Movidius Myriad architecture and evaluate its performance and energy efficiency. The authors demonstrate a performance of 8.11 Gflops on the Myriad I processor and a performance/watt ratio of 23.17 Gflops/W for a key computational kernel. These results show significant potential for scientific-computing tasks and invite further research.

Journal ArticleDOI
01 Jul 2015
TL;DR: A scalable coarse-grained PGA-PGAP is developed to enable efficient deme interactions and significantly improve the overlapping of computation and communication for NP-hard optimization problem: Generalized Assignment Problem (GAP).
Abstract: A scalable parallel genetic algorithm (PGA) is developed for the NP-hard GAP problem.High scalability is achieved through a novel asynchronous migration strategy.Our algorithmic analysis resolves the buffer overflow issue in asynchronous PGAs.Strong and weak scaling tests demonstrate the superiority of our PGA approach.Our approach scales to 16,384 processors with super-linear speedups observed. Known as an effective heuristic for finding optimal or near-optimal solutions to difficult optimization problems, a genetic algorithm (GA) is inherently parallel for exploiting high performance and parallel computing resources for randomized iterative evolutionary computation. It remains to be a significant challenge, however, to devise parallel genetic algorithms (PGAs) that can scale to massively parallel computer architecture (also known as the mainstream supercomputer architecture) primarily because: (1) a common PGA design adopts synchronized migration, which becomes increasingly costly as more processor cores are involved in global synchronization; and (2) asynchronous PGA design and associated performance evaluation are intricate due to the fact that PGA is a type of stochastic algorithm and the amount of computation work needed to solve a problem is not simply dependent on the problem size. To address the challenge, this paper describes a scalable coarse-grained PGA-PGAP, for a well-known NP-hard optimization problem: Generalized Assignment Problem (GAP). Specifically, an asynchronous migration strategy is developed to enable efficient deme interactions and significantly improve the overlapping of computation and communication. Buffer overflow and its relationship with migration parameters were investigated to resolve the issues of observed message buffer overflow and the loss of good solutions obtained from migration. Two algorithmic conditions were then established to detect these issues caused by communication delays and improper configuration of migration parameters and, thus, guide the dynamic tuning of PGA parameters to detect and avoid these issues. A set of computational experiments is designed to evaluate the scalability and numerical performance of PGAP. These experiments were conducted for large GAP instances on multiple supercomputers as part of the National Science Foundation Extreme Science and Engineering Discovery Environment (XSEDE). Results showed that, PGAP exhibited desirable scalability by achieving low communication cost when using up to 16,384 processor cores. Near-linear and super-linear speedups on large GAP instances were obtained in strong scaling tests. Desirable scalability to both population size and the number of processor cores were observed in weak scaling tests. The design strategies applied in PGAP are applicable to general asynchronous PGA development.

Proceedings ArticleDOI
08 Jul 2015
TL;DR: This work proposes the Single Core Equivalence (SCE) technology: a framework of OS-level techniques designed for commercial (COTS) architectures that exports a set of equivalent single-core virtual machines from a multi-core platform, ultimately enabling industry to reuse existing software, schedulability analysis methodologies and engineering processes.
Abstract: Multi-core platforms represent the answer of the industry to the increasing demand for computational capabilities. From a real-time perspective, however, the inherent sharing of resources, such as memory subsystem and I/O channels, creates inter-core timing interference among critical tasks and applications deployed on different cores. As a result, modular per-core certification cannot be performed, meaning that: (1) current industrial engineering processes cannot be reused, (2) software developed and certified for single-core chips cannot be deployed on multi-core platforms as is. In this work, we propose the Single Core Equivalence (SCE) technology: a framework of OS-level techniques designed for commercial (COTS) architectures that exports a set of equivalent single-core virtual machines from a multi-core platform. This allows per-core schedulability results to be calculated in isolation and to hold when multiple cores of the system run in parallel. Thus, SCE allows each core of a multi-core chip to be considered as a conventional single-core chip, ultimately enabling industry to reuse existing software, schedulability analysis methodologies and engineering processes.

Patent
13 Feb 2015
TL;DR: In this article, a processor includes a plurality of first cores to independently execute instructions, each of the first cores including counters to store performance information; at least one second core to perform memory operations; and a power controller to determine a workload type executed on the processor based at least in part on the performance information.
Abstract: In an embodiment, a processor includes: a plurality of first cores to independently execute instructions, each of the plurality of first cores including a plurality of counters to store performance information; at least one second core to perform memory operations; and a power controller to receive performance information from at least some of the plurality of counters, determine a workload type executed on the processor based at least in part on the performance information, and based on the workload type dynamically migrate one or more threads from one or more of the plurality of first cores to the at least one second core for execution during a next operation interval. Other embodiments are described and claimed.

Proceedings ArticleDOI
29 Mar 2015
TL;DR: Analysis of TLP behavior and big-little core energy efficiency suggests that current mobile workloads can benefit from an architecture that has the flexibility to accommodate both high performance and good energy-efficiency for different application phases.
Abstract: Mobile devices are becoming more powerful and versatile than ever, calling for better embedded processors. Following the trend in desktop CPUs, microprocessor vendors are trying to meet such needs by increasing the number of cores in mobile device SoCs. However, increasing the number does not translate proportionally into performance gain and power reduction. In the past, studies have shown that there exists little parallelism to be exploited by a multi-core processor in desktop platform applications, and many cores sit idle during runtime. In this paper, we investigate whether the same is true for current mobile applications. We analyze the behavior of a broad range of commonly used mobile applications on real devices. We measure their Thread Level Parallelism (TLP), which is the machine utilization over the non-idle runtime. Our results demonstrate that mobile applications are utilizing less than 2 cores on average, even with background applications running concurrently. We observe a diminishing return on TLP with increasing the number of cores, and low TLP even with heavy-load scenarios. These studies suggest that having many powerful cores is over-provisioning. Further analysis of TLP behavior and big-little core energy efficiency suggests that current mobile workloads can benefit from an architecture that has the flexibility to accommodate both high performance and good energy-efficiency for different application phases.

Journal ArticleDOI
TL;DR: This paper model the workload and the power consumption of a multicore processor as random variables and exploit the monotonicity property of their distribution functions to establish a quantitative relationship between the random variables.
Abstract: Quantitatively estimating the relationship between the workload and the corresponding power consumption of a multicore processor is an essential step towards achieving energy proportional computing. Most existing and proposed approaches use Performance Monitoring Counters (Hardware Monitoring Counters) for this task. In this paper we propose a complementary approach that employs the statistics of CPU utilization (workload) only. Hence, we model the workload and the power consumption of a multicore processor as random variables and exploit the monotonicity property of their distribution functions to establish a quantitative relationship between the random variables. We will show that for a single-core processor the relationship is best approximated by a quadratic function whereas for a dualcore processor, the relationship is best approximated by a linear function. We will demonstrate the plausibility of our approach by estimating the power consumption of both custom-made and standard benchmarks (namely, the SPEC power benchmark and the Apache benchmarking tool) for an Intel and AMD processors.

Book ChapterDOI
01 Jan 2015
TL;DR: This paper focuses on island models, or coarse-grained EA s, which have shown that island models can speed up computation significantly, and that parallel populations can further increase solution diversity.
Abstract: Evolutionary algorithms (EA s) have given rise to many parallel variants, fuelled by the rapidly increasing number of CPU cores and the ready availability of computation power through GPUs and cloud computing. A very popular approach is to parallelize evolution in island models, or coarse-grained EA s, by evolving different populations on different processors. These populations run independently most of the time, but they periodically communicate genetic information to coordinate search. Many applications have shown that island models can speed up computation significantly, and that parallel populations can further increase solution diversity.

Journal ArticleDOI
TL;DR: In this paper, the authors combine the ideas of wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches.
Abstract: The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemp...

Journal ArticleDOI
01 May 2015
TL;DR: The development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices are described, with the help of profiling and tracing tools, to achieve up to two-fold speedup and three-fold better energy efficiency against highly optimized batched CPU implementations based on MKL library.
Abstract: Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip a GPU or a CPU factorizes a single problem at a time is not at all efficient in our applications' context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5A— speedup on the NVIDIA K40 GPU.

Journal ArticleDOI
TL;DR: An analytic performance model of the GP-SIMD architecture is presented, comparing it to associative processor and to conventional SIMD architectures, and assuming a moderate die area, GP- SIMD architecture outperforms both the associatives processor and conventionalSIMD coprocessor architectures by almost an order of magnitude while consuming less power.
Abstract: GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory with modified SRAM storage cells and a bit-serial processing unit per each memory row. An analytic performance model of the GP-SIMD architecture is presented, comparing it to associative processor and to conventional SIMD architectures. Cycle-accurate simulation of four workloads supports the analytical comparison. Assuming a moderate die area, GP-SIMD architecture outperforms both the associative processor and conventional SIMD coprocessor architectures by almost an order of magnitude while consuming less power.

Journal ArticleDOI
TL;DR: This work is aimed to explore the area and energy implications of scaling a WNoC in terms of: 1) the number of cores within the chip, and 2) the capacity of each link in the network.
Abstract: Networks-on-chip (NoCs) are emerging as the way to interconnect the processing cores and the memory within a chip multiprocessor. As recent years have seen a significant increase in the number of cores per chip, it is crucial to guarantee the scalability of NoCs in order to avoid communication to become the next performance bottleneck in multicore processors. Among other alternatives, the concept of wireless network-on-chip (WNoC) has been proposed, wherein on-chip antennas would provide native broadcast capabilities leading to enhanced network performance. Since energy consumption and chip area are the two primary constraints, this work is aimed to explore the area and energy implications of scaling a WNoC in terms of: 1) the number of cores within the chip, and 2) the capacity of each link in the network. To this end, an integral design space exploration is performed, covering implementation aspects (area and energy), communication aspects (link capacity), and network-level considerations (number of cores and network architecture). The study is entirely based upon analytical models, which will allow to benchmark the WNoC scalability against a baseline NoC. Eventually, this investigation will provide qualitative and quantitative guidelines for the design of future transceivers for wireless on-chip communication.

Journal ArticleDOI
TL;DR: This paper presents a solution that uses functional performance models (FPMs) of processing elements and FPM-based data partitioning algorithms for optimal distribution of the workload of data-parallel scientific applications between processing elements of such heterogeneous computing systems.
Abstract: Heterogeneous multiprocessor systems, which are composed of a mix of processing elements, such as commodity multicore processors, graphics processing units (GPUs), and others, have been widely used in scientific computing community. Software applications incorporate the code designed and optimized for different types of processing elements in order to exploit the computing power of such heterogeneous computing systems. In this paper, we consider the problem of optimal distribution of the workload of data-parallel scientific applications between processing elements of such heterogeneous computing systems. We present a solution that uses functional performance models (FPMs) of processing elements and FPM-based data partitioning algorithms. Efficiency of this approach is demonstrated by experiments with parallel matrix multiplication and numerical simulation of lid-driven cavity flow on hybrid servers and clusters.

Journal ArticleDOI
TL;DR: A profiling method is introduced to establish the suitability of parallel applications for improved mappings that take the memory hierarchy into account, based on a mathematical description of their memory access behaviors.

Journal ArticleDOI
08 Sep 2015
TL;DR: A provably efficient scheduling algorithm, the Piper algorithm, is described, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested and automatically throttles the parallelism, precluding “runaway” pipelines.
Abstract: Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a “construct-and-run” approach, this article investigates “on-the-fly” pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding “runaway” pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP ≤ T1/P+O(T∞+lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as “lazy enabling” and “dependency folding.” We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.

Journal ArticleDOI
TL;DR: A dynamic voltage and frequency scaling scheme with SC converters is proposed that achieves high converter efficiency by allowing the output voltage to ripple and having the processor core frequency track the ripple.
Abstract: Integrating multiple power converters on-chip improves energy efficiency of manycore architectures. Switched-capacitor (SC) dc-dc converters are compatible with conventional CMOS processes, but traditional implementations suffer from limited conversion efficiency. We propose a dynamic voltage and frequency scaling scheme with SC converters that achieves high converter efficiency by allowing the output voltage to ripple and having the processor core frequency track the ripple. Minimum core energy is achieved by hopping between different converter modes and tuning body-bias voltages. A multicore processor model based on a 28-nm technology shows conversion efficiencies of 90% along with over 25% improvement in the overall chip energy efficiency.

Proceedings ArticleDOI
Zhibo Wang1, Yongpan Liu1, Yinan Sun1, Yang Li1, Daming Zhang1, Huazhong Yang1 
24 May 2015
TL;DR: A novel energy-efficient heterogenous dual-core processor, which includes both an ultra low power near-threshold CoreL and a fast CoreH to meet those emerging requirements of IoT applications and an optimal framework is proposed to realize energy efficient task mapping and scheduling.
Abstract: With the fast development of Internet of Things (IoTs) in recent years, many IoT applications, such as structure health monitoring, surveillance camera and etc, require both extensive computation for burst-mode signal processing as well as ultra low power continuous operations. However, most of conventional IoT processors focus on ultra low power consumption and cannot satisfy those demands. This paper proposes a novel energy-efficient heterogenous dual-core processor, which includes both an ultra low power near-threshold CoreL and a fast CoreH to meet those emerging requirements. Furthermore, an optimal framework is proposed to realize energy efficient task mapping and scheduling. The processor is fabricated and its energy consumption in low power mode is as low as 7.7pJ/cycle and outperforms related work. Detailed analysis under several real applications shows that up to 2.62× energy efficiency improvements can be achieved without deadline miss compared with the high-performance-only signle core architecture.