Showing papers on "Performance per watt published in 2015"

PDF

Open Access

Proceedings Article•DOI•

Power-Performance Modelling of Mobile Gaming Workloads on Heterogeneous MPSoCs

[...]

Anuj Pathania¹, Alexandru Eugen Irimiea², Alok Prakash², Tulika Mitra²•Institutions (2)

Karlsruhe Institute of Technology¹, National University of Singapore²

07 Jun 2015

TL;DR: Based on power-performance models developed, an efficient power management strategy is proposed and implemented on an Odroid-XU+E mobile platform and shows that it provides on average 20% increase in performance per watt when compared to the state-of-the-art.

...read moreread less

Abstract: Games have emerged as one of the most popular applications on mobile platforms. Recent platforms are now equipped with Heterogeneous Multiprocessor System-on-Chips (HMPSoCs) tightly integrating CPUs and GPUs on the same chip. This configuration enables high-end gaming on the platform but at the cost of high power consumption rapidly draining the underlying limited-capacity battery. The HMPSoCs are capable of independent Dynamic Voltage and Frequency Scaling (DVFS) for CPUs and GPUs for reduction in platform's power consumption. State-of-the-art power manager for mobile games on HMPSoCs oversimplifies the complex CPU-GPU interplay. In this paper, we develop power-performance models predicting the impact of DVFS on mobile gaming workloads. Based on our models, we propose an efficient power management strategy and implement it on an Odroid-XU+E mobile platform. Measurements on the platform show that our power manager provides on average 20% increase in performance per watt when compared to the state-of-the-art.

...read moreread less

58 citations

Proceedings Article•DOI•

Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS

[...]

Qing Jiao¹, Mian Lu², Huynh Phung Huynh², Tulika Mitra¹•Institutions (2)

National University of Singapore¹, Institute of High Performance Computing Singapore²

07 Feb 2015

TL;DR: This paper design and experimentally validate power-performance models to carefully select the appropriate kernel combinations to be executed concurrently, the relative contributions of the kernels to the thread mix, along with the frequency choices for the cores and the memory to achieve high performance per watt metric.

...read moreread less

Abstract: Current generation GPUs can accelerate high-performance, compute-intensive applications by exploiting massive thread-level parallelism. The high performance, however, comes at the cost of increased power consumption. Recently, commercial GPGPU architectures have introduced support for concurrent kernel execution to better utilize the computational/memory resources and thereby improve overall throughput. In this paper, we argue and experimentally validate the benefits of concurrent kernels towards energy-efficient execution. We design power-performance models to carefully select the appropriate kernel combinations to be executed concurrently, the relative contributions of the kernels to the thread mix, along with the frequency choices for the cores and the memory to achieve high performance per watt metric. Our experimental evaluation shows that the concurrent kernel execution in combination with DVFS can improve energy-efficiency by up to 34.5% compared to the most energy-efficient sequential execution.

...read moreread less

52 citations

Proceedings Article•DOI•

ACFS: a completely fair scheduler for asymmetric single-isa multicore systems

[...]

Juan Carlos Saez¹, Adrián Pousa², Fernando Castro¹, Daniel Chaver¹, Manuel Prieto-Matias¹ - Show less +1 more•Institutions (2)

Complutense University of Madrid¹, National University of La Plata²

13 Apr 2015

TL;DR: This work proposes ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput, and demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.

...read moreread less

Abstract: Single-ISA (instruction set architecture) asymmetric multicore processors (AMPs) were shown to deliver higher performance per watt and area than symmetric CMPs (Chip Multi-Processors) for applications with diverse architectural requirements. A large body of work has demonstrated that this potential of AMP systems can be realizable via OS scheduling. Yet, existing schedulers that seek to deliver fairness on AMPs do not ensure that equal-priority applications experience the same slowdown when sharing the system. Moreover, most of these schemes are also subject to high throughput degradation and fail to effectively deal with user priorities. In this work we propose ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput. Our evaluation on real AMP hardware, and using scheduler implementations on a general-purpose OS, demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.

...read moreread less

18 citations

Journal Article•DOI•

Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

[...]

Li Tan¹, Zizhong Chen¹, Shuaiwen Leon Song²•Institutions (2)

University of California, Riverside¹, Pacific Northwest National Laboratory²

16 Nov 2015-ACM Transactions on Architecture and Code Optimization

TL;DR: By extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, this article quantitatively model the integrated energy efficiency in terms of performance per Watt and showcases the trade-offs among typical HPC parameters.

...read moreread less

Abstract: Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficiency and resilience, due to rapidly expanding size and duration in use of the large-scale computing systems. Many application/architecture-dependent parameters that determine energy efficiency and resilience individually have causal effects with each other, which directly affect the trade-offs among performance, energy efficiency and resilience at scale. To enable high-efficiency management for large-scale High-Performance Computing (HPC) systems nowadays, quantitatively understanding the entangled effects among performance, energy efficiency, and resilience is thus required. While previous work focuses on exploring energy-saving and resilience-enhancing opportunities separately, little has been done to theoretically and empirically investigate the interplay between energy efficiency and resilience at scale. In this article, by extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, we quantitatively model the integrated energy efficiency in terms of performance per Watt and showcase the trade-offs among typical HPC parameters, such as number of cores, frequency/voltage, and failure rates. Experimental results for a wide spectrum of HPC benchmarks on two HPC systems show that the proposed models are accurate in extrapolating resilience-aware performance and energy efficiency, and capable of capturing the interplay among various energy-saving and resilience factors. Moreover, the models can help find the optimal HPC configuration for the highest integrated energy efficiency, in the presence of failures and applied resilience techniques.

...read moreread less

12 citations

Proceedings Article•DOI•

Heterogeneous architecture design with emerging 3D and non-volatile memory technologies

[...]

Qiaosha Zou¹, Matthew Poremba¹, Rui He², Wei Yang², Junfeng Zhao², Yuan Xie³ - Show less +2 more•Institutions (3)

Pennsylvania State University¹, Huawei², University of California, Santa Barbara³

12 Mar 2015

TL;DR: 3D die stacking is demonstrated, whereby disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.

...read moreread less

Abstract: Energy becomes the primary concern in nowadays multi-core architecture designs. Moore's law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.1

...read moreread less

10 citations

Proceedings Article•DOI•

Optimizing Fully Anisotropic Elastic Propagation on Intel Xeon Phi Coprocessors

[...]

Diego Caballero¹, Albert Farrés², Alejandro Duran³, Mauricio Hanzich², S. Fernández, Xavier Martorell¹ - Show less +2 more•Institutions (3)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center², Intel³

13 Sep 2015

TL;DR: This work shows some optimization strategies evaluated and applied to an elastic propagator based on a Fully Staggered Grid, running on the Intel® Xeon Phi(TM) coprocessor, and it is important to remark, that the propagator is able to reproduce elastic wave propagation, even for an arbitrary anisotropy.

...read moreread less

Abstract: The current trend in seismic imaging aims at using an improved physical model, considering that the Earth is not rigid but an elastic body. This new model takes simulations closer to the real physics of the problem, at the cost of raising the needed computational resources. On the hardware front, recently developed high-performing devices, called accelerators or co-processors, have shown that can outperform their general purpose counterparts by orders of magnitude in terms of performance per watt. These new alternatives may then provide the necessary resources for making possible to represent complex wave physics in a reasonable time. There might be, however, a penalty associated to the usage of such devices, as some portion of the simulation code might need some re-writing or new optimization strategies explored and applied. In this work we will show some optimization strategies evaluated and applied to an elastic propagator based on a Fully Staggered Grid, running on the Intel® Xeon Phi(TM) coprocessor. It is important to remark, that the propagator is able to reproduce elastic wave propagation, even for an arbitrary anisotropy.

...read moreread less

10 citations

Journal Article•DOI•

Development of a Next Generation Concurrent Framework for the ATLAS Experiment

[...]

Paolo Calafiura¹, W. Lampl², Charles Leggett¹, David Malon³, Graeme Stewart, Benjamin Wynne - Show less +2 more•Institutions (3)

Lawrence Berkeley National Laboratory¹, CERN², Argonne National Laboratory³

11 May 2015

TL;DR: This paper expands support to Calorimeter, Inner Detector, and Tracking code, and presents the findings on implementing a hybrid multi-threaded / multi-process framework, to take advantage of the strengths of each type of concurrency, while avoiding some of their corresponding limitations.

...read moreread less

Abstract: The ATLAS experiment has successfully used its Gaudi/Athena software framework for data taking and analysis during the first LHC run, with billions of events successfully processed. However, the design of Gaudi/Athena dates from early 2000 and the software and the physics code has been written using a single threaded, serial design. This programming model has increasing difficulty in exploiting the potential of current CPUs, which offer their best performance only through taking full advantage of multiple cores and wide vector registers. Future CPU evolution will intensify this trend, with core counts increasing and memory per core falling. With current memory consumption for 64 bit ATLAS reconstruction in a high luminosity environment approaching 4GB, it will become impossible to fully occupy all cores in a machine without exhausting available memory. However, since maximizing performance per watt will be a key metric, a mechanism must be found to use all cores as efficiently as possible.In this paper we report on our progress with a practical demonstration of the use of multithreading in the ATLAS reconstruction software, using the GaudiHive framework. We have expanded support to Calorimeter, Inner Detector, and Tracking code, discussing what changes were necessary in order to allow the serially designed ATLAS code to run, both to the framework and to the tools and algorithms used. We report on both the performance gains, and what general lessons were learned about the code patterns that had been employed in the software and which patterns were identified as particularly problematic for multi-threading. We also present our findings on implementing a hybrid multi-threaded / multi-process framework, to take advantage of the strengths of each type of concurrency, while avoiding some of their corresponding limitations.

...read moreread less

9 citations

Proceedings Article•DOI•

ePUMA: A processor architecture for future DSP

[...]

Andreas Karlsson¹, Joar Sohl¹, Dake Liu¹•Institutions (1)

Linköping University¹

21 Jul 2015

TL;DR: The ePUMA platform is a flexible and configurable DSP platform that tries to address many of the problems with traditional DSP designs, to increase performance, but use less power.

...read moreread less

Abstract: Since the breakdown of Dennard scaling the primary design goal for processor designs has shifted from increasing performance to increasing performance per Watt. The ePUMA platform is a flexible and configurable DSP platform that tries to address many of the problems with traditional DSP designs, to increase performance, but use less power. We trade the flexibility of traditional VLIW DSP designs for a simpler single instruction issue scheme and instead make sure that each instruction can perform more work. Multi-cycle instructions can operate directly on vectors and matrices in memory and the datapaths implement common DSP subgraphs directly in hardware, for high compute through-put. Memory bottlenecks, that are common in other architectures, are handled with flexible LUT-based multi-bank memory addressing and memory parallelism. A major contributor to energy consumption, data movement, is reduced by using heterogeneous interconnect and clustering compute resources around local memories for simple data sharing. To evaluate ePUMA we have implemented the majority of the kernel library from a commercial VLIW DSP manufacturer for comparison. Our results not only show good performance, but also an order of magnitude increase in energy- and area efficiency. In addition, the kernel code size is reduced by 91% on average compared to the VLIW DSP. These benefits makes ePUMA an attractive solution for future DSP.

...read moreread less

8 citations

Proceedings Article•DOI•

Using hardware parallelism for reducing power consumption in video streaming applications

[...]

Karim M. A. Ali, Rabie Ben Atitallah, Nizar Fakhfakh, Jean-Luc Dekeyser¹•Institutions (1)

university of lille¹

03 Sep 2015

TL;DR: A flexible parallel hardware-based architecture in conjunction with frequency scaling as a technique for reducing power consumption in video streaming applications and derived equations to ease the calculation for the level of parallelism and the maximum depth for the FIFOs used for clock domain crossing is presented.

...read moreread less

Abstract: Reconfigurable technology fits for real-time video streaming applications. It is considered as a promising solution due to the offered performance per watt compared to other technologies. Since FPGA evolved, several techniques at different design levels starting from the circuit-level up to the system-level were proposed to reduce the power consumption of the FPGA devices. In this paper, we present a flexible parallel hardware-based architecture in conjunction with frequency scaling as a technique for reducing power consumption in video streaming applications. In this work, we derived equations to ease the calculation for the level of parallelism and the maximum depth for the FIFOs used for clock domain crossing. Accordingly, a design space was formed including all the design alternatives for the application. The preferable design alternative is selected in aware of how much hardware it costs and what power reduction goal it can satisfy. We used Xilinx Zynq ZC706 evaluation board to implement two video streaming applications: Video downscaler (1∶16) and AES encryption algorithm to verify our approach. The experimental results showed up to 19.6% power reduction for the video downscaler and up to 5.4% for the AES encryption.

...read moreread less

8 citations

Journal Article•DOI•

Evaluation of the 3-D finite difference implementation of the acoustic diffusion equation model on massively parallel architectures

[...]

Mario Hernández¹, Baldomero Imbernón², Juan M. Navarro², José M. García³, Juan M. Cebrian³, José M. Cecilia² - Show less +2 more•Institutions (3)

Autonomous University of Guerrero¹, Universidad Católica San Antonio de Murcia², University of Murcia³

01 Aug 2015-Computers & Electrical Engineering

TL;DR: The roadmap to use massively parallel architectures in a 3D-FD simulation is shown and it is revealed that NVIDIA architectures outperform by a wide margin the Intel Xeon Phi co-processor while dissipating approximately 50W less for large-scale input problems.

...read moreread less

8 citations

Journal Article•DOI•

Performance and energy aware scheduling simulator for HPC: evaluating different resource selection methods

[...]

César Gómez-Martín, Miguel A. Vega-Rodríguez¹, José-Luis González-Sánchez•Institutions (1)

University of Extremadura¹

10 Dec 2015-Concurrency and Computation: Practice and Experience

TL;DR: The usefulness of the simulator for this type of studies is demonstrated and it is concluded that the superior behavior of multiobjective algorithms makes them recommended for use in modern scheduling systems.

...read moreread less

Abstract: Today, in an energy-aware society, job scheduling is becoming an important task for computer engineers and system analysts that may lead to a performance per Watt trade-off of computing infrastructures. Thus, new algorithms, and a simulator of computing environments, may help information and communications technology and data center managers to make decisions with a solid experimental basis. There are several simulators that try to address performance and, somehow, estimate energy consumption, but there are none in which the energy model is based on benchmark data that have been countersigned by independent bodies such as the Standard Performance Evaluation Corporation. This is the reason why we have implemented a performance and energy-aware scheduling PEAS simulator for high-performance computing. Furthermore, to evaluate the simulator, we propose an implementation of the non-dominated sorting genetic algorithm-II NSGA-II algorithm, a fast and elitist multiobjective genetic algorithm, for the resource selection. With the help of the PEAS simulator, we have studied if it is possible to provide an intelligent job allocation policy that may be able to save energy and time without compromising performance. The results of our simulations show a great improvement in response time and power consumption. In most of the cases, NSGA-II performs better than other 'intelligent' algorithms like multiobjective heterogeneous earliest finish time and clearly outperforms the first-fit algorithm. We demonstrate the usefulness of the simulator for this type of studies and conclude that the superior behavior of multiobjective algorithms makes them recommended for use in modern scheduling systems. Copyright © 2015 John Wiley & Sons, Ltd.

...read moreread less

Proceedings Article•DOI•

A problem-based learning approach to GPU computing

[...]

Robert Geist¹, Joshua A. Levine¹, James Westall¹•Institutions (1)

Clemson University¹

15 Nov 2015

TL;DR: A course in GPU programming for senior undergraduates and first-year graduates that has been taught at Clemson University annually since 2010 is described, with focus on a large, real-world problem, in particular, a system for parallel solution of partial differential equations.

...read moreread less

Abstract: Compared to CPUs, modern GPUs exhibit a high ratio of computing performance per watt, and so current supercomputer designs often include multiple racks of GPUs in order to achieve high teraflop counts at minimal energy cost. GPU programming is thus becoming increasingly important, and yet it remains a challenging task. This paper describes a course in GPU programming for senior undergraduates and first-year graduates that has been taught at Clemson University annually since 2010. The course uses problem-based learning, with focus on a large, real-world problem, in particular, a system for parallel solution of partial differential equations. Although the system for solving PDEs is useful in its own right, the problem is used as a vehicle in which to explore design issues that face those attempting to achieve new levels of performance on architectures.

...read moreread less

Proceedings Article•DOI•

Power-efficient embedded processing with resilience and real-time constraints

[...]

Liang Wang¹, Augusto Vega², Alper Buyuktosunoglu², Pradip Bose², Kevin Skadron¹ - Show less +1 more•Institutions (2)

University of Virginia¹, IBM²

22 Jul 2015

TL;DR: This study examines a class of embedded system applications relevant to mobile vehicles to understand the limits of achievable energy efficiency under varying levels of system resilience constraints and considers static optimization of voltage-frequency settings on a per-application-segment basis.

...read moreread less

Abstract: Low-power embedded processing typically relies on dynamic voltage-frequency scaling (DVFS) in order to optimize energy usage (and therefore, battery life) However, low voltage operation exacerbates the incidence of soft errors Similarly, higher voltage operation (to meet real-time deadlines) is constrained by hard-failure rate limits In this paper, we examine a class of embedded system applications relevant to mobile vehicles We investigate the problem of assigning optimal voltage-frequency settings to individual segments within target workflows The goal of this study is to understand the limits of achievable energy efficiency (performance per watt) under varying levels of system resilience constraints To optimize for energy efficiency, we consider static optimization of voltage-frequency settings on a per-application-segment basis We consider both linear and graph-structured workflows In order to understand the loss in energy efficiency in the face of environmental uncertainties encountered by the mobile vehicle, we also study the effect of injecting random variations in the actual runtime of individual application segments A dynamic re-optimization of the voltage-frequency settings is required to cope with such in-field uncertainties

...read moreread less

Journal Article•DOI•

Voltage scaling and dark silicon in symmetric multicore processors

[...]

Hamid Nejatollahi¹, Mostafa E. Salehi¹•Institutions (1)

University of Tehran¹

01 Oct 2015-The Journal of Supercomputing

TL;DR: This paper proposes high-performance and energy-efficient multicore architectures for variety of parallelisms and memory-intensities in workloads and uses dynamic voltage and frequency scaling in Amdahl’s law to decrease amount of dark silicon and improve performance and performance per watt/joule.

...read moreread less

Abstract: As technology scales further, multicore and many-core processors emerge as an alternative to keep up with performance demands. However, because of power and thermal constraints, we are obliged to power off remarkable area of chip. Many innovative techniques have been presented to improve energy efficiency and maintain utilization at the highest level. In this paper, we discuss different models and methods of exploiting dark silicon, and by using dynamic voltage and frequency scaling in Amdahl's law and considering memory overheads, we attempt to decrease amount of dark silicon and improve performance and performance per watt/joule. We propose high-performance and energy-efficient multicore architectures for variety of parallelisms and memory-intensities in workloads. According to the results, by voltage scaling, for a highly parallel CPU-intensive workload, we reach improvements of approximately $$5.2{\times }$$5.2× and $$3.78{\times }$$3.78× in performance per watt and performance per joule, respectively, while about 27 % reduction of performance should be tolerated. For memory-intensive applications, a negligible change in speedup is detected by scaling, while performance per watt and performance per joule for both serial and parallel applications lead to around $$6{\times }$$6× enhancements.

...read moreread less

Proceedings Article•DOI•

Accurately modeling the GPU memory subsystem

[...]

Francisco Candel¹, Salvador Petit¹, Julio Sahuquillo¹, José Duato¹•Institutions (1)

Polytechnic University of Valencia¹

20 Jul 2015

TL;DR: The Multi2Sim heterogeneous CPU/GPU processor simulator is extended to model the GPU memory subsystem with enough accuracy, and three main aspects that should be modeled with more accuracy are identified: i) miss status holding registers, ii) coalescing vector memory requests, and iii) non-blocking GPU stores.

...read moreread less

Abstract: Nowadays, research on GPU processor architecture is extraordinarily active since these architectures offer much more performance per watt than CPU architectures. This is the main reason why massive deployment of GPU multiprocessors is considered one of the most feasible solutions to attain exascale computing capabilities. In this context, ongoing GPU architecture research is required to improve GPU programmability as well as to integrate CPU and GPU cores in the same die. One of the most important research topics in current GPUs, is the GPU memory hierarchy, since its design goals are very different from those of conventional CPU memory hierarchies. To explore novel designs to better support General Purpose computing in GPUs (GPGPU computing) as well as to improve the performance of GPU and CPU/GPU systems, researchers often require advanced microarchitectural simulators with detailed models of the memory subsystem. Nevertheless, due to fast speed at which current GPU architectures evolve, simulation accuracy of existing state-of-the-art simulators suffers. This paper focuses on accurately modeling the GPU memory subsystem. We identified three main aspects that should be modeled with more accuracy: i) miss status holding registers, ii) coalescing vector memory requests, and iii) non-blocking GPU stores. In this sense, we extend the Multi2Sim heterogeneous CPU/GPU processor simulator to model these aspects with enough accuracy. Experimental results show that if these aspects are not considered in the simulation framework, performance deviations can rise in some applications up to 70%, 75%, and 60%, respectively.

...read moreread less

Patent•

Network chip and cloud server system

[...]

Xiaojun Yang, Liu Xingkui, Qin Mengyu, Jing Li, Bingzhang Wang, Li Xiangxin - Show less +2 more

10 Jun 2015

TL;DR: In this paper, a network chip consisting of a plurality of Ethernet NICs (network interface cards) and an Ethernet switch is integrated into a single network chip, and the cloud server system is built based on the chip.

...read moreread less

Abstract: The invention discloses a network chip and a cloud server system. The network chip comprises a plurality of Ethernet NICs (network interface cards) and an Ethernet switch, wherein the plurality of NICs are connected with the Ethernet switch. The NICs and the Ethernet switch are integrated into a single network chip, and the cloud server system is built based on the chip. The structure of the cloud server system can meet the design requirements of a cloud server very well, that is, the performance per watt and the integrated service capacity are high, the cost and power consumption are low, and high performance is realized. Network virtualization is realized on the framework, and the performance of the server can be guaranteed to the largest extent.

...read moreread less

Book Chapter•DOI•

Scheduling for Better Energy Efficiency on Many-Core Chips

[...]

Chanseok Kang¹, Seung Yul Lee¹, Yong-Jun Lee¹, Jaejin Lee¹, Bernhard Egger¹ - Show less +1 more•Institutions (1)

Seoul National University¹

26 May 2015

TL;DR: It is demonstrated that many-core chips offer new opportunities for extremely light-weight migration of independent processes running bare-metal on the many- core chip and how this intra-chip migration can be utilized to achieve a better performance per watt ratio by implementing a hierarchical power-management scheme on top of dynamic voltage and frequency scaling (DVFS).

...read moreread less

Abstract: Many-core chips are especially attractive for data center operators providing cloud computing service models. With the advance of many-core chips in such environments energy-conscious scheduling of independent processes or operating systems (OSes) is gaining importance. An important research question is how the scheduler of such a system should assign the cores to the OSes in order to achieve a better energy utilization. In this paper, we demonstrate that many-core chips offer new opportunities for extremely light-weight migration of independent processes (or OSes) running bare-metal on the many-core chip. We then show how this intra-chip migration can be utilized to achieve a better performance per watt ratio by implementing a hierarchical power-management scheme on top of dynamic voltage and frequency scaling (DVFS). We have implemented and tested the proposed techniques on the Intel Single Chip Cloud Computer (SCC). Combining migration with DVFS we achieve, on average, a 25–35% better performance per watt over a DVFS-only solution.

...read moreread less

Dissertation•DOI•

Direct Communication Methods for Distributed GPUs

[...]

Lena Oden

01 Jan 2015

TL;DR: The results show that a global address space is best for applications that require small, non-blocking, and irregular data transfers, and that by using GPU optimized communication models, between 10 and 50% better energy efficiency can be reached than by using a hybrid model with CPU-controlled communication.

...read moreread less

Abstract: Today, GPUs and other parallel accelerators are widely used in high performance computing, due to their high computational power and high performance per watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. Often, a data transfer between two distributed GPUs even requires intermediate copies in host memory. This overhead penalizes small data movements and synchronization operations. In this work, different communication methods for distributed GPUs are implemented and evaluated. First, a new technique, called GPUDirect RDMA, is implemented for the Extoll device and evaluated. The performance results show that this technique brings performance benefits for small- and mediums-sized data transfers, but for larger transfer sizes, a staged protocol is preferable since the PCIe-bus does not well support peer-to-peer data transfers. In the next step, GPUs are integrated to the one-sided communication library GPI-2. Since this interface was designed for heterogeneous memory structures, it allows an easy integration of GPUs. The performance results show that using one-sided communication for GPUs brings some performance benefits compared to two-sided communication which is the current state-of-the-art. However, using GPI-2 for communication still requires a host thread to control GPU-related communication, although the data is transferred directly between the GPUs without any host copies. Therefore, the subsequent part of the work analyze GPU-controlled communication. First, a put/get communication interface, based on Infiniband verbs, for the GPU is implemented. This interface enables the GPU to independently source and synchronize communication requests without any involvements of the CPU. However, the Infiniband verbs protocol adds a lot of sequential overhead to the communication, so the performance of GPU-controlled put/get communication is far behind the performance of CPU-controlled put/get communication. Another problem is intra-GPU synchronization, since GPU blocks are non-preemptive. The use of communication requests within a GPU can easily result in a deadlock. Dynamic parallelism solves this problem. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, the performance per watt increases, since the CPU can be relieved from the communication work. As a communication model that is more in line with the massive parallelism of GPUs, the performance of a hardware-supported global address space for GPUs is evaluated. This global address space allows communication with simple load and store instructions which can be performed by multiple threads in parallel. With this method, the latency for a GPU-to-GPU data transfer can be reduced to 3us, using an FPGA. The results show that a global address space is best for applications that require small, non-blocking, and irregular data transfers. However, the main bottleneck of this method is that is does not allow overlapping of communication and computation which is the case for put/get communication. However, by using GPU optimized communication models, depending on the application, between 10 and 50% better energy efficiency can be reached than by using a hybrid model with CPU-controlled communication.

...read moreread less

Proceedings Article•DOI•

Evaluation of Cooling Technologies for Xeon Phi™ Based High Performance Computing Clusters

[...]

Suchismita Sarangi¹, Will A. Kuhn, Scott M. Rider², Claude J. Wright², Shankar Krishnan² - Show less +1 more•Institutions (2)

Purdue University¹, Intel²

19 Nov 2015

TL;DR: In this article, the performance per Watt of the Xeon Phi coprocessors is characterized as a function of the cooling technologies using several HPC workloads benchmark run at constant frequency, such as the Intel proprietary Power Thermal Utility (PTU), and industry standard HPC benchmarks LINPACK, DGEMM, SGEMM and STREAM.

...read moreread less

Abstract: Efficient and compact cooling technologies play a pivotal role in determining the performance of high performance computing devices when used with highly parallel workloads in supercomputers. The present work deals with evaluation of different cooling technologies and elucidating their impact on the power, performance, and thermal management of Intel® Xeon Phi™ coprocessors. The scope of the study is to demonstrate enhanced cooling capabilities beyond today’s fan-driven air-cooling for use in high performance computing (HPC) technology, thereby improving the overall Performance per Watt in datacenters. The various cooling technologies evaluated for the present study include air-cooling, liquid-cooling and two-phase immersion-cooling. Air-cooling is evaluated by providing controlled airflow to a cluster of eight 300 W Xeon Phi coprocessors (7120P). For liquid-cooling, two different cold plate technologies are evaluated, viz, Formed tube cold pates and Microchannel based cold plates. Liquidcooling with water as working fluid, is evaluated on single Xeon Phi coprocessors, using inlet conditions in accordance with ASHRAE W2 and W3 class liquid cooled datacenter baselines. For immersion-cooling, a cluster of multiple Xeon Phi coprocessors is evaluated, with three different types of Integrated Heat Spreaders (IHS), viz., bare IHS, IHS with a Boiling Enhancement Coating (BEC) and IHS with BEC coated pin-fins. The entire cluster is immersed in a pool of Novec 649 (3M fluid, boiling point 49 °C at 1 atm), with polycarbonate spacers used to reduce the volume of fluid required, to achieve target fluid/power density of ∼ 3 L/kW. Flow visualization is performed to provide further insight into the boiling behavior during the immersion-cooling process.Performance per Watt of the Xeon Phi coprocessors is characterized as a function of the cooling technologies using several HPC workloads benchmark run at constant frequency, such as the Intel proprietary Power Thermal Utility (PTU), and industry standard HPC benchmarks LINPACK, DGEMM, SGEMM and STREAM. The major parameters measured by sensors on the coprocessor include total power to the coprocessor, CPU temperature, and memory temperature, while the calculated outputs of interest also include the performance per watt and equivalent thermal resistance. As expected, it is observed that both liquid and immersion cooling show improved performance per Watt and lower CPU temperature compared to air-cooling. In addition to elucidating the performance/watt improvement, this work reports on the relationship of cooling technologies on total power consumed by the Xeon-Phi card as a function of coolant inlet temperatures. Further, the paper discusses form-factor advantages to liquid and immersion cooling and compares technologies on a common platform. Finally, the paper concludes by discussing datacenter optimization for cooling in the context of leakage power control for Xeon-Phi coprocessors.Copyright © 2015 by ASME

...read moreread less

Grid and high performance computing applied to bioinformatics

[...]

Emanuele Manca

27 Apr 2015

TL;DR: The overall results show that the proposed parallel algorithms are highly performant, thus justifying the use of such technology, and a software infrastructure for work management has been devised to provide support in CPU and GPU computation on a distributed GPU-based in- frastructure.

...read moreread less

Abstract: Recent advances in genome sequencing technologies and modern biological data analysis technologies used in bioinformatics have led to a fast and continuous increase in biological data. The difficulty of managing the huge amounts of data currently available to researchers and the need to have results within a reasonable time have led to the use of distributed and parallel computing infrastructures for their analysis. In this context Grid computing has been successfully used. Grid computing is based on a distributed system which interconnects several computers and/or clusters to access global-scale resources. This infrastructure is exible, highly scalable and can achieve high performances with data-compute-intensive algorithms. Recently, bioinformatics is exploring new approaches based on the use of hardware accelerators, such as the Graphics Processing Units (GPUs). Initially developed as graphics cards, GPUs have been recently introduced for scientific purposes by rea- son of their performance per watt and the better cost/performance ratio achieved in terms of throughput and response time compared to other high-performance com- puting solutions. Although developers must have an in-depth knowledge of GPU programming and hardware to be effective, GPU accelerators have produced a lot of impressive results. The use of high-performance computing infrastructures raises the question of finding a way to parallelize the algorithms while limiting data dependency issues in order to accelerate computations on a massively parallel hardware. In this context, the research activity in this dissertation focused on the assessment and testing of the impact of these innovative high-performance computing technolo- gies on computational biology. In order to achieve high levels of parallelism and, in the final analysis, obtain high performances, some of the bioinformatic algorithms applicable to genome data analysis were selected, analyzed and implemented. These algorithms have been highly parallelized and optimized, thus maximizing the GPU hardware resources. The overall results show that the proposed parallel algorithms are highly performant, thus justifying the use of such technology. However, a software infrastructure for work ow management has been devised to provide support in CPU and GPU computation on a distributed GPU-based in- frastructure. Moreover, this software infrastructure allows a further coarse-grained data-parallel parallelization on more GPUs. Results show that the proposed appli- cation speed-up increases with the increase in the number of GPUs.

...read moreread less