scispace - formally typeset
Search or ask a question

Showing papers on "Performance per watt published in 2015"


Proceedings ArticleDOI
07 Jun 2015
TL;DR: Based on power-performance models developed, an efficient power management strategy is proposed and implemented on an Odroid-XU+E mobile platform and shows that it provides on average 20% increase in performance per watt when compared to the state-of-the-art.
Abstract: Games have emerged as one of the most popular applications on mobile platforms. Recent platforms are now equipped with Heterogeneous Multiprocessor System-on-Chips (HMPSoCs) tightly integrating CPUs and GPUs on the same chip. This configuration enables high-end gaming on the platform but at the cost of high power consumption rapidly draining the underlying limited-capacity battery. The HMPSoCs are capable of independent Dynamic Voltage and Frequency Scaling (DVFS) for CPUs and GPUs for reduction in platform's power consumption. State-of-the-art power manager for mobile games on HMPSoCs oversimplifies the complex CPU-GPU interplay. In this paper, we develop power-performance models predicting the impact of DVFS on mobile gaming workloads. Based on our models, we propose an efficient power management strategy and implement it on an Odroid-XU+E mobile platform. Measurements on the platform show that our power manager provides on average 20% increase in performance per watt when compared to the state-of-the-art.

58 citations


Proceedings ArticleDOI
07 Feb 2015
TL;DR: This paper design and experimentally validate power-performance models to carefully select the appropriate kernel combinations to be executed concurrently, the relative contributions of the kernels to the thread mix, along with the frequency choices for the cores and the memory to achieve high performance per watt metric.
Abstract: Current generation GPUs can accelerate high-performance, compute-intensive applications by exploiting massive thread-level parallelism. The high performance, however, comes at the cost of increased power consumption. Recently, commercial GPGPU architectures have introduced support for concurrent kernel execution to better utilize the computational/memory resources and thereby improve overall throughput. In this paper, we argue and experimentally validate the benefits of concurrent kernels towards energy-efficient execution. We design power-performance models to carefully select the appropriate kernel combinations to be executed concurrently, the relative contributions of the kernels to the thread mix, along with the frequency choices for the cores and the memory to achieve high performance per watt metric. Our experimental evaluation shows that the concurrent kernel execution in combination with DVFS can improve energy-efficiency by up to 34.5% compared to the most energy-efficient sequential execution.

52 citations


Proceedings ArticleDOI
13 Apr 2015
TL;DR: This work proposes ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput, and demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.
Abstract: Single-ISA (instruction set architecture) asymmetric multicore processors (AMPs) were shown to deliver higher performance per watt and area than symmetric CMPs (Chip Multi-Processors) for applications with diverse architectural requirements. A large body of work has demonstrated that this potential of AMP systems can be realizable via OS scheduling. Yet, existing schedulers that seek to deliver fairness on AMPs do not ensure that equal-priority applications experience the same slowdown when sharing the system. Moreover, most of these schemes are also subject to high throughput degradation and fail to effectively deal with user priorities. In this work we propose ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput. Our evaluation on real AMP hardware, and using scheduler implementations on a general-purpose OS, demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.

18 citations


Journal ArticleDOI
TL;DR: By extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, this article quantitatively model the integrated energy efficiency in terms of performance per Watt and showcases the trade-offs among typical HPC parameters.
Abstract: Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficiency and resilience, due to rapidly expanding size and duration in use of the large-scale computing systems. Many application/architecture-dependent parameters that determine energy efficiency and resilience individually have causal effects with each other, which directly affect the trade-offs among performance, energy efficiency and resilience at scale. To enable high-efficiency management for large-scale High-Performance Computing (HPC) systems nowadays, quantitatively understanding the entangled effects among performance, energy efficiency, and resilience is thus required. While previous work focuses on exploring energy-saving and resilience-enhancing opportunities separately, little has been done to theoretically and empirically investigate the interplay between energy efficiency and resilience at scale. In this article, by extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, we quantitatively model the integrated energy efficiency in terms of performance per Watt and showcase the trade-offs among typical HPC parameters, such as number of cores, frequency/voltage, and failure rates. Experimental results for a wide spectrum of HPC benchmarks on two HPC systems show that the proposed models are accurate in extrapolating resilience-aware performance and energy efficiency, and capable of capturing the interplay among various energy-saving and resilience factors. Moreover, the models can help find the optimal HPC configuration for the highest integrated energy efficiency, in the presence of failures and applied resilience techniques.

12 citations


Proceedings ArticleDOI
12 Mar 2015
TL;DR: 3D die stacking is demonstrated, whereby disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.
Abstract: Energy becomes the primary concern in nowadays multi-core architecture designs. Moore's law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.1

10 citations


Proceedings ArticleDOI
13 Sep 2015
TL;DR: This work shows some optimization strategies evaluated and applied to an elastic propagator based on a Fully Staggered Grid, running on the Intel® Xeon Phi(TM) coprocessor, and it is important to remark, that the propagator is able to reproduce elastic wave propagation, even for an arbitrary anisotropy.
Abstract: The current trend in seismic imaging aims at using an improved physical model, considering that the Earth is not rigid but an elastic body. This new model takes simulations closer to the real physics of the problem, at the cost of raising the needed computational resources. On the hardware front, recently developed high-performing devices, called accelerators or co-processors, have shown that can outperform their general purpose counterparts by orders of magnitude in terms of performance per watt. These new alternatives may then provide the necessary resources for making possible to represent complex wave physics in a reasonable time. There might be, however, a penalty associated to the usage of such devices, as some portion of the simulation code might need some re-writing or new optimization strategies explored and applied. In this work we will show some optimization strategies evaluated and applied to an elastic propagator based on a Fully Staggered Grid, running on the Intel® Xeon Phi(TM) coprocessor. It is important to remark, that the propagator is able to reproduce elastic wave propagation, even for an arbitrary anisotropy.

10 citations


Journal ArticleDOI
11 May 2015
TL;DR: This paper expands support to Calorimeter, Inner Detector, and Tracking code, and presents the findings on implementing a hybrid multi-threaded / multi-process framework, to take advantage of the strengths of each type of concurrency, while avoiding some of their corresponding limitations.
Abstract: The ATLAS experiment has successfully used its Gaudi/Athena software framework for data taking and analysis during the first LHC run, with billions of events successfully processed. However, the design of Gaudi/Athena dates from early 2000 and the software and the physics code has been written using a single threaded, serial design. This programming model has increasing difficulty in exploiting the potential of current CPUs, which offer their best performance only through taking full advantage of multiple cores and wide vector registers. Future CPU evolution will intensify this trend, with core counts increasing and memory per core falling. With current memory consumption for 64 bit ATLAS reconstruction in a high luminosity environment approaching 4GB, it will become impossible to fully occupy all cores in a machine without exhausting available memory. However, since maximizing performance per watt will be a key metric, a mechanism must be found to use all cores as efficiently as possible.In this paper we report on our progress with a practical demonstration of the use of multithreading in the ATLAS reconstruction software, using the GaudiHive framework. We have expanded support to Calorimeter, Inner Detector, and Tracking code, discussing what changes were necessary in order to allow the serially designed ATLAS code to run, both to the framework and to the tools and algorithms used. We report on both the performance gains, and what general lessons were learned about the code patterns that had been employed in the software and which patterns were identified as particularly problematic for multi-threading. We also present our findings on implementing a hybrid multi-threaded / multi-process framework, to take advantage of the strengths of each type of concurrency, while avoiding some of their corresponding limitations.

9 citations


Proceedings ArticleDOI
21 Jul 2015
TL;DR: The ePUMA platform is a flexible and configurable DSP platform that tries to address many of the problems with traditional DSP designs, to increase performance, but use less power.
Abstract: Since the breakdown of Dennard scaling the primary design goal for processor designs has shifted from increasing performance to increasing performance per Watt. The ePUMA platform is a flexible and configurable DSP platform that tries to address many of the problems with traditional DSP designs, to increase performance, but use less power. We trade the flexibility of traditional VLIW DSP designs for a simpler single instruction issue scheme and instead make sure that each instruction can perform more work. Multi-cycle instructions can operate directly on vectors and matrices in memory and the datapaths implement common DSP subgraphs directly in hardware, for high compute through-put. Memory bottlenecks, that are common in other architectures, are handled with flexible LUT-based multi-bank memory addressing and memory parallelism. A major contributor to energy consumption, data movement, is reduced by using heterogeneous interconnect and clustering compute resources around local memories for simple data sharing. To evaluate ePUMA we have implemented the majority of the kernel library from a commercial VLIW DSP manufacturer for comparison. Our results not only show good performance, but also an order of magnitude increase in energy- and area efficiency. In addition, the kernel code size is reduced by 91% on average compared to the VLIW DSP. These benefits makes ePUMA an attractive solution for future DSP.

8 citations


Proceedings ArticleDOI
03 Sep 2015
TL;DR: A flexible parallel hardware-based architecture in conjunction with frequency scaling as a technique for reducing power consumption in video streaming applications and derived equations to ease the calculation for the level of parallelism and the maximum depth for the FIFOs used for clock domain crossing is presented.
Abstract: Reconfigurable technology fits for real-time video streaming applications. It is considered as a promising solution due to the offered performance per watt compared to other technologies. Since FPGA evolved, several techniques at different design levels starting from the circuit-level up to the system-level were proposed to reduce the power consumption of the FPGA devices. In this paper, we present a flexible parallel hardware-based architecture in conjunction with frequency scaling as a technique for reducing power consumption in video streaming applications. In this work, we derived equations to ease the calculation for the level of parallelism and the maximum depth for the FIFOs used for clock domain crossing. Accordingly, a design space was formed including all the design alternatives for the application. The preferable design alternative is selected in aware of how much hardware it costs and what power reduction goal it can satisfy. We used Xilinx Zynq ZC706 evaluation board to implement two video streaming applications: Video downscaler (1∶16) and AES encryption algorithm to verify our approach. The experimental results showed up to 19.6% power reduction for the video downscaler and up to 5.4% for the AES encryption.

8 citations


Journal ArticleDOI
TL;DR: The roadmap to use massively parallel architectures in a 3D-FD simulation is shown and it is revealed that NVIDIA architectures outperform by a wide margin the Intel Xeon Phi co-processor while dissipating approximately 50W less for large-scale input problems.

8 citations


Journal ArticleDOI
TL;DR: The usefulness of the simulator for this type of studies is demonstrated and it is concluded that the superior behavior of multiobjective algorithms makes them recommended for use in modern scheduling systems.
Abstract: Today, in an energy-aware society, job scheduling is becoming an important task for computer engineers and system analysts that may lead to a performance per Watt trade-off of computing infrastructures. Thus, new algorithms, and a simulator of computing environments, may help information and communications technology and data center managers to make decisions with a solid experimental basis. There are several simulators that try to address performance and, somehow, estimate energy consumption, but there are none in which the energy model is based on benchmark data that have been countersigned by independent bodies such as the Standard Performance Evaluation Corporation. This is the reason why we have implemented a performance and energy-aware scheduling PEAS simulator for high-performance computing. Furthermore, to evaluate the simulator, we propose an implementation of the non-dominated sorting genetic algorithm-II NSGA-II algorithm, a fast and elitist multiobjective genetic algorithm, for the resource selection. With the help of the PEAS simulator, we have studied if it is possible to provide an intelligent job allocation policy that may be able to save energy and time without compromising performance. The results of our simulations show a great improvement in response time and power consumption. In most of the cases, NSGA-II performs better than other 'intelligent' algorithms like multiobjective heterogeneous earliest finish time and clearly outperforms the first-fit algorithm. We demonstrate the usefulness of the simulator for this type of studies and conclude that the superior behavior of multiobjective algorithms makes them recommended for use in modern scheduling systems. Copyright © 2015 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
15 Nov 2015
TL;DR: A course in GPU programming for senior undergraduates and first-year graduates that has been taught at Clemson University annually since 2010 is described, with focus on a large, real-world problem, in particular, a system for parallel solution of partial differential equations.
Abstract: Compared to CPUs, modern GPUs exhibit a high ratio of computing performance per watt, and so current supercomputer designs often include multiple racks of GPUs in order to achieve high teraflop counts at minimal energy cost. GPU programming is thus becoming increasingly important, and yet it remains a challenging task. This paper describes a course in GPU programming for senior undergraduates and first-year graduates that has been taught at Clemson University annually since 2010. The course uses problem-based learning, with focus on a large, real-world problem, in particular, a system for parallel solution of partial differential equations. Although the system for solving PDEs is useful in its own right, the problem is used as a vehicle in which to explore design issues that face those attempting to achieve new levels of performance on architectures.

Proceedings ArticleDOI
22 Jul 2015
TL;DR: This study examines a class of embedded system applications relevant to mobile vehicles to understand the limits of achievable energy efficiency under varying levels of system resilience constraints and considers static optimization of voltage-frequency settings on a per-application-segment basis.
Abstract: Low-power embedded processing typically relies on dynamic voltage-frequency scaling (DVFS) in order to optimize energy usage (and therefore, battery life) However, low voltage operation exacerbates the incidence of soft errors Similarly, higher voltage operation (to meet real-time deadlines) is constrained by hard-failure rate limits In this paper, we examine a class of embedded system applications relevant to mobile vehicles We investigate the problem of assigning optimal voltage-frequency settings to individual segments within target workflows The goal of this study is to understand the limits of achievable energy efficiency (performance per watt) under varying levels of system resilience constraints To optimize for energy efficiency, we consider static optimization of voltage-frequency settings on a per-application-segment basis We consider both linear and graph-structured workflows In order to understand the loss in energy efficiency in the face of environmental uncertainties encountered by the mobile vehicle, we also study the effect of injecting random variations in the actual runtime of individual application segments A dynamic re-optimization of the voltage-frequency settings is required to cope with such in-field uncertainties

Journal ArticleDOI
TL;DR: This paper proposes high-performance and energy-efficient multicore architectures for variety of parallelisms and memory-intensities in workloads and uses dynamic voltage and frequency scaling in Amdahl’s law to decrease amount of dark silicon and improve performance and performance per watt/joule.
Abstract: As technology scales further, multicore and many-core processors emerge as an alternative to keep up with performance demands. However, because of power and thermal constraints, we are obliged to power off remarkable area of chip. Many innovative techniques have been presented to improve energy efficiency and maintain utilization at the highest level. In this paper, we discuss different models and methods of exploiting dark silicon, and by using dynamic voltage and frequency scaling in Amdahl's law and considering memory overheads, we attempt to decrease amount of dark silicon and improve performance and performance per watt/joule. We propose high-performance and energy-efficient multicore architectures for variety of parallelisms and memory-intensities in workloads. According to the results, by voltage scaling, for a highly parallel CPU-intensive workload, we reach improvements of approximately $$5.2{\times }$$5.2× and $$3.78{\times }$$3.78× in performance per watt and performance per joule, respectively, while about 27 % reduction of performance should be tolerated. For memory-intensive applications, a negligible change in speedup is detected by scaling, while performance per watt and performance per joule for both serial and parallel applications lead to around $$6{\times }$$6× enhancements.

Proceedings ArticleDOI
20 Jul 2015
TL;DR: The Multi2Sim heterogeneous CPU/GPU processor simulator is extended to model the GPU memory subsystem with enough accuracy, and three main aspects that should be modeled with more accuracy are identified: i) miss status holding registers, ii) coalescing vector memory requests, and iii) non-blocking GPU stores.
Abstract: Nowadays, research on GPU processor architecture is extraordinarily active since these architectures offer much more performance per watt than CPU architectures. This is the main reason why massive deployment of GPU multiprocessors is considered one of the most feasible solutions to attain exascale computing capabilities. In this context, ongoing GPU architecture research is required to improve GPU programmability as well as to integrate CPU and GPU cores in the same die. One of the most important research topics in current GPUs, is the GPU memory hierarchy, since its design goals are very different from those of conventional CPU memory hierarchies. To explore novel designs to better support General Purpose computing in GPUs (GPGPU computing) as well as to improve the performance of GPU and CPU/GPU systems, researchers often require advanced microarchitectural simulators with detailed models of the memory subsystem. Nevertheless, due to fast speed at which current GPU architectures evolve, simulation accuracy of existing state-of-the-art simulators suffers. This paper focuses on accurately modeling the GPU memory subsystem. We identified three main aspects that should be modeled with more accuracy: i) miss status holding registers, ii) coalescing vector memory requests, and iii) non-blocking GPU stores. In this sense, we extend the Multi2Sim heterogeneous CPU/GPU processor simulator to model these aspects with enough accuracy. Experimental results show that if these aspects are not considered in the simulation framework, performance deviations can rise in some applications up to 70%, 75%, and 60%, respectively.

Patent
10 Jun 2015
TL;DR: In this paper, a network chip consisting of a plurality of Ethernet NICs (network interface cards) and an Ethernet switch is integrated into a single network chip, and the cloud server system is built based on the chip.
Abstract: The invention discloses a network chip and a cloud server system. The network chip comprises a plurality of Ethernet NICs (network interface cards) and an Ethernet switch, wherein the plurality of NICs are connected with the Ethernet switch. The NICs and the Ethernet switch are integrated into a single network chip, and the cloud server system is built based on the chip. The structure of the cloud server system can meet the design requirements of a cloud server very well, that is, the performance per watt and the integrated service capacity are high, the cost and power consumption are low, and high performance is realized. Network virtualization is realized on the framework, and the performance of the server can be guaranteed to the largest extent.

Book ChapterDOI
26 May 2015
TL;DR: It is demonstrated that many-core chips offer new opportunities for extremely light-weight migration of independent processes running bare-metal on the many- core chip and how this intra-chip migration can be utilized to achieve a better performance per watt ratio by implementing a hierarchical power-management scheme on top of dynamic voltage and frequency scaling (DVFS).
Abstract: Many-core chips are especially attractive for data center operators providing cloud computing service models. With the advance of many-core chips in such environments energy-conscious scheduling of independent processes or operating systems (OSes) is gaining importance. An important research question is how the scheduler of such a system should assign the cores to the OSes in order to achieve a better energy utilization. In this paper, we demonstrate that many-core chips offer new opportunities for extremely light-weight migration of independent processes (or OSes) running bare-metal on the many-core chip. We then show how this intra-chip migration can be utilized to achieve a better performance per watt ratio by implementing a hierarchical power-management scheme on top of dynamic voltage and frequency scaling (DVFS). We have implemented and tested the proposed techniques on the Intel Single Chip Cloud Computer (SCC). Combining migration with DVFS we achieve, on average, a 25–35% better performance per watt over a DVFS-only solution.

DissertationDOI
01 Jan 2015
TL;DR: The results show that a global address space is best for applications that require small, non-blocking, and irregular data transfers, and that by using GPU optimized communication models, between 10 and 50% better energy efficiency can be reached than by using a hybrid model with CPU-controlled communication.
Abstract: Today, GPUs and other parallel accelerators are widely used in high performance computing, due to their high computational power and high performance per watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. Often, a data transfer between two distributed GPUs even requires intermediate copies in host memory. This overhead penalizes small data movements and synchronization operations. In this work, different communication methods for distributed GPUs are implemented and evaluated. First, a new technique, called GPUDirect RDMA, is implemented for the Extoll device and evaluated. The performance results show that this technique brings performance benefits for small- and mediums-sized data transfers, but for larger transfer sizes, a staged protocol is preferable since the PCIe-bus does not well support peer-to-peer data transfers. In the next step, GPUs are integrated to the one-sided communication library GPI-2. Since this interface was designed for heterogeneous memory structures, it allows an easy integration of GPUs. The performance results show that using one-sided communication for GPUs brings some performance benefits compared to two-sided communication which is the current state-of-the-art. However, using GPI-2 for communication still requires a host thread to control GPU-related communication, although the data is transferred directly between the GPUs without any host copies. Therefore, the subsequent part of the work analyze GPU-controlled communication. First, a put/get communication interface, based on Infiniband verbs, for the GPU is implemented. This interface enables the GPU to independently source and synchronize communication requests without any involvements of the CPU. However, the Infiniband verbs protocol adds a lot of sequential overhead to the communication, so the performance of GPU-controlled put/get communication is far behind the performance of CPU-controlled put/get communication. Another problem is intra-GPU synchronization, since GPU blocks are non-preemptive. The use of communication requests within a GPU can easily result in a deadlock. Dynamic parallelism solves this problem. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, the performance per watt increases, since the CPU can be relieved from the communication work. As a communication model that is more in line with the massive parallelism of GPUs, the performance of a hardware-supported global address space for GPUs is evaluated. This global address space allows communication with simple load and store instructions which can be performed by multiple threads in parallel. With this method, the latency for a GPU-to-GPU data transfer can be reduced to 3us, using an FPGA. The results show that a global address space is best for applications that require small, non-blocking, and irregular data transfers. However, the main bottleneck of this method is that is does not allow overlapping of communication and computation which is the case for put/get communication. However, by using GPU optimized communication models, depending on the application, between 10 and 50% better energy efficiency can be reached than by using a hybrid model with CPU-controlled communication.

Proceedings ArticleDOI
19 Nov 2015
TL;DR: In this article, the performance per Watt of the Xeon Phi coprocessors is characterized as a function of the cooling technologies using several HPC workloads benchmark run at constant frequency, such as the Intel proprietary Power Thermal Utility (PTU), and industry standard HPC benchmarks LINPACK, DGEMM, SGEMM and STREAM.
Abstract: Efficient and compact cooling technologies play a pivotal role in determining the performance of high performance computing devices when used with highly parallel workloads in supercomputers. The present work deals with evaluation of different cooling technologies and elucidating their impact on the power, performance, and thermal management of Intel® Xeon Phi™ coprocessors. The scope of the study is to demonstrate enhanced cooling capabilities beyond today’s fan-driven air-cooling for use in high performance computing (HPC) technology, thereby improving the overall Performance per Watt in datacenters. The various cooling technologies evaluated for the present study include air-cooling, liquid-cooling and two-phase immersion-cooling. Air-cooling is evaluated by providing controlled airflow to a cluster of eight 300 W Xeon Phi coprocessors (7120P). For liquid-cooling, two different cold plate technologies are evaluated, viz, Formed tube cold pates and Microchannel based cold plates. Liquidcooling with water as working fluid, is evaluated on single Xeon Phi coprocessors, using inlet conditions in accordance with ASHRAE W2 and W3 class liquid cooled datacenter baselines. For immersion-cooling, a cluster of multiple Xeon Phi coprocessors is evaluated, with three different types of Integrated Heat Spreaders (IHS), viz., bare IHS, IHS with a Boiling Enhancement Coating (BEC) and IHS with BEC coated pin-fins. The entire cluster is immersed in a pool of Novec 649 (3M fluid, boiling point 49 °C at 1 atm), with polycarbonate spacers used to reduce the volume of fluid required, to achieve target fluid/power density of ∼ 3 L/kW. Flow visualization is performed to provide further insight into the boiling behavior during the immersion-cooling process.Performance per Watt of the Xeon Phi coprocessors is characterized as a function of the cooling technologies using several HPC workloads benchmark run at constant frequency, such as the Intel proprietary Power Thermal Utility (PTU), and industry standard HPC benchmarks LINPACK, DGEMM, SGEMM and STREAM. The major parameters measured by sensors on the coprocessor include total power to the coprocessor, CPU temperature, and memory temperature, while the calculated outputs of interest also include the performance per watt and equivalent thermal resistance. As expected, it is observed that both liquid and immersion cooling show improved performance per Watt and lower CPU temperature compared to air-cooling. In addition to elucidating the performance/watt improvement, this work reports on the relationship of cooling technologies on total power consumed by the Xeon-Phi card as a function of coolant inlet temperatures. Further, the paper discusses form-factor advantages to liquid and immersion cooling and compares technologies on a common platform. Finally, the paper concludes by discussing datacenter optimization for cooling in the context of leakage power control for Xeon-Phi coprocessors.Copyright © 2015 by ASME

27 Apr 2015
TL;DR: The overall results show that the proposed parallel algorithms are highly performant, thus justifying the use of such technology, and a software infrastructure for work management has been devised to provide support in CPU and GPU computation on a distributed GPU-based in- frastructure.
Abstract: Recent advances in genome sequencing technologies and modern biological data analysis technologies used in bioinformatics have led to a fast and continuous increase in biological data. The difficulty of managing the huge amounts of data currently available to researchers and the need to have results within a reasonable time have led to the use of distributed and parallel computing infrastructures for their analysis. In this context Grid computing has been successfully used. Grid computing is based on a distributed system which interconnects several computers and/or clusters to access global-scale resources. This infrastructure is exible, highly scalable and can achieve high performances with data-compute-intensive algorithms. Recently, bioinformatics is exploring new approaches based on the use of hardware accelerators, such as the Graphics Processing Units (GPUs). Initially developed as graphics cards, GPUs have been recently introduced for scientific purposes by rea- son of their performance per watt and the better cost/performance ratio achieved in terms of throughput and response time compared to other high-performance com- puting solutions. Although developers must have an in-depth knowledge of GPU programming and hardware to be effective, GPU accelerators have produced a lot of impressive results. The use of high-performance computing infrastructures raises the question of finding a way to parallelize the algorithms while limiting data dependency issues in order to accelerate computations on a massively parallel hardware. In this context, the research activity in this dissertation focused on the assessment and testing of the impact of these innovative high-performance computing technolo- gies on computational biology. In order to achieve high levels of parallelism and, in the final analysis, obtain high performances, some of the bioinformatic algorithms applicable to genome data analysis were selected, analyzed and implemented. These algorithms have been highly parallelized and optimized, thus maximizing the GPU hardware resources. The overall results show that the proposed parallel algorithms are highly performant, thus justifying the use of such technology. However, a software infrastructure for work ow management has been devised to provide support in CPU and GPU computation on a distributed GPU-based in- frastructure. Moreover, this software infrastructure allows a further coarse-grained data-parallel parallelization on more GPUs. Results show that the proposed appli- cation speed-up increases with the increase in the number of GPUs.