Topic
Performance per watt
About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.
Papers published on a yearly basis
Papers
More filters
••
16 Jun 2013TL;DR: This work takes a look at the total cost of ownership (TCO) that includes costs for administration and programming effort and compute the costs per program run which can be used as a comparison metric for a hardware purchase decision.
Abstract: Nowadays, HPC systems emerge in a great variety including commodity processors with attached accelerators which promise to improve the performance per watt ratio. These heterogeneous architectures often get far more complex to employ. Therefore, a hardware purchase decision should not only take capital expenses and operational costs such as power consumption into account, but also manpower. In this work, we take a look at the total cost of ownership (TCO) that includes costs for administration and programming effort. From that, we compute the costs per program run which can be used as a comparison metric for a purchase decision. In a case study, we evaluate our approach on two real-world simulation applications on Intel Xeon architectures, NVIDIA GPUs and Intel Xeon Phis by using different programming models: OpenCL, OpenACC, OpenMP and Intel’s Language Extensions for Offload.
15 citations
••
18 Aug 2010TL;DR: This paper proposes and implements mechanisms and policies for a commercial OS scheduler and load balancer which incorporates thread characteristics, and shows that it results in improvements of up to 30% in performance per watt.
Abstract: Runtime characteristics of individual threads (such as IPC, cache usage, etc.) are a critical factor in making efficient scheduling decisions in modern chip-multiprocessor systems. They provide key insights into how threads interact when they share processor resources, and affect the overall system power and performance efficiency. In this paper, we propose and implement mechanisms and policies for a commercial OS scheduler and load balancer which incorporates thread characteristics, and show that it results in improvements of up to 30% in performance per watt.
14 citations
••
TL;DR: A new technique, ASTPI (Average Stall Time Per Instruction), is proposed, design, implement and evaluate a new online monitoring approach called ESHMP, which is based on the metric, and shows that among HMP systems in which heterogeneity-aware schedulers are adopted and there are more than one LLC, the architecture where heterogeneous cores share LLCs gain better performance than the ones where homogeneous coresshare LLCs.
14 citations
••
18 Mar 2013TL;DR: CReAMS is composed of multiple adaptive reconfigurable processors that simultaneously exploit Instruction and Thread Level Parallelism, and works in a transparent fashion, so binary compatibility is maintained, with no need to change the software development process or environment.
Abstract: As the number of embedded applications increases, companies are launching new platforms within short periods of time to efficiently execute software with the lowest possible energy consumption. However, for each new platform deployment, new tool chains, with additional libraries, debuggers and compilers must come along, breaking binary compatibility. This strategy implies in high hardware and software redesign costs. In this scenario, we propose the exploitation of Custom Reconfigurable Arrays for Multiprocessor Systems (CReAMS). CReAMS is composed of multiple adaptive reconfigurable processors that simultaneously exploit Instruction and Thread Level Parallelism. It works in a transparent fashion, so binary compatibility is maintained, with no need to change the software development process or environment. We also show that CReAMS delivers higher performance per watt in comparison to a 4-issue Superscalar processor, when the same power budget is considered for both designs.
14 citations
••
TL;DR: In this paper, the authors investigated how GPU power consumption increases non-linearly with both temperature and supply voltage, as predicted by physical transistor models, and they showed that GPU supply voltage and clock frequency while maintaining a low die temperature increases the power efficiency of an NVIDIA K20 GPU.
Abstract: The magnitude of the real-time digital signal processing challenge attached to large radio astronomical antenna arrays motivates use of high performance computing (HPC) systems. The need for high power efficiency at remote observatory sites parallels that in HPC broadly, where efficiency is a critical metric. We investigate how the performance-per-watt of graphics processing units (GPUs) is affected by temperature, core clock frequency and voltage. Our results highlight how the underlying physical processes that govern transistor operation affect power efficiency. In particular, we show experimentally that GPU power consumption increases non-linearly (quadratic) with both temperature and supply voltage, as predicted by physical transistor models. We show lowering GPU supply voltage and increasing clock frequency while maintaining a low die temperature increases the power efficiency of an NVIDIA K20 GPU by up to 37---48 % over default settings when running xGPU, a compute-bound code used in radio astronomy. We discuss how automatic temperature-aware and application-dependent voltage and frequency scaling (T-DVFS and A-DVFS) may provide a mechanism to achieve better power efficiency for a wider range of compute codes running on GPUs.
14 citations