Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Optimization of losless image compression method for GPGPU

[...]

Luka Strizic¹, Josip Knezović¹•Institutions (1)

University of Zagreb¹

18 Apr 2016

TL;DR: TheCUDA version proved to be both more efficient and faster than a singlethreaded CPU version, however, further tests should be done comparing even more optimized CUDA version against multithreaded CPU implementation to cover the whole spectrum and to achieve better performance per watt.

...read moreread less

Abstract: This paper presents power and execution time efficient implementation of highly adaptive lossless image compression method based on predictor classification and blending, denoted as CBPC coder. Power efficiency is becoming increasingly important in both: datacenters and consumer electronics. This is why we aimed to target its optimization, as well as throughput, of CBPC coder on a CPU and a GPU, using CUDA for the latter. Tests were conducted using mainstream components in a desktop PC and high end components in a server. The CUDA version proved to be both more efficient and faster than a singlethreaded CPU version, however, further tests should be done comparing even more optimized CUDA version against multithreaded CPU implementation to cover the whole spectrum and to achieve better performance per watt for both consumer desktop and server system running the method. Finally, we demonstrate the benefits of GPGPU approach for compute-intensive, fine grained data-parallel parts of the algorithm, notably predictor classification and blending computations.

...read moreread less

2 citations

Dissertation•DOI•

Direct Communication Methods for Distributed GPUs

[...]

Lena Oden

01 Jan 2015

TL;DR: The results show that a global address space is best for applications that require small, non-blocking, and irregular data transfers, and that by using GPU optimized communication models, between 10 and 50% better energy efficiency can be reached than by using a hybrid model with CPU-controlled communication.

...read moreread less

Abstract: Today, GPUs and other parallel accelerators are widely used in high performance computing, due to their high computational power and high performance per watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. Often, a data transfer between two distributed GPUs even requires intermediate copies in host memory. This overhead penalizes small data movements and synchronization operations. In this work, different communication methods for distributed GPUs are implemented and evaluated. First, a new technique, called GPUDirect RDMA, is implemented for the Extoll device and evaluated. The performance results show that this technique brings performance benefits for small- and mediums-sized data transfers, but for larger transfer sizes, a staged protocol is preferable since the PCIe-bus does not well support peer-to-peer data transfers. In the next step, GPUs are integrated to the one-sided communication library GPI-2. Since this interface was designed for heterogeneous memory structures, it allows an easy integration of GPUs. The performance results show that using one-sided communication for GPUs brings some performance benefits compared to two-sided communication which is the current state-of-the-art. However, using GPI-2 for communication still requires a host thread to control GPU-related communication, although the data is transferred directly between the GPUs without any host copies. Therefore, the subsequent part of the work analyze GPU-controlled communication. First, a put/get communication interface, based on Infiniband verbs, for the GPU is implemented. This interface enables the GPU to independently source and synchronize communication requests without any involvements of the CPU. However, the Infiniband verbs protocol adds a lot of sequential overhead to the communication, so the performance of GPU-controlled put/get communication is far behind the performance of CPU-controlled put/get communication. Another problem is intra-GPU synchronization, since GPU blocks are non-preemptive. The use of communication requests within a GPU can easily result in a deadlock. Dynamic parallelism solves this problem. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, the performance per watt increases, since the CPU can be relieved from the communication work. As a communication model that is more in line with the massive parallelism of GPUs, the performance of a hardware-supported global address space for GPUs is evaluated. This global address space allows communication with simple load and store instructions which can be performed by multiple threads in parallel. With this method, the latency for a GPU-to-GPU data transfer can be reduced to 3us, using an FPGA. The results show that a global address space is best for applications that require small, non-blocking, and irregular data transfers. However, the main bottleneck of this method is that is does not allow overlapping of communication and computation which is the case for put/get communication. However, by using GPU optimized communication models, depending on the application, between 10 and 50% better energy efficiency can be reached than by using a hybrid model with CPU-controlled communication.

...read moreread less

2 citations

Proceedings Article•DOI•

Petascale Computing Research Challenges - A Manycore Perspective

[...]

S. Pawlowski¹•Institutions (1)

Intel¹

10 Feb 2007

TL;DR: Intel Senior Fellow and Chief Technology Officer of Intel's Digital Enterprise Group, Steve Pawlowski, will provide his technology vision, insight and research challenges to achieve the vision of Petascale computing and beyond.

...read moreread less

Abstract: Summary form only given. Future high performance computing will undoubtedly reach Petascale and beyond. Today's HPC is tomorrow's personal computing. What are the evolving processor architectures towards multi-core and many-core for the best performance per watt; memory bandwidth solutions to feed the ever more powerful processors; intra-chip interconnect options for optimal bandwidth vs. power? With Moore's Law continuing to prove its viability and shrinking transistors' geometry, improving reliability is even more challenging. Intel Senior Fellow and Chief Technology Officer of Intel's Digital Enterprise Group, Steve Pawlowski, will provide his technology vision, insight and research challenges to achieve the vision of Petascale computing and beyond

...read moreread less

2 citations

Proceedings Article•DOI•

Accelerators, quo vadis? Performance vs. productivity

[...]

Sandra Wienke¹, Christian Terboven¹, Dieter an Mey¹, Matthias S. Müller¹•Institutions (1)

RWTH Aachen University¹

01 Jul 2013

TL;DR: In an HPC era attracting notice to energy efficiency, a promising performance per watt ratio motivates the usage of accelerators like GPGPUs or Intel's Xeon Phi in today's heterogeneous computer systems.

...read moreread less

Abstract: In an HPC era attracting notice to energy efficiency, a promising performance per watt ratio motivates the usage of accelerators like GPGPUs or Intel's Xeon Phi in today's heterogeneous computer systems. However, these heterogeneous architectures often get far more complex to program and employ.

...read moreread less

2 citations

Proceedings Article•DOI•

Optimizing Parallel Reduction on OpenCL FPGA Platform – A Case Study of Frequent Pattern Compression

[...]

Zheming Jin¹, Hal Finkel¹•Institutions (1)

Argonne National Laboratory¹

21 May 2018

TL;DR: Inspired by the reduction operation in frequent pattern compression, this work transforms the function into an OpenCL kernel, and describes the optimizations of the kernel on an Arria10-based FPGA platform as a case study, finding that automatic kernel vectorization does not improve the kernel performance.

...read moreread less

Abstract: Field-programmable gate arrays (FPGAs) are becoming a promising heterogeneous computing component in high-performance computing. To facilitate the usage of FPGAs for developers and researchers, high-level synthesis tools are pushing the FPGA-based design abstraction from the registertransfer level to high-level language design flow using OpenCL/C/C++. Currently, there are few studies on parallel reduction using atomic functions in the OpenCL-based design flow on an FPGA. Inspired by the reduction operation in frequent pattern compression, we transform the function into an OpenCL kernel, and describe the optimizations of the kernel on an Arria10-based FPGA platform as a case study. We found that automatic kernel vectorization does not improve the kernel performance. Users can manually vectorize the kernel to achieve performance speedup. Overall, our optimizations improve the kernel performance by a factor of 11.9 over the baseline kernel. The performance per watt of the kernel on an Intel Arria 10 GX1150 FPGA is 5.3X higher than an Intel Xeon 16-core CPU while 0.625X lower than an Nvidia K80 GPU.

...read moreread less

2 citations

Collapse

Network Information

Performance

Metrics

315

Papers

6,353

Citations

No. of papers in the topic in previous years
Year	Papers
2021	14
2020	15
2019	15
2018	36
2017	25
2016	31

Performance per watt

Papers published on a yearly basis

Papers

Network Information

Related Topics (5)

Performance

Metrics