Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Energy-efficient collective reduce and allreduce operations on distributed GPUs

[...]

Lena Oden¹, Benjamin Klenk¹, Holger Fröning¹•Institutions (1)

Heidelberg University¹

26 May 2014

TL;DR: Global GPU Address Spaces (GGAS) enable a direct GPU-to-GPU communication for heterogeneous clusters, which is completely in-line with the GPU's thread-collective execution model and does not require CPU assistance or staging copies in host memory.

...read moreread less

Abstract: GPUs gain high popularity in High Performance Computing, due to their massive parallelism and high performance per Watt. Despite their popularity, data transfer between multiple GPUs in a cluster remains a problem. Most communication models require the CPU to control the data flow; also intermediate staging copies to host memory are often inevitable. These two facts lead to higher CPU and memory utilization. As a result, overall performance decreases and power consumption increases.Collective operations like reduce and allreduce are very common in scientific simulations and also very sensitive to performance. Due to their massive parallelism, GPUs are very suitable for such operations, but they only excel in performance if they can process the problem in-core. Global GPU Address Spaces (GGAS) enable a direct GPU-to-GPU communication for heterogeneous clusters, which is completely in-line with the GPUs thread-collective execution model and does not require CPU assistance or staging copies in host memory. As we will see, GGAS helps to process collective operations among distributed GPUs in-core.In this paper, we introduce the implementation and optimization of collective reduce and allreduce operations using GGAS as a communication model. Compared to message passing, we get a speedup of 1.7x for small data sizes. A detailed analysis based on power measurements of CPU, host memory and GPU reveals that GGAS as communication model not only saves cycles, also the power and energy consumption is reduced dramatically. For instance, for an allreduce operation half of the energy can be saved by the reduced the power consumption in combination with the lower run time.

...read moreread less

21 citations

Proceedings Article•DOI•

A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory

[...]

Thierry Lepley¹, Pierre Paulin, Eric Flamand¹•Institutions (1)

STMicroelectronics¹

29 Sep 2013

TL;DR: A novel approach designed for simplifying the programming of EMM many-core architectures is presented, which takes a high-level description of the computation kernel algorithm and generates an OpenCL kernel optimized for the target architecture, while managing the parallelization and data movements across the hierarchy in a transparent fashion.

...read moreread less

Abstract: Explicitly managed memory many-cores (EMM) have been a part of the industrial landscape for the last decade. The IBM CELL processor, general-purpose graphics processing units (GP-GPU) and the STHORM embedded many-core of STMicroelectronics are representative examples. This class of architecture is expected to scale well and to deliver good performance per watt and per mm2 of silicon. As such, it is appealing for application problems with regular data access patterns. However, this moves significant complexity to the programmer who must master parallelization and data movement. High level programming tools are therefore essential in order to allow the effective programming of EMM many-cores to a wide class of programmers. This paper presents a novel approach designed for simplifying the programming of EMM many-core architectures. It initially addresses the image processing application domain and has been targeted to the STHORM platform. It takes a high-level description of the computation kernel algorithm and generates an OpenCL kernel optimized for the target architecture, while managing the parallelization and data movements across the hierarchy in a transparent fashion. The goal is to provide both high productivity and high performance without requiring parallel computing expertise from the programmer, nor the need for application code specialization for the target architecture.

...read moreread less

21 citations

Journal Article•DOI•

Hadoop and memcached: Performance and power characterization and analysis

[...]

Joseph Issa¹, Silvia Figueira¹•Institutions (1)

Santa Clara University¹

12 Jul 2012-Journal of Cloud Computing

TL;DR: An analytical estimation model for performance and power using different server processor microarchitecture parameters is implemented and verified to estimate power and performance with less than 10% error deviation.

...read moreread less

Abstract: Given the rapid expansion in cloud computing in the past few years, there is a driving necessity of having cloud workloads running on a backend servers analyzed and characterized for performance and power consumption. In this research, we focus on Hadoop framework and Memcached, which are distributed model frameworks for processing large scale data intensive applications for different purposes. Hadoop is used for short jobs requiring low response time; it is a popular open source implementation of MapReduce for the analysis of large datasets, while Memcached is a high performance distributed memory object caching system that could speed up throughput of web applications by reducing the effect of bottlenecks on database load. In this paper, we characterize different workloads running on Hadoop framework and Memcached for different processor configurations and microarchitecture parameters. We implement an analytical estimation model for performance and power using different server processor microarchitecture parameters. The proposed analytical estimation model uses analytical method to scale different processor microarchitecture parameters such as CPI with respect to processor core frequency. We also propose an analytical model to estimate power consumption scaling for different processor core frequency. The combination of both performance and power consumption analytical models enables the estimation of performance per watt for different cloud benchmarks. The proposed estimation models are verified to estimate power and performance with less than 10% error deviation.

...read moreread less

21 citations

Journal Article•DOI•

Towards completely fair scheduling on asymmetric single-ISA multicore processors

[...]

Juan Carlos Saez¹, Adrián Pousa², Fernando Castro¹, Daniel Chaver¹, Manuel Prieto-Matias¹ - Show less +1 more•Institutions (2)

Complutense University of Madrid¹, National University of La Plata²

01 Apr 2017-Journal of Parallel and Distributed Computing

TL;DR: Evaluation on real AMP hardware and using scheduler implementations in the Linux kernel demonstrates that ACFS achieves an average 23% fairness improvement over two state-of-the-art schemes, while providing higher system throughput.

...read moreread less

21 citations

Proceedings Article•DOI•

Performance and power of cache-based reconfigurable computing

[...]

Andrew Putnam¹, Susan J. Eggers¹, Dave Bennett², Eric F. Dellinger², Jeffrey M. Mason², Henry E. Styles², Prasanna Sundararajan², Ralph D. Wittig² - Show less +4 more•Institutions (2)

University of Washington¹, Xilinx²

22 Feb 2009

TL;DR: This poster presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches, and presents the details of the cache parameters on a Xilinx Virtex-5 LX110T FPGA.

...read moreread less

Abstract: CHiMPS is a C-based compiler for high-performance computing (HPC) on heterogeneous CPU-FPGA computing platforms. CHiMPS efficiently supports random accesses to main memory through the many-cache memory model, enabling a broader range of applications to take advantage of FPGA-based acceleration. Many-cache creates multiple caches on top of an FGPA's small, independent memories, each targeting a particular data structure or region of memory in an application and each customized for the memory operations that access it. This poster presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches, and presents the details of the cache parameters on a Xilinx Virtex-5 LX110T FPGA. Detailed simulation results on HPC kernels demonstrate a 7.8x (geometric mean) performance boost over CPU-only execution of the same source code, FPGA power usage that is on average 4.1x less, and consequently performance per watt that is also greater, by a geometric mean of 21.3x.

...read moreread less

20 citations

Collapse

Network Information

Performance

Metrics

315

Papers

6,353

Citations

No. of papers in the topic in previous years
Year	Papers
2021	14
2020	15
2019	15
2018	36
2017	25
2016	31

Performance per watt

Papers published on a yearly basis

Papers

Network Information

Related Topics (5)

Performance

Metrics