scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Proceedings ArticleDOI
26 May 2014
TL;DR: Global GPU Address Spaces (GGAS) enable a direct GPU-to-GPU communication for heterogeneous clusters, which is completely in-line with the GPU's thread-collective execution model and does not require CPU assistance or staging copies in host memory.
Abstract: GPUs gain high popularity in High Performance Computing, due to their massive parallelism and high performance per Watt. Despite their popularity, data transfer between multiple GPUs in a cluster remains a problem. Most communication models require the CPU to control the data flow; also intermediate staging copies to host memory are often inevitable. These two facts lead to higher CPU and memory utilization. As a result, overall performance decreases and power consumption increases.Collective operations like reduce and allreduce are very common in scientific simulations and also very sensitive to performance. Due to their massive parallelism, GPUs are very suitable for such operations, but they only excel in performance if they can process the problem in-core. Global GPU Address Spaces (GGAS) enable a direct GPU-to-GPU communication for heterogeneous clusters, which is completely in-line with the GPUs thread-collective execution model and does not require CPU assistance or staging copies in host memory. As we will see, GGAS helps to process collective operations among distributed GPUs in-core.In this paper, we introduce the implementation and optimization of collective reduce and allreduce operations using GGAS as a communication model. Compared to message passing, we get a speedup of 1.7x for small data sizes. A detailed analysis based on power measurements of CPU, host memory and GPU reveals that GGAS as communication model not only saves cycles, also the power and energy consumption is reduced dramatically. For instance, for an allreduce operation half of the energy can be saved by the reduced the power consumption in combination with the lower run time.

21 citations

Proceedings ArticleDOI
29 Sep 2013
TL;DR: A novel approach designed for simplifying the programming of EMM many-core architectures is presented, which takes a high-level description of the computation kernel algorithm and generates an OpenCL kernel optimized for the target architecture, while managing the parallelization and data movements across the hierarchy in a transparent fashion.
Abstract: Explicitly managed memory many-cores (EMM) have been a part of the industrial landscape for the last decade. The IBM CELL processor, general-purpose graphics processing units (GP-GPU) and the STHORM embedded many-core of STMicroelectronics are representative examples. This class of architecture is expected to scale well and to deliver good performance per watt and per mm2 of silicon. As such, it is appealing for application problems with regular data access patterns. However, this moves significant complexity to the programmer who must master parallelization and data movement. High level programming tools are therefore essential in order to allow the effective programming of EMM many-cores to a wide class of programmers. This paper presents a novel approach designed for simplifying the programming of EMM many-core architectures. It initially addresses the image processing application domain and has been targeted to the STHORM platform. It takes a high-level description of the computation kernel algorithm and generates an OpenCL kernel optimized for the target architecture, while managing the parallelization and data movements across the hierarchy in a transparent fashion. The goal is to provide both high productivity and high performance without requiring parallel computing expertise from the programmer, nor the need for application code specialization for the target architecture.

21 citations

Journal ArticleDOI
TL;DR: An analytical estimation model for performance and power using different server processor microarchitecture parameters is implemented and verified to estimate power and performance with less than 10% error deviation.
Abstract: Given the rapid expansion in cloud computing in the past few years, there is a driving necessity of having cloud workloads running on a backend servers analyzed and characterized for performance and power consumption. In this research, we focus on Hadoop framework and Memcached, which are distributed model frameworks for processing large scale data intensive applications for different purposes. Hadoop is used for short jobs requiring low response time; it is a popular open source implementation of MapReduce for the analysis of large datasets, while Memcached is a high performance distributed memory object caching system that could speed up throughput of web applications by reducing the effect of bottlenecks on database load. In this paper, we characterize different workloads running on Hadoop framework and Memcached for different processor configurations and microarchitecture parameters. We implement an analytical estimation model for performance and power using different server processor microarchitecture parameters. The proposed analytical estimation model uses analytical method to scale different processor microarchitecture parameters such as CPI with respect to processor core frequency. We also propose an analytical model to estimate power consumption scaling for different processor core frequency. The combination of both performance and power consumption analytical models enables the estimation of performance per watt for different cloud benchmarks. The proposed estimation models are verified to estimate power and performance with less than 10% error deviation.

21 citations

Journal ArticleDOI
TL;DR: Evaluation on real AMP hardware and using scheduler implementations in the Linux kernel demonstrates that ACFS achieves an average 23% fairness improvement over two state-of-the-art schemes, while providing higher system throughput.

21 citations

Proceedings ArticleDOI
22 Feb 2009
TL;DR: This poster presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches, and presents the details of the cache parameters on a Xilinx Virtex-5 LX110T FPGA.
Abstract: CHiMPS is a C-based compiler for high-performance computing (HPC) on heterogeneous CPU-FPGA computing platforms. CHiMPS efficiently supports random accesses to main memory through the many-cache memory model, enabling a broader range of applications to take advantage of FPGA-based acceleration. Many-cache creates multiple caches on top of an FGPA's small, independent memories, each targeting a particular data structure or region of memory in an application and each customized for the memory operations that access it. This poster presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches, and presents the details of the cache parameters on a Xilinx Virtex-5 LX110T FPGA. Detailed simulation results on HPC kernels demonstrate a 7.8x (geometric mean) performance boost over CPU-only execution of the same source code, FPGA power usage that is on average 4.1x less, and consequently performance per watt that is also greater, by a geometric mean of 21.3x.

20 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631