Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A Performance Benchmarking Analysis of Hypervisors Containers and Unikernels on ARMv8 and x86 CPUs

[...]

Ashijeet Acharya, Jeremy Fanguede, Michele Paolino, Daniel Raho

18 Jun 2018

TL;DR: This paper highlights performance issues of the Linux tap bridge with KVM but that can easily be overcome by using a user space virtual switch such as VOSYSwitch and OVS/DPDK, and showcases the several shortcomings of unikernels on ARM.

...read moreread less

Abstract: Network Functions Virtualization paradigm has emerged as a new concept in networking which aims at cost reduction and ease of network scalability by leveraging on virtualization technologies and commercial-off-the-shelf hardware to disintegrate the software implementation of network functions from the underlying hardware. Recently, lightweight virtualization techniques have emerged as efficient alternatives to traditional Virtual Network Functions (VNFs) developed as VMs. At the same time ARMv8 servers are gaining traction in the server world, mostly because of their interesting performance per watt characteristics. In this paper, the CPU, memory and Input/Output (I/O) performance of such lightweight techniques are compared with that of classic virtual machines on both x86 and ARMv8 platforms. More in particular, we selected KVM as hypervisor solution, Docker and rkt as container engines and finally Rumprun and OSv as unikernels. On x86, our results for CPU and memory related workloads highlight a slightly better performance for containers and unikernels whereas both of them perform almost twice as better as KVM for network I/O operations. This highlights performance issues of the Linux tap bridge with KVM but that can easily be overcome by using a user space virtual switch such as VOSYSwitch and OVS/DPDK. On ARM, both KVM and containers produce similar results for CPU and memory workloads, but have an exception for network I/O operations where KVM proves to be the fastest. We also showcase the several shortcomings of unikernels on ARM which account for their lack of stable support for this architecture.

...read moreread less

9 citations

Journal Article•DOI•

Parallel Implementation of the Irregular Terrain Model (ITM) for Radio Transmission Loss Prediction Using GPU and Cell BE Processors

[...]

Yang Song¹, Ali Akoglu¹•Institutions (1)

University of Arizona¹

01 Aug 2011-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This work studies architectural features of the Tesla C870 GPU and Cell BE and evaluates the effectiveness of architecture-specific optimizations and parallelization strategies for ITM on these platforms, observing that the GPU delivers better performance than Cell BE in terms of total execution time and performance per watt metrics.

...read moreread less

Abstract: The Irregular Terrain Model (ITM), also known as the Longley-Rice model, predicts long-range average transmission loss of a radio signal based on atmospheric and geographic conditions. Due to variable terrain effects and constantly changing atmospheric conditions which can dramatically influence radio wave propagation, there is a pressing need for computational resources capable of running hundreds of thousands of transmission loss calculations per second. Multicore processors, like the NVIDIA Graphics Processing Unit (GPU) and IBM Cell Broadband Engine (BE), offer improved performance over mainstream microprocessors for ITM. We study architectural features of the Tesla C870 GPU and Cell BE and evaluate the effectiveness of architecture-specific optimizations and parallelization strategies for ITM on these platforms. We assess the GPU implementations that utilize both global and shared memories along with fine-grained parallelism. We assess the Cell BE implementations that utilize direct memory access, double buffering, and SIMDization. With these optimization strategies, we achieve less than a second of computation time on each platform which is not feasible with a general purpose processor, and we observe that the GPU delivers better performance than Cell BE in terms of total execution time and performance per watt metrics by a factor of 2.3x and 1.6x, respectively.

...read moreread less

9 citations

Journal Article•DOI•

Energy-Aware Modeling of Scaled Heterogeneous Systems

[...]

Ami Marowka

01 Oct 2017-International Journal of Parallel Programming

TL;DR: Analytical models based on scaled power metrics are presented to analyze the impact of various architectural design choices on scaled performance and power savings and show that by choosing the optimal chip configuration, energy efficiency and energy savings can be increased considerably.

...read moreread less

Abstract: Many-core processors are accelerating the performance of contemporary high-performance systems. Managing power consumption within these systems demands low-power architectures to increase power savings. One of the promising solutions offered today by microprocessor architects is asymmetric microprocessors that integrate different core architectures on a single die. This paper presents analytical models based on scaled power metrics to analyze the impact of various architectural design choices on scaled performance and power savings. The power consumption implications of different processing schemes and various chip configurations were also analyzed. Analysis shows that by choosing the optimal chip configuration, energy efficiency and energy savings can be increased considerably.

...read moreread less

9 citations

Proceedings Article•DOI•

The Curious Case of Container Orchestration and Scheduling in GPU-based Datacenters

[...]

Prashanth Thinakaran¹, Jashwant Raj¹, Bikash Sharma², Mahmut Kandemir¹, Chita R. Das¹ - Show less +1 more•Institutions (2)

Pennsylvania State University¹, Facebook²

11 Oct 2018

TL;DR: Knots is built, a GPU-aware resource orchestration layer which enables the resource scheduler to take advantage of the GPUs by knowing their real-time utilization and discusses the ideal scheduler properties for a GPU rich datacenter and list the challenges in developing such a production-grade GPU-baseddatacenter scheduler.

...read moreread less

Abstract: Modern data centers are increasingly being provisioned with compute accelerators such as GPUs, FPGAs and ASIC's to catch up with the workload performance demands and reduce the total cost of ownership (TCO). By 2021, traffic within hyperscale datacenters is expected to quadruple with 94% of workloads moving to cloud-based datacenters according to Cisco's global cloud index. A majority of these workloads include data mining, image processing, speech recognition and gaming which uses GPUs for high throughput computing. This trend is evident as public cloud operators like Amazon and Microsoft have started to offer GPU-based infrastructure services in the recent times. The GPU-bound applications in general, can either be batch or latency-sensitive. Typically the latency-critical applications subscribe to datacenter resources in the form of queries (e.g. inference requests from a DNN model). For example, a wearable health monitoring device aggregates several sensor data through a mobile application. In case of a data anomaly, inference services can be triggered from the mobile device to the cloud, requesting for a deep neural network (DNN) model that fits the symptom. Such inference requests which are GPU bound impose strict Service Level Agreements (SLAs) that is typically set around 150 to 500ms. In contrast to the regular datacenter batch workloads, these user-facing applications are typically hosted as services that occur and scale in short bursts. On the other hand, batch applications are HPC based compute-bound workloads which are throughput oriented. In a typical datacenter, these both applications might co-exist on the same device depends on the orchestration and scheduling policy. With the expected increase in such workloads, this GPU resource management problem is expected to exacerbate. Hence, GPUs/accelerators are on the critical path to ensure the performance and meet the end-to-end latency demands of such queries. State-of-the-art resource orchestrators are agnostic of GPUs and their resource utilization footprints, and thus not equipped to dynamically orchestrate these accelerator-bound containers. On the other hand, job schedulers at the datacenter are heavily optimized and tuned for CPU-based systems. Kubernetes and Mesos by default does uniform task scheduling which statically assigns the GPU resources to the applications. The scheduled tasks access the GPUs via PCIe pass-through which gives the application complete access to the GPU as seen in Figure 1. Hence the resource utilization of the GPU is based on the parallelism of the application which is scheduled to run on it. In case of CPUs, Kubernetes has support for dynamic orchestration with the features such as node affinity, pod affinity, and pod preemption. However, these features cannot be extended for GPUs. This is because, it neither has the support for pod preemption nor the ability to query the real-time GPU metrics such as memory, symmetric multiprocessor (SM) utilization, PCIe bandwidth, etc. Moreover, the containers often overstate their GPU resource requirements such as memory, and this leads to severe resource underutilization which leads to multiple QoS violations because of queuing delays. We identify that by employing CPU-based scheduling policies for GPU-bound workloads would fail to yield high accelerator utilization and lead to poor performance per watt per query. Motivated by this, we propose a GPU-aware resource orchestration layer which enables the resource scheduler to take advantage of the GPUs by knowing their real-time utilization. We further discuss the ideal scheduler properties for a GPU rich datacenter and list the challenges in developing such a production-grade GPU-based datacenter scheduler. Therefore we modify the well-known Google's Kubernetes datacenter-level resource orchestrator by making it GPU-aware by exposing GPU driver APIs. Based on our observations from Alibaba's cluster traces and real hardware GPU cluster experiments, we build Knots, a GPU-aware resource orchestration layer and integrate it with Kubernetes container orchestrator. In addition, we also evaluate three GPU-based scheduling schemes to schedule datacenter representative GPU workload mixes through Kube-Knots. Evaluations on a ten node GPU cluster demonstrate that Knots together with our proposed GPU-aware scheduling scheme improves the cluster-wide GPU utilization while significantly reducing the cluster-wide power consumption across three different workload mixes when compared against Kubernetes's default uniform scheduler.

...read moreread less

9 citations

Proceedings Article•DOI•

GPGPU workload characteristics and performance analysis

[...]

Sohan Lal¹, Jan Lucas¹, Michael Andersch¹, Mauricio Alvarez-Mesa¹, Ahmed Elhossini¹, Ben Juurlink¹ - Show less +2 more•Institutions (1)

Technical University of Berlin¹

14 Jul 2014

TL;DR: This paper study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency, and divides the low performance kernels into low occupancy and full occupancy categories.

...read moreread less

Abstract: GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and continue high performance computing growth, new architectural and application techniques are required to create power-efficient computing systems. To find such techniques, however, it is necessary to study the power consumption at a detailed level and understand the bottlenecks which cause low performance. Therefore, in this paper, we study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency. We divide the low performance kernels into low occupancy and full occupancy categories. For the low occupancy category, we study if increasing the occupancy helps in increasing performance and energy efficiency. For the full occupancy category, we investigate if these kernels are limited by memory bandwidth, coalescing efficiency, or SIMD utilization.

...read moreread less

9 citations

Collapse

Network Information

Performance

Metrics

315

Papers

6,353

Citations

No. of papers in the topic in previous years
Year	Papers
2021	14
2020	15
2019	15
2018	36
2017	25
2016	31

Performance per watt

Papers published on a yearly basis

Papers

Network Information

Related Topics (5)

Performance

Metrics