scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: This paper highlights performance issues of the Linux tap bridge with KVM but that can easily be overcome by using a user space virtual switch such as VOSYSwitch and OVS/DPDK, and showcases the several shortcomings of unikernels on ARM.
Abstract: Network Functions Virtualization paradigm has emerged as a new concept in networking which aims at cost reduction and ease of network scalability by leveraging on virtualization technologies and commercial-off-the-shelf hardware to disintegrate the software implementation of network functions from the underlying hardware. Recently, lightweight virtualization techniques have emerged as efficient alternatives to traditional Virtual Network Functions (VNFs) developed as VMs. At the same time ARMv8 servers are gaining traction in the server world, mostly because of their interesting performance per watt characteristics. In this paper, the CPU, memory and Input/Output (I/O) performance of such lightweight techniques are compared with that of classic virtual machines on both x86 and ARMv8 platforms. More in particular, we selected KVM as hypervisor solution, Docker and rkt as container engines and finally Rumprun and OSv as unikernels. On x86, our results for CPU and memory related workloads highlight a slightly better performance for containers and unikernels whereas both of them perform almost twice as better as KVM for network I/O operations. This highlights performance issues of the Linux tap bridge with KVM but that can easily be overcome by using a user space virtual switch such as VOSYSwitch and OVS/DPDK. On ARM, both KVM and containers produce similar results for CPU and memory workloads, but have an exception for network I/O operations where KVM proves to be the fastest. We also showcase the several shortcomings of unikernels on ARM which account for their lack of stable support for this architecture.

9 citations

Journal ArticleDOI
TL;DR: This work studies architectural features of the Tesla C870 GPU and Cell BE and evaluates the effectiveness of architecture-specific optimizations and parallelization strategies for ITM on these platforms, observing that the GPU delivers better performance than Cell BE in terms of total execution time and performance per watt metrics.
Abstract: The Irregular Terrain Model (ITM), also known as the Longley-Rice model, predicts long-range average transmission loss of a radio signal based on atmospheric and geographic conditions. Due to variable terrain effects and constantly changing atmospheric conditions which can dramatically influence radio wave propagation, there is a pressing need for computational resources capable of running hundreds of thousands of transmission loss calculations per second. Multicore processors, like the NVIDIA Graphics Processing Unit (GPU) and IBM Cell Broadband Engine (BE), offer improved performance over mainstream microprocessors for ITM. We study architectural features of the Tesla C870 GPU and Cell BE and evaluate the effectiveness of architecture-specific optimizations and parallelization strategies for ITM on these platforms. We assess the GPU implementations that utilize both global and shared memories along with fine-grained parallelism. We assess the Cell BE implementations that utilize direct memory access, double buffering, and SIMDization. With these optimization strategies, we achieve less than a second of computation time on each platform which is not feasible with a general purpose processor, and we observe that the GPU delivers better performance than Cell BE in terms of total execution time and performance per watt metrics by a factor of 2.3x and 1.6x, respectively.

9 citations

Journal ArticleDOI
TL;DR: Analytical models based on scaled power metrics are presented to analyze the impact of various architectural design choices on scaled performance and power savings and show that by choosing the optimal chip configuration, energy efficiency and energy savings can be increased considerably.
Abstract: Many-core processors are accelerating the performance of contemporary high-performance systems. Managing power consumption within these systems demands low-power architectures to increase power savings. One of the promising solutions offered today by microprocessor architects is asymmetric microprocessors that integrate different core architectures on a single die. This paper presents analytical models based on scaled power metrics to analyze the impact of various architectural design choices on scaled performance and power savings. The power consumption implications of different processing schemes and various chip configurations were also analyzed. Analysis shows that by choosing the optimal chip configuration, energy efficiency and energy savings can be increased considerably.

9 citations

Proceedings ArticleDOI
11 Oct 2018
TL;DR: Knots is built, a GPU-aware resource orchestration layer which enables the resource scheduler to take advantage of the GPUs by knowing their real-time utilization and discusses the ideal scheduler properties for a GPU rich datacenter and list the challenges in developing such a production-grade GPU-baseddatacenter scheduler.
Abstract: Modern data centers are increasingly being provisioned with compute accelerators such as GPUs, FPGAs and ASIC's to catch up with the workload performance demands and reduce the total cost of ownership (TCO). By 2021, traffic within hyperscale datacenters is expected to quadruple with 94% of workloads moving to cloud-based datacenters according to Cisco's global cloud index. A majority of these workloads include data mining, image processing, speech recognition and gaming which uses GPUs for high throughput computing. This trend is evident as public cloud operators like Amazon and Microsoft have started to offer GPU-based infrastructure services in the recent times. The GPU-bound applications in general, can either be batch or latency-sensitive. Typically the latency-critical applications subscribe to datacenter resources in the form of queries (e.g. inference requests from a DNN model). For example, a wearable health monitoring device aggregates several sensor data through a mobile application. In case of a data anomaly, inference services can be triggered from the mobile device to the cloud, requesting for a deep neural network (DNN) model that fits the symptom. Such inference requests which are GPU bound impose strict Service Level Agreements (SLAs) that is typically set around 150 to 500ms. In contrast to the regular datacenter batch workloads, these user-facing applications are typically hosted as services that occur and scale in short bursts. On the other hand, batch applications are HPC based compute-bound workloads which are throughput oriented. In a typical datacenter, these both applications might co-exist on the same device depends on the orchestration and scheduling policy. With the expected increase in such workloads, this GPU resource management problem is expected to exacerbate. Hence, GPUs/accelerators are on the critical path to ensure the performance and meet the end-to-end latency demands of such queries. State-of-the-art resource orchestrators are agnostic of GPUs and their resource utilization footprints, and thus not equipped to dynamically orchestrate these accelerator-bound containers. On the other hand, job schedulers at the datacenter are heavily optimized and tuned for CPU-based systems. Kubernetes and Mesos by default does uniform task scheduling which statically assigns the GPU resources to the applications. The scheduled tasks access the GPUs via PCIe pass-through which gives the application complete access to the GPU as seen in Figure 1. Hence the resource utilization of the GPU is based on the parallelism of the application which is scheduled to run on it. In case of CPUs, Kubernetes has support for dynamic orchestration with the features such as node affinity, pod affinity, and pod preemption. However, these features cannot be extended for GPUs. This is because, it neither has the support for pod preemption nor the ability to query the real-time GPU metrics such as memory, symmetric multiprocessor (SM) utilization, PCIe bandwidth, etc. Moreover, the containers often overstate their GPU resource requirements such as memory, and this leads to severe resource underutilization which leads to multiple QoS violations because of queuing delays. We identify that by employing CPU-based scheduling policies for GPU-bound workloads would fail to yield high accelerator utilization and lead to poor performance per watt per query. Motivated by this, we propose a GPU-aware resource orchestration layer which enables the resource scheduler to take advantage of the GPUs by knowing their real-time utilization. We further discuss the ideal scheduler properties for a GPU rich datacenter and list the challenges in developing such a production-grade GPU-based datacenter scheduler. Therefore we modify the well-known Google's Kubernetes datacenter-level resource orchestrator by making it GPU-aware by exposing GPU driver APIs. Based on our observations from Alibaba's cluster traces and real hardware GPU cluster experiments, we build Knots, a GPU-aware resource orchestration layer and integrate it with Kubernetes container orchestrator. In addition, we also evaluate three GPU-based scheduling schemes to schedule datacenter representative GPU workload mixes through Kube-Knots. Evaluations on a ten node GPU cluster demonstrate that Knots together with our proposed GPU-aware scheduling scheme improves the cluster-wide GPU utilization while significantly reducing the cluster-wide power consumption across three different workload mixes when compared against Kubernetes's default uniform scheduler.

9 citations

Proceedings ArticleDOI
14 Jul 2014
TL;DR: This paper study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency, and divides the low performance kernels into low occupancy and full occupancy categories.
Abstract: GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and continue high performance computing growth, new architectural and application techniques are required to create power-efficient computing systems. To find such techniques, however, it is necessary to study the power consumption at a detailed level and understand the bottlenecks which cause low performance. Therefore, in this paper, we study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency. We divide the low performance kernels into low occupancy and full occupancy categories. For the low occupancy category, we study if increasing the occupancy helps in increasing performance and energy efficiency. For the full occupancy category, we investigate if these kernels are limited by memory bandwidth, coalescing efficiency, or SIMD utilization.

9 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631