Topic
Performance per watt
About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.
Papers published on a yearly basis
Papers
More filters
••
01 Dec 2013TL;DR: The challenges faced by software programmers when using HLS to implement computing kernels within FPGAs are explored and the specific new knowledge and skills required by programmers to succeed at the task are identified.
Abstract: For many application-specific computations, FPGA-based computing systems have been shown to provide superior performance per Watt than many general-purpose architectures. However, the benefits of FPGA-based computing are difficult to exploit since FPGAs are challenging to program and require advanced hardware design skills. Recent developments in High Level Synthesis (HLS) provide the ability to create FPGA compute accelerators entirely in `C' code. Because the circuits are described in `C', it may be possible for software programmers to “program” FPGA accelerator circuits. This paper explores the challenges faced by software programmers when using HLS to implement computing kernels within FPGAs and identifies the specific new knowledge and skills required by these programmers to succeed at the task. A high-performance Sobel edge-detection acceleration core is developed and used to demonstrate the use of the Vivado HLS tool. A variety of simple directives and code restructuring steps are applied to demonstrate a variety of Sobel edge-detection accelerators that vary in performance from 10.9 frames per second (fps) to 388 fps. The concepts outlined in this paper suggest that with proper training, software programmers are able to create a wide range of FPGA acceleration circuits.
19 citations
••
01 Jun 2018
TL;DR: A dynamic cloud resource management (DCRM) policy to improve the quality of service (QoS) in multimedia mobile computing is proposed, and experimental results show that DCRM behaves better in both response time and QoS, thus proving thatDCRM is good at shared resource management in mobile media cloud computing.
Abstract: Single-instruction-set architecture (Single-ISA) heterogeneous multi-core processors (HMP) are superior to Symmetric Multi-core processors in performance per watt. They are popular in many aspects of the Internet of Things, including mobile multimedia cloud computing platforms. One Single-ISA HMP integrates both fast out-of-order cores and slow simpler cores, while all cores are sharing the same ISA. The quality of service (QoS) is most important for virtual machine (VM) resource management in multimedia mobile computing, particularly in Single-ISA heterogeneous multi-core cloud computing platforms. Therefore, in this paper, we propose a dynamic cloud resource management (DCRM) policy to improve the QoS in multimedia mobile computing. DCRM dynamically and optimally partitions shared resources according to service or application requirements. Moreover, DCRM combines resource-aware VM allocation to maximize the effectiveness of the heterogeneous multi-core cloud platform. The basic idea for this performance improvement is to balance the shared resource allocations with these resources requirements. The experimental results show that DCRM behaves better in both response time and QoS, thus proving that DCRM is good at shared resource management in mobile media cloud computing.
19 citations
••
28 Aug 2017TL;DR: This paper characterizes the NVIDIA Jetson TK1 and TX1 Platforms by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach and through a case study of a heterogeneous application (matrix multiplication).
Abstract: This study characterizes the NVIDIA Jetson TK1 and TX1 Platforms, both built on a NVIDIA Tegra System on Chip and combining a quad-core ARM CPU and an NVIDIA GPU. Their heterogeneous nature, as well as their wide operating frequency range, make it hard for application developers to reason about performance and determine which optimizations are worth pursuing. This paper attempts to inform developers’ choices by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach as well as through a case study of a heterogeneous application (matrix multiplication). Our results highlight a difference of more than an order of magnitude in compute performance between the CPU and GPU on both platforms. Given that the CPU and GPU share the same memory bus, their Roofline models’ balance points are also more than an order of magnitude apart. We also explore the impact of frequency scaling: build CPU and GPU Roofline profiles and characterize both platforms’ balance point variation, power consumption, and performance per watt as frequency is scaled.
18 citations
••
13 Apr 2015TL;DR: This work proposes ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput, and demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.
Abstract: Single-ISA (instruction set architecture) asymmetric multicore processors (AMPs) were shown to deliver higher performance per watt and area than symmetric CMPs (Chip Multi-Processors) for applications with diverse architectural requirements. A large body of work has demonstrated that this potential of AMP systems can be realizable via OS scheduling. Yet, existing schedulers that seek to deliver fairness on AMPs do not ensure that equal-priority applications experience the same slowdown when sharing the system. Moreover, most of these schemes are also subject to high throughput degradation and fail to effectively deal with user priorities. In this work we propose ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput. Our evaluation on real AMP hardware, and using scheduler implementations on a general-purpose OS, demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.
18 citations
••
12 Jul 2017TL;DR: This paper reviews some latest works using Intel FPGA SDK for OpenCL and the strategies for optimization, evaluating the framework for the design of a hyperspectral image spatial-spectral classifier accelerator and shows how reasonable speedups are obtained in a device with scarce computing and embedded memory resources.
Abstract: Current computational demands require increasing designer's efficiency and system performance per watt. A broadly accepted solution for efficient accelerators implementation is reconfigurable computing. However, typical HDL methodologies require very specific skills and a considerable amount of designer's time. Despite the new approaches to high-level synthesis like OpenCL, given the large heterogeneity in today's devices (manycore, CPUs, GPUs, FPGAs), there is no one-fits-all solution, so to maximize performance, platform-driven optimization is needed. This paper reviews some latest works using Intel FPGA SDK for OpenCL and the strategies for optimization, evaluating the framework for the design of a hyperspectral image spatial-spectral classifier accelerator. Results are reported for a Cyclone V SoC using Intel FPGA OpenCL Offline Compiler 16.0 out-of-the-box. From a common baseline C implementation running on the embedded ARM® Cortex®-A9, OpenCL-based synthesis is evaluated applying different generic and vendor specific optimizations. Results show how reasonable speedups are obtained in a device with scarce computing and embedded memory resources. It seems a great step has been given to effectively raise the abstraction level, but still, a considerable amount of HW design skills is needed.
16 citations