scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Proceedings ArticleDOI
01 Dec 2013
TL;DR: The challenges faced by software programmers when using HLS to implement computing kernels within FPGAs are explored and the specific new knowledge and skills required by programmers to succeed at the task are identified.
Abstract: For many application-specific computations, FPGA-based computing systems have been shown to provide superior performance per Watt than many general-purpose architectures. However, the benefits of FPGA-based computing are difficult to exploit since FPGAs are challenging to program and require advanced hardware design skills. Recent developments in High Level Synthesis (HLS) provide the ability to create FPGA compute accelerators entirely in `C' code. Because the circuits are described in `C', it may be possible for software programmers to “program” FPGA accelerator circuits. This paper explores the challenges faced by software programmers when using HLS to implement computing kernels within FPGAs and identifies the specific new knowledge and skills required by these programmers to succeed at the task. A high-performance Sobel edge-detection acceleration core is developed and used to demonstrate the use of the Vivado HLS tool. A variety of simple directives and code restructuring steps are applied to demonstrate a variety of Sobel edge-detection accelerators that vary in performance from 10.9 frames per second (fps) to 388 fps. The concepts outlined in this paper suggest that with proper training, software programmers are able to create a wide range of FPGA acceleration circuits.

19 citations

Journal ArticleDOI
01 Jun 2018
TL;DR: A dynamic cloud resource management (DCRM) policy to improve the quality of service (QoS) in multimedia mobile computing is proposed, and experimental results show that DCRM behaves better in both response time and QoS, thus proving thatDCRM is good at shared resource management in mobile media cloud computing.
Abstract: Single-instruction-set architecture (Single-ISA) heterogeneous multi-core processors (HMP) are superior to Symmetric Multi-core processors in performance per watt. They are popular in many aspects of the Internet of Things, including mobile multimedia cloud computing platforms. One Single-ISA HMP integrates both fast out-of-order cores and slow simpler cores, while all cores are sharing the same ISA. The quality of service (QoS) is most important for virtual machine (VM) resource management in multimedia mobile computing, particularly in Single-ISA heterogeneous multi-core cloud computing platforms. Therefore, in this paper, we propose a dynamic cloud resource management (DCRM) policy to improve the QoS in multimedia mobile computing. DCRM dynamically and optimally partitions shared resources according to service or application requirements. Moreover, DCRM combines resource-aware VM allocation to maximize the effectiveness of the heterogeneous multi-core cloud platform. The basic idea for this performance improvement is to balance the shared resource allocations with these resources requirements. The experimental results show that DCRM behaves better in both response time and QoS, thus proving that DCRM is good at shared resource management in mobile media cloud computing.

19 citations

Book ChapterDOI
28 Aug 2017
TL;DR: This paper characterizes the NVIDIA Jetson TK1 and TX1 Platforms by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach and through a case study of a heterogeneous application (matrix multiplication).
Abstract: This study characterizes the NVIDIA Jetson TK1 and TX1 Platforms, both built on a NVIDIA Tegra System on Chip and combining a quad-core ARM CPU and an NVIDIA GPU. Their heterogeneous nature, as well as their wide operating frequency range, make it hard for application developers to reason about performance and determine which optimizations are worth pursuing. This paper attempts to inform developers’ choices by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach as well as through a case study of a heterogeneous application (matrix multiplication). Our results highlight a difference of more than an order of magnitude in compute performance between the CPU and GPU on both platforms. Given that the CPU and GPU share the same memory bus, their Roofline models’ balance points are also more than an order of magnitude apart. We also explore the impact of frequency scaling: build CPU and GPU Roofline profiles and characterize both platforms’ balance point variation, power consumption, and performance per watt as frequency is scaled.

18 citations

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This work proposes ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput, and demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.
Abstract: Single-ISA (instruction set architecture) asymmetric multicore processors (AMPs) were shown to deliver higher performance per watt and area than symmetric CMPs (Chip Multi-Processors) for applications with diverse architectural requirements. A large body of work has demonstrated that this potential of AMP systems can be realizable via OS scheduling. Yet, existing schedulers that seek to deliver fairness on AMPs do not ensure that equal-priority applications experience the same slowdown when sharing the system. Moreover, most of these schemes are also subject to high throughput degradation and fail to effectively deal with user priorities. In this work we propose ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput. Our evaluation on real AMP hardware, and using scheduler implementations on a general-purpose OS, demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.

18 citations

Proceedings ArticleDOI
12 Jul 2017
TL;DR: This paper reviews some latest works using Intel FPGA SDK for OpenCL and the strategies for optimization, evaluating the framework for the design of a hyperspectral image spatial-spectral classifier accelerator and shows how reasonable speedups are obtained in a device with scarce computing and embedded memory resources.
Abstract: Current computational demands require increasing designer's efficiency and system performance per watt. A broadly accepted solution for efficient accelerators implementation is reconfigurable computing. However, typical HDL methodologies require very specific skills and a considerable amount of designer's time. Despite the new approaches to high-level synthesis like OpenCL, given the large heterogeneity in today's devices (manycore, CPUs, GPUs, FPGAs), there is no one-fits-all solution, so to maximize performance, platform-driven optimization is needed. This paper reviews some latest works using Intel FPGA SDK for OpenCL and the strategies for optimization, evaluating the framework for the design of a hyperspectral image spatial-spectral classifier accelerator. Results are reported for a Cyclone V SoC using Intel FPGA OpenCL Offline Compiler 16.0 out-of-the-box. From a common baseline C implementation running on the embedded ARM® Cortex®-A9, OpenCL-based synthesis is evaluated applying different generic and vendor specific optimizations. Results show how reasonable speedups are obtained in a device with scarce computing and embedded memory resources. It seems a great step has been given to effectively raise the abstraction level, but still, a considerable amount of HW design skills is needed.

16 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631