Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

AccelWattch: A Power Modeling Framework for Modern GPUs

[...]

Vijay Kandiah¹, Scott Peverelle², Mahmoud Khairy³, Junrui Pan³, Amogh Manjunath³, Timothy G. Rogers³, Tor M. Aamodt⁴, Nikos Hardavellas¹ - Show less +4 more•Institutions (4)

Northwestern University¹, Intel², Purdue University³, University of British Columbia⁴

18 Oct 2021

TL;DR: In this article, the authors propose a configurable GPU power model called AccelWattch that can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS.

...read moreread less

Abstract: Graphics Processing Units (GPUs) are rapidly dominating the accelerator space, as illustrated by their wide-spread adoption in the data analytics and machine learning markets. At the same time, performance per watt has emerged as a crucial evaluation metric together with peak performance. As such, GPU architects require robust tools that will enable them to model both the performance and the power consumption of modern GPUs. However, while GPU performance modeling has progressed in great strides, power modeling has lagged behind. To mitigate this problem we propose AccelWattch, a configurable GPU power model that resolves two long-standing needs: the lack of a detailed and accurate cycle-level power model for modern GPU architectures, and the inability to capture their constant and static power with existing tools. AccelWattch can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS. We integrate AccelWattch with GPGPU-Sim and Accel-Sim to facilitate its widespread use. We validate AccelWattch on a NVIDIA Volta GPU, and show that it achieves strong correlation against hardware power measurements. Finally, we demonstrate that AccelWattch can enable reliable design space exploration: by directly applying AccelWattch tuned for Volta on GPU configurations resembling NVIDIA Pascal and Turing GPUs, we obtain accurate power models for these architectures.

...read moreread less

28 citations

Journal Article•DOI•

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

[...]

Alexandros Papakonstantinou¹, Karthik Gururaj², John A. Stratton¹, Deming Chen¹, Jason Cong², Wen-mei W. Hwu¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of California, Los Angeles²

30 Sep 2013-ACM Transactions in Embedded Computing Systems

TL;DR: This work adapts one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric.

...read moreread less

Abstract: The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

...read moreread less

28 citations

Journal Article•DOI•

IBM POWER7 systems

[...]

R. X. Arroyo¹, R. J. Harrington¹, S. P. Hartman¹, Thoi Nguyen¹•Institutions (1)

IBM¹

01 May 2011-Journal of Reproduction and Development

TL;DR: Enhanced memory and input/output subsystems enable the POWER7 processor-based servers to achieve significant increases in the performance density and the performance per watt, as compared with the predecessor POWER6® processor- based servers.

...read moreread less

Abstract: This paper describes the system architectures and designs of the IBM POWER7® servers. From the smallest single-processor socket blade to the largest 32-processor-socket 256-core enterprise rack server, each system is designed to fully exploit the performance and the scalability of the POWER7 processor. This paper describes the enhancements made to the memory and input/output subsystems to achieve balanced and scalable designs, the changes made to the power and cooling circuitry to manage energy consumption and power dissipation, and the enhancements made to reliability, availability, and serviceability. These enhancements enable the POWER7 processor-based servers to achieve significant increases in the performance density and the performance per watt, as compared with the predecessor POWER6® processor-based servers.

...read moreread less

28 citations

Journal Article•DOI•

Accelerating data mining workloads: current approaches and future challenges in system architecture design

[...]

Alok Choudhary¹, Daniel Honbo¹, Prabhat Kumar¹, Berkin Ozisikyilmaz¹, Sanchit Misra¹, Gokhan Memik¹ - Show less +2 more•Institutions (1)

Northwestern University¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Experiments have shown that heterogeneous architectures employing GPUs or FPGAs can result in significant application speedups over homogenous CPU‐based systems, while increasing performance per watt.

...read moreread less

Abstract: Conventional systems based on general-purpose processors cannot keep pace with the exponential increase in the generation and collection of data. It is therefore important to explore alternative architectures that can provide the computational capabilities required to analyze ever-growing datasets. Programmable graphics processing units (GPUs) offer computational capabilities that surpass even high-end multi-core central processing units (CPUs), making them wellsuited for floating-point- or integer-intensive and data parallel operations. Fieldprogrammable gate arrays (FPGAs), which can be reconfigured to implement an arbitrary circuit, provide the capability to specify a customized datapath for any task. The multiple granularities of parallelism offered by FPGA architectures, as well as their high internal bandwidth, make them suitable for low complexity parallel computations. GPUs and FPGAs can serve as coprocessors for data mining applications, allowing the CPU to offload computationally intensive tasks for faster processing. Experiments have shown that heterogeneous architectures employingGPUsorFPGAscanresultinsignificantapplicationspeedupsoverhomogenous CPU-based systems, while increasing performance per watt. C

...read moreread less

27 citations

Journal Article•DOI•

A Heterogeneous Multicore System on Chip for Energy Efficient Brain Inspired Computing

[...]

Antonio Pullini¹, Francesco Conti¹, Davide Rossi², Igor Loi², Michael Gautschi¹, Luca Benini¹ - Show less +2 more•Institutions (2)

ETH Zurich¹, University of Bologna²

01 Aug 2018-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: Mia Wallace is presented, a 65-nm system-on-chip integrating a near-threshold parallel processor cluster tightly coupled with a CNN accelerator that achieves peak energy efficiency of 108 GMAC/s/W at 0.72 V and peak performance of 14 GMac/s at 1.2 V.

...read moreread less

Abstract: Convolutional neural networks (CNNs) have revolutionized computer vision, speech recognition, and other fields requiring strong classification capabilities. These strengths make CNNs appealing in edge node Internet-of-Things (IoT) applications requiring near-sensors processing. Specialized CNN accelerators deliver significant performance per watt and satisfy the tight constraints of deeply embedded devices, but they cannot be used to implement arbitrary CNN topologies or nonconventional sensory algorithms where CNNs are only a part of the processing stack. A higher level of flexibility is desirable for next generation IoT nodes. Here, we present Mia Wallace , a 65-nm system-on-chip integrating a near-threshold parallel processor cluster tightly coupled with a CNN accelerator: it achieves peak energy efficiency of 108 GMAC/s/W at 0.72 V and peak performance of 14 GMAC/s at 1.2 V, leaving 1.2 GMAC/s available for general-purpose parallel processing.

...read moreread less

27 citations

Collapse

Network Information

Performance

Metrics

315

Papers

6,353

Citations

No. of papers in the topic in previous years
Year	Papers
2021	14
2020	15
2019	15
2018	36
2017	25
2016	31

Performance per watt

Papers published on a yearly basis

Papers

Network Information

Related Topics (5)

Performance

Metrics