Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Larrabee: a many-core x86 architecture for visual computing

[...]

Larry D. Seiler¹, Doug Carmean¹, Eric Sprangle¹, Tom Forsyth¹, Michael Abrash, Pradeep Dubey¹, Stephen Junkins¹, Adam T. Lake¹, Jeremy Sugerman², Robert Dale Cavin¹, Roger Espasa¹, Ed Grochowski¹, Toni Juan¹, Pat Hanrahan² - Show less +10 more•Institutions (2)

Intel¹, Stanford University²

01 Aug 2008

TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.

...read moreread less

Abstract: This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

...read moreread less

784 citations

Proceedings Article•DOI•

GPUWattch: enabling energy optimizations in GPGPUs

[...]

Jingwen Leng¹, Tayler Hetherington², Ahmed ElTantawy², Syed Zohaib Gilani³, Nam Sung Kim³, Tor M. Aamodt², Vijay Janapa Reddi¹ - Show less +3 more•Institutions (3)

University of Texas at Austin¹, University of British Columbia², University of Wisconsin-Madison³

23 Jun 2013

TL;DR: This work proposes a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements, and accurately tracks the power consumption trend over time.

...read moreread less

Abstract: General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

...read moreread less

558 citations

Proceedings Article•DOI•

Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications

[...]

Adrian M. Caulfield¹, Laura M. Grupp¹, Steven Swanson¹•Institutions (1)

University of California, San Diego¹

07 Mar 2009

TL;DR: The paper presents an exhaustive analysis of the design space of Gordon systems, focusing on the trade-offs between power, energy, and performance that Gordon must make, and describes a novel flash translation layer tailored to data intensive workloads and large flash storage arrays.

...read moreread less

Abstract: As our society becomes more information-driven, we have begun to amass data at an astounding and accelerating rate. At the same time, power concerns have made it difficult to bring the necessary processing power to bear on querying, processing, and understanding this data. We describe Gordon, a system architecture for data-centric applications that combines low-power processors, flash memory, and data-centric programming systems to improve performance for data-centric applications while reducing power consumption. The paper presents an exhaustive analysis of the design space of Gordon systems, focusing on the trade-offs between power, energy, and performance that Gordon must make. It analyzes the impact of flash-storage and the Gordon architecture on the performance and power efficiency of data-centric applications. It also describes a novel flash translation layer tailored to data intensive workloads and large flash storage arrays. Our data show that, using technologies available in the near future, Gordon systems can out-perform disk-based clusters by 1.5× and deliver up to 2.5× more performance per Watt.

...read moreread less

277 citations

Proceedings Article•DOI•

The 48-core SCC Processor: the Programmer's View

[...]

Timothy G. Mattson¹, Michael Riepen¹, Thomas Lehnig, Paul Brett¹, Werner Haas¹, Patrick Kennedy, Jason Howard¹, Sriram R. Vangal¹, Nitin Borkar¹, Greg Ruhl¹, Saurabh Dighe¹ - Show less +7 more•Institutions (1)

Intel¹

13 Nov 2010

TL;DR: The programmer's view of this chip is described and RCCE is described: the native message passing model created for the SCC processor, an intermediate case, sharing traits of message passing and shared memory architectures.

...read moreread less

Abstract: The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmer's view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.

...read moreread less

267 citations

Journal Article•DOI•

HASS: a scheduler for heterogeneous multicore systems

[...]

Daniel Shelepov¹, Juan Carlos Saez Alcaide², Stacey Jeffery³, Alexandra Fedorova¹, Nestor Perez¹, Zhi Feng Huang¹, Sergey Blagodurov¹, Viren Kumar¹ - Show less +4 more•Institutions (3)

Simon Fraser University¹, Complutense University of Madrid², University of Waterloo³

21 Apr 2009-Operating Systems Review

TL;DR: This work proposes a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline, and is comparatively simple and scalable.

...read moreread less

Abstract: Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline. The resulting algorithm does not rely on dynamic profiling, and is comparatively simple and scalable. We implemented HASS in OpenSolaris, and achieved average workload speedups of up to 13%, matching best static assignment, achievable only by an oracle. We have also implemented a dynamic IPC-driven algorithm proposed earlier that relies on online profiling. We found that the complexity, load imbalance and associated performance degradation resulting from dynamic profiling are significant challenges to using this algorithm successfully. As a result it failed to deliver expected performance gains and to outperform HASS.

...read moreread less

256 citations

Collapse

Network Information

Performance

Metrics

315

Papers

6,353

Citations

No. of papers in the topic in previous years
Year	Papers
2021	14
2020	15
2019	15
2018	36
2017	25
2016	31

Performance per watt

Papers published on a yearly basis

Papers

Network Information

Related Topics (5)

Performance

Metrics