Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

TimeCube: A manycore embedded processor with interference-agnostic progress tracking

[...]

Anshuman Gupta¹, Jack Sampson¹, Michael Taylor¹•Institutions (1)

University of California, San Diego¹

15 Jul 2013

TL;DR: This paper proposes Progress Time as the counterpart of CPU-time in space-multiplexed systems and shows how it can be used to track application progress and introduces TimeCube, a manycore embedded processor that uses dynamic execution isolation and shadow performance modeling to provide an accurate online measurement of each application's Progress Time.

...read moreread less

Abstract: Recently introduced processors such as Tilera's Tile Gx100 and Intel's 48-core SCC have delivered on the promise of high performance per watt in manycore processors, making these architectures ostensibly as attractive for low-power embedded processors as for cloud services. However, these architectures space-multiplex the microarchitectural resources between many threads to increase utilization, which leads to potentially large and varying levels of interference. This decorrelates CPU-time from actual application progress and decreases the ability of traditional software to accurately track and finely control application progress, hindering the adoption of manycore processors in embedded computing. In this paper we propose Progress Time as the counterpart of CPU-time in space-multiplexed systems and show how it can be used to track application progress. We also introduce TimeCube, a manycore embedded processor that uses dynamic execution isolation and shadow performance modeling to provide an accurate online measurement of each application's Progress Time. Our evaluation shows that a 32-core TimeCube processor can track application progress with less than 1% error even in the presence of a 6× average worst-case slowdown. TimeCube also uses Progress Times to perform online architectural resource management that leads to a 36% improvement in throughput compared to existing microarchitectural resource allocation schemes. Overall, the results argue for adding the requisite microarchitectural structures to support Progress Time in manycore chips for embedded systems.

...read moreread less

8 citations

Journal Article•

Nearest neighbor affinity scheduling in heterogeneous multi-core architectures

[...]

Fadi N. Sibai

01 Oct 2008-Journal of Computer Science and Technology

TL;DR: This work proposes a 16 core AMC architecture mixing simple and complex cores, and single and multiple thread cores of various power envelopes, and proposes a priority-based thread scheduling algorithm which outperforms or is competitive with the other algorithms in all considered scenarios.

...read moreread less

Abstract: Asymmetric or heterogeneous multi-core (AMC) architectures have definite performance, performance per watt and fault tolerance advantages for a wide range of workloads. We propose a 16 core AMC architecture mixing simple and complex cores, and single and multiple thread cores of various power envelopes. A priority-based thread scheduling algorithm is also proposed for this AMC architecture. Fairness of this scheduling algorithm vis-a-vis lower priority thread starvation, and hardware and software requirements needed to implement this algorithm are addressed. We illustrate how this algorithm operates by a thread scheduling example. The produced schedule maximizes throughput (but is priority-based) and the core utilization given the available resources, the states and contents of the starting queues, and the threads' core requirement constraints. A simulation model simulates 6 scheduling algorithms which vary in their support of core affinity and thread migration. The simulation results that both core affinity and thread migration positively effect the completion time and that the nearest neighbor scheduling algorithm outperforms or is competitive with the other algorithms in all considered scenarios

...read moreread less

8 citations

Proceedings Article•DOI•

High performance OpenSHMEM for Xeon Phi clusters: Extensions, runtime designs and application co-design

[...]

Jithin Jose¹, Khaled Hamidouche¹, Xiaoyi Lu¹, Sreeram Potluri¹, Jie Zhang¹, Karen Tomko², Dhabaleswar K. Panda¹ - Show less +3 more•Institutions (2)

Ohio State University¹, Ohio Supercomputer Center²

01 Dec 2014

TL;DR: This paper proposes extensions to overcome this restriction and propose high performance runtime-level designs for efficient communication involving Xeon Phi processors, and indicates 4X to 7X reduction in OpenSHMEM data movement operation latencies, and 6X to 11X improvement in performance for collective operations.

...read moreread less

Abstract: Intel Many Integrated Core (MIC) architectures are becoming an integral part of modern supercomputer architectures due to their high compute density and performance per watt. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics, by abstracting shared memory address space, along with one-sided communication semantics. However, the current OpenSHMEM standard does not efficiently support heterogeneous memory architectures such as Xeon Phi. Host and Xeon Phi cores have different memory capacities and compute characteristics. But, the global symmetric memory allocation in the current OpenSHMEM standard mandates that same amount of memory be allocated on every process. In this paper, we propose extensions to overcome this restriction and propose high performance runtime-level designs for efficient communication involving Xeon Phi processors. Further, we re-design applications to demonstrate the effectiveness of the proposed designs and extensions. Experimental evaluations indicate 4X to 7X reduction in OpenSHMEM data movement operation latencies, and 6X to 11X improvement in performance for collective operations. Application evaluations in symmetric mode indicate performance improvements of 28% at 1,024 processes. Further, application redesigns using the proposed extensions provide several magnitudes of performance improvement, as compared to the symmetric mode. To the best of our knowledge, this is the first research work that proposes high performance runtime designs for OpenSHMEM on Intel Xeon Phi clusters.

...read moreread less

7 citations

Towards the Net-Zero Data Center: Development and Application of an Energy Reuse Metric

[...]

M. K. Patterson, O. VanGeet, W. Tschudi, Dan Azevedo

01 Jan 2011

TL;DR: The Energy Reuse Effectiveness metric or ERE is discussed; both the development and application of the metric are looked at in detail.

...read moreread less

Abstract: Data Centers are an ever increasing user of energy in our economy. While the performance per watt of our IT equipment continues to increase exponentially, this energy performance improvement is still outstripped by increasing demand. Because of this, the efficiency of data centers must continue to improve. Beyond just efficiency, many data centers now are working towards reuse of their waste energy in other areas in the data center or on the site or campus. How to account for this, through metrics and measurements, is the topic of this paper. The Energy Reuse Effectiveness metric or ERE is discussed; both the development and application of the metric are looked at in detail. The use of ERE in conjunction with PUE (Power Usage Effectiveness) is also considered.

...read moreread less

7 citations

Proceedings Article•DOI•

Design Exploration of IoT centric Neural Inference Accelerators

[...]

Vivek Parmar¹, Manan Suri¹•Institutions (1)

Indian Institute of Technology Delhi¹

30 May 2018

TL;DR: This paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation and presents a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine.

...read moreread less

Abstract: Neural networks have been successfully deployed in a variety of fields like computer vision, natural language processing, pattern recognition, etc. However most of their current deployments are suitable for cloud-based high-performance computing systems. As the computation of neural networks is not suited to traditional Von-Neumann CPU architectures, many novel hardware accelerator designs have been proposed in literature. In this paper we present the design of a novel, simplified and extensible neural inference engine for IoT systems. We present a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine. The paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation. Comparison between learning performance of simulated hardware against the software model of the neural network shows a variation of ~ 1% in testing accuracy due to quantization. The accelerator compute blocks manage to achieve a performance per Watt of ~ 290 MSPS/W (Million samples per second per Watt) with a network structure of size: 8 x 32 x 2. Minimum energy of 40 pJ is acheived per sample processed for a block size of 16. Further, we show through simulations that an added power-saving of ~ 30 % can be acheived if SRAM based main memory is replaced with emerging STT-MRAM technology.

...read moreread less

7 citations

Collapse

Network Information

Performance

Metrics

315

Papers

6,353

Citations

No. of papers in the topic in previous years
Year	Papers
2021	14
2020	15
2019	15
2018	36
2017	25
2016	31

Performance per watt

Papers published on a yearly basis

Papers

Network Information

Related Topics (5)

Performance

Metrics