scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Proceedings ArticleDOI
15 Jul 2013
TL;DR: This paper proposes Progress Time as the counterpart of CPU-time in space-multiplexed systems and shows how it can be used to track application progress and introduces TimeCube, a manycore embedded processor that uses dynamic execution isolation and shadow performance modeling to provide an accurate online measurement of each application's Progress Time.
Abstract: Recently introduced processors such as Tilera's Tile Gx100 and Intel's 48-core SCC have delivered on the promise of high performance per watt in manycore processors, making these architectures ostensibly as attractive for low-power embedded processors as for cloud services. However, these architectures space-multiplex the microarchitectural resources between many threads to increase utilization, which leads to potentially large and varying levels of interference. This decorrelates CPU-time from actual application progress and decreases the ability of traditional software to accurately track and finely control application progress, hindering the adoption of manycore processors in embedded computing. In this paper we propose Progress Time as the counterpart of CPU-time in space-multiplexed systems and show how it can be used to track application progress. We also introduce TimeCube, a manycore embedded processor that uses dynamic execution isolation and shadow performance modeling to provide an accurate online measurement of each application's Progress Time. Our evaluation shows that a 32-core TimeCube processor can track application progress with less than 1% error even in the presence of a 6× average worst-case slowdown. TimeCube also uses Progress Times to perform online architectural resource management that leads to a 36% improvement in throughput compared to existing microarchitectural resource allocation schemes. Overall, the results argue for adding the requisite microarchitectural structures to support Progress Time in manycore chips for embedded systems.

8 citations

Journal Article
TL;DR: This work proposes a 16 core AMC architecture mixing simple and complex cores, and single and multiple thread cores of various power envelopes, and proposes a priority-based thread scheduling algorithm which outperforms or is competitive with the other algorithms in all considered scenarios.
Abstract: Asymmetric or heterogeneous multi-core (AMC) architectures have definite performance, performance per watt and fault tolerance advantages for a wide range of workloads. We propose a 16 core AMC architecture mixing simple and complex cores, and single and multiple thread cores of various power envelopes. A priority-based thread scheduling algorithm is also proposed for this AMC architecture. Fairness of this scheduling algorithm vis-a-vis lower priority thread starvation, and hardware and software requirements needed to implement this algorithm are addressed. We illustrate how this algorithm operates by a thread scheduling example. The produced schedule maximizes throughput (but is priority-based) and the core utilization given the available resources, the states and contents of the starting queues, and the threads' core requirement constraints. A simulation model simulates 6 scheduling algorithms which vary in their support of core affinity and thread migration. The simulation results that both core affinity and thread migration positively effect the completion time and that the nearest neighbor scheduling algorithm outperforms or is competitive with the other algorithms in all considered scenarios

8 citations

Proceedings ArticleDOI
01 Dec 2014
TL;DR: This paper proposes extensions to overcome this restriction and propose high performance runtime-level designs for efficient communication involving Xeon Phi processors, and indicates 4X to 7X reduction in OpenSHMEM data movement operation latencies, and 6X to 11X improvement in performance for collective operations.
Abstract: Intel Many Integrated Core (MIC) architectures are becoming an integral part of modern supercomputer architectures due to their high compute density and performance per watt. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics, by abstracting shared memory address space, along with one-sided communication semantics. However, the current OpenSHMEM standard does not efficiently support heterogeneous memory architectures such as Xeon Phi. Host and Xeon Phi cores have different memory capacities and compute characteristics. But, the global symmetric memory allocation in the current OpenSHMEM standard mandates that same amount of memory be allocated on every process. In this paper, we propose extensions to overcome this restriction and propose high performance runtime-level designs for efficient communication involving Xeon Phi processors. Further, we re-design applications to demonstrate the effectiveness of the proposed designs and extensions. Experimental evaluations indicate 4X to 7X reduction in OpenSHMEM data movement operation latencies, and 6X to 11X improvement in performance for collective operations. Application evaluations in symmetric mode indicate performance improvements of 28% at 1,024 processes. Further, application redesigns using the proposed extensions provide several magnitudes of performance improvement, as compared to the symmetric mode. To the best of our knowledge, this is the first research work that proposes high performance runtime designs for OpenSHMEM on Intel Xeon Phi clusters.

7 citations

01 Jan 2011
TL;DR: The Energy Reuse Effectiveness metric or ERE is discussed; both the development and application of the metric are looked at in detail.
Abstract: Data Centers are an ever increasing user of energy in our economy. While the performance per watt of our IT equipment continues to increase exponentially, this energy performance improvement is still outstripped by increasing demand. Because of this, the efficiency of data centers must continue to improve. Beyond just efficiency, many data centers now are working towards reuse of their waste energy in other areas in the data center or on the site or campus. How to account for this, through metrics and measurements, is the topic of this paper. The Energy Reuse Effectiveness metric or ERE is discussed; both the development and application of the metric are looked at in detail. The use of ERE in conjunction with PUE (Power Usage Effectiveness) is also considered.

7 citations

Proceedings ArticleDOI
30 May 2018
TL;DR: This paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation and presents a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine.
Abstract: Neural networks have been successfully deployed in a variety of fields like computer vision, natural language processing, pattern recognition, etc. However most of their current deployments are suitable for cloud-based high-performance computing systems. As the computation of neural networks is not suited to traditional Von-Neumann CPU architectures, many novel hardware accelerator designs have been proposed in literature. In this paper we present the design of a novel, simplified and extensible neural inference engine for IoT systems. We present a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine. The paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation. Comparison between learning performance of simulated hardware against the software model of the neural network shows a variation of ~ 1% in testing accuracy due to quantization. The accelerator compute blocks manage to achieve a performance per Watt of ~ 290 MSPS/W (Million samples per second per Watt) with a network structure of size: 8 x 32 x 2. Minimum energy of 40 pJ is acheived per sample processed for a block size of 16. Further, we show through simulations that an added power-saving of ~ 30 % can be acheived if SRAM based main memory is replaced with emerging STT-MRAM technology.

7 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631