scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Performance and power results indicate that multi-core embedded system architectures that leverage shared last-level caches provide the best LLC performance per watt but may introduce main memory response time and throughput bottlenecks for high cache miss rates, whereas architectures leveraging a hybrid of private and shared LLCs alleviate main memory bottlenECks at the expense of reduced performance per Watt.

5 citations

Patent
Heller Jr Thomas J1
16 Nov 2009
TL;DR: In this article, a stack of microprocessor chips that are designed to work together in a multiprocessor system is discussed, and the hypervisor or operating system controls the utilization of individual chips of a stack.
Abstract: A computing system has a stack of microprocessor chips that are designed to work together in a multiprocessor system. The chips are interconnected with 3D through vias, or alternatively by compatible package carriers having the interconnections, while logically the chips in the stack are interconnected via specialized cache coherent interconnections. All of the chips in the stack use the same logical chip design, even though they can be easily personalized by setting specialized latches on the chips. One or more of the individual microprocessor chips utilized in the stack are implemented in a silicon process that is optimized for high performance while others are implemented in a silicon process that is optimized for power consumption i.e. for the best performance per Watt of electrical power consumed. The hypervisor or operating system controls the utilization of individual chips of a stack.

5 citations

Journal ArticleDOI
TL;DR: This work proposes a Lightweight Chip Multi-Threaded (LCMT) architecture that further exploits thread-level parallelism (TLP) by incorporating direct architectural support for an “unlimited” number of dynamically created lightweight threads with very low thread management and synchronization overhead.
Abstract: Irregular and dynamic applications, such as graph problems and agent-based simulations, often require fine-grained parallelism to achieve good performance. However, current multicore processors only provide architectural support for coarse-grained parallelism, making it necessary to use software-based multithreading environments to effectively implement fine-grained parallelism. Although these software-based environments have demonstrated superior performance over heavyweight, OS-level threads, they are still limited by the significant overhead involved in thread management and synchronization. In order to address this, we propose a Lightweight Chip Multi-Threaded (LCMT) architecture that further exploits thread-level parallelism (TLP) by incorporating direct architectural support for an “unlimited” number of dynamically created lightweight threads with very low thread management and synchronization overhead. The LCMT architecture can be implemented atop a mainstream architecture with minimum extra hardware to leverage existing legacy software environments. We compare the LCMT architecture with a Niagara-like baseline architecture. Our results show up to 1.8X better scalability, 1.91X better performance, and more importantly, 1.74X better performance per watt, using the LCMT architecture for irregular and dynamic benchmarks, when compared to the baseline architecture. The LCMT architecture delivers similar performance to the baseline architecture for regular benchmarks.

5 citations

Proceedings ArticleDOI
14 Jul 2014
TL;DR: The presentation gives the audience a high-level understanding of the goals of HSA, the HSA system architecture properties and its use models by system software, tools and applications.
Abstract: Summary form only given. The use of GPUs in computation intensive tasks has an ever increasing impact across all platforms - including embedded - sometimes even used to create new forms of currency (Bitcoin, Litecoin, ...). And the exponential improvements in Performance per Watt gains are still ongoing unabated. At the same time, due to their “design heritage” as primarily 3D accelerators, GPUs have several properties that make it a SW challenge to unlock their full benefit in many real-world application scenarios, be it due to limiting API's (proprietary or limited functionality) or properties that require an advanced understanding of the platform architecture and managing the memory and other system resources, beyond the reach of the “average programmer”. The Heterogeneous System Architecture is established by the HSA Foundation to address many of the current shortcomings at a system architecture and programming model level while providing a great foundation for already established SW models, and in addition to the GPU allow extending the architecture to other specialty processors like DSPs, FPGAs and others to interoperate within the SW framework, a main task for the next level of work in the HSA Foundation. The HSA Foundation, a not-for-profit consortium of SOC and SOC IP vendors, OEMs, academia, OSVs and ISVs defining a consistent heterogeneous platform architecture to make it dramatically easier to program heterogeneous parallel devices like GPUs and other accelerators. The presentation gives the audience a high-level understanding of the goals of HSA, the HSA system architecture properties and its use models by system software, tools and applications.

5 citations

Book ChapterDOI
26 Apr 2017
TL;DR: A performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method demonstrates a good correlation between the performance attained and the extra energy required.
Abstract: We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method. Speed-up factors and energy consumption are monitored on different stages of the algorithm with the goal of identifying advantageous scenarios to maximize acceleration and minimize power consumption. Experimental results using CUDA on a set of GeForce GTX 980 GPUs illustrate their capabilities as high-performance and low-power devices, with a energy cost to be more attractive when increasing the number of GPUs. Overall, our results demonstrate a good correlation between the performance attained and the extra energy required, even in scenarios where multi-GPUs do not show great scalability.

5 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631