scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Proceedings ArticleDOI
01 Aug 2012
TL;DR: This work proposes to use dedicated hardware accelerators like squaring and cubing units to perform squares and cubes to reduce power consumption per computation by more than 50% and more than 40% using dedicated units, respectively.
Abstract: With power becoming a precious resource in current VLSI systems, performance per Watt has become a more important metric than chip area. With a large number of applications benefitting from support for complex functional units like squaring and cubing, it becomes imperative that such functions be implemented in hardware. Implementing these functions using existing general purpose multipliers in a design may result in area savings in some cases but results in power and latency penalties. We propose to use dedicated hardware accelerators like squaring and cubing units to perform squares and cubes, respectively. We study the trade-off for computing squares and cubes using a general purpose multiplier versus dedicated units from a software perspective. We compare area and power requirements for various widths. We are able to reduce power consumption per computation by more than 50% in squaring units and more than 40% in cubing units using dedicated units. Depending on the requirements of the applications, dedicated squaring and cubing units can also aid multipliers in improving the performance and latency of various applications.

10 citations

Proceedings ArticleDOI
12 Mar 2015
TL;DR: 3D die stacking is demonstrated, whereby disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.
Abstract: Energy becomes the primary concern in nowadays multi-core architecture designs. Moore's law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.1

10 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: A direct memory-access scheme is developed to take advantage of the complex KeyStone architecture for FFTs and shows that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53.
Abstract: Future space missions require reliable architectures with higher performance and lower power consumption. Exploring new architectures worthy of undergoing the expensive and time-consuming process of radiation hardening is critical for this endeavor. Two such architectures are the Texas Instruments KeyStone II octal-core processor and the ARM® Cortex®-A53 (ARMv8) quad-core CPU. DSPs have been proven in prior space applications, and the KeyStone II has eight high-performance DSP cores and is under consideration for potential hardening for space. Meanwhile, a radiation-hardened quad-core ARM Cortex-A53 CPU is under development at Boeing under the NASA/AFRL High-Performance Spaceflight Computing initiative. In this paper, we optimize and evaluate the performance of batched 1D-FFTs, 2D-FFTs, and the Complex Ambiguity Function (CAF). We developed a direct memory-access scheme to take advantage of the complex KeyStone architecture for FFTs. Our results for batched 1D-FFTs show that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53. For CAF, our results show that the KeyStone II is 1.7 times better.

10 citations

01 Jan 2004
TL;DR: This work proposes to evolve Green Destiny with a hybrid software-hardware solution, one that uses commodity processors from AMD (i.e., Athlon XP-M, Athlon 64, and Opteron) to achieve better performance, coupled with AMD’s “Cool-N-Quiet” technology and the novel dynamic voltage-scaling (DVS) technique to reduce power consumption by as much as 40% while impacting performance by less than 7%.
Abstract: Although the performance of supercomputers on our n-body cosmology code has improved by a factor of nearly 2000 since 1991, the performance per watt has only improved 300-fold and the performance per square foot only 65fold. Clearly, we are building less and less efficient supercomputers, thus resulting in the construction of new machines rooms 1 and even entirely new buildings. Furthermore, as these supercomputers continue to follow “Moore’s Law for Power Consumption,” the reliability of these supercomputers continues to plummet, relative to Arrenhius’ equation for microelectronics. To address these problems, we built a super-efficient supercomputer dubbed Green Destiny, a 240-processor supercomputer that fits in a telephone booth (i.e., a footprint of five square feet) and sips less than 5.2 kW of power at full load [FWW02, WWF02, Feng03]. This “Supercomputer for the Rest of Us” – a 2003 R&D 100 award-winning machine – provided affordable, general-purpose supercomputing to our application scientists while sitting in an 85-90˚ F (29-32˚ C) dusty warehouse at 7,400 feet (2256 meters) above sea level. Furthermore, it delivered reliable computing cycles without any special facilities, i.e., no air conditioning, no humidification control, no air filtration, and no ventilation, and without any unscheduled downtime. However, although Green Destiny demonstrated a total price-performance ratio (ToPPeR) that was 50% better than a traditional Beowulf cluster or supercomputer, power efficiency (i.e., performance-power ratio) that was up to eight times better, and space efficiency (i.e., performance-space ratio) that was up to thirty times better, both the raw performance and price/performance lagged a traditional Beowulf cluster or supercomputer by a factor of two. Thus, many would argue that Green Destiny sacrificed too much performance in achieving power and space efficiency (and thus, better reliability and total cost of ownership). Therefore, we propose to evolve Green Destiny with a hybrid software-hardware solution, one that uses commodity processors from AMD (i.e., Athlon XP-M, Athlon 64, and Opteron) to achieve better performance, coupled with AMD’s “Cool-N-Quiet” technology (formerly PowerNow!) and our novel dynamic voltage-scaling (DVS) technique to reduce power consumption by as much as 40% while impacting performance by less than 7%.

10 citations

01 Jan 2011
TL;DR: This dissertation provides a utilization bound of 65% for independent sequential tasks, demonstrates up to 50% reduction in the required number of cores using synchronization-aware allocation, and proves a 3.42 resource augmentation bound for parallel real-time task scheduling.
Abstract: Multi-core processors are already prevalent in general-purpose computing systems with manufacturers currently offering up to a dozen cores per processor. Real-time and embedded systems adopting such processors gain increased computational capacity, improved parallelism, and higher performance per watt. However, using multi-core processors in real-time applications also introduces new challenges and opportunities for efficient scheduling and task synchronization. In this dissertation, we study this problem, characterize the design space, and develop an analytical and systems framework for multi-core real-time scheduling. Exploiting the co-located nature of processor cores, the general principle adopted in this thesis is to statically partition tasks among processor cores, co-allocate synchronizing tasks when possible, and introduce limited inter-core task migration and synchronization for improving system utilization as necessary. We model the multi-core real-time scheduling problem as a bin-packing problem and develop an object splitting algorithm for scheduling tasks on multi-core processors. We develop Highest-Priority Task Splitting (HPTS) to schedule independent sequential tasks on multi-core processors. We then analyze the overheads of inter-core task synchronization and provide mechanisms to efficiently allocate synchronizing sequential tasks on multi-cores by co-locating such tasks. We then generalize this approach to provide early solutions for scheduling parallel real-time tasks using the fork-join model. Next, we develop mechanisms to use such techniques in mixed-criticality systems. Finally, we describe the distributed resource kernel framework, where we demonstrate the practical feasibility of our approach. The results of this dissertation contribute to a system that can efficiently utilize multi-core processors to predictably execute periodic tasks with well-defined deadlines and analytically guarantee such deadlines. We provide a utilization bound of 65% for independent sequential tasks, demonstrate up to 50% reduction in the required number of cores using synchronization-aware allocation, and prove a 3.42 resource augmentation bound for parallel real-time task scheduling.

10 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631