Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Comparing squaring and cubing units with multipliers

[...]

Aditya M. Deshpande¹, Jeffrey Draper¹•Institutions (1)

University of Southern California¹

01 Aug 2012

TL;DR: This work proposes to use dedicated hardware accelerators like squaring and cubing units to perform squares and cubes to reduce power consumption per computation by more than 50% and more than 40% using dedicated units, respectively.

...read moreread less

Abstract: With power becoming a precious resource in current VLSI systems, performance per Watt has become a more important metric than chip area. With a large number of applications benefitting from support for complex functional units like squaring and cubing, it becomes imperative that such functions be implemented in hardware. Implementing these functions using existing general purpose multipliers in a design may result in area savings in some cases but results in power and latency penalties. We propose to use dedicated hardware accelerators like squaring and cubing units to perform squares and cubes, respectively. We study the trade-off for computing squares and cubes using a general purpose multiplier versus dedicated units from a software perspective. We compare area and power requirements for various widths. We are able to reduce power consumption per computation by more than 50% in squaring units and more than 40% in cubing units using dedicated units. Depending on the requirements of the applications, dedicated squaring and cubing units can also aid multipliers in improving the performance and latency of various applications.

...read moreread less

10 citations

Proceedings Article•DOI•

Heterogeneous architecture design with emerging 3D and non-volatile memory technologies

[...]

Qiaosha Zou¹, Matthew Poremba¹, Rui He², Wei Yang², Junfeng Zhao², Yuan Xie³ - Show less +2 more•Institutions (3)

Pennsylvania State University¹, Huawei², University of California, Santa Barbara³

12 Mar 2015

TL;DR: 3D die stacking is demonstrated, whereby disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.

...read moreread less

Abstract: Energy becomes the primary concern in nowadays multi-core architecture designs. Moore's law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.1

...read moreread less

10 citations

Proceedings Article•DOI•

Investigating TI KeyStone II and quad-core ARM Cortex-A53 architectures for on-board space processing

[...]

Benjamin Schwaller¹, Barath Ramesh¹, Alan D. George¹•Institutions (1)

University of Pittsburgh¹

01 Sep 2017

TL;DR: A direct memory-access scheme is developed to take advantage of the complex KeyStone architecture for FFTs and shows that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53.

...read moreread less

Abstract: Future space missions require reliable architectures with higher performance and lower power consumption. Exploring new architectures worthy of undergoing the expensive and time-consuming process of radiation hardening is critical for this endeavor. Two such architectures are the Texas Instruments KeyStone II octal-core processor and the ARM® Cortex®-A53 (ARMv8) quad-core CPU. DSPs have been proven in prior space applications, and the KeyStone II has eight high-performance DSP cores and is under consideration for potential hardening for space. Meanwhile, a radiation-hardened quad-core ARM Cortex-A53 CPU is under development at Boeing under the NASA/AFRL High-Performance Spaceflight Computing initiative. In this paper, we optimize and evaluate the performance of batched 1D-FFTs, 2D-FFTs, and the Complex Ambiguity Function (CAF). We developed a direct memory-access scheme to take advantage of the complex KeyStone architecture for FFTs. Our results for batched 1D-FFTs show that the performance per Watt of KeyStone II is 4.5 times better than the ARM Cortex-A53. For CAF, our results show that the KeyStone II is 1.7 times better.

...read moreread less

10 citations

Green Destiny and its Evolving Parts

[...]

Wu-chun Feng, Chung-Hsing Hsu

01 Jan 2004

TL;DR: This work proposes to evolve Green Destiny with a hybrid software-hardware solution, one that uses commodity processors from AMD (i.e., Athlon XP-M, Athlon 64, and Opteron) to achieve better performance, coupled with AMD’s “Cool-N-Quiet” technology and the novel dynamic voltage-scaling (DVS) technique to reduce power consumption by as much as 40% while impacting performance by less than 7%.

...read moreread less

Abstract: Although the performance of supercomputers on our n-body cosmology code has improved by a factor of nearly 2000 since 1991, the performance per watt has only improved 300-fold and the performance per square foot only 65fold. Clearly, we are building less and less efficient supercomputers, thus resulting in the construction of new machines rooms 1 and even entirely new buildings. Furthermore, as these supercomputers continue to follow “Moore’s Law for Power Consumption,” the reliability of these supercomputers continues to plummet, relative to Arrenhius’ equation for microelectronics. To address these problems, we built a super-efficient supercomputer dubbed Green Destiny, a 240-processor supercomputer that fits in a telephone booth (i.e., a footprint of five square feet) and sips less than 5.2 kW of power at full load [FWW02, WWF02, Feng03]. This “Supercomputer for the Rest of Us” – a 2003 R&D 100 award-winning machine – provided affordable, general-purpose supercomputing to our application scientists while sitting in an 85-90˚ F (29-32˚ C) dusty warehouse at 7,400 feet (2256 meters) above sea level. Furthermore, it delivered reliable computing cycles without any special facilities, i.e., no air conditioning, no humidification control, no air filtration, and no ventilation, and without any unscheduled downtime. However, although Green Destiny demonstrated a total price-performance ratio (ToPPeR) that was 50% better than a traditional Beowulf cluster or supercomputer, power efficiency (i.e., performance-power ratio) that was up to eight times better, and space efficiency (i.e., performance-space ratio) that was up to thirty times better, both the raw performance and price/performance lagged a traditional Beowulf cluster or supercomputer by a factor of two. Thus, many would argue that Green Destiny sacrificed too much performance in achieving power and space efficiency (and thus, better reliability and total cost of ownership). Therefore, we propose to evolve Green Destiny with a hybrid software-hardware solution, one that uses commodity processors from AMD (i.e., Athlon XP-M, Athlon 64, and Opteron) to achieve better performance, coupled with AMD’s “Cool-N-Quiet” technology (formerly PowerNow!) and our novel dynamic voltage-scaling (DVS) technique to reduce power consumption by as much as 40% while impacting performance by less than 7%.

...read moreread less

10 citations

Scheduling and synchronization for multi-core real-time systems

[...]

Ragunathan Rajkumar¹, Karthik Lakshmanan¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2011

TL;DR: This dissertation provides a utilization bound of 65% for independent sequential tasks, demonstrates up to 50% reduction in the required number of cores using synchronization-aware allocation, and proves a 3.42 resource augmentation bound for parallel real-time task scheduling.

...read moreread less

Abstract: Multi-core processors are already prevalent in general-purpose computing systems with manufacturers currently offering up to a dozen cores per processor. Real-time and embedded systems adopting such processors gain increased computational capacity, improved parallelism, and higher performance per watt. However, using multi-core processors in real-time applications also introduces new challenges and opportunities for efficient scheduling and task synchronization. In this dissertation, we study this problem, characterize the design space, and develop an analytical and systems framework for multi-core real-time scheduling. Exploiting the co-located nature of processor cores, the general principle adopted in this thesis is to statically partition tasks among processor cores, co-allocate synchronizing tasks when possible, and introduce limited inter-core task migration and synchronization for improving system utilization as necessary. We model the multi-core real-time scheduling problem as a bin-packing problem and develop an object splitting algorithm for scheduling tasks on multi-core processors. We develop Highest-Priority Task Splitting (HPTS) to schedule independent sequential tasks on multi-core processors. We then analyze the overheads of inter-core task synchronization and provide mechanisms to efficiently allocate synchronizing sequential tasks on multi-cores by co-locating such tasks. We then generalize this approach to provide early solutions for scheduling parallel real-time tasks using the fork-join model. Next, we develop mechanisms to use such techniques in mixed-criticality systems. Finally, we describe the distributed resource kernel framework, where we demonstrate the practical feasibility of our approach. The results of this dissertation contribute to a system that can efficiently utilize multi-core processors to predictably execute periodic tasks with well-defined deadlines and analytically guarantee such deadlines. We provide a utilization bound of 65% for independent sequential tasks, demonstrate up to 50% reduction in the required number of cores using synchronization-aware allocation, and prove a 3.42 resource augmentation bound for parallel real-time task scheduling.

...read moreread less

10 citations

Collapse

Network Information

Performance

Metrics

315

Papers

6,353

Citations

No. of papers in the topic in previous years
Year	Papers
2021	14
2020	15
2019	15
2018	36
2017	25
2016	31

Performance per watt

Papers published on a yearly basis

Papers

Network Information

Related Topics (5)

Performance

Metrics