scispace - formally typeset
Search or ask a question
Topic

Performance per watt

About: Performance per watt is a research topic. Over the lifetime, 315 publications have been published within this topic receiving 5778 citations.


Papers
More filters
Proceedings ArticleDOI
Farah Fargo1, Cihan Tunc1, Youssif Al-Nashif1, Ali Akoglu1, Salim Hariri1 
08 Sep 2014
TL;DR: This paper presents an autonomic power and performance management method for cloud systems in order to dynamically match the application requirements with "just-enough" system resources at runtime that lead to significant power reduction while meeting the quality of service requirements of the cloud applications.
Abstract: The power consumption of data centers and cloud systems have increased almost three times between 2007 and 2012. Over-provisioning techniques are typically used for meeting the peak workloads. In this paper we present an autonomic power and performance management method for cloud systems in order to dynamically match the application requirements with "just-enough" system resources at runtime that lead to significant power reduction while meeting the quality of service requirements of the cloud applications. Our solution offers the following capabilities: 1) real-time monitoring of the cloud resources and workload behavior running on virtual machines (VMs), 2) determine the current operating point of both workloads and the VMs running these workloads, 3) characterize workload behavior and predict the next operating point for the VMs, 4) dynamically manage the VM resources (scaling up and down the number of cores, CPU frequency, and memory amount) at run time, and 5) assign available cloud resources that can guarantee optimal power consumption without sacrificing the QoS requirements of cloud workloads. We validate the performance of our approach using the RUB is benchmark, an auction model emulating eBay transactions that generates a wide range of workloads (such as browsing and bidding with different number of clients). Our experimental results show that our approach can lead to reduction in power consumption up to 87% when compared to the static resource allocation strategy, 72% compared to adaptive frequency scaling strategy and 66% compared to a similar multi-resource management strategy.

25 citations

Proceedings ArticleDOI
09 Mar 2020
TL;DR: Fleet is presented, a framework that offers a massively parallel streaming model for FPGAs and is effective in a number of domains well-suited to FPGA acceleration, including parsing, compression, and machine learning.
Abstract: We present Fleet, a framework that offers a massively parallel streaming model for FPGAs and is effective in a number of domains well-suited for FPGA acceleration, including parsing, compression, and machine learning. Fleet requires the user to specify RTL for a processing unit that serially processes every input token in a stream, a far simpler task than writing a parallel processing unit. It then takes the user's processing unit and generates a hardware design with many copies of the unit as well as memory controllers to feed the units with separate streams and drain their outputs. Fleet includes a Chisel-based processing unit language. The language maintains Chisel's low-level performance control while adding a few productivity features, including automatic handling of ready-valid signaling and a native and automatically pipelined BRAM type. We evaluate Fleet on six different applications, including JSON parsing and integer compression, fitting hundreds of Fleet processing units on the Amazon F1 FPGA and outperforming CPU implementations by over 400x and GPU implementations by over 9x in performance per watt while requiring a similar number of lines of code.

25 citations

Journal ArticleDOI
TL;DR: This work proposes a novel decoupled access-execute CGRA design called CASCADE with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory.
Abstract: A Coarse-Grained Reconfigurable Array (CGRA) is a promising high-performance low-power accelerator for compute-intensive loop kernels. While the mapping of the computations on the CGRA is a well-studied problem, bringing the data into the array at a high throughput remains a challenge. A conventional CGRA design involves on-array computations to generate memory addresses for data access undermining the attainable throughput. A decoupled access-execute architecture, on the other hand, isolates the memory access from the actual computations resulting in a significantly higher throughput.We propose a novel decoupled access-execute CGRA design called CASCADE with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory. CASCADE offloads the address computations for the multi-bank data memory access to a custom designed programmable hardware. An end-to-end fully-automated compiler synchronizes the conflict-free movement of data between the memory banks and the CGRA. Experimental evaluations show on average 3× performance benefit and 2.2× performance per watt improvement for CASCADE compared to an iso-area conventional CGRA with a bigger processing array in lieu of a dedicated hardware memory address generation logic.

25 citations

Proceedings ArticleDOI
01 Feb 2018
TL;DR: This paper proposes photonic interconnects for heterogeneous multicores using a checkerboard pattern that clusters CPU-GPU cores together and implements bandwidth reconfiguration using local router information without global coordination and proposes a dynamic laser scaling technique that predicts the power level for the next epoch using the buffer occupancy of previous epoch.
Abstract: As communication energy exceeds computation energy in future technologies, traditional on-chip electrical interconnects face fundamental challenges in the many-core era. Photonic interconnects have been proposed as a disruptive technology solution due to superior performance per Watt, distance independent energy consumption and CMOS compatibility for on-chip interconnects. Static power due to the laser being always switched on, varying link utilization due to spatial and temporal traffic fluctuations and thermal sensitivity are some of the critical challenges facing photonics interconnects. In this paper, we propose photonic interconnects for heterogeneous multicores using a checkerboard pattern that clusters CPU-GPU cores together and implements bandwidth reconfiguration using local router information without global coordination. To reduce the static power, we also propose a dynamic laser scaling technique that predicts the power level for the next epoch using the buffer occupancy of previous epoch. To further improve power-performance trade-offs, we also propose a regression-based machine learning technique for scaling the power of the photonic link. Our simulation results demonstrate a 34% performance improvement over a baseline electrical CMESH while consuming 25% less energy per bit when dynamically reallocating bandwidth. When dynamically scaling laser power, our buffer-based reactive and ML-based proactive prediction techniques show 40 - 65% in power savings with 0 - 14% in throughput loss depending on the reservation window size.

24 citations

Proceedings Article
01 Jan 2011
TL;DR: A programmable, pattern-based memory controller (PMC) that aims at improving the performance of heterogeneous or reconfigurable SoC devices, including scatter gather and strided 1D, 2D and 3D patterns.
Abstract: Heterogeneous architectures are increasingly popular due to their flexibility and high performance per watt capability. A kind of heterogeneous architecture, reconfigurable systems-on-chip, offer high performance per watt through the reconfigurable logic and flexibility via multiprocessor cores. But in order to achieve the performance goals it is necessary to provide enough data to the accelerators. In this paper we describe a programmable, pattern-based memory controller (PMC) that aims at improving the performance of heterogeneous or reconfigurable SoC devices. These include scatter gather and strided 1D, 2D and 3D patterns. PMC can prefetch complete patterns into scratchpads that can then be accessed either by a microprocessor or by an accelerator. As a result, the microprocessors and accelerators can focus on computation and are relieved of having to perform address calculations. PMC has been implemented and tested on an ML505 evaluation board using the MicroBlaze softcore as the platform’s microprocessor. While PMC adds some latency, it improves performance by offloading the processor and by making better use of available bandwidths. The PMC provide 1.5x speed-ups with processor and 27x speed-ups achieved by using hardware accelerator in PMC SoC based environment while executing thresholding application.

23 citations

Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
81% related
Benchmark (computing)
19.6K papers, 419.1K citations
80% related
Programming paradigm
18.7K papers, 467.9K citations
77% related
Compiler
26.3K papers, 578.5K citations
77% related
Scalability
50.9K papers, 931.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202114
202015
201915
201836
201725
201631