scispace - formally typeset
Search or ask a question

Showing papers on "Performance per watt published in 2019"


Journal ArticleDOI
TL;DR: An efficient deep Q-learning methodology to optimize the performance per watt (PPW) is proposed and experiments show that the PPW achieved by the proposed approach is within 1 percent of the optimal value obtained by an oracle.
Abstract: Heterogeneous multiprocessor system-on-chips (SoCs) provide a wide range of parameters that can be managed dynamically. For example, one can control the type (big/little), number and frequency of active cores in state-of-the-art mobile processors at runtime. These runtime choices lead to more than 10× range in execution time, 5× range in power consumption, and 50× range in performance per watt. Therefore, it is crucial to make optimum power management decisions as a function of dynamically varying workloads at runtime. This paper presents a reinforcement learning approach for dynamically controlling the number and frequency of active big and little cores in mobile processors. We propose an efficient deep Q-learning methodology to optimize the performance per watt (PPW). Experiments using Odroid XU3 mobile platform show that the PPW achieved by the proposed approach is within 1 percent of the optimal value obtained by an oracle.

48 citations


Journal ArticleDOI
TL;DR: This work proposes a novel decoupled access-execute CGRA design called CASCADE with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory.
Abstract: A Coarse-Grained Reconfigurable Array (CGRA) is a promising high-performance low-power accelerator for compute-intensive loop kernels. While the mapping of the computations on the CGRA is a well-studied problem, bringing the data into the array at a high throughput remains a challenge. A conventional CGRA design involves on-array computations to generate memory addresses for data access undermining the attainable throughput. A decoupled access-execute architecture, on the other hand, isolates the memory access from the actual computations resulting in a significantly higher throughput.We propose a novel decoupled access-execute CGRA design called CASCADE with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory. CASCADE offloads the address computations for the multi-bank data memory access to a custom designed programmable hardware. An end-to-end fully-automated compiler synchronizes the conflict-free movement of data between the memory banks and the CGRA. Experimental evaluations show on average 3× performance benefit and 2.2× performance per watt improvement for CASCADE compared to an iso-area conventional CGRA with a bigger processing array in lieu of a dedicated hardware memory address generation logic.

25 citations


Proceedings ArticleDOI
22 Jun 2019
TL;DR: This paper presents an adaptive CPU based on Intel SkyLake that closes the loop to deployment, and provides a novel mechanism for post-silicon customization, and shows how to optimize PPW using models trained to different SLAs or to specific applications, e.g. to improve datacenter hardware in situ.
Abstract: Processors that adapt architecture to workloads at runtime promise compelling performance per watt (PPW) gains, offering one way to mitigate diminishing returns from pipeline scaling. State-of-the-art adaptive CPUs deploy machine learning (ML) models on-chip to optimize hardware by recognizing workload patterns in event counter data. However, despite breakthrough PPW gains, such designs are not yet widely adopted due to the potential for systematic adaptation errors in the field. This paper presents an adaptive CPU based on Intel SkyLake that (1) closes the loop to deployment, and (2) provides a novel mechanism for post-silicon customization. Our CPU performs predictive cluster gating, dynamically setting the issue width of a clustered architecture while clock-gating unused resources. Gating decisions are driven by ML adaptation models that execute on an existing microcontroller, minimizing design complexity and allowing performance characteristics to be adjusted with the ease of a firmware update. Crucially, we show that although adaptation models can suffer from statistical blindspots that risk degrading performance on new workloads, these can be reduced to minimal impact with careful design and training. Our adaptive CPU improves PPW by 31.4% over a comparable non-adaptive CPU on SPEC2017, and exhibits two orders of magnitude fewer Service Level Agreement (SLA) violations than the state-of-the-art. We show how to optimize PPW using models trained to different SLAs or to specific applications, e.g. to improve datacenter hardware in situ. The resulting CPU meets real world deployment criteria for the first time and provides a new means to tailor hardware to individual customers, even as their needs change.

16 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: The proposed STT-RAM based neurosynaptic core designed in 28 nm technology node has approximately 6× higher throughput per unit Watt and unit area than an equivalent SRAM based design and achieves ∼ 2× higher performance per Watt compared to other memristive neural network accelerator designs in the literature.
Abstract: In this paper, we propose a Spin Transfer Torque RAM (STT-RAM) based neurosynaptic core to implement a hardware accelerator for Spiking Neural Networks (SNNs), which mimic the time-based signal encoding and processing mechanisms of the human brain. The computational core consists of a crossbar array of non-volatile STT-RAMs, read/write peripheral circuits, and digital logic for the spiking neurons. Inter-core communication is realized through on-chip routing network by sending/receiving spike packets. Unlike prior works that use multi-level states of non-volatile memory (NVM) devices for the synaptic weights, we use the technologically-mature STT-RAM devices for binary data storage. The design studies are conducted using a compact model for STT-RAM devices, tuned to capture the state-of-the-art experimental results. Our design avoids the need for expensive ADCs and DACs, enabling instantiation of large NVM arrays for our core. We show that the STT-RAM based neurosynaptic core designed in 28 nm technology node has approximately 6× higher throughput per unit Watt and unit area than an equivalent SRAM based design. Our design also achieves ∼ 2× higher performance per Watt compared to other memristive neural network accelerator designs in the literature.

15 citations


Journal ArticleDOI
TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.

10 citations


Book ChapterDOI
09 Apr 2019
TL;DR: This paper evaluates the resource utilizations, performance, and performance per watt of the implementations of the LULESH kernels in OpenCL on an Arria10-based FPGA platform and finds that the FFPA, constrained by the memory bandwidth, can perform 1.05X to 3.4X better than the CPU and GPU for small problem sizes.
Abstract: FPGAs are becoming promising heterogeneous computing components for high-performance computing. In this paper, we evaluate the resource utilizations, performance, and performance per watt of our implementations of the LULESH kernels in OpenCL on an Arria10-based FPGA platform. LULESH is a complex proxy application in the CORAL benchmark suite. We choose two representative kernels “CalcFBHourglassForceForElems” and “EvalEOSForElems” from the application in our study. Compared with the baseline implementations, our optimizations improve the performance by a factor of 1.65X and 2.96X for the two kernels on the FPGA, respectively. Using directives for accelerator programming, we also evaluate the performance of the kernels on an Intel Xeon 16-core CPU and an Nvidia K80 GPU. We find that the FPGA, constrained by the memory bandwidth, can perform 1.05X to 3.4X better than the CPU and GPU for small problem sizes. For the first kernel, the performance per watt on the FPGA is 1.59X and 7.1X higher than that on an Intel Xeon 16-core CPU and an Nvidia K80 GPU, respectively. For the second kernel, the performance per watt on the GPU is 1.82X higher than that on the FPGA. However, the performance per watt on the FPGA is 1.77X higher than that on the CPU.

3 citations


Patent
27 Jun 2019
TL;DR: In this paper, a technique for accessing memory in an accelerated processing device coupled to stacked memory dies is presented, which includes receiving a memory access request from an execution unit and identifying whether the access request corresponds to memory cells of the stacked dies that are considered local to the execution unit or not local.
Abstract: A technique for accessing memory in an accelerated processing device coupled to stacked memory dies is provided herein. The technique includes receiving a memory access request from an execution unit and identifying whether the memory access request corresponds to memory cells of the stacked dies that are considered local to the execution unit or non-local. For local accesses, the access is made “directly”, that is, without using a bus. A control die coordinates operations for such local accesses, activating particular through-silicon-vias associated with the memory cells that include the data for the access. Non-local accesses are made via a distributed cache fabric and an interconnect bus in the control die. Various other features and details are provided below.

2 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: An approach for on-line energy-efficiency analysis when executing OpenMP workloads on multicore systems using a specific neural network model derived from the popular auto-encoder that is capable of understanding application profile and track phase changes at run-time is proposed.
Abstract: Energy-efficiency has been a major challenge in compute systems over the last decade. Both embedded and highperformance computing domains are concerned. Many efforts have been currently spent to devise solutions that are capable of providing systems with the best compromises in terms of performance and power consumption. In this paper, we propose an approach for on-line energy-efficiency analysis when executing OpenMP workloads on multicore systems. The novelty of our approach lies in the ability to monitor energy efficiency at runtime without prior knowledge of the application profile or code annotation. The solution relies on two new metrics: the Chunks per Second (CpS) and Chunks per Joule (CpJ). The former captures the quantity of work achieved by threads per unit time (i.e. a performance indicator). The latter indicates the quantity of work achieved by threads per unit energy, also corresponding to the performance per watt (i.e. an energy efficiency indicator). As most programs are made of several phases performing different computations for which CpS and CpJ cannot be related, it is crucial to be capable of detecting phase changes such as to perform intra-phase energy efficiency optimizations. For that purpose we devise a specific neural network model derived from the popular auto-encoder largely explored in the machine learning community, that is capable of understanding application profile and track phase changes at run-time. We show that these new metrics allow to perform energy efficiency optimization, and illustrate our approach on the analysis of the SRAD application from the Rodinia benchmark. The energy-efficiency profile analysis of the application is conducted on both an Intel and ARM platforms, showing its flexibility.

2 citations


Proceedings ArticleDOI
20 Feb 2019
TL;DR: This paper evaluates and optimize the OpenCL implementations of three nuclear reactor simulation applications (XSBench, RSBench, and SimpleMOC kernel) on a heterogeneous computing platform that consists of a general-purpose CPU and an FPGA.
Abstract: Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) tools, such as Intel FPGA SDK for OpenCL, provide a streamlined design flow to facilitate parallel application on FPGAs. In this paper, we evaluate and optimize the OpenCL implementations of three nuclear reactor simulation applications (XSBench, RSBench, and SimpleMOC kernel) on a heterogeneous computing platform that consists of a general-purpose CPU and an FPGA. We introduce the applications, and describe their OpenCL implementations and optimization methods on an Arria10-based FPGA platform. Compared with the baseline kernel implementations, our optimizations increase the performance of the three kernels by a factor of 35, 295, and 102, respectively. We compare the performance, power, and performance per watt of the three applications on an Intel Xeon 16-core CPU, an Nvidia Tesla K80 GPU, and an Intel Arria10 GX1150 FPGA. The performance per watt on the FPGA is competitive. For XSBench, the performance per watt on the FPGA is 1.43X higher than that on the CPU, and 2.58X lower than that on the GPU. For RSBench, the performance per watt on the FPGA is 3.6X higher than that on the CPU, and 5.8X lower than that on the GPU. For SimpleMOC kernel, the performance per watt on the FPGA is 1.74X higher than that on the CPU, and 1.65X lower than that on the GPU.

1 citations


Journal ArticleDOI
TL;DR: This paper investigates the performance improvement of hot spares to see if it can be used to improve performance per watt (PPW) in multi-core single-instruction, multiple-thread (SIMT) processors over different applications and observes that hot sparing is effective for specific types of SIMT processor configurations (small and medium sized).

1 citations


Journal ArticleDOI
TL;DR: Experimental results show that the ring-based data fabric can reduce read latencies and power consumption, and compare the approach with Wide I/O, which is designed for power-constrained systems.
Abstract: As computer memory increases in size and processors continue to get faster, the memory subsystem becomes a bottleneck to system performance. To mitigate the relatively slow dynamic random access memory (DRAM) chip speeds, a new generation of 3-D stacked DRAM is being developed, with lower power consumption and higher bandwidth. This paper proposes the use of 3-D ring-based data fabrics for fast data transfer between the chips in the 3-D stacked DRAM. The ring-based data fabric uses a fast standing wave oscillator to clock its transactions. With a fast clocking scheme and multiple channels sharing the same bus, more channels are utilized while significantly reducing the number of through-silicon vias. Our memory architecture using a ring-based scheme (MARS) can effectively trade off power, throughput, and latency to improve the system performance for different application spaces. Experimental results show that our ring-based data fabric can reduce read latencies and power consumption. MARS variants can deliver better latency (up to $\sim 4\times $ ), power (up to $\sim 8\times $ ), and performance per watt (up to $\sim 8\times $ ) over high bandwidth memory. We also compare our approach with Wide I/O, which is designed for power-constrained systems. MARS variants provide better latency (up to $\sim 8\times $ ) with similar performance per watt.

Book ChapterDOI
16 Oct 2019
TL;DR: A division of the inner loop of the stencil-based code is proposed in such a way that total latency is reduced using memory partition and pipeline directives and the two-dimensional Laplace equation implemented on a ZedBoard and an Ultra96 board using Vivado HLS is used.
Abstract: Iterative stencil computations are present in many scientific and engineering applications. The acceleration of stencil codes using parallel architectures has been widely studied. The parallelization of the stencil computation on FPGA based heterogeneous architectures has been reported with the use of traditional RTL logic design or the use of directives in C/C++ codes on high level synthesis tools. In both cases, it has been shown that FPGAs provide better performance per watt compared to CPU or GPU-based systems. High level synthesis tools are limited to the use of parallelization directives without evaluating other possibilities of their application based on the adaptation of the algorithm. In this document, it is proposed a division of the inner loop of the stencil-based code in such a way that total latency is reduced using memory partition and pipeline directives. As a case study is used the two-dimensional Laplace equation implemented on a ZedBoard and an Ultra96 board using Vivado HLS. The performance is evaluated according to the amount of inner loop divisions and the on-chip memory partitions, in terms of the latency, power consumption, use of FPGA resources, and speed-up.

Patent
07 Feb 2019
TL;DR: In this article, a real-time correlation between a power consumption of a processor and an operating frequency of the processor is determined, and the operating frequency is set to a value based on the first and second realtime correlations.
Abstract: Systems, apparatuses and methods may provide for technology that determines a first real-time correlation between a power consumption of a processor and an operating frequency of the processor, determines a second real-time correlation between a performance level of the processor and the operating frequency of the processor, and sets the operating frequency of the processor to a value based on the first and second real-time correlations. In one example, the performance level or performance per watt of the processor decreases at one or more operating frequencies greater than the value.

Proceedings ArticleDOI
04 Sep 2019
TL;DR: This paper chooses a random network of Hodgkin-Huxley neurons with exponential synaptic conductance to evaluate the performance of the simulation of networks of spiking neurons on an FPGA and identifies a computationally intensive kernel in the generated C++ code, converts the kernel to a portable OpenCL kernel, and describes the optimizations which can reduce the resource utilizations and improve the kernel performance.
Abstract: Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis tools offer a streamlined design flow for researchers to develop a parallel application using a high-level language on FPGAs. In this paper, we choose a random network of Hodgkin-Huxley (HH) neurons with exponential synaptic conductance to evaluate the performance of the simulation of networks of spiking neurons on an FPGA. Focused on the conductance-based HH benchmark, we execute the benchmark on a general-purpose simulator for spiking neural networks, identify a computationally intensive kernel in the generated C++ code, convert the kernel to a portable OpenCL kernel, and describe the optimizations which can reduce the resource utilizations and improve the kernel performance. We evaluate the kernel on an Intel Arria 10 based FPGA platform, an Intel Xeon 16-core CPU, an Intel Xeon 4-core low-power processor with a CPU and a GPU integrated on the same chip, and an NVIDIA Tesla P100 discrete GPU. For the kernel execution time, the Arria 10 GX1150 FPGA is 2X and 3X faster than the two CPUs, but it is 2.5X and 4.8X slower than the two GPUs, respectively. The FPGA consumes the least power, but its performance per watt is 1.56X and 1.96X lower than the two GPUs, respectively.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: An on-chip power management technique is developed that makes use of particle swarm optimization (PSO) to improve the performance per watt of the circuit while maintaining the power integrity.
Abstract: An on-chip power management technique is developed that makes use of particle swarm optimization (PSO) to improve the performance per watt of the circuit while maintaining the power integrity. On-line learning is applied to determine the optimum reference voltages of the on-chip voltage regulators set through the PSO to reduce the energy consumption of the system while preventing any timing failure due to process variation, voltage variation, temperature, and aging. The runtime adaptive voltage delivery technique is applicable to any processor architecture. Simulation results on a streaming multiprocessor similar to the NVIDIA GV100 GPU in a 7 nm FinFET technology indicate an average reduction of 35%, 40%, and 5% in, respectively, the power consumption, the threshold voltage drift, and the operating temperature as compared to existing techniques that implement static voltage guardbands.