scispace - formally typeset
Search or ask a question

Showing papers on "Performance per watt published in 2021"


Proceedings ArticleDOI
18 Oct 2021
TL;DR: In this article, the authors propose a configurable GPU power model called AccelWattch that can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS.
Abstract: Graphics Processing Units (GPUs) are rapidly dominating the accelerator space, as illustrated by their wide-spread adoption in the data analytics and machine learning markets. At the same time, performance per watt has emerged as a crucial evaluation metric together with peak performance. As such, GPU architects require robust tools that will enable them to model both the performance and the power consumption of modern GPUs. However, while GPU performance modeling has progressed in great strides, power modeling has lagged behind. To mitigate this problem we propose AccelWattch, a configurable GPU power model that resolves two long-standing needs: the lack of a detailed and accurate cycle-level power model for modern GPU architectures, and the inability to capture their constant and static power with existing tools. AccelWattch can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS. We integrate AccelWattch with GPGPU-Sim and Accel-Sim to facilitate its widespread use. We validate AccelWattch on a NVIDIA Volta GPU, and show that it achieves strong correlation against hardware power measurements. Finally, we demonstrate that AccelWattch can enable reliable design space exploration: by directly applying AccelWattch tuned for Volta on GPU configurations resembling NVIDIA Pascal and Turing GPUs, we obtain accurate power models for these architectures.

28 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose an automated framework for the implementation of hardware-accelerated DNN architectures on FPGAs by combining custom hardware scalability with optimization strategies.
Abstract: Deep Learning techniques have been successfully applied to solve many Artificial Intelligence (AI) applications problems. However, owing to topologies with many hidden layers, Deep Neural Networks (DNNs) have high computational complexity, which makes their deployment difficult in contexts highly constrained by requirements such as performance, real-time processing, or energy efficiency. Numerous hardware/software optimization techniques using GPUs, ASICs, and reconfigurable computing (i.e, FPGAs), have been proposed in the literature. With FPGAs, very specialized architectures have been developed to provide an optimal balance between high-speed and low power. However, when targeting edge computing, user requirements and hardware constraints must be efficiently met. Therefore, in this work, we only focus on reconfigurable embedded systems based on the Xilinx ZYNQ SoC and popular DNNs that can be implemented on Embedded Edge improving performance per watt while maintaining accuracy. In this context, we propose an automated framework for the implementation of hardware-accelerated DNN architectures. This framework provides an end-to-end solution that facilitates the efficient deployment of topologies on FPGAs by combining custom hardware scalability with optimization strategies. Cutting-edge comparisons and experimental results demonstrate that the architectures developed by our framework offer the best compromise between performance, energy consumption, and system costs. For instance, the low power (0.266W) DNN topologies generated for the MNIST database achieved a high throughput of 3,626 FPS.

13 citations


Proceedings ArticleDOI
01 Aug 2021
TL;DR: DeepFire as discussed by the authors proposes a high-performance RTL IP for accelerating convolutional SNN inference on modern FPGAs, which achieves up to 40.1kFPS and 28.3kFps on MNIST and CIFAR-10/SVHN datasets with 99.14% and 81.8%/93.1% accuracies respectively.
Abstract: Spiking neural networks (SNN) with their ‘integrate and fire’ (I&F) neurons replace the hardware-intensive multiply-accumulate (MAC) operations in convolutional neural networks (CNN) with accumulate operations — not only making it easy to implement on FPGAs but also opening up the opportunities for energy-efficient hardware acceleration. In this paper, we propose DeepFire — the high-performance RTL IP — for accelerating convolutional SNN inference. The IP exploits various resources available on modern FPGAs, and it outperforms existing SNN implementations by more than 10× in terms of both frame per second (FPS) and performance per watt (FPS/Watt). Our design achieves up to 40.1kFPS and 28.3kFPS on MNIST and CIFAR-10/SVHN datasets with 99.14% and 81.8%/93.1% accuracies respectively. IP was evaluated with 7-series and Ultrascale+ FPGAs from Xilinx achieving Fmax of 375MHz and 500MHz respectively.

10 citations


Journal ArticleDOI
TL;DR: In this paper, the estimation of power usage at the on-chip level for 3D-TTN with the various other networks along with the analysis of static network performance is presented.
Abstract: Green computing is an important factor to ensure the eco-friendly use of computers and their resources. Electric power used in a computer converts into heat and thus, the system takes fewer watts ensuring less cooling. This lower energy consumption allows to be less costly to run as well as reduces the environmental impact of powering the computer. One of the most challenging problems for the modern green supercomputers is the reduction of current power consumptions. Consequently, regular conventional interconnection networks also show poor cost performance. On the other hand, hierarchical interconnection networks (like-3D-TTN) can be a possible solution to those issues. The main focus for this paper is the estimation of power usage at the on-chip level for 3D-TTN with the various other networks along with the analysis of static network performance. In our analysis, 3D-TTN requires about 32.48% less router power usage at the on-chip level and can also achieve near about 21% better diameter performance as well as 12% better average distance performance than the 5D-Torus network. Similarly, it also requires only about 14.43% higher router power usage; however, can achieve 23.21% better diameter performance and 26.3% better average distance than recent hierarchical interconnection network- 3D-TESH. The most attractive feature of this paper is the static hop distance parameter and per watt analysis (power-performance). According to our power-performance results, 3D-TTN can also show better result than the 3D-Mesh, 2D-Mesh, 2D-Torus and 3D-TESH network even at the lowest network level. Moreover, this paper is also featured with the static effectiveness analysis, which ensures cost and time efficiency of 3D-TTN.

9 citations


Journal ArticleDOI
TL;DR: The rigid power, performance, and flexibility tradeoffs that BCI designers must balance are discussed, and how HALO’s palette of domain-specific hardware accelerators, general-purpose microcontroller, and configurable interconnect overcome them.
Abstract: We are building HALO, a flexible ultralow-power processing architecture for implantable brain– computer interfaces (BCIs) that directly communicate with biological neurons in real time. This article discusses the rigid power, performance, and flexibility tradeoffs that BCI designers must balance, and how we overcome them via HALO’s palette of domain-specific hardware accelerators, general-purpose microcontroller, and configurable interconnect. Our evaluations using neuronal data collected in vivo from a nonhuman primate, along with full-stack algorithm to chip codesign, show that HALO achieves flexibility and superior performance per watt versus existing implantable BCIs.

6 citations


Book ChapterDOI
06 Jul 2021
TL;DR: In this paper, the authors explore the use of control theory for the design of a dynamic power regulation method for current and future HPC architectures, which is based on periodically monitoring application progress and choosing at runtime a suitable power cap for processors.
Abstract: Production high-performance computing systems continue to grow in complexity and size. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Alongside the growing complexity of scientific workloads, this extreme heterogeneity is also an opportunity: as applications dynamically undergo variations in workload, due to phases or data/compute movement between devices, one can dynamically adjust power across compute elements to save energy without impacting performance. With an aim toward an autonomous and dynamic power management strategy for current and future HPC architectures, this paper explores the use of control theory for the design of a dynamic power regulation method. Structured as a feedback loop, our approach—which is novel in computing resource management—consists of periodically monitoring application progress and choosing at runtime a suitable power cap for processors. Thanks to a preliminary offline identification process, we derive a model of the dynamics of the system and a proportional-integral (PI) controller. We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid’5000, using a standard memory-bound HPC benchmark.

6 citations


Proceedings ArticleDOI
01 Sep 2021
TL;DR: In this paper, a parameterizable vector processing unit (VPU) is presented based on a subset of V-extension from the RISC-V instruction set architecture (ISA) for embedded processing.
Abstract: The computational intensity in embedded processing applications is increasing. This requires domain-specific embedded platforms in order to achieve maximum performance per watt of the system. With the arrival of open-source instruction set architectures such as RISC-V, and different domain-specific architecture development toolchains, the trend of application-specific architectures is increasing. In this paper, a parameterizable Vector Processing Unit (VPU) is presented based on a subset of V-extension from the RISC-V instruction set architecture (ISA) for embedded processing. Two key configurable parameters for the proposed VPU are vector length (VLEN) and the number of execution lanes. These parameters allow design space exploration for the VPU for different configurations and help to understand which application scenarios would fit for certain configurations. The proposed VPU was integrated into a 32-bit RISC-V processor. For maximum parallelization configuration, 2.3 x fewer cycles per instructions were achieved as compared to a RISC-V processor. Moreover, a relative cycle gain of 33-73% was achieved for different configurations as compared with the RISC-V processor.

5 citations


Journal ArticleDOI
TL;DR: This paper proposes a methodology to predict the power consumption and performance for groups of concurrently executing applications at all available frequencies of a CMP, which outperforms three state-of-the-art resource managers, yielding the highest performance per Watt in all evaluated use cases.

3 citations


Journal ArticleDOI
Vivek Bhardwaj1
TL;DR: This paper looks at some of the techniques and efficient algorithms of ASIC software that have resulted in better performance per watt- a key metric in FPGA world.
Abstract: With the ever rising demand of FPGA based applications and increasing semiconductor complexity of late, the techniques and efficient algorithms of ASIC software have trickled down to FPGA as well, In this paper, we are going to look at some of these techniques that have resulted in better performance per watt- a key metric in FPGA world. We will also do a brief comparison of ASIC vs FPGA design flow and FPGA architecture, connect the dots and make user better aware of the challenges that are faced by FPGA designers in implementing a certain design technique and how the software tries to overcome those challenges.

1 citations


Posted Content
TL;DR: In this paper, the authors propose an information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives.
Abstract: Mobile system-on-chips (SoCs) are growing in their complexity and heterogeneity (e.g., Arm's Big-Little architecture) to meet the needs of emerging applications, including games and artificial intelligence. This makes it very challenging to optimally manage the resources (e.g., controlling the number and frequency of different types of cores) at runtime to meet the desired trade-offs among multiple objectives such as performance and energy. This paper proposes a novel information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives. PaRMIS specifies parametric policies to manage resources and learns statistical models from candidate policy evaluation data in the form of target design objective values. The key idea is to select a candidate policy for evaluation in each iteration guided by statistical models that maximize the information gain about the true Pareto front. Experiments on a commercial heterogeneous SoC show that PaRMIS achieves better Pareto fronts and is easily usable to optimize complex objectives (e.g., performance per Watt) when compared to prior methods.

Posted Content
TL;DR: In this paper, the authors explore the use of control theory for the design of a dynamic power regulation method for current and future HPC architectures, which is based on periodically monitoring application progress and choosing at runtime a suitable power cap for processors.
Abstract: Production high-performance computing systems continue to grow in complexity and size. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Alongside the growing complexity of scientific workloads, this extreme heterogeneity is also an opportunity: as applications dynamically undergo variations in workload, due to phases or data/compute movement between devices, one can dynamically adjust power across compute elements to save energy without impacting performance. With an aim toward an autonomous and dynamic power management strategy for current and future HPC architectures, this paper explores the use of control theory for the design of a dynamic power regulation method. Structured as a feedback loop, our approach-which is novel in computing resource management-consists of periodically monitoring application progress and choosing at runtime a suitable power cap for processors. Thanks to a preliminary offline identification process, we derive a model of the dynamics of the system and a proportional-integral (PI) controller. We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid'5000, using a standard memory-bound HPC benchmark.

Proceedings ArticleDOI
01 Feb 2021
TL;DR: In this article, a low-power processing-in-memory inference accelerator is proposed for subject-independent EEG signal classification. But, the model is not suitable for BCI tasks.
Abstract: State-of-the-art deep neural networks (DNNs) for electroencephalography (EEG) signals classification focus on subject-related tasks, in which the test data and the training data needs to be collected from the same subject. In addition, due to limited computing resources and strict power budgets at edges, it is very challenging to deploy the inference of such DNN models on biological devices. In this work, we present an algorithm/hardware co-designed low-power accelerator for subject-independent EEG signal classification. We propose a compact neural network that is capable to identify the common and stable structure among subjects. Based on it, we realize a robust subject-independent EEG signal classification model that can be extended to multiple BCI tasks with minimal overhead. Based on this model, we present RAISE, a low-power processing-in-memory inference accelerator by leveraging the emerging resistive memory. We compare the proposed model and hardware accelerator to prior arts across various BCI paradigms. We show that our model achieves the best subject-independent classification accuracy, while RAISE achieves 2.8× power reduction and 2.5× improvement in performance per watt compared to the state-of-the-art resistive inference accelerator.

Proceedings ArticleDOI
05 Dec 2021
TL;DR: In this paper, the authors propose an information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives.
Abstract: Mobile system-on-chips (SoCs) are growing in their complexity and heterogeneity (e.g., Arm’s Big-Little architecture) to meet the needs of emerging applications, including games and artificial intelligence. This makes it very challenging to optimally manage the resources (e.g., controlling the number and frequency of different types of cores) at runtime to meet the desired trade-offs among multiple objectives such as performance and energy. This paper proposes a novel information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives. PaRMIS specifies parametric policies to manage resources and learns statistical models from candidate policy evaluation data in the form of target design objective values. The key idea is to select a candidate policy for evaluation in each iteration guided by statistical models that maximize the information gain about the true Pareto front. Experiments on a commercial heterogeneous SoC show that PaRMIS achieves better Pareto fronts and is easily usable to optimize complex objectives (e.g., performance per Watt) when compared to prior methods

Proceedings ArticleDOI
28 May 2021
TL;DR: In this paper, a multi-core reconfigurable architecture for particle transport is proposed, which consists of heterogeneous lightweight cores, a recon-figurable cache structure, and high bandwidth memory.
Abstract: Random simulation for particle transport theory is the main method for solving particle transport questions, which is widely used in medicine and computational physics. In this work, we present a multi-core reconfigurable architecture that aims to meet the performance per watt requirements of future Domain Specific Architectures (DSAs). The architecture proposed in this paper consists of heterogeneous lightweight cores, a reconfigurable cache structure, and High Bandwidth Memory. By targeting the different feature requirements of the Monte Carlo transport code at different stages, we design more necessary and efficient features for the lightweight calculating core, and continue to provide a trade-off of performance and energy consumption through reconfiguration. We designed and validated the accelerator architecture using gem5. Experiments show that compared with the traditional architecture composed of multiple out-of-order core, this architecture can obtain more than 3x in performance per watt. Some conclusions explored are not limited to the architecture proposed in this paper, but lay the foundation for further studies of large-scale transport accelerators.