Showing papers on "Performance per watt published in 2021"

PDF

Open Access

Proceedings Article•DOI•

AccelWattch: A Power Modeling Framework for Modern GPUs

[...]

Vijay Kandiah¹, Scott Peverelle², Mahmoud Khairy³, Junrui Pan³, Amogh Manjunath³, Timothy G. Rogers³, Tor M. Aamodt⁴, Nikos Hardavellas¹ - Show less +4 more•Institutions (4)

Northwestern University¹, Intel², Purdue University³, University of British Columbia⁴

18 Oct 2021

TL;DR: In this article, the authors propose a configurable GPU power model called AccelWattch that can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS.

...read moreread less

Abstract: Graphics Processing Units (GPUs) are rapidly dominating the accelerator space, as illustrated by their wide-spread adoption in the data analytics and machine learning markets. At the same time, performance per watt has emerged as a crucial evaluation metric together with peak performance. As such, GPU architects require robust tools that will enable them to model both the performance and the power consumption of modern GPUs. However, while GPU performance modeling has progressed in great strides, power modeling has lagged behind. To mitigate this problem we propose AccelWattch, a configurable GPU power model that resolves two long-standing needs: the lack of a detailed and accurate cycle-level power model for modern GPU architectures, and the inability to capture their constant and static power with existing tools. AccelWattch can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS. We integrate AccelWattch with GPGPU-Sim and Accel-Sim to facilitate its widespread use. We validate AccelWattch on a NVIDIA Volta GPU, and show that it achieves strong correlation against hardware power measurements. Finally, we demonstrate that AccelWattch can enable reliable design space exploration: by directly applying AccelWattch tuned for Volta on GPU configurations resembling NVIDIA Pascal and Turing GPUs, we obtain accurate power models for these architectures.

...read moreread less

28 citations

Journal Article•DOI•

User Driven FPGA-Based Design Automated Framework of Deep Neural Networks for Low-Power Low-Cost Edge Computing

[...]

Tarek Belabed¹, Maria Gracielly F. Coutinho², Marcelo A. C. Fernandes², Carlos Alberto Valderrama Sakuyama¹, Chokri Souani³ - Show less +1 more•Institutions (3)

University of Mons¹, Federal University of Rio Grande do Norte², University of Sousse³

17 Jun 2021-IEEE Access

TL;DR: In this article, the authors propose an automated framework for the implementation of hardware-accelerated DNN architectures on FPGAs by combining custom hardware scalability with optimization strategies.

...read moreread less

Abstract: Deep Learning techniques have been successfully applied to solve many Artificial Intelligence (AI) applications problems. However, owing to topologies with many hidden layers, Deep Neural Networks (DNNs) have high computational complexity, which makes their deployment difficult in contexts highly constrained by requirements such as performance, real-time processing, or energy efficiency. Numerous hardware/software optimization techniques using GPUs, ASICs, and reconfigurable computing (i.e, FPGAs), have been proposed in the literature. With FPGAs, very specialized architectures have been developed to provide an optimal balance between high-speed and low power. However, when targeting edge computing, user requirements and hardware constraints must be efficiently met. Therefore, in this work, we only focus on reconfigurable embedded systems based on the Xilinx ZYNQ SoC and popular DNNs that can be implemented on Embedded Edge improving performance per watt while maintaining accuracy. In this context, we propose an automated framework for the implementation of hardware-accelerated DNN architectures. This framework provides an end-to-end solution that facilitates the efficient deployment of topologies on FPGAs by combining custom hardware scalability with optimization strategies. Cutting-edge comparisons and experimental results demonstrate that the architectures developed by our framework offer the best compromise between performance, energy consumption, and system costs. For instance, the low power (0.266W) DNN topologies generated for the MNIST database achieved a high throughput of 3,626 FPS.

...read moreread less

13 citations

Proceedings Article•DOI•

DeepFire: Acceleration of Convolutional Spiking Neural Network on Modern Field Programmable Gate Arrays

[...]

Myat Thu Linn Aung¹, Chuping Qu¹, Liwei Yang¹, Tao Luo¹, Rick Siow Mong Goh¹, Weng-Fai Wong² - Show less +2 more•Institutions (2)

Institute of High Performance Computing Singapore¹, National University of Singapore²

01 Aug 2021

TL;DR: DeepFire as discussed by the authors proposes a high-performance RTL IP for accelerating convolutional SNN inference on modern FPGAs, which achieves up to 40.1kFPS and 28.3kFps on MNIST and CIFAR-10/SVHN datasets with 99.14% and 81.8%/93.1% accuracies respectively.

...read moreread less

Abstract: Spiking neural networks (SNN) with their ‘integrate and fire’ (I&F) neurons replace the hardware-intensive multiply-accumulate (MAC) operations in convolutional neural networks (CNN) with accumulate operations — not only making it easy to implement on FPGAs but also opening up the opportunities for energy-efficient hardware acceleration. In this paper, we propose DeepFire — the high-performance RTL IP — for accelerating convolutional SNN inference. The IP exploits various resources available on modern FPGAs, and it outperforms existing SNN implementations by more than 10× in terms of both frame per second (FPS) and performance per watt (FPS/Watt). Our design achieves up to 40.1kFPS and 28.3kFPS on MNIST and CIFAR-10/SVHN datasets with 99.14% and 81.8%/93.1% accuracies respectively. IP was evaluated with 7-series and Ultrascale+ FPGAs from Xilinx achieving Fmax of 375MHz and 500MHz respectively.

...read moreread less

10 citations

Journal Article•DOI•

3D-TTN: a power efficient cost effective high performance hierarchical interconnection network for next generation green supercomputer.

[...]

Faiz Al Faisal¹, M.M. Hafizur Rahman², Yasushi Inoguchi³•Institutions (3)

State University of Bangladesh¹, King Faisal University², Japan Advanced Institute of Science and Technology³

19 May 2021-Cluster Computing

TL;DR: In this paper, the estimation of power usage at the on-chip level for 3D-TTN with the various other networks along with the analysis of static network performance is presented.

...read moreread less

Abstract: Green computing is an important factor to ensure the eco-friendly use of computers and their resources. Electric power used in a computer converts into heat and thus, the system takes fewer watts ensuring less cooling. This lower energy consumption allows to be less costly to run as well as reduces the environmental impact of powering the computer. One of the most challenging problems for the modern green supercomputers is the reduction of current power consumptions. Consequently, regular conventional interconnection networks also show poor cost performance. On the other hand, hierarchical interconnection networks (like-3D-TTN) can be a possible solution to those issues. The main focus for this paper is the estimation of power usage at the on-chip level for 3D-TTN with the various other networks along with the analysis of static network performance. In our analysis, 3D-TTN requires about 32.48% less router power usage at the on-chip level and can also achieve near about 21% better diameter performance as well as 12% better average distance performance than the 5D-Torus network. Similarly, it also requires only about 14.43% higher router power usage; however, can achieve 23.21% better diameter performance and 26.3% better average distance than recent hierarchical interconnection network- 3D-TESH. The most attractive feature of this paper is the static hop distance parameter and per watt analysis (power-performance). According to our power-performance results, 3D-TTN can also show better result than the 3D-Mesh, 2D-Mesh, 2D-Torus and 3D-TESH network even at the lowest network level. Moreover, this paper is also featured with the static effectiveness analysis, which ensures cost and time efficiency of 3D-TTN.

...read moreread less

9 citations

Journal Article•DOI•

Balancing Specialized Versus Flexible Computation in Brain–Computer Interfaces

[...]

Ioannis Karageorgos¹, Karthik Sriram¹, Jan Vesely², Nick Lindsay¹, Xiayuan Wen¹, Michael Wu², Marc Powell³, David A. Borton⁴, Rajit Manohar¹, Abhishek Bhattacharjee¹ - Show less +6 more•Institutions (4)

Yale University¹, Rutgers University², University of Pittsburgh³, Brown University⁴

01 May 2021-IEEE Micro

TL;DR: The rigid power, performance, and flexibility tradeoffs that BCI designers must balance are discussed, and how HALO’s palette of domain-specific hardware accelerators, general-purpose microcontroller, and configurable interconnect overcome them.

...read moreread less

Abstract: We are building HALO, a flexible ultralow-power processing architecture for implantable brain– computer interfaces (BCIs) that directly communicate with biological neurons in real time. This article discusses the rigid power, performance, and flexibility tradeoffs that BCI designers must balance, and how we overcome them via HALO’s palette of domain-specific hardware accelerators, general-purpose microcontroller, and configurable interconnect. Our evaluations using neuronal data collected in vivo from a nonhuman primate, along with full-stack algorithm to chip codesign, show that HALO achieves flexibility and superior performance per watt versus existing implantable BCIs.

...read moreread less

6 citations

Book Chapter•DOI•

Sustaining Performance While Reducing Energy Consumption: A Control Theory Approach

[...]

Eric Rutten¹, Sophie Cerf¹, Raphaël Bleuse¹, Valentin Reis², Swann Perarnau² - Show less +1 more•Institutions (2)

University of Grenoble¹, Argonne National Laboratory²

06 Jul 2021

TL;DR: In this paper, the authors explore the use of control theory for the design of a dynamic power regulation method for current and future HPC architectures, which is based on periodically monitoring application progress and choosing at runtime a suitable power cap for processors.

...read moreread less

Abstract: Production high-performance computing systems continue to grow in complexity and size. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Alongside the growing complexity of scientific workloads, this extreme heterogeneity is also an opportunity: as applications dynamically undergo variations in workload, due to phases or data/compute movement between devices, one can dynamically adjust power across compute elements to save energy without impacting performance. With an aim toward an autonomous and dynamic power management strategy for current and future HPC architectures, this paper explores the use of control theory for the design of a dynamic power regulation method. Structured as a feedback loop, our approach—which is novel in computing resource management—consists of periodically monitoring application progress and choosing at runtime a suitable power cap for processors. Thanks to a preliminary offline identification process, we derive a model of the dynamics of the system and a proportional-integral (PI) controller. We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid’5000, using a standard memory-bound HPC benchmark.

...read moreread less

6 citations

Proceedings Article•DOI•

Vector Processing Unit: A RISC-V based SIMD Co-processor for Embedded Processing

[...]

Muhammad Ali¹, Matthias von Ameln¹, Diana Goehringer¹•Institutions (1)

Dresden University of Technology¹

01 Sep 2021

TL;DR: In this paper, a parameterizable vector processing unit (VPU) is presented based on a subset of V-extension from the RISC-V instruction set architecture (ISA) for embedded processing.

...read moreread less

Abstract: The computational intensity in embedded processing applications is increasing. This requires domain-specific embedded platforms in order to achieve maximum performance per watt of the system. With the arrival of open-source instruction set architectures such as RISC-V, and different domain-specific architecture development toolchains, the trend of application-specific architectures is increasing. In this paper, a parameterizable Vector Processing Unit (VPU) is presented based on a subset of V-extension from the RISC-V instruction set architecture (ISA) for embedded processing. Two key configurable parameters for the proposed VPU are vector length (VLEN) and the number of execution lanes. These parameters allow design space exploration for the VPU for different configurations and help to understand which application scenarios would fit for certain configurations. The proposed VPU was integrated into a 32-bit RISC-V processor. For maximum parallelization configuration, 2.3 x fewer cycles per instructions were achieved as compared to a RISC-V processor. Moreover, a relative cycle gain of 33-73% was achieved for different configurations as compared with the RISC-V processor.

...read moreread less

5 citations

Journal Article•DOI•

Online frequency-based performance and power estimation for clustered multi-processor systems

[...]

Shivam Kundan¹, Ourania Spantidi¹, Iraklis Anagnostopoulos¹•Institutions (1)

Southern Illinois University Carbondale¹

01 Mar 2021-Computers & Electrical Engineering

TL;DR: This paper proposes a methodology to predict the power consumption and performance for groups of concurrently executing applications at all available frequencies of a CMP, which outperforms three state-of-the-art resource managers, yielding the highest performance per Watt in all evaluated use cases.

...read moreread less

3 citations

Journal Article•DOI•

FPGA Design Flow and CAD

[...]

Vivek Bhardwaj¹•Institutions (1)

Intel¹

18 Feb 2021-Social Science Research Network

TL;DR: This paper looks at some of the techniques and efficient algorithms of ASIC software that have resulted in better performance per watt- a key metric in FPGA world.

...read moreread less

Abstract: With the ever rising demand of FPGA based applications and increasing semiconductor complexity of late, the techniques and efficient algorithms of ASIC software have trickled down to FPGA as well, In this paper, we are going to look at some of these techniques that have resulted in better performance per watt- a key metric in FPGA world. We will also do a brief comparison of ASIC vs FPGA design flow and FPGA architecture, connect the dots and make user better aware of the challenges that are faced by FPGA designers in implementing a certain design technique and how the software tries to overcome those challenges.

...read moreread less

1 citations

Posted Content•

Learning Pareto-Frontier Resource Management Policies for Heterogeneous SoCs: An Information-Theoretic Approach

[...]

Aryan Deshwal¹, Syrine Belakaria¹, Ganapati Bhat¹, Janardhan Rao Doppa¹, Partha Pratim Pande¹ - Show less +1 more•Institutions (1)

Washington State University¹

14 Apr 2021-arXiv: Hardware Architecture

TL;DR: In this paper, the authors propose an information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives.

...read moreread less

Abstract: Mobile system-on-chips (SoCs) are growing in their complexity and heterogeneity (e.g., Arm's Big-Little architecture) to meet the needs of emerging applications, including games and artificial intelligence. This makes it very challenging to optimally manage the resources (e.g., controlling the number and frequency of different types of cores) at runtime to meet the desired trade-offs among multiple objectives such as performance and energy. This paper proposes a novel information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives. PaRMIS specifies parametric policies to manage resources and learns statistical models from candidate policy evaluation data in the form of target design objective values. The key idea is to select a candidate policy for evaluation in each iteration guided by statistical models that maximize the information gain about the true Pareto front. Experiments on a commercial heterogeneous SoC show that PaRMIS achieves better Pareto fronts and is easily usable to optimize complex objectives (e.g., performance per Watt) when compared to prior methods.

...read moreread less

Posted Content•

Sustaining Performance While Reducing Energy Consumption: A Control Theory Approach

[...]

Eric Rutten¹, Sophie Cerf¹, Raphaël Bleuse¹, Valentin Reis², Swann Perarnau² - Show less +1 more•Institutions (2)

University of Grenoble¹, Argonne National Laboratory²

06 Jul 2021-arXiv: Distributed, Parallel, and Cluster Computing

...read moreread less

Abstract: Production high-performance computing systems continue to grow in complexity and size. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Alongside the growing complexity of scientific workloads, this extreme heterogeneity is also an opportunity: as applications dynamically undergo variations in workload, due to phases or data/compute movement between devices, one can dynamically adjust power across compute elements to save energy without impacting performance. With an aim toward an autonomous and dynamic power management strategy for current and future HPC architectures, this paper explores the use of control theory for the design of a dynamic power regulation method. Structured as a feedback loop, our approach-which is novel in computing resource management-consists of periodically monitoring application progress and choosing at runtime a suitable power cap for processors. Thanks to a preliminary offline identification process, we derive a model of the dynamics of the system and a proportional-integral (PI) controller. We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid'5000, using a standard memory-bound HPC benchmark.

...read moreread less

Proceedings Article•DOI•

RAISE: A Resistive Accelerator for Subject-Independent EEG Signal Classification

[...]

Fan Chen¹, Linghao Song¹, Hai Helen Li¹, Yi Chen¹•Institutions (1)

Duke University¹

01 Feb 2021

TL;DR: In this article, a low-power processing-in-memory inference accelerator is proposed for subject-independent EEG signal classification. But, the model is not suitable for BCI tasks.

...read moreread less

Abstract: State-of-the-art deep neural networks (DNNs) for electroencephalography (EEG) signals classification focus on subject-related tasks, in which the test data and the training data needs to be collected from the same subject. In addition, due to limited computing resources and strict power budgets at edges, it is very challenging to deploy the inference of such DNN models on biological devices. In this work, we present an algorithm/hardware co-designed low-power accelerator for subject-independent EEG signal classification. We propose a compact neural network that is capable to identify the common and stable structure among subjects. Based on it, we realize a robust subject-independent EEG signal classification model that can be extended to multiple BCI tasks with minimal overhead. Based on this model, we present RAISE, a low-power processing-in-memory inference accelerator by leveraging the emerging resistive memory. We compare the proposed model and hardware accelerator to prior arts across various BCI paradigms. We show that our model achieves the best subject-independent classification accuracy, while RAISE achieves 2.8× power reduction and 2.5× improvement in performance per watt compared to the state-of-the-art resistive inference accelerator.

...read moreread less

Proceedings Article•DOI•

Learning Pareto-Frontier Resource Management Policies for Heterogeneous SoCs: An Information-Theoretic Approach

[...]

Aryan Deshwal¹, Syrine Belakaria¹, Ganapati Bhat¹, Janardhan Rao Doppa¹, Partha Pratim Pande¹ - Show less +1 more•Institutions (1)

Washington State University¹

05 Dec 2021

...read moreread less

Abstract: Mobile system-on-chips (SoCs) are growing in their complexity and heterogeneity (e.g., Arm’s Big-Little architecture) to meet the needs of emerging applications, including games and artificial intelligence. This makes it very challenging to optimally manage the resources (e.g., controlling the number and frequency of different types of cores) at runtime to meet the desired trade-offs among multiple objectives such as performance and energy. This paper proposes a novel information-theoretic framework referred to as PaRMIS to create Pareto-optimal resource management policies for given target applications and design objectives. PaRMIS specifies parametric policies to manage resources and learns statistical models from candidate policy evaluation data in the form of target design objective values. The key idea is to select a candidate policy for evaluation in each iteration guided by statistical models that maximize the information gain about the true Pareto front. Experiments on a commercial heterogeneous SoC show that PaRMIS achieves better Pareto fronts and is easily usable to optimize complex objectives (e.g., performance per Watt) when compared to prior methods

...read moreread less

Proceedings Article•DOI•

Accelerating Monte Carlo Transport in the Trade-off of Performance and Power

[...]

Siqing Fu¹, Tiejun Li¹, Jianmin Zhang¹•Institutions (1)

National University of Defense Technology¹

28 May 2021

TL;DR: In this paper, a multi-core reconfigurable architecture for particle transport is proposed, which consists of heterogeneous lightweight cores, a recon-figurable cache structure, and high bandwidth memory.

...read moreread less

Abstract: Random simulation for particle transport theory is the main method for solving particle transport questions, which is widely used in medicine and computational physics. In this work, we present a multi-core reconfigurable architecture that aims to meet the performance per watt requirements of future Domain Specific Architectures (DSAs). The architecture proposed in this paper consists of heterogeneous lightweight cores, a reconfigurable cache structure, and High Bandwidth Memory. By targeting the different feature requirements of the Monte Carlo transport code at different stages, we design more necessary and efficient features for the lightweight calculating core, and continue to provide a trade-off of performance and energy consumption through reconfiguration. We designed and validated the accelerator architecture using gem5. Experiments show that compared with the traditional architecture composed of multiple out-of-order core, this architecture can obtain more than 3x in performance per watt. Some conclusions explored are not limited to the architecture proposed in this paper, but lay the foundation for further studies of large-scale transport accelerators.

...read moreread less