TL;DR: A multithread power-gating framework composed of multith read power- gating analysis (MTPGA) and predicated power-Gating (PPG) energy management mechanisms for reducing the leakage power when executingMultithread programs on simultaneous multithreading (SMT) machines is presented.
Abstract: Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.This article presents a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09p for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30p; and the total energy consumption is reduced by an average of 4.27p on leakage contribution set to 10p. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.
Approaches for minimizing power dissipation can be applied at the algorithmic, compiler, architectural, logic, and circuit levels [Chandrakasan et al. 1992].
Turning resources on and off requires careful consideration of cases where multiple threads are present.
The BSP model, proposed by Valiant [1990], is designed to bridge between theory and practice of parallel computations.
A conventional power-gating optimization framework [You et al. 2005, 2007] can be employed for candidates used by a single thread, with the compiler inserting instructions into the program to shut down and wake up components as appropriate.
2. MOTIVATION
A system might be equipped with a power-gating mechanism to activate and deactivate components in order to reduce the leakage current [Goodacre 2011].
In such systems, programmers or compilers should analyze the behavior of programs, investigate component utilization based on execution sequences, and insert power-gating ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
These code segments work smoothly when executed individually in single-thread environments as shown in Figure 1(b).
For thread T2, after five instructions are executed, the power-off instruction is executed at t6, which turns off component C1 to stop the leakage current.
This article presents their solution for addressing this issue.
3.1. PPG Operations
Predicated execution support provides an effective means to eliminate branches from an instruction stream.
Instructions whose predicate is true are executed normally, while those whose predicate is false are nullified and thus prevented from modifying the processor state.
The authors combine the predicated executions into three special power-gating operations: predicated power-on, predicated power-off, and initialization operations.
(1) to turn on a component only when it is actually in the off state; (2) to keep track of the number of threads using the component; and (3) to turn off the component when this is the last exit of all threads using this component, also known as The main ideas are.
The operation consists of the following steps: (1) power on Ci if pgpi (i.e., the predicated bit of Ci) is set; (2) increase rci (i.e., the reference counter of Ci) by 1.
3.2. Multithread Power-Gating Framework
Algorithm 1 summarizes their proposed compiler flow of the MTPG framework for BSP models.
To generate code with power-gating control in a multithread BSP program, the compiler should compute concurrency information and analyze component usage with respect to concurrent threads.
In step 2, detailed component usages can be calculated via dataflow equations by referencing component-activity dataflow analysis [You et al. 2002, 2006].
Steps 3 and 4 insert PPG instructions according to the information gathered in the previous steps while considering the cost model (Section 5 presents their MTPGA compiler framework for power optimizations).
Step 6 attempts to merge the power-gating instructions with the sink-n-hoist framework.
4. TFCA FOR BSP PROGRAMS
This section presents the concurrency analysis method for BSP programs.
Figure 3 presents an example for a superstep of the hierarchical BSP model, in which vertical black lines indicate threads and horizontal gray bars indicate barriers.
Eight individual threads and two barriers form the superstep, where the eight threads join and are divided into six groups.
In a hierarchical BSP program, programmers are allowed to divide threads into groups and the synchronization of threads would be limited in the groups, which form subsupersteps inside groups.
Computing the concurrency between threads actually involves considering the relation between threads that are present during a specific period, which are indicated by a set of neighboring nodes in the controlflow graph (CFG), denoted by a thread fragment.
4.1. Thread Fragment Graph
The relationships between thread fragments in a superstep are abstracted into a directed graph named the TFG, in which a node represents a thread fragment and an edge represents the control flow.
For a multiple-program multiple-data programming model, a multithread program is composed of multiple individual executable files that are executed on different processors; in such a case, a TFG is constructed from several CFGs of the individual programs.
In the first superstep, four threads are further grouped into two groups (g1 to g2).
For a given group g, let the numbers of BSP barrier ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
There are multiple nodes and exit nodes in a TFG, denoted by V ′ and V ′(exit), respectively.
4.2. Constructing TFGs
The authors designed a TFG construction algorithm that builds the TFG for each BSP superstep from a CFG and performs the lineal thread fragments analysis for each TFG.
Algorithm 2 is the kernel algorithm that collects the thread fragment of a designated group as well as constructs the TFG of the group and computes the concurrency information.
The output of the algorithm would be a set of nodes between the entry barrier of an ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
After performing TraverseGroup for each subgroup, the output blocked set V ′blk needs to cross barriers; then the processed V ′ blk set is added to Vitr so that thread fragments are collected in the subsequent iterations.
4.3. Lineal Thread Fragments Analysis and MTF
Once the TFG has been constructed, the authors can compute the concurrent thread fragments of a hierarchical BSP program.
The authors collect all nodes along the TFG in their dataflow analysis and maintain the set of entire lineal thread fragments by adding nodes symmetrically so as to keep this set symmetric.
The MHP regions are determined by first constructing an MHP graph G′′ = (V ′, E′′), that is an indirected graph whose nodes are thread fragments and edges are nodes that may happen in parallel; that is, they are related to the MTF set.
Table II lists the results of GEN, OUT, ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014. and LTF sets for the example in Figure 5.
5. MULTITHREAD POWER-GATING ANALYSIS
The TFCA results and the component usage for a power-gating candidate Ci of all concurrent thread fragments can be categorized in the following three cases.
Figure 10 demonstrates two possible placements of PPG operations based on the MTPGA results.
Therefore, within the MHP region there will be only a pair of power-gating operations, namely the first power-on and the last power-off operations, belonging to a pair of PPG operations being executed, whereas the power gating of the other PPG operations will be disabled.
Figure 10 portrays the implications of the aforesaid functions with TF1 and C1 as parameters.
6.1. Platform
The authors used a DEC-Alpha-compatible architecture with the PPG controls and two-way to 8-way simultaneous multithreading as the target architecture for their experiments.
By default, the simulator performs out-of-order execution.
Format with SUIF, processed by concurrent thread fragment analysis, and then translated to the machine- or instruction-level CFG form with Machine-SUIF.
Four components of the low-power optimization phase for multithread programs (implemented as a Machine-SUIF pass) were then performed, and finally, the compiler generated DEC Alpha assembly code with extended power-gating controls.
Also, the baseline data was provided by the power estimation of Wattch cc3 with a clock-gating mechanism that gates the clocks of unused resources in multiport hardware to reduce the dynamic power; however, leakage power still exists.
6.2. Simulation Results
To verify their proposed MTPGA algorithm and PPG mechanism, the authors focused on investigating component utilization in the supersteps.
As indicated in Table VI, while CADFA results in less leakage energy in power-gateable units (about 30% energy consumption ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014. relative to MTPG), it suffers the overhead of traditional power-gating instructions (about 11× the energy consumption relative to MTPG).
Figures 14 through 19 show their experimental results for BSP programs from OpenCL-based kernels.
7. DISCUSSION
The authors discuss the impact of latency and the capability to apply MTPGA on real hardware.
Latencies in processors include pipelining latency and memory access latency.
Nevertheless, MTPGA also conservatively estimates the inactive period in an MHP region with the worst case, namely the minimal thread execution time among threads.
When the instruction fetching policy changes in SMT, their method is also applied because it estimates energy consumption with the worst case of concurrent threads, which guarantees that leakage energy would be reduced in any case.
Finally, the power management controller removes the power-gating instructions from the power-gating direction buffer.
8. CONCLUSION
This article has presented a foundation framework for compilation optimization that reduces the power consumption on SMT architectures.
It has also presented PPG operations for improving the energy management of multithread programs in hierarchical BSP models.
Based on a multithread component analysis with dataflow equations, their MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management.
TL;DR: In this paper, the authors focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing static resource analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements.
Abstract: Energy transparency is a concept that makes a program’s energy consumption visible, from hardware up to software, through the different system layers. Such transparency can enable energy optimizations at each layer and between layers, as well as help both programmers and operating systems make energy-aware decisions. In this article, we focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing static resource analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements. Our novel mapping technique enables software energy consumption estimations at a higher level than the Instruction Set Architecture (ISA), namely the LLVM intermediate representation (IR) level, and therefore introduces energy transparency directly to the LLVM optimizer. We apply our energy estimation techniques to a comprehensive set of benchmarks, including single- and multithreaded embedded programs from two commonly used concurrency patterns: task farms and pipelines. Using SRA, our LLVM IR results demonstrate a high accuracy with a deviation in the range of 1% from the ISA SRA. Our profiling technique captures the actual energy consumption at the LLVM IR level with an average error of 3%.
TL;DR: A new framework is presented that eliminates redundant race check and boosts the dynamic race detection by performing static optimizations on top of a series of thread interference analysis phases.
Abstract: Precise dynamic race detectors report an error if and only if more than one thread concurrently exhibits conflict on a memory access. They insert instrumentations at compile-time to perform runtime checks on all memory accesses to ensure that all races are captured and no spurious warnings are generated. However, a dynamic race check for a particular memory access statement is guaranteed to be redundant if the statement can be statically identified as thread interference-free. Despite significant recent advances in dynamic detection techniques, the redundant check remains a critical factor that leads to prohibitive overhead of dynamic race detection for multithreaded programs.In this paper, we present a new framework that eliminates redundant race check and boosts the dynamic race detection by performing static optimizations on top of a series of thread interference analysis phases. Our framework is implemented on top of LLVM 3.5.0 and evaluated with an industry dynamic race detector TSAN which is available as a part of LLVM tool chain. 11 benchmarks from SPLASH2 are used to evaluate the effectiveness of our approach in accelerating TSAN by eliminating redundant interference-free checks. The experimental result demonstrates our new approach achieves from 1.4x to 4.0x (2.4x on average) speedup over original TSAN under 4 threads setting, and achieves from 1.3x to 4.6x (2.6x on average) speedup under 16 threads setting.
5 citations
Cites background from "Compiler Optimization for Reducing ..."
...Shin et al.[40] presents a power-gating analysis framework (MTPG) for multithreaded programs....
TL;DR: This work attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs, including Pipe and Filter pattern, MapReduce with Iterator pattern, and Bulk Synchronous Parallel Model.
Abstract: Minimization of power dissipation can be considered at algorithmic, compiler, architectural, logic, and circuit level. Recent research trends for multicore programming models have come to the direction that parallel design patterns can be a solution to develop multicore applications. As parallel design patterns are with regularity, we view this as a great opportunity to exploit power optimizations in the software layer. In this paper, we investigate compilers for low power with parallel design patterns on embedded multicore systems. We evaluate four major parallel design patterns, Pipe and Filter, MapReduce with Iterator, Puppeteer, and Bulk Synchronous Parallel (BSP) Model. Our work attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs. The proposed optimization schemes are rate-based optimization for Pipe and Filter pattern , early-exit power optimization for MapReduce with Iterator pattern, power aware mapping algorithm for Puppeteer pattern, and multi-phases power gating scheme for BSP pattern. In our experiments, real world multicore applications are evaluated on a multicore power simulator. Significant power reductions are observed from the experimental results. Therefore, we present a direction for power optimizations that one can further identify additional key design patterns for embedded multicore systems to explore power optimization opportunities via compilers.
TL;DR: This work attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs, and presents a direction for power optimizations that one can further identify additional key design patterns for embedded multicores systems to explore power optimization opportunities via compilers.
Abstract: Minimization of power dissipation can be considered at algorithmic, compilers, architectural, logic, and circuit levels. Recent research trends for multicore programming models have come to the direction that parallel design patterns can be a solution to develop multicore applications. As parallel design patterns are with regularity, we view this as a great opportunity to exploit power optimizations in the software layer. In this paper, we present case studies to investigate compilers for low power with parallel design patterns on embedded multicore systems. We evaluate two major parallel design patterns, Pipe and Filter and MapReduce with Iterator. Our work, attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs. In all two cases of the patterns investigated, the common recurring patterns of programs are exploited to seek the opportunity for compiler optimizations for low power. Proposed optimization schemes are rate-based optimization for Pipe and Filter pattern and early-exit power optimization for MapReduce with Iterator pattern. Our experiment is based on a power simulator simulating a heterogeneous multicore system under SID simulation framework. In our experiments, a finite impulse response (FIR) program with Pipe and Filter pattern and an image recognition application applied MapReduce with Iterator pattern are evaluated by incorporating our proposed power optimization schemes for each pattern. Significant power reductions are observed in all two cases. With the case study, we present a direction for power optimizations that one can further identify additional key design patterns for embedded multicore systems to explore power optimization opportunities via compilers.
TL;DR: This paper implements a compiler for OpenMP, a hardware-aware Compiler Enhanced Scheduling (CES), where the common compiler transformations are coupled with compiler added scheduling commands to take advantage of the hardware asymmetry and improve the runtime efficiency.
Abstract: Scheduling in Asymmetric Multicore Processors (AMP), a special case of Heterogeneous Multiprocessors, is a widely studied topic. The scheduling techniques which are mostly runtime do not usually consider parallel programming pattern used in parallel programming frameworks like OpenMP. On the other hand, current compilers for these parallel programming platforms are hardware oblivious which prevent any compile-time optimization for platforms like big.LITTLE and has to completely rely on runtime optimization. In this paper, we propose a hardware-aware Compiler Enhanced Scheduling (CES) where the common compiler transformations are coupled with compiler added scheduling commands to take advantage of the hardware asymmetry and improve the runtime efficiency. We implement a compiler for OpenMP and demonstrate its efficiency in Samsung Exynos with big.LITTLE architecture. On an average, we see 18% reduction in runtime and 14% reduction in energy consumption in standard NPB and FSU benchmarks with CES across multiple frequencies and core configurations in big.LITTLE.
2 citations
Cites methods from "Compiler Optimization for Reducing ..."
...Compiler analysis and transformations typically accomplish optimizations by efficiently estimating the run-time behaviour [11, 17, 14]....
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
TL;DR: In this paper, techniques for low power operation are presented which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations to reduce power consumption in CMOS digital circuits while maintaining computational throughput.
Abstract: Motivated by emerging battery-operated applications that demand intensive computation in portable environments, techniques are investigated which reduce power consumption in CMOS digital circuits while maintaining computational throughput. Techniques for low-power operation are shown which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations. An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations. This optimum is achieved by trading increased silicon area for reduced power consumption. >
TL;DR: An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations, and is achieved by trading increased silicon area for reduced power consumption.
Abstract: Motivated by emerging battery-operated applications that demand intensive computation in portable environments, techniques are investigated which reduce power consumption in CMOS digital circuits while maintaining computational throughput Techniques for low-power operation are shown which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations This optimum is achieved by trading increased silicon area for reduced power consumption >
2,337 citations
"Compiler Optimization for Reducing ..." refers methods in this paper
...Approaches for minimizing power dissipation can be applied at the algorithmic, compiler, architectural, logic, and circuit levels [Chandrakasan et al. 1992]....
[...]
...INTRODUCTION Approaches for minimizing
power dissipation can be applied at the algorithmic, compiler, architectural, logic, and circuit levels
[Chandrakasan et al. 1992]....
TL;DR: In this article, the authors present several dual-threshold voltage techniques for reducing standby power dissipation while still maintaining high performance in static and dynamic combinational logic blocks MTCMOS sleep transistor sizing issues are addressed, and a hierarchical sizing methodology based on mutual exclusive discharge patterns is presented.
Abstract: Scaling and power reduction trends in future technologies will cause subthreshold leakage currents to become an increasingly large component of total power dissipation This paper presents several dual-threshold voltage techniques for reducing standby power dissipation while still maintaining high performance in static and dynamic combinational logic blocks MTCMOS sleep transistor sizing issues are addressed, and a hierarchical sizing methodology based on mutual exclusive discharge patterns is presented A dual-V/sub t/ domino logic style that provides the performance equivalent of a purely low-V/sub t/ design with the standby leakage characteristic of a purely high-V/sub t/ implementation is also proposed
TL;DR: In this article, the authors used an energy-delay metric to compare many of the proposed techniques and provided insight into some of the basic trade-offs in low-power design, including trade speed for power, do not waste power, and find a lower power problem.
Abstract: Recently there has been a surge of interest in low-power devices and design techniques. While many papers have been published describing power-saving techniques for use in digital systems, trade-offs between the methods are rarely discussed. We address this issue by using an energy-delay metric to compare many of the proposed techniques. Using this metric also provides insight into some of the basic trade-offs in low-power design. The next section describes the energy-loss mechanisms that are present in CMOS circuits, which provides the parameters that must be changed to lower the power dissipation. With these factors in mind, the rest of the paper reviews the energy saving techniques that have been proposed. These proposals fall into one of three main strategies: trade speed for power, do not waste power, and find a lower power problem.
471 citations
"Compiler Optimization for Reducing ..." refers background or methods in this paper
...…long instruction word) instructions to reduce the power consumption on the
instruction bus [Lee et al. 2003], reducing instruction encoding to reduce code size and power consumption
[Lee et al. 2013], and gating the clock to reduce workloads [Horowitz et al. 1994; Tiwari et al. 1997,
1998]....
[...]
...…to combining architecture design and software arrangement
at the instruction level have been addressed with the aim of reducing power consumption [Bellas et al.
2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su and Despain 1995;
Tiwari et al. 1997, 1998]....
[...]
...Aspects relative to combining architecture design and software arrangement at the instruction level have been addressed with the aim of reducing power consumption [Bellas et al. 2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su and Despain 1995; Tiwari et al. 1997, 1998]....
[...]
...2013], and gating the clock to reduce workloads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]....
Q1. What contributions have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?
This article addresses compiler optimization for reducing the power consumption of multithread programs. This article presents a multithread power-gating framework composed of multithread power-gating analysis ( MTPGA ) and predicated power-gating ( PPG ) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading ( SMT ) machines. The authors performed experiments by incorporating their power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits.
Q2. What future works have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?
It would be a possible direction for future research to apply their method on GPU architectures.