Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs
Summary (4 min read)
1. INTRODUCTION
- Approaches for minimizing power dissipation can be applied at the algorithmic, compiler, architectural, logic, and circuit levels [Chandrakasan et al. 1992].
- Turning resources on and off requires careful consideration of cases where multiple threads are present.
- The BSP model, proposed by Valiant [1990], is designed to bridge between theory and practice of parallel computations.
- A conventional power-gating optimization framework [You et al. 2005, 2007] can be employed for candidates used by a single thread, with the compiler inserting instructions into the program to shut down and wake up components as appropriate.
2. MOTIVATION
- A system might be equipped with a power-gating mechanism to activate and deactivate components in order to reduce the leakage current [Goodacre 2011].
- In such systems, programmers or compilers should analyze the behavior of programs, investigate component utilization based on execution sequences, and insert power-gating ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
- These code segments work smoothly when executed individually in single-thread environments as shown in Figure 1(b).
- For thread T2, after five instructions are executed, the power-off instruction is executed at t6, which turns off component C1 to stop the leakage current.
- This article presents their solution for addressing this issue.
3.1. PPG Operations
- Predicated execution support provides an effective means to eliminate branches from an instruction stream.
- Instructions whose predicate is true are executed normally, while those whose predicate is false are nullified and thus prevented from modifying the processor state.
- The authors combine the predicated executions into three special power-gating operations: predicated power-on, predicated power-off, and initialization operations.
- (1) to turn on a component only when it is actually in the off state; (2) to keep track of the number of threads using the component; and (3) to turn off the component when this is the last exit of all threads using this component, also known as The main ideas are.
- The operation consists of the following steps: (1) power on Ci if pgpi (i.e., the predicated bit of Ci) is set; (2) increase rci (i.e., the reference counter of Ci) by 1.
3.2. Multithread Power-Gating Framework
- Algorithm 1 summarizes their proposed compiler flow of the MTPG framework for BSP models.
- To generate code with power-gating control in a multithread BSP program, the compiler should compute concurrency information and analyze component usage with respect to concurrent threads.
- In step 2, detailed component usages can be calculated via dataflow equations by referencing component-activity dataflow analysis [You et al. 2002, 2006].
- Steps 3 and 4 insert PPG instructions according to the information gathered in the previous steps while considering the cost model (Section 5 presents their MTPGA compiler framework for power optimizations).
- Step 6 attempts to merge the power-gating instructions with the sink-n-hoist framework.
4. TFCA FOR BSP PROGRAMS
- This section presents the concurrency analysis method for BSP programs.
- Figure 3 presents an example for a superstep of the hierarchical BSP model, in which vertical black lines indicate threads and horizontal gray bars indicate barriers.
- Eight individual threads and two barriers form the superstep, where the eight threads join and are divided into six groups.
- In a hierarchical BSP program, programmers are allowed to divide threads into groups and the synchronization of threads would be limited in the groups, which form subsupersteps inside groups.
- Computing the concurrency between threads actually involves considering the relation between threads that are present during a specific period, which are indicated by a set of neighboring nodes in the controlflow graph (CFG), denoted by a thread fragment.
4.1. Thread Fragment Graph
- The relationships between thread fragments in a superstep are abstracted into a directed graph named the TFG, in which a node represents a thread fragment and an edge represents the control flow.
- For a multiple-program multiple-data programming model, a multithread program is composed of multiple individual executable files that are executed on different processors; in such a case, a TFG is constructed from several CFGs of the individual programs.
- In the first superstep, four threads are further grouped into two groups (g1 to g2).
- For a given group g, let the numbers of BSP barrier ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
- There are multiple nodes and exit nodes in a TFG, denoted by V ′ and V ′(exit), respectively.
4.2. Constructing TFGs
- The authors designed a TFG construction algorithm that builds the TFG for each BSP superstep from a CFG and performs the lineal thread fragments analysis for each TFG.
- Algorithm 2 is the kernel algorithm that collects the thread fragment of a designated group as well as constructs the TFG of the group and computes the concurrency information.
- The output of the algorithm would be a set of nodes between the entry barrier of an ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
- After performing TraverseGroup for each subgroup, the output blocked set V ′blk needs to cross barriers; then the processed V ′ blk set is added to Vitr so that thread fragments are collected in the subsequent iterations.
4.3. Lineal Thread Fragments Analysis and MTF
- Once the TFG has been constructed, the authors can compute the concurrent thread fragments of a hierarchical BSP program.
- The authors collect all nodes along the TFG in their dataflow analysis and maintain the set of entire lineal thread fragments by adding nodes symmetrically so as to keep this set symmetric.
- The MHP regions are determined by first constructing an MHP graph G′′ = (V ′, E′′), that is an indirected graph whose nodes are thread fragments and edges are nodes that may happen in parallel; that is, they are related to the MTF set.
- Table II lists the results of GEN, OUT, ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014. and LTF sets for the example in Figure 5.
5. MULTITHREAD POWER-GATING ANALYSIS
- The TFCA results and the component usage for a power-gating candidate Ci of all concurrent thread fragments can be categorized in the following three cases.
- Figure 10 demonstrates two possible placements of PPG operations based on the MTPGA results.
- Therefore, within the MHP region there will be only a pair of power-gating operations, namely the first power-on and the last power-off operations, belonging to a pair of PPG operations being executed, whereas the power gating of the other PPG operations will be disabled.
- Figure 10 portrays the implications of the aforesaid functions with TF1 and C1 as parameters.
6.1. Platform
- The authors used a DEC-Alpha-compatible architecture with the PPG controls and two-way to 8-way simultaneous multithreading as the target architecture for their experiments.
- By default, the simulator performs out-of-order execution.
- Format with SUIF, processed by concurrent thread fragment analysis, and then translated to the machine- or instruction-level CFG form with Machine-SUIF.
- Four components of the low-power optimization phase for multithread programs (implemented as a Machine-SUIF pass) were then performed, and finally, the compiler generated DEC Alpha assembly code with extended power-gating controls.
- Also, the baseline data was provided by the power estimation of Wattch cc3 with a clock-gating mechanism that gates the clocks of unused resources in multiport hardware to reduce the dynamic power; however, leakage power still exists.
6.2. Simulation Results
- To verify their proposed MTPGA algorithm and PPG mechanism, the authors focused on investigating component utilization in the supersteps.
- As indicated in Table VI, while CADFA results in less leakage energy in power-gateable units (about 30% energy consumption ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014. relative to MTPG), it suffers the overhead of traditional power-gating instructions (about 11× the energy consumption relative to MTPG).
- Figures 14 through 19 show their experimental results for BSP programs from OpenCL-based kernels.
7. DISCUSSION
- The authors discuss the impact of latency and the capability to apply MTPGA on real hardware.
- Latencies in processors include pipelining latency and memory access latency.
- Nevertheless, MTPGA also conservatively estimates the inactive period in an MHP region with the worst case, namely the minimal thread execution time among threads.
- When the instruction fetching policy changes in SMT, their method is also applied because it estimates energy consumption with the worst case of concurrent threads, which guarantees that leakage energy would be reduced in any case.
- Finally, the power management controller removes the power-gating instructions from the power-gating direction buffer.
8. CONCLUSION
- This article has presented a foundation framework for compilation optimization that reduces the power consumption on SMT architectures.
- It has also presented PPG operations for improving the energy management of multithread programs in hierarchical BSP models.
- Based on a multithread component analysis with dataflow equations, their MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management.
Did you find this useful? Give us your feedback
Citations
29 citations
5 citations
Cites background from "Compiler Optimization for Reducing ..."
...Shin et al.[40] presents a power-gating analysis framework (MTPG) for multithreaded programs....
[...]
3 citations
2 citations
2 citations
Cites methods from "Compiler Optimization for Reducing ..."
...Compiler analysis and transformations typically accomplish optimizations by efficiently estimating the run-time behaviour [11, 17, 14]....
[...]
References
31 citations
"Compiler Optimization for Reducing ..." refers background or methods in this paper
...…long instruction word) instructions to reduce the power consumption on the instruction bus [Lee et al. 2003], reducing instruction encoding to reduce code size and power consumption [Lee et al. 2013], and gating the clock to reduce workloads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]....
[...]
...…to combining architecture design and software arrangement at the instruction level have been addressed with the aim of reducing power consumption [Bellas et al. 2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su and Despain 1995; Tiwari et al. 1997, 1998]....
[...]
31 citations
30 citations
20 citations
18 citations
"Compiler Optimization for Reducing ..." refers background or methods in this paper
...Various studies have attempted to reduce the leakage power using integrated architectures and compiler-based power gating mechanisms [Dropsho et al. 2002; Yang et al. 2002; You et al. 2002, 2007; Rele et al. 2002; Zhang et al. 2003; Li and Xue 2004]....
[...]
...Memory access latency is caused by the memory hierarchy, such as cache miss. Pipelining latency and memory access latency are both discussed in traditional power-gating analyses for single-thread environments such as CADFA [You et al. 2006] and sink-n-hoist [You et al. 2005, 2007]....
[...]
...A conventional power-gating optimization framework [You et al. 2005, 2007] can be employed for candidates used by a single thread, with the compiler inserting instructions into the program to shut down and wake up components as appropriate....
[...]
...Steps 5 and 6 further merge the generated power-gating controls into a single compound instruction based on the sink-n-hoist framework [You et al. 2005, 2007]....
[...]
...The Sink-N-Hoist framework [You et al. 2005, 2007] has been used to reduce the number of power-gating instructions generated by compilers....
[...]
Related Papers (5)
Frequently Asked Questions (2)
Q2. What future works have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?
It would be a possible direction for future research to apply their method on GPU architectures.