Journal Article•DOI•

Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs

Wen-Li Shih¹, Yi-Ping You², Chung-Wen Huang¹, Jenq Kuen Lee¹•Institutions (2)

National Tsing Hua University¹, National Chiao Tung University²

18 Nov 2014-ACM Transactions on Design Automation of Electronic Systems (ACM)-Vol. 20, Iss: 1, pp 9

TL;DR: A multithread power-gating framework composed of multith read power- gating analysis (MTPGA) and predicated power-Gating (PPG) energy management mechanisms for reducing the leakage power when executingMultithread programs on simultaneous multithreading (SMT) machines is presented.

read less

Abstract: Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.This article presents a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09p for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30p; and the total energy consumption is reduced by an average of 4.27p on leakage contribution set to 10p. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.

...read moreread less

Summary (4 min read)

Jump to: [1. INTRODUCTION] – [2. MOTIVATION] – [3.1. PPG Operations] – [3.2. Multithread Power-Gating Framework] – [4. TFCA FOR BSP PROGRAMS] – [4.1. Thread Fragment Graph] – [4.2. Constructing TFGs] – [4.3. Lineal Thread Fragments Analysis and MTF] – [5. MULTITHREAD POWER-GATING ANALYSIS] – [6.1. Platform] – [6.2. Simulation Results] – [7. DISCUSSION] and [8. CONCLUSION]

1. INTRODUCTION

Approaches for minimizing power dissipation can be applied at the algorithmic, compiler, architectural, logic, and circuit levels [Chandrakasan et al. 1992].
Turning resources on and off requires careful consideration of cases where multiple threads are present.
The BSP model, proposed by Valiant [1990], is designed to bridge between theory and practice of parallel computations.
A conventional power-gating optimization framework [You et al. 2005, 2007] can be employed for candidates used by a single thread, with the compiler inserting instructions into the program to shut down and wake up components as appropriate.

2. MOTIVATION

A system might be equipped with a power-gating mechanism to activate and deactivate components in order to reduce the leakage current [Goodacre 2011].
In such systems, programmers or compilers should analyze the behavior of programs, investigate component utilization based on execution sequences, and insert power-gating ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
These code segments work smoothly when executed individually in single-thread environments as shown in Figure 1(b).
For thread T2, after five instructions are executed, the power-off instruction is executed at t6, which turns off component C1 to stop the leakage current.
This article presents their solution for addressing this issue.

3.1. PPG Operations

Predicated execution support provides an effective means to eliminate branches from an instruction stream.
Instructions whose predicate is true are executed normally, while those whose predicate is false are nullified and thus prevented from modifying the processor state.
The authors combine the predicated executions into three special power-gating operations: predicated power-on, predicated power-off, and initialization operations.
(1) to turn on a component only when it is actually in the off state; (2) to keep track of the number of threads using the component; and (3) to turn off the component when this is the last exit of all threads using this component, also known as The main ideas are.
The operation consists of the following steps: (1) power on Ci if pgpi (i.e., the predicated bit of Ci) is set; (2) increase rci (i.e., the reference counter of Ci) by 1.

3.2. Multithread Power-Gating Framework

Algorithm 1 summarizes their proposed compiler flow of the MTPG framework for BSP models.
To generate code with power-gating control in a multithread BSP program, the compiler should compute concurrency information and analyze component usage with respect to concurrent threads.
In step 2, detailed component usages can be calculated via dataflow equations by referencing component-activity dataflow analysis [You et al. 2002, 2006].
Steps 3 and 4 insert PPG instructions according to the information gathered in the previous steps while considering the cost model (Section 5 presents their MTPGA compiler framework for power optimizations).
Step 6 attempts to merge the power-gating instructions with the sink-n-hoist framework.

4. TFCA FOR BSP PROGRAMS

This section presents the concurrency analysis method for BSP programs.
Figure 3 presents an example for a superstep of the hierarchical BSP model, in which vertical black lines indicate threads and horizontal gray bars indicate barriers.
Eight individual threads and two barriers form the superstep, where the eight threads join and are divided into six groups.
In a hierarchical BSP program, programmers are allowed to divide threads into groups and the synchronization of threads would be limited in the groups, which form subsupersteps inside groups.
Computing the concurrency between threads actually involves considering the relation between threads that are present during a specific period, which are indicated by a set of neighboring nodes in the controlflow graph (CFG), denoted by a thread fragment.

4.1. Thread Fragment Graph

The relationships between thread fragments in a superstep are abstracted into a directed graph named the TFG, in which a node represents a thread fragment and an edge represents the control flow.
For a multiple-program multiple-data programming model, a multithread program is composed of multiple individual executable files that are executed on different processors; in such a case, a TFG is constructed from several CFGs of the individual programs.
In the first superstep, four threads are further grouped into two groups (g1 to g2).
For a given group g, let the numbers of BSP barrier ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
There are multiple nodes and exit nodes in a TFG, denoted by V ′ and V ′(exit), respectively.

4.2. Constructing TFGs

The authors designed a TFG construction algorithm that builds the TFG for each BSP superstep from a CFG and performs the lineal thread fragments analysis for each TFG.
Algorithm 2 is the kernel algorithm that collects the thread fragment of a designated group as well as constructs the TFG of the group and computes the concurrency information.
The output of the algorithm would be a set of nodes between the entry barrier of an ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
After performing TraverseGroup for each subgroup, the output blocked set V ′blk needs to cross barriers; then the processed V ′ blk set is added to Vitr so that thread fragments are collected in the subsequent iterations.

4.3. Lineal Thread Fragments Analysis and MTF

Once the TFG has been constructed, the authors can compute the concurrent thread fragments of a hierarchical BSP program.
The authors collect all nodes along the TFG in their dataflow analysis and maintain the set of entire lineal thread fragments by adding nodes symmetrically so as to keep this set symmetric.
The MHP regions are determined by first constructing an MHP graph G′′ = (V ′, E′′), that is an indirected graph whose nodes are thread fragments and edges are nodes that may happen in parallel; that is, they are related to the MTF set.
Table II lists the results of GEN, OUT, ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014. and LTF sets for the example in Figure 5.

5. MULTITHREAD POWER-GATING ANALYSIS

The TFCA results and the component usage for a power-gating candidate Ci of all concurrent thread fragments can be categorized in the following three cases.
Figure 10 demonstrates two possible placements of PPG operations based on the MTPGA results.
Therefore, within the MHP region there will be only a pair of power-gating operations, namely the first power-on and the last power-off operations, belonging to a pair of PPG operations being executed, whereas the power gating of the other PPG operations will be disabled.
Figure 10 portrays the implications of the aforesaid functions with TF1 and C1 as parameters.

6.1. Platform

The authors used a DEC-Alpha-compatible architecture with the PPG controls and two-way to 8-way simultaneous multithreading as the target architecture for their experiments.
By default, the simulator performs out-of-order execution.
Format with SUIF, processed by concurrent thread fragment analysis, and then translated to the machine- or instruction-level CFG form with Machine-SUIF.
Four components of the low-power optimization phase for multithread programs (implemented as a Machine-SUIF pass) were then performed, and finally, the compiler generated DEC Alpha assembly code with extended power-gating controls.
Also, the baseline data was provided by the power estimation of Wattch cc3 with a clock-gating mechanism that gates the clocks of unused resources in multiport hardware to reduce the dynamic power; however, leakage power still exists.

6.2. Simulation Results

To verify their proposed MTPGA algorithm and PPG mechanism, the authors focused on investigating component utilization in the supersteps.
As indicated in Table VI, while CADFA results in less leakage energy in power-gateable units (about 30% energy consumption ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014. relative to MTPG), it suffers the overhead of traditional power-gating instructions (about 11× the energy consumption relative to MTPG).
Figures 14 through 19 show their experimental results for BSP programs from OpenCL-based kernels.

7. DISCUSSION

The authors discuss the impact of latency and the capability to apply MTPGA on real hardware.
Latencies in processors include pipelining latency and memory access latency.
Nevertheless, MTPGA also conservatively estimates the inactive period in an MHP region with the worst case, namely the minimal thread execution time among threads.
When the instruction fetching policy changes in SMT, their method is also applied because it estimates energy consumption with the worst case of concurrent threads, which guarantees that leakage energy would be reduced in any case.
Finally, the power management controller removes the power-gating instructions from the power-gating direction buffer.

8. CONCLUSION

This article has presented a foundation framework for compilation optimization that reduces the power consumption on SMT architectures.
It has also presented PPG operations for improving the energy management of multithread programs in hierarchical BSP models.
Based on a multithread component analysis with dataflow equations, their MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management.

Did you find this useful? Give us your feedback

Figures (29)

Fig. 2. Pseudocode segments to illustrate the specification of PPG operations for a power-gating candidate C1: (a) Initialization operation for the power-gating mechanism; (b); (c) operations to atomically predicated power-on and predicated power-off C1, respectively.

Fig. 6. Superstep VSS(g0,1) of Figure 5. Nodes inside a superstep are divided into thread fragments that are used to build a TFG. Nodes v11 and v19 are overlapped because there is a barrier node v15 inside the loop structure. With a barrier node inside a loop, nodes inside the loop may be executed multiple times in different thread fragments; therefore nodes v11 and v19 appear in different thread fragments, resulting in the thread fragments overlapping.

Table VIII. Normalized Total Energy Consumptions of Randomly Generated TFGs for Setting B

Table IX. Normalized Total Energy Consumptions of Randomly Generated TFGs for Setting B

Fig. 8. Dataflow equations for lineal thread fragments information.

Table IV. Baseline Processor Configuration

Fig. 11. Power management in the compilation phases of multithread programs.

Fig. 17. Normalized total energy consumptions of BSP programs from OpenCL kernels on eight-way SMT system with leakage contribution set to 30%.

Fig. 3. Illustration of one superstep of a BSP program, where eight threads (T1 to T8) are divided into six subgroups (G1 to G6). Each subgroup contains a subsuperstep. In a hierarchical BSP program, programmers are allowed to divide threads into groups and the synchronization of threads would be limited in the groups, which form subsupersteps inside the groups. A barrier is a synchronous point of a group in the hierarchical BSP model; therefore all the barriers in program must belong to a specific group as shown in the figure.

Fig. 16. Normalized total energy consumptions of BSP programs from OpenCL kernels on eight-way SMT system with leakage contribution set to 10%.

Table V. Parameter Settings Used for Generating TFGs

Fig. 15. Normalized total energy consumptions of BSP programs from OpenCL kernels on four-way SMT system with leakage contribution set to 30%.

Fig. 5. Supersteps for Figure 4. Nodes inside the dashed area are a superstep, named VSS. Four supersteps are shown in the figure: VSS(g0, 0), VSS(g0,1), VSS(g0,2), and VSS(g0, 3).

Fig. 10. Two kinds of instruction placement for power gating: (a) Two concurrent thread fragments TF1 and TF2 using a power-gating candidate C1 and their component usage; (b); (c) two strategies for placing powergating instructions among TF1 and TF2; (b) the leakage energy of thread fragments is worthy of being gated (i.e., the calculation result of Eq. 10); thus all thread fragments would be inserted with ppg instructions; (c) the leakage energy of thread fragments is not worth gating; in such a case, we insert conventional power-gating instructions before and after the MHP region.

Fig. 9. Two thread fragments TF1 and TF2 in an MHP region and their utilization status for the powergating candidates C1, C2, and C3; in-use units are depicted with light gray boxes.

Table VI. Normalized Total Energy Consumptions of Randomly Generated TFGs for Setting A on Leakage Contribution Set to 10% and 30% (see Table V), Categorized by the Number of MHP Regions for Cases with Two Hardware Threads

Fig. 14. Normalized total energy consumptions of BSP programs from OpenCL kernels on four-way SMT system with leakage contribution set to 10%.

Fig. 13. Normalized total energy consumptions of BSP programs from BSPedupack.

Fig. 4. Hierarchical BSP program presented in a CFG, where four threads (T1 to T4) are divided into four supersteps by barriers. In the second superstep, four threads are further grouped into two groups (g1 to g2). Each subgroup has its own supersteps.

Fig. 1. The traditional power-gating mechanism adopted in a single-thread or SMT environment. Both environments are equipped with two categories of components C0 and C1, where C0 is capable of controlling the power-gating status of C1: (a) Two code segments of threads T1 and T2, where power-gating instructions are inserted by power-gating analysis results for threads T1 and T2 individually. Note that op1 of T1 and op2 and op5 of T2 demonstrate those cases where instructions might need more than one component (in this case, C0 and C1) to complete operation; (b); (c) how the code segments in (a) are executed in a single-thread and SMT environment, respectively. All component usages of instructions for the two threads are labeled as square boxes with corresponding labels, and power-off instructions are labeled as boxes with a cross.

Content maybe subject to copyright Report

Compiler Optimization for Reducing Leakage Power in M ultithread

BSP Programs

WEN-LI SHIH, National Tsing Hua University

YI-PING YOU, National Chiao Tung University

CHUNG-WEN HUANG and JENQ KUEN LEE, National Tsing Hua University

Multithread programming is widely adopted in novel embedded system applications due to its high perfor-

mance and ﬂexibility. This article addresses compiler optimization for reducing the power consumption of

multithread programs. A traditional compiler employs energy management techniques that analyze compo-

nent usage in control-ﬂow graphs with a focus on single-thread programs. In this environment the leakage

power can be controlled by inserting on and off instructions based on component usage information generated

by ﬂow equations. However, these methods cannot be directly extended to a multithread environment due

to concurrent execution issues.

This article presents a multithread power-gating framework composed of multithread power-gating anal-

ysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage

power when executing multithread programs on simultaneous multithreading (SMT) machines. Our mul-

tithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on

a multithread component analysis with dataﬂow equations, our MTPGA framework estimates the energy

usage of multithread programs and inserts PPG operations as power controls for energy management. We

performed experiments by incorporating our power optimization framework into SUIF compiler tools and

by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The

experimental results show that the total energy consumption of a system with PPG support and our power

optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a

power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced

by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are

effective in reducing the leakage energy of BSP multithread programs.

Categories and Subject Descriptors: D.3.2 [Programming Languages]: Language Classiﬁcations—

Concurrent; distributed; parallel languages; D.3.4 [Programming Languages]: Processors—Compiler;

optimization

General Terms: Design, Language

Additional Key Words and Phrases: Compilers for low power, leakage power reduction, power-gating mech-

anisms, multithreading

ACM Reference Format:

Wen-Li Shih, Yi-Ping You, Chung-Wen Huang, and Jenq Kuen Lee. 2014. Compiler optimization for reducing

leakage power in multithread BSP programs. ACM Trans. Des. Autom. Electron. Syst. 20, 1, Article 9

(November 2014), 34 pages.

DOI: http://dx.doi.org/10.1145/2668119

This work is supported in part by Ministry of Science and Technology (under grant no. 103-2220-E-007-019)

and Ministry of Economic Affairs (under grant no. 103-EC-17-A-02-S1-202) in Taiwan.

Author’s addresses: W.-L. Shih, Department of Computer Science, National Tsing Hua University, Hsinchu,

Taiwan; Y.-P. You, Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan;

C.-W. Huang and J. K. Lee (corresponding author), Department of Computer Science, N ational Tsing Hua

University, Hsinchu, Taiwan; email: jklee@cs.nthu.edu.tw.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted

without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that

copies bear this notice and the full citation on the ﬁrst page. Copyrights for components o f this work owned by

others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to

post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions

from permissions@acm.org.

 2014 ACM 1084-4309/2014/11-ART9 $15.00

DOI: http://dx.doi.org/10.1145/2668119

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

9:2 W.-L.Shihetal.

1. INTRODUCTION

Approaches for minimizing power dissipation can be applied at the algorithmic, com-

piler, architectural, logic, and circuit levels [Chandrakasan et al. 1992]. Aspects rel-

ative to combining architecture design and software arrangement at the instruction

level have been addressed with the aim of reducing power consumption [Bellas et al.

2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su

and Despain 1995; Tiwari et al. 1997, 1998]. Major efforts in power optimization in-

clude dynamic and leakage power optimization. Works in dynamic power optimization

include utilizing the value locality of registers [Chang and Pedram 1995], scheduling

VLIW (very long instruction word) instructions to reduce the power consumption on

the instruction bus [Lee et al. 2003], reducing instruction encoding to reduce code

size and power consumption [Lee et al. 2013], and gating the clock to reduce work-

loads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]. Compiler code for reducing

leakage power can employ power gating [Kao and Chandrakasan 2000; Butts and Sohi

2000; Hu et al. 2004]. Various studies have attempted to reduce the leakage power using

integrated architectures and compiler-based power gating mechanisms [Dropsho et al.

2002; Yang et al. 2002; You et al. 2002, 2007; Rele et al. 2002; Zhang et al. 2003; Li and

Xue 2004]. These approaches involve compilers inserting instructions into programs to

shut down and wake up components as appropriate, based on a dataﬂow analysis or a

proﬁling analysis. The power analysis and instruction insertion are further integrated

into trace-based binary translation [Li and Xue 2004]. The Sink-N-Hoist framework

[You et al. 2005, 2007] has been used to reduce the number of power-gating instructions

generated by compilers. However, these power-gating control frameworks are only ap-

plicable to single-thread programs, and care is needed in multithread programs since

some of the threads might share the same hardware resources. Turning resources on

and off requires careful consideration of cases where multiple threads are present.

Herein, we extend previous work to deal with the case of multithread systems in a

bulk-synchronous parallel (BSP) model.

The BSP model, proposed by Valiant [1990], is designed to bridge between theory

and practice of parallel computations. The BSP model structures multiple processors

with local memory and a global barrier synchronous mechanism. Threads processed by

processors are separated by synchronous points, called supersteps, that form the basic

unit of the BSP model. A superstep consists of a computation phase and a communica-

tion phase, allowing processors to compute data in local memory until encountering a

global synchronous point in the computation phase and synchronizing local data with

each other in the communication phase. The algorithm complexity of parallel programs

can then be analyzed in the BSP model by considering both locality and parallelism

issues. The BSP model works well for a family of parallel applications in which the

tasks are balanced. However, global barrier synchronization was found inﬂexible in the

practice [McColl 1996], which promoted proposals for several enhanced BSP models

presenting hierarchical groupings. NestStep [Keßler 2000] is a programming language

for the BSP model that adopts nested parallelism with support for virtual shared

memory. The H-BSP model [Cha and Lee 2001] splits processors into groups and dy-

namically runs BSP programs within each group in a bulk-synchronous fashion, while

the multicore BSP [Valiant 2008, 2011] provides hierarchical multicore environments

with independent communication costs. In the present study we adopted the concept

of hierarchical BSP models [Keßler 2000; Cha and Lee 2001; Torre and Kruskal 1996]

as the basis for a power reduction framework for use in parallel programming.

Several methods have been proposed for analyzing the concurrency of multithread

programs. May-happen-in-parallel (MHP) analysis computes which statements may

be executed concurrently in a multithread program [Callahan and Sublok 1989;

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in BSP Programs 9:3

Duesterwald and Soffa 1991; Masticola and Ryder 1993; Naumovich and Avrunin

1998; Naumovich et al. 1999; Li and Verbrugge 2004; Barik 2005]. The problem of

precisely computing all pairs of statements that may execute in parallel is undecidable

[Ramalingam 2000]; however, it was proved that the problem is NP-complete if we

assume that all control paths are executable [Taylor 1983]. The general approach

involves using a dataﬂow framework to compute a conservative estimate of MHP

information.

This article presents a multithread power-gating (MTPG) framework, composed of

MTPG Analysis (MTPGA) and predicated power-gating (PPG) energy management

mechanisms for reducing leakage power when executing multithread programs on si-

multaneous multithreading (SMT) machines. SMT is a widely adopted processor tech-

nique that allows multithread programs to utilize functional units more efﬁciently

by fetching and executing instructions from multiple threads at the same time. Our

multithread programming model is based on hierarchical BSP models. We propose us-

ing thread fragment concurrency analysis (TFCA) to analyze MHP information among

threads and MTPGA to report the component usages shared by multiple threads in hi-

erarchical BSP models. TFCA reports the concurrency of threads, which allows power-

gating candidates to be classiﬁed into those used by multiple threads and those used

by a single thread. A conventional power-gating optimization framework [You et al.

2005, 2007] can be employed for candidates used by a single thread, with the compiler

inserting instructions into the program to shut down and wake up components as ap-

propriate. For candidates used concurrently by different threads, PPG instructions are

adopted to turn components on and off as appropriate. Based on the TFCA, our MTPGA

framework estimates the energy usage of multithread programs with our proposed cost

model and inserts a pair of predicated power-on and predicated power-off operations at

those positions where a power-gating candidate is ﬁrst activated and last deactivated

within a thread.

To our best knowledge, this is the ﬁrst work to attempt to devise an analysis scheme

for reducing leakage power in multithread programs. We performed experiments by

incorporating TFCA and MTPGA i nto SUIF compiler tools and by simulating the

energy consumption with a post-estimated SMT simulator based on Wattch toolkits.

Our preliminary experimental results on a system with leakage contribution set to

30% show that the total energy consumption of a system with PPG support and our

power optimization method is reduced by an average of 10.09% for BSP programs

converted from the OpenCL kernel and by up to 10.49% for D-BSP programs relative

to the system without a power-gating mechanism, and is reduced by an average of

4.27% for BSP programs and by up to 6.68% for D-BSP programs on a system with

leakage contribution set to 10%, demonstrating our mechanisms effective in reducing

the leakage power in hierarchical BSP multithread environments.

The remainder of the article is organized as follows. Section 2 gives a motivating

example for the problem addressed by our study. Section 3 presents the technical

rationale of our work, ﬁrst presenting the PPG instruction and architectures, and

then summarizing our compilation ﬂow. Section 4 presents the method of TFCA for

hierarchical BSP programs while Section 5 presents our MTPGA compiler framework

for power optimizations. Section 6 presents the experimental results, discussion is

given in Section 7, and conclusions are drawn in Section 8.

2. MOTIVATION

A system might be equipped with a power-gating mechanism to activate and deac-

tivate components in order to reduce the leakage current [Goodacre 2011]. In such

systems, programmers or compilers should analyze the behavior of programs, inves-

tigate component utilization based on execution sequences, and insert power-gating

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

9:4 W.-L.Shihetal.

Fig. 1. The traditional power-gating mechanism adopted in a single-thread or SMT environment. Both

environments are equipped with two categories of components C

and C

,whereC

is capable of controlling

the power-gating status of C

: (a) Two code segments of threads T

and T

, where power-gating instructions

are inserted by power-gating analysis results for threads T

and T

individually. Note that op1ofT

and

op2andop5ofT

demonstrate those cases where instructions might need more than one component (in this

case, C

and C

) to complete operation; (b); (c) how the code segments in (a) are executed in a single-thread

and SMT environment, respectively. All component usages of instructions for the two threads are labeled as

square boxes with corresponding labels, and power-off instructions are labeled as boxes with a cross.

instructions into programs [You et al. 2002, 2006] to ensure that the leakage current

is gated appropriately. Traditional compiler analysis algorithms for low power focus

on single-thread programs, and the methods cannot be directly applied to multithread

programs. We use the example in Figure 1 to illustrate the scenario for motivating

the need of new compiler schemes for reducing the power consumption in multithread

environments. Assume we have hardware equipped with two categories of functional

units, named C

and C

, where C

is capable of controlling the power-gating status of

, and the hardware is conﬁgurable as a single-thread or SMT environment. We ﬁrst

present two pseudocode segments for threads T

and T

in Figure 1(a), that are ana-

lyzed and processed by traditional low-power optimization analysis. Note that op1ofT

and op2andop5ofT

demonstrate the cases where instructions might need more than

one component (in this case, C

and C

) to complete operation. Traditional sequential

analysis of the compiler will yield the component utilization for every instruction. As

shown in Figure 1(a), the compiler inserts two power-gating instructions “pg-off C

”

at the end of both code segments because C

is no longer used for those segments in

subsequent codes. These code segments work smoothly when executed individually in

single-thread environments as shown in Figure 1(b). In the ﬁgure, all component us-

ages of instructions for the two threads are labeled as square boxes with corresponding

labels, and power-off instructions are labeled as boxes with a cross. For thread T

,after

instructions op1 and op2 are executed, the power-off instruction is executed at t

; hence

the system could save leakage energy from idle component C

. For thread T

, after ﬁve

instructions are executed, the power-off instruction is executed at t

,whichturnsoff

component C

to stop the leakage current.

However, when the multithread program is executed in an SMT system, the system

could concurrently execute threads T

and T

with shared components C

and C

illustrated in Figure 1(c). At time t

, thread T

powers off C

because the traditional

compiler analysis reports that C

will no longer be used in T

and a power-off instruction

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in BSP Programs 9:5

is inserted. However, T

actually still uses C

at time t

and t

, which means that

powering off C

at t

will make the system fail if the powered-off components fully

rely on power-gating instructions; or the system would pay the penalty associated with

executing T

at t

if the system could internally turn on the components according to

the status of instruction queues.

The prior example indicates that the traditional single-thread analyzer cannot be

naively applied to the MTPG case, as it will likely break the logic that a unit must be in

the active state (i.e., powered on) before being used for processing, since the unit might

be powered off by a thread while other concurrent threads are still using or about to

use it. Moreover, a unit might be powered on multiple times by a set of concurrent

threads. The preceding problems must be appropriately addressed when constructing

power-gating controls for multithread programs. This article presents our solution for

addressing this issue.

3. TECHNICAL RATIONALE

3.1. PPG Operations

Predicated execution support provides an effective means to eliminate branches from

an instruction stream. Predicated or guarded execution refers to the conditional exe-

cution of an instruction based on the value of a Boolean source operand, referred to as

the predicate [Hsu and Davidson 1986]. Predicated instructions are fetched regardless

of their predicate value. Instructions whose predicate is true are executed normally,

while those whose predicate is false are nulliﬁed and thus prevented from modifying

the processor state.

We include the concept of predicated execution in power-gating devices for controlling

the power gating of a set of concurrent threads. We combine the predicated executions

into three special power-gating operations: predicated power-on, predicated power-off,

and initialization operations. The main ideas are: (1) to turn on a component only

when it is actually in the off state; (2) to keep track of the number of threads using the

component; and (3) to turn off the component when this is the last exit of all threads

using this component. Note that these operations must be atomic with respect to each

other in order to prevent multiple threads from accessing control at the same time.

—Initialization operation. An initialization operation is designed to clean all predicated

bits (i.e., pgp

, pgp

, ..., pgp

) and empty all reference counters (i.e., rc

, rc

, ...,

) when the processor is starting up.

—Predicated power-on operation. The predicated power-on operation takes an explicit

operand and two implicit operands to record component usage and conditionally turn

on a power-gating candidate. The explicit operand is power-gating candidate C

,and

the implicit operands include predicated bit pgp

of C

and a reference counter rc

. The operation consists of the following steps:

(1) power on C

if pgp

(i.e., the predicated bit of C

)isset;

(2) increase rc

(i.e., the reference counter of C

) by 1. The reference counter keeps

track of the number of threads that reference the power-gating candidate at this

time; and

(3) unset predicated bit pgp

—Predicated power-off operation. The predicated power-off operation also takes an

explicit operand C

and two implicit operands pgp

and rc

. Predicated power-off

instructions update component usage rc

and conditionally turn off a power-gating

candidate C

by predicated bit pgp

. The operation consists of the following steps:

(1) decrease the reference counter rc

by 1;

(2) set predicate bit pgp

if reference counter rc

is 0; and

(3) power off C

if predicated bit pgp

is set.

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

HTML Viewer

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?

This article addresses compiler optimization for reducing the power consumption of multithread programs. This article presents a multithread power-gating framework composed of multithread power-gating analysis ( MTPGA ) and predicated power-gating ( PPG ) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading ( SMT ) machines. The authors performed experiments by incorporating their power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits.

Q2. What future works have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?

It would be a possible direction for future research to apply their method on GPU architectures.

Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs

Summary (4 min read)

1. INTRODUCTION

2. MOTIVATION

3.1. PPG Operations

3.2. Multithread Power-Gating Framework

4. TFCA FOR BSP PROGRAMS

4.1. Thread Fragment Graph

4.2. Constructing TFGs

4.3. Lineal Thread Fragments Analysis and MTF

5. MULTITHREAD POWER-GATING ANALYSIS

6.1. Platform

6.2. Simulation Results

7. DISCUSSION

8. CONCLUSION

Figures (29)

Citations

Cites background from "Compiler Optimization for Reducing ..."

Cites methods from "Compiler Optimization for Reducing ..."

References

"Compiler Optimization for Reducing ..." refers background or methods in this paper

"Compiler Optimization for Reducing ..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?

Q2. What future works have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?