What future works have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?

It would be a possible direction for future research to apply their method on GPU architectures.

(Open Access) Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs (2014) | Wen-Li Shih

Compiler Optimization for Reducing Leakage Power in M ultithread

BSP Programs

WEN-LI SHIH, National Tsing Hua University

YI-PING YOU, National Chiao Tung University

CHUNG-WEN HUANG and JENQ KUEN LEE, National Tsing Hua University

Multithread programming is widely adopted in novel embedded system applications due to its high perfor-

mance and ﬂexibility. This article addresses compiler optimization for reducing the power consumption of

multithread programs. A traditional compiler employs energy management techniques that analyze compo-

nent usage in control-ﬂow graphs with a focus on single-thread programs. In this environment the leakage

power can be controlled by inserting on and off instructions based on component usage information generated

by ﬂow equations. However, these methods cannot be directly extended to a multithread environment due

to concurrent execution issues.

This article presents a multithread power-gating framework composed of multithread power-gating anal-

ysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage

power when executing multithread programs on simultaneous multithreading (SMT) machines. Our mul-

tithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on

a multithread component analysis with dataﬂow equations, our MTPGA framework estimates the energy

usage of multithread programs and inserts PPG operations as power controls for energy management. We

performed experiments by incorporating our power optimization framework into SUIF compiler tools and

by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The

experimental results show that the total energy consumption of a system with PPG support and our power

optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a

power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced

by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are

effective in reducing the leakage energy of BSP multithread programs.

Categories and Subject Descriptors: D.3.2 [Programming Languages]: Language Classiﬁcations—

Concurrent; distributed; parallel languages; D.3.4 [Programming Languages]: Processors—Compiler;

optimization

General Terms: Design, Language

Additional Key Words and Phrases: Compilers for low power, leakage power reduction, power-gating mech-

anisms, multithreading

ACM Reference Format:

Wen-Li Shih, Yi-Ping You, Chung-Wen Huang, and Jenq Kuen Lee. 2014. Compiler optimization for reducing

leakage power in multithread BSP programs. ACM Trans. Des. Autom. Electron. Syst. 20, 1, Article 9

(November 2014), 34 pages.

DOI: http://dx.doi.org/10.1145/2668119

This work is supported in part by Ministry of Science and Technology (under grant no. 103-2220-E-007-019)

and Ministry of Economic Affairs (under grant no. 103-EC-17-A-02-S1-202) in Taiwan.

Author’s addresses: W.-L. Shih, Department of Computer Science, National Tsing Hua University, Hsinchu,

Taiwan; Y.-P. You, Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan;

C.-W. Huang and J. K. Lee (corresponding author), Department of Computer Science, N ational Tsing Hua

University, Hsinchu, Taiwan; email: jklee@cs.nthu.edu.tw.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted

without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that

copies bear this notice and the full citation on the ﬁrst page. Copyrights for components o f this work owned by

others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to

post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions

from permissions@acm.org.

 2014 ACM 1084-4309/2014/11-ART9 $15.00

DOI: http://dx.doi.org/10.1145/2668119

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

9:2 W.-L.Shihetal.

1. INTRODUCTION

Approaches for minimizing power dissipation can be applied at the algorithmic, com-

piler, architectural, logic, and circuit levels [Chandrakasan et al. 1992]. Aspects rel-

ative to combining architecture design and software arrangement at the instruction

level have been addressed with the aim of reducing power consumption [Bellas et al.

2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su

and Despain 1995; Tiwari et al. 1997, 1998]. Major efforts in power optimization in-

clude dynamic and leakage power optimization. Works in dynamic power optimization

include utilizing the value locality of registers [Chang and Pedram 1995], scheduling

VLIW (very long instruction word) instructions to reduce the power consumption on

the instruction bus [Lee et al. 2003], reducing instruction encoding to reduce code

size and power consumption [Lee et al. 2013], and gating the clock to reduce work-

loads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]. Compiler code for reducing

leakage power can employ power gating [Kao and Chandrakasan 2000; Butts and Sohi

2000; Hu et al. 2004]. Various studies have attempted to reduce the leakage power using

integrated architectures and compiler-based power gating mechanisms [Dropsho et al.

2002; Yang et al. 2002; You et al. 2002, 2007; Rele et al. 2002; Zhang et al. 2003; Li and

Xue 2004]. These approaches involve compilers inserting instructions into programs to

shut down and wake up components as appropriate, based on a dataﬂow analysis or a

proﬁling analysis. The power analysis and instruction insertion are further integrated

into trace-based binary translation [Li and Xue 2004]. The Sink-N-Hoist framework

[You et al. 2005, 2007] has been used to reduce the number of power-gating instructions

generated by compilers. However, these power-gating control frameworks are only ap-

plicable to single-thread programs, and care is needed in multithread programs since

some of the threads might share the same hardware resources. Turning resources on

and off requires careful consideration of cases where multiple threads are present.

Herein, we extend previous work to deal with the case of multithread systems in a

bulk-synchronous parallel (BSP) model.

The BSP model, proposed by Valiant [1990], is designed to bridge between theory

and practice of parallel computations. The BSP model structures multiple processors

with local memory and a global barrier synchronous mechanism. Threads processed by

processors are separated by synchronous points, called supersteps, that form the basic

unit of the BSP model. A superstep consists of a computation phase and a communica-

tion phase, allowing processors to compute data in local memory until encountering a

global synchronous point in the computation phase and synchronizing local data with

each other in the communication phase. The algorithm complexity of parallel programs

can then be analyzed in the BSP model by considering both locality and parallelism

issues. The BSP model works well for a family of parallel applications in which the

tasks are balanced. However, global barrier synchronization was found inﬂexible in the

practice [McColl 1996], which promoted proposals for several enhanced BSP models

presenting hierarchical groupings. NestStep [Keßler 2000] is a programming language

for the BSP model that adopts nested parallelism with support for virtual shared

memory. The H-BSP model [Cha and Lee 2001] splits processors into groups and dy-

namically runs BSP programs within each group in a bulk-synchronous fashion, while

the multicore BSP [Valiant 2008, 2011] provides hierarchical multicore environments

with independent communication costs. In the present study we adopted the concept

of hierarchical BSP models [Keßler 2000; Cha and Lee 2001; Torre and Kruskal 1996]

as the basis for a power reduction framework for use in parallel programming.

Several methods have been proposed for analyzing the concurrency of multithread

programs. May-happen-in-parallel (MHP) analysis computes which statements may

be executed concurrently in a multithread program [Callahan and Sublok 1989;

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in BSP Programs 9:3

Duesterwald and Soffa 1991; Masticola and Ryder 1993; Naumovich and Avrunin

1998; Naumovich et al. 1999; Li and Verbrugge 2004; Barik 2005]. The problem of

precisely computing all pairs of statements that may execute in parallel is undecidable

[Ramalingam 2000]; however, it was proved that the problem is NP-complete if we

assume that all control paths are executable [Taylor 1983]. The general approach

involves using a dataﬂow framework to compute a conservative estimate of MHP

information.

This article presents a multithread power-gating (MTPG) framework, composed of

MTPG Analysis (MTPGA) and predicated power-gating (PPG) energy management

mechanisms for reducing leakage power when executing multithread programs on si-

multaneous multithreading (SMT) machines. SMT is a widely adopted processor tech-

nique that allows multithread programs to utilize functional units more efﬁciently

by fetching and executing instructions from multiple threads at the same time. Our

multithread programming model is based on hierarchical BSP models. We propose us-

ing thread fragment concurrency analysis (TFCA) to analyze MHP information among

threads and MTPGA to report the component usages shared by multiple threads in hi-

erarchical BSP models. TFCA reports the concurrency of threads, which allows power-

gating candidates to be classiﬁed into those used by multiple threads and those used

by a single thread. A conventional power-gating optimization framework [You et al.

2005, 2007] can be employed for candidates used by a single thread, with the compiler

inserting instructions into the program to shut down and wake up components as ap-

propriate. For candidates used concurrently by different threads, PPG instructions are

adopted to turn components on and off as appropriate. Based on the TFCA, our MTPGA

framework estimates the energy usage of multithread programs with our proposed cost

model and inserts a pair of predicated power-on and predicated power-off operations at

those positions where a power-gating candidate is ﬁrst activated and last deactivated

within a thread.

To our best knowledge, this is the ﬁrst work to attempt to devise an analysis scheme

for reducing leakage power in multithread programs. We performed experiments by

incorporating TFCA and MTPGA i nto SUIF compiler tools and by simulating the

energy consumption with a post-estimated SMT simulator based on Wattch toolkits.

Our preliminary experimental results on a system with leakage contribution set to

30% show that the total energy consumption of a system with PPG support and our

power optimization method is reduced by an average of 10.09% for BSP programs

converted from the OpenCL kernel and by up to 10.49% for D-BSP programs relative

to the system without a power-gating mechanism, and is reduced by an average of

4.27% for BSP programs and by up to 6.68% for D-BSP programs on a system with

leakage contribution set to 10%, demonstrating our mechanisms effective in reducing

the leakage power in hierarchical BSP multithread environments.

The remainder of the article is organized as follows. Section 2 gives a motivating

example for the problem addressed by our study. Section 3 presents the technical

rationale of our work, ﬁrst presenting the PPG instruction and architectures, and

then summarizing our compilation ﬂow. Section 4 presents the method of TFCA for

hierarchical BSP programs while Section 5 presents our MTPGA compiler framework

for power optimizations. Section 6 presents the experimental results, discussion is

given in Section 7, and conclusions are drawn in Section 8.

2. MOTIVATION

A system might be equipped with a power-gating mechanism to activate and deac-

tivate components in order to reduce the leakage current [Goodacre 2011]. In such

systems, programmers or compilers should analyze the behavior of programs, inves-

tigate component utilization based on execution sequences, and insert power-gating

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

9:4 W.-L.Shihetal.

Fig. 1. The traditional power-gating mechanism adopted in a single-thread or SMT environment. Both

environments are equipped with two categories of components C

and C

,whereC

is capable of controlling

the power-gating status of C

: (a) Two code segments of threads T

and T

, where power-gating instructions

are inserted by power-gating analysis results for threads T

and T

individually. Note that op1ofT

and

op2andop5ofT

demonstrate those cases where instructions might need more than one component (in this

case, C

and C

) to complete operation; (b); (c) how the code segments in (a) are executed in a single-thread

and SMT environment, respectively. All component usages of instructions for the two threads are labeled as

square boxes with corresponding labels, and power-off instructions are labeled as boxes with a cross.

instructions into programs [You et al. 2002, 2006] to ensure that the leakage current

is gated appropriately. Traditional compiler analysis algorithms for low power focus

on single-thread programs, and the methods cannot be directly applied to multithread

programs. We use the example in Figure 1 to illustrate the scenario for motivating

the need of new compiler schemes for reducing the power consumption in multithread

environments. Assume we have hardware equipped with two categories of functional

units, named C

and C

, where C

is capable of controlling the power-gating status of

, and the hardware is conﬁgurable as a single-thread or SMT environment. We ﬁrst

present two pseudocode segments for threads T

and T

in Figure 1(a), that are ana-

lyzed and processed by traditional low-power optimization analysis. Note that op1ofT

and op2andop5ofT

demonstrate the cases where instructions might need more than

one component (in this case, C

and C

) to complete operation. Traditional sequential

analysis of the compiler will yield the component utilization for every instruction. As

shown in Figure 1(a), the compiler inserts two power-gating instructions “pg-off C

”

at the end of both code segments because C

is no longer used for those segments in

subsequent codes. These code segments work smoothly when executed individually in

single-thread environments as shown in Figure 1(b). In the ﬁgure, all component us-

ages of instructions for the two threads are labeled as square boxes with corresponding

labels, and power-off instructions are labeled as boxes with a cross. For thread T

,after

instructions op1 and op2 are executed, the power-off instruction is executed at t

; hence

the system could save leakage energy from idle component C

. For thread T

, after ﬁve

instructions are executed, the power-off instruction is executed at t

,whichturnsoff

component C

to stop the leakage current.

However, when the multithread program is executed in an SMT system, the system

could concurrently execute threads T

and T

with shared components C

and C

illustrated in Figure 1(c). At time t

, thread T

powers off C

because the traditional

compiler analysis reports that C

will no longer be used in T

and a power-off instruction

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in BSP Programs 9:5

is inserted. However, T

actually still uses C

at time t

and t

, which means that

powering off C

at t

will make the system fail if the powered-off components fully

rely on power-gating instructions; or the system would pay the penalty associated with

executing T

at t

if the system could internally turn on the components according to

the status of instruction queues.

The prior example indicates that the traditional single-thread analyzer cannot be

naively applied to the MTPG case, as it will likely break the logic that a unit must be in

the active state (i.e., powered on) before being used for processing, since the unit might

be powered off by a thread while other concurrent threads are still using or about to

use it. Moreover, a unit might be powered on multiple times by a set of concurrent

threads. The preceding problems must be appropriately addressed when constructing

power-gating controls for multithread programs. This article presents our solution for

addressing this issue.

3. TECHNICAL RATIONALE

3.1. PPG Operations

Predicated execution support provides an effective means to eliminate branches from

an instruction stream. Predicated or guarded execution refers to the conditional exe-

cution of an instruction based on the value of a Boolean source operand, referred to as

the predicate [Hsu and Davidson 1986]. Predicated instructions are fetched regardless

of their predicate value. Instructions whose predicate is true are executed normally,

while those whose predicate is false are nulliﬁed and thus prevented from modifying

the processor state.

We include the concept of predicated execution in power-gating devices for controlling

the power gating of a set of concurrent threads. We combine the predicated executions

into three special power-gating operations: predicated power-on, predicated power-off,

and initialization operations. The main ideas are: (1) to turn on a component only

when it is actually in the off state; (2) to keep track of the number of threads using the

component; and (3) to turn off the component when this is the last exit of all threads

using this component. Note that these operations must be atomic with respect to each

other in order to prevent multiple threads from accessing control at the same time.

—Initialization operation. An initialization operation is designed to clean all predicated

bits (i.e., pgp

, pgp

, ..., pgp

) and empty all reference counters (i.e., rc

, rc

, ...,

) when the processor is starting up.

—Predicated power-on operation. The predicated power-on operation takes an explicit

operand and two implicit operands to record component usage and conditionally turn

on a power-gating candidate. The explicit operand is power-gating candidate C

,and

the implicit operands include predicated bit pgp

of C

and a reference counter rc

. The operation consists of the following steps:

(1) power on C

if pgp

(i.e., the predicated bit of C

)isset;

(2) increase rc

(i.e., the reference counter of C

) by 1. The reference counter keeps

track of the number of threads that reference the power-gating candidate at this

time; and

(3) unset predicated bit pgp

—Predicated power-off operation. The predicated power-off operation also takes an

explicit operand C

and two implicit operands pgp

and rc

. Predicated power-off

instructions update component usage rc

and conditionally turn off a power-gating

candidate C

by predicated bit pgp

. The operation consists of the following steps:

(1) decrease the reference counter rc

by 1;

(2) set predicate bit pgp

if reference counter rc

is 0; and

(3) power off C

if predicated bit pgp

is set.

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs

Figures

Citations

Energy Transparency for Deeply Embedded Programs

Accelerating Dynamic Data Race Detection Using Static Thread Interference Analysis

Compilers for Low Power with Design Patterns on Embedded Multicore Systems

Compilers for Low Power with Design Patterns on Embedded Multicore Systems

Compiler Enhanced Scheduling for OpenMP for Heterogeneous Multiprocessors

References

Managing static leakage energy in microprocessor functional units

An OpenCL framework for heterogeneous multicores with local memory

Submachine Locality in the Bulk Synchronous Setting (Extended Abstract)

Compiler optimization on VLIW instruction scheduling for low power

Compilers for leakage power reduction

Related Papers (5)

Compilers for leakage power reduction

Compilation for compact power-gating controls

A sink-n-hoist framework for leakage power reduction

Compiler optimizations for low power systems

Compiler-Directed Energy Reduction Using Dynamic Voltage Scaling and Voltage Islands for Embedded Systems

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?

Q2. What future works have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?