scispace - formally typeset
Open AccessJournal ArticleDOI

Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs

Reads0
Chats0
TLDR
A multithread power-gating framework composed of multith read power- gating analysis (MTPGA) and predicated power-Gating (PPG) energy management mechanisms for reducing the leakage power when executingMultithread programs on simultaneous multithreading (SMT) machines is presented.
Abstract
Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.This article presents a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09p for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30p; and the total energy consumption is reduced by an average of 4.27p on leakage contribution set to 10p. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.

read more

Content maybe subject to copyright    Report

9
Compiler Optimization for Reducing Leakage Power in M ultithread
BSP Programs
WEN-LI SHIH, National Tsing Hua University
YI-PING YOU, National Chiao Tung University
CHUNG-WEN HUANG and JENQ KUEN LEE, National Tsing Hua University
Multithread programming is widely adopted in novel embedded system applications due to its high perfor-
mance and flexibility. This article addresses compiler optimization for reducing the power consumption of
multithread programs. A traditional compiler employs energy management techniques that analyze compo-
nent usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage
power can be controlled by inserting on and off instructions based on component usage information generated
by flow equations. However, these methods cannot be directly extended to a multithread environment due
to concurrent execution issues.
This article presents a multithread power-gating framework composed of multithread power-gating anal-
ysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage
power when executing multithread programs on simultaneous multithreading (SMT) machines. Our mul-
tithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on
a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy
usage of multithread programs and inserts PPG operations as power controls for energy management. We
performed experiments by incorporating our power optimization framework into SUIF compiler tools and
by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The
experimental results show that the total energy consumption of a system with PPG support and our power
optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a
power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced
by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are
effective in reducing the leakage energy of BSP multithread programs.
Categories and Subject Descriptors: D.3.2 [Programming Languages]: Language Classifications—
Concurrent; distributed; parallel languages; D.3.4 [Programming Languages]: Processors—Compiler;
optimization
General Terms: Design, Language
Additional Key Words and Phrases: Compilers for low power, leakage power reduction, power-gating mech-
anisms, multithreading
ACM Reference Format:
Wen-Li Shih, Yi-Ping You, Chung-Wen Huang, and Jenq Kuen Lee. 2014. Compiler optimization for reducing
leakage power in multithread BSP programs. ACM Trans. Des. Autom. Electron. Syst. 20, 1, Article 9
(November 2014), 34 pages.
DOI: http://dx.doi.org/10.1145/2668119
This work is supported in part by Ministry of Science and Technology (under grant no. 103-2220-E-007-019)
and Ministry of Economic Affairs (under grant no. 103-EC-17-A-02-S1-202) in Taiwan.
Author’s addresses: W.-L. Shih, Department of Computer Science, National Tsing Hua University, Hsinchu,
Taiwan; Y.-P. You, Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan;
C.-W. Huang and J. K. Lee (corresponding author), Department of Computer Science, N ational Tsing Hua
University, Hsinchu, Taiwan; email: jklee@cs.nthu.edu.tw.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components o f this work owned by
others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions
from permissions@acm.org.
c
2014 ACM 1084-4309/2014/11-ART9 $15.00
DOI: http://dx.doi.org/10.1145/2668119
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

9:2 W.-L.Shihetal.
1. INTRODUCTION
Approaches for minimizing power dissipation can be applied at the algorithmic, com-
piler, architectural, logic, and circuit levels [Chandrakasan et al. 1992]. Aspects rel-
ative to combining architecture design and software arrangement at the instruction
level have been addressed with the aim of reducing power consumption [Bellas et al.
2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su
and Despain 1995; Tiwari et al. 1997, 1998]. Major efforts in power optimization in-
clude dynamic and leakage power optimization. Works in dynamic power optimization
include utilizing the value locality of registers [Chang and Pedram 1995], scheduling
VLIW (very long instruction word) instructions to reduce the power consumption on
the instruction bus [Lee et al. 2003], reducing instruction encoding to reduce code
size and power consumption [Lee et al. 2013], and gating the clock to reduce work-
loads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]. Compiler code for reducing
leakage power can employ power gating [Kao and Chandrakasan 2000; Butts and Sohi
2000; Hu et al. 2004]. Various studies have attempted to reduce the leakage power using
integrated architectures and compiler-based power gating mechanisms [Dropsho et al.
2002; Yang et al. 2002; You et al. 2002, 2007; Rele et al. 2002; Zhang et al. 2003; Li and
Xue 2004]. These approaches involve compilers inserting instructions into programs to
shut down and wake up components as appropriate, based on a dataflow analysis or a
profiling analysis. The power analysis and instruction insertion are further integrated
into trace-based binary translation [Li and Xue 2004]. The Sink-N-Hoist framework
[You et al. 2005, 2007] has been used to reduce the number of power-gating instructions
generated by compilers. However, these power-gating control frameworks are only ap-
plicable to single-thread programs, and care is needed in multithread programs since
some of the threads might share the same hardware resources. Turning resources on
and off requires careful consideration of cases where multiple threads are present.
Herein, we extend previous work to deal with the case of multithread systems in a
bulk-synchronous parallel (BSP) model.
The BSP model, proposed by Valiant [1990], is designed to bridge between theory
and practice of parallel computations. The BSP model structures multiple processors
with local memory and a global barrier synchronous mechanism. Threads processed by
processors are separated by synchronous points, called supersteps, that form the basic
unit of the BSP model. A superstep consists of a computation phase and a communica-
tion phase, allowing processors to compute data in local memory until encountering a
global synchronous point in the computation phase and synchronizing local data with
each other in the communication phase. The algorithm complexity of parallel programs
can then be analyzed in the BSP model by considering both locality and parallelism
issues. The BSP model works well for a family of parallel applications in which the
tasks are balanced. However, global barrier synchronization was found inflexible in the
practice [McColl 1996], which promoted proposals for several enhanced BSP models
presenting hierarchical groupings. NestStep [Keßler 2000] is a programming language
for the BSP model that adopts nested parallelism with support for virtual shared
memory. The H-BSP model [Cha and Lee 2001] splits processors into groups and dy-
namically runs BSP programs within each group in a bulk-synchronous fashion, while
the multicore BSP [Valiant 2008, 2011] provides hierarchical multicore environments
with independent communication costs. In the present study we adopted the concept
of hierarchical BSP models [Keßler 2000; Cha and Lee 2001; Torre and Kruskal 1996]
as the basis for a power reduction framework for use in parallel programming.
Several methods have been proposed for analyzing the concurrency of multithread
programs. May-happen-in-parallel (MHP) analysis computes which statements may
be executed concurrently in a multithread program [Callahan and Sublok 1989;
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in BSP Programs 9:3
Duesterwald and Soffa 1991; Masticola and Ryder 1993; Naumovich and Avrunin
1998; Naumovich et al. 1999; Li and Verbrugge 2004; Barik 2005]. The problem of
precisely computing all pairs of statements that may execute in parallel is undecidable
[Ramalingam 2000]; however, it was proved that the problem is NP-complete if we
assume that all control paths are executable [Taylor 1983]. The general approach
involves using a dataflow framework to compute a conservative estimate of MHP
information.
This article presents a multithread power-gating (MTPG) framework, composed of
MTPG Analysis (MTPGA) and predicated power-gating (PPG) energy management
mechanisms for reducing leakage power when executing multithread programs on si-
multaneous multithreading (SMT) machines. SMT is a widely adopted processor tech-
nique that allows multithread programs to utilize functional units more efficiently
by fetching and executing instructions from multiple threads at the same time. Our
multithread programming model is based on hierarchical BSP models. We propose us-
ing thread fragment concurrency analysis (TFCA) to analyze MHP information among
threads and MTPGA to report the component usages shared by multiple threads in hi-
erarchical BSP models. TFCA reports the concurrency of threads, which allows power-
gating candidates to be classified into those used by multiple threads and those used
by a single thread. A conventional power-gating optimization framework [You et al.
2005, 2007] can be employed for candidates used by a single thread, with the compiler
inserting instructions into the program to shut down and wake up components as ap-
propriate. For candidates used concurrently by different threads, PPG instructions are
adopted to turn components on and off as appropriate. Based on the TFCA, our MTPGA
framework estimates the energy usage of multithread programs with our proposed cost
model and inserts a pair of predicated power-on and predicated power-off operations at
those positions where a power-gating candidate is first activated and last deactivated
within a thread.
To our best knowledge, this is the first work to attempt to devise an analysis scheme
for reducing leakage power in multithread programs. We performed experiments by
incorporating TFCA and MTPGA i nto SUIF compiler tools and by simulating the
energy consumption with a post-estimated SMT simulator based on Wattch toolkits.
Our preliminary experimental results on a system with leakage contribution set to
30% show that the total energy consumption of a system with PPG support and our
power optimization method is reduced by an average of 10.09% for BSP programs
converted from the OpenCL kernel and by up to 10.49% for D-BSP programs relative
to the system without a power-gating mechanism, and is reduced by an average of
4.27% for BSP programs and by up to 6.68% for D-BSP programs on a system with
leakage contribution set to 10%, demonstrating our mechanisms effective in reducing
the leakage power in hierarchical BSP multithread environments.
The remainder of the article is organized as follows. Section 2 gives a motivating
example for the problem addressed by our study. Section 3 presents the technical
rationale of our work, first presenting the PPG instruction and architectures, and
then summarizing our compilation flow. Section 4 presents the method of TFCA for
hierarchical BSP programs while Section 5 presents our MTPGA compiler framework
for power optimizations. Section 6 presents the experimental results, discussion is
given in Section 7, and conclusions are drawn in Section 8.
2. MOTIVATION
A system might be equipped with a power-gating mechanism to activate and deac-
tivate components in order to reduce the leakage current [Goodacre 2011]. In such
systems, programmers or compilers should analyze the behavior of programs, inves-
tigate component utilization based on execution sequences, and insert power-gating
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

9:4 W.-L.Shihetal.
Fig. 1. The traditional power-gating mechanism adopted in a single-thread or SMT environment. Both
environments are equipped with two categories of components C
0
and C
1
,whereC
0
is capable of controlling
the power-gating status of C
1
: (a) Two code segments of threads T
1
and T
2
, where power-gating instructions
are inserted by power-gating analysis results for threads T
1
and T
2
individually. Note that op1ofT
1
and
op2andop5ofT
2
demonstrate those cases where instructions might need more than one component (in this
case, C
0
and C
1
) to complete operation; (b); (c) how the code segments in (a) are executed in a single-thread
and SMT environment, respectively. All component usages of instructions for the two threads are labeled as
square boxes with corresponding labels, and power-off instructions are labeled as boxes with a cross.
instructions into programs [You et al. 2002, 2006] to ensure that the leakage current
is gated appropriately. Traditional compiler analysis algorithms for low power focus
on single-thread programs, and the methods cannot be directly applied to multithread
programs. We use the example in Figure 1 to illustrate the scenario for motivating
the need of new compiler schemes for reducing the power consumption in multithread
environments. Assume we have hardware equipped with two categories of functional
units, named C
0
and C
1
, where C
0
is capable of controlling the power-gating status of
C
1
, and the hardware is configurable as a single-thread or SMT environment. We first
present two pseudocode segments for threads T
1
and T
2
in Figure 1(a), that are ana-
lyzed and processed by traditional low-power optimization analysis. Note that op1ofT
1
and op2andop5ofT
2
demonstrate the cases where instructions might need more than
one component (in this case, C
0
and C
1
) to complete operation. Traditional sequential
analysis of the compiler will yield the component utilization for every instruction. As
shown in Figure 1(a), the compiler inserts two power-gating instructions “pg-off C
1
at the end of both code segments because C
1
is no longer used for those segments in
subsequent codes. These code segments work smoothly when executed individually in
single-thread environments as shown in Figure 1(b). In the figure, all component us-
ages of instructions for the two threads are labeled as square boxes with corresponding
labels, and power-off instructions are labeled as boxes with a cross. For thread T
1
,after
instructions op1 and op2 are executed, the power-off instruction is executed at t
4
; hence
the system could save leakage energy from idle component C
1
. For thread T
2
, after five
instructions are executed, the power-off instruction is executed at t
6
,whichturnsoff
component C
1
to stop the leakage current.
However, when the multithread program is executed in an SMT system, the system
could concurrently execute threads T
1
and T
2
with shared components C
0
and C
1
as
illustrated in Figure 1(c). At time t
4
, thread T
1
powers off C
1
because the traditional
compiler analysis reports that C
1
will no longer be used in T
1
and a power-off instruction
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in BSP Programs 9:5
is inserted. However, T
2
actually still uses C
1
at time t
4
and t
5
, which means that
powering off C
1
at t
4
will make the system fail if the powered-off components fully
rely on power-gating instructions; or the system would pay the penalty associated with
executing T
2
at t
4
if the system could internally turn on the components according to
the status of instruction queues.
The prior example indicates that the traditional single-thread analyzer cannot be
naively applied to the MTPG case, as it will likely break the logic that a unit must be in
the active state (i.e., powered on) before being used for processing, since the unit might
be powered off by a thread while other concurrent threads are still using or about to
use it. Moreover, a unit might be powered on multiple times by a set of concurrent
threads. The preceding problems must be appropriately addressed when constructing
power-gating controls for multithread programs. This article presents our solution for
addressing this issue.
3. TECHNICAL RATIONALE
3.1. PPG Operations
Predicated execution support provides an effective means to eliminate branches from
an instruction stream. Predicated or guarded execution refers to the conditional exe-
cution of an instruction based on the value of a Boolean source operand, referred to as
the predicate [Hsu and Davidson 1986]. Predicated instructions are fetched regardless
of their predicate value. Instructions whose predicate is true are executed normally,
while those whose predicate is false are nullified and thus prevented from modifying
the processor state.
We include the concept of predicated execution in power-gating devices for controlling
the power gating of a set of concurrent threads. We combine the predicated executions
into three special power-gating operations: predicated power-on, predicated power-off,
and initialization operations. The main ideas are: (1) to turn on a component only
when it is actually in the off state; (2) to keep track of the number of threads using the
component; and (3) to turn off the component when this is the last exit of all threads
using this component. Note that these operations must be atomic with respect to each
other in order to prevent multiple threads from accessing control at the same time.
Initialization operation. An initialization operation is designed to clean all predicated
bits (i.e., pgp
1
, pgp
2
, ..., pgp
N
) and empty all reference counters (i.e., rc
1
, rc
2
, ...,
rc
N
) when the processor is starting up.
Predicated power-on operation. The predicated power-on operation takes an explicit
operand and two implicit operands to record component usage and conditionally turn
on a power-gating candidate. The explicit operand is power-gating candidate C
i
,and
the implicit operands include predicated bit pgp
i
of C
i
and a reference counter rc
i
of
C
i
. The operation consists of the following steps:
(1) power on C
i
if pgp
i
(i.e., the predicated bit of C
i
)isset;
(2) increase rc
i
(i.e., the reference counter of C
i
) by 1. The reference counter keeps
track of the number of threads that reference the power-gating candidate at this
time; and
(3) unset predicated bit pgp
i
.
Predicated power-off operation. The predicated power-off operation also takes an
explicit operand C
i
and two implicit operands pgp
i
and rc
i
. Predicated power-off
instructions update component usage rc
i
and conditionally turn off a power-gating
candidate C
i
by predicated bit pgp
i
. The operation consists of the following steps:
(1) decrease the reference counter rc
i
by 1;
(2) set predicate bit pgp
i
if reference counter rc
i
is 0; and
(3) power off C
i
if predicated bit pgp
i
is set.
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Citations
More filters
Journal ArticleDOI

Energy Transparency for Deeply Embedded Programs

TL;DR: In this paper, the authors focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing static resource analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements.
Proceedings ArticleDOI

Accelerating Dynamic Data Race Detection Using Static Thread Interference Analysis

TL;DR: A new framework is presented that eliminates redundant race check and boosts the dynamic race detection by performing static optimizations on top of a series of thread interference analysis phases.
Journal ArticleDOI

Compilers for Low Power with Design Patterns on Embedded Multicore Systems

TL;DR: This work attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs, including Pipe and Filter pattern, MapReduce with Iterator pattern, and Bulk Synchronous Parallel Model.
Proceedings ArticleDOI

Compilers for Low Power with Design Patterns on Embedded Multicore Systems

TL;DR: This work attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs, and presents a direction for power optimizations that one can further identify additional key design patterns for embedded multicores systems to explore power optimization opportunities via compilers.
Posted Content

Compiler Enhanced Scheduling for OpenMP for Heterogeneous Multiprocessors

TL;DR: This paper implements a compiler for OpenMP, a hardware-aware Compiler Enhanced Scheduling (CES), where the common compiler transformations are coupled with compiler added scheduling commands to take advantage of the hardware asymmetry and improve the runtime efficiency.
References
More filters
Proceedings ArticleDOI

Managing static leakage energy in microprocessor functional units

TL;DR: The results show that if the leakage approaches the magnitude as projected in the literature, even for short idle intervals as few as ten cycles, an aggressive policy of activating the sleep mode at every idle period performs well and a more complex control strategy may not be warranted.
Proceedings ArticleDOI

An OpenCL framework for heterogeneous multicores with local memory

TL;DR: The design and implementation of an OpenCL framework that targets heterogeneous accelerator multicore architectures with local memory, based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory is presented.
Book ChapterDOI

Submachine Locality in the Bulk Synchronous Setting (Extended Abstract)

TL;DR: In this paper, two refinements of the BSP model are presented to account for submachine locality and to more accurately reflect router load, which allow one to obtain better estimates of algorithm performance with manageable accounting of costs.
Journal ArticleDOI

Compiler optimization on VLIW instruction scheduling for low power

TL;DR: It is proved that the greedy bipartite-matching scheme always gives the optimal switching activities of the instruction bus for given VLIW instruction scheduling policies.
Journal ArticleDOI

Compilers for leakage power reduction

TL;DR: This article proposes a framework for analyzing data flow for estimating the component activities at fixed points of programs whilst considering pipeline architectures and proposes a set of scheduling policies that are effective in reducing leakage power in microprocessors.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?

This article addresses compiler optimization for reducing the power consumption of multithread programs. This article presents a multithread power-gating framework composed of multithread power-gating analysis ( MTPGA ) and predicated power-gating ( PPG ) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading ( SMT ) machines. The authors performed experiments by incorporating their power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. 

It would be a possible direction for future research to apply their method on GPU architectures.