scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs

TL;DR: A multithread power-gating framework composed of multith read power- gating analysis (MTPGA) and predicated power-Gating (PPG) energy management mechanisms for reducing the leakage power when executingMultithread programs on simultaneous multithreading (SMT) machines is presented.
Abstract: Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.This article presents a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09p for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30p; and the total energy consumption is reduced by an average of 4.27p on leakage contribution set to 10p. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.

Summary (4 min read)

1. INTRODUCTION

  • Approaches for minimizing power dissipation can be applied at the algorithmic, compiler, architectural, logic, and circuit levels [Chandrakasan et al. 1992].
  • Turning resources on and off requires careful consideration of cases where multiple threads are present.
  • The BSP model, proposed by Valiant [1990], is designed to bridge between theory and practice of parallel computations.
  • A conventional power-gating optimization framework [You et al. 2005, 2007] can be employed for candidates used by a single thread, with the compiler inserting instructions into the program to shut down and wake up components as appropriate.

2. MOTIVATION

  • A system might be equipped with a power-gating mechanism to activate and deactivate components in order to reduce the leakage current [Goodacre 2011].
  • In such systems, programmers or compilers should analyze the behavior of programs, investigate component utilization based on execution sequences, and insert power-gating ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
  • These code segments work smoothly when executed individually in single-thread environments as shown in Figure 1(b).
  • For thread T2, after five instructions are executed, the power-off instruction is executed at t6, which turns off component C1 to stop the leakage current.
  • This article presents their solution for addressing this issue.

3.1. PPG Operations

  • Predicated execution support provides an effective means to eliminate branches from an instruction stream.
  • Instructions whose predicate is true are executed normally, while those whose predicate is false are nullified and thus prevented from modifying the processor state.
  • The authors combine the predicated executions into three special power-gating operations: predicated power-on, predicated power-off, and initialization operations.
  • (1) to turn on a component only when it is actually in the off state; (2) to keep track of the number of threads using the component; and (3) to turn off the component when this is the last exit of all threads using this component, also known as The main ideas are.
  • The operation consists of the following steps: (1) power on Ci if pgpi (i.e., the predicated bit of Ci) is set; (2) increase rci (i.e., the reference counter of Ci) by 1.

3.2. Multithread Power-Gating Framework

  • Algorithm 1 summarizes their proposed compiler flow of the MTPG framework for BSP models.
  • To generate code with power-gating control in a multithread BSP program, the compiler should compute concurrency information and analyze component usage with respect to concurrent threads.
  • In step 2, detailed component usages can be calculated via dataflow equations by referencing component-activity dataflow analysis [You et al. 2002, 2006].
  • Steps 3 and 4 insert PPG instructions according to the information gathered in the previous steps while considering the cost model (Section 5 presents their MTPGA compiler framework for power optimizations).
  • Step 6 attempts to merge the power-gating instructions with the sink-n-hoist framework.

4. TFCA FOR BSP PROGRAMS

  • This section presents the concurrency analysis method for BSP programs.
  • Figure 3 presents an example for a superstep of the hierarchical BSP model, in which vertical black lines indicate threads and horizontal gray bars indicate barriers.
  • Eight individual threads and two barriers form the superstep, where the eight threads join and are divided into six groups.
  • In a hierarchical BSP program, programmers are allowed to divide threads into groups and the synchronization of threads would be limited in the groups, which form subsupersteps inside groups.
  • Computing the concurrency between threads actually involves considering the relation between threads that are present during a specific period, which are indicated by a set of neighboring nodes in the controlflow graph (CFG), denoted by a thread fragment.

4.1. Thread Fragment Graph

  • The relationships between thread fragments in a superstep are abstracted into a directed graph named the TFG, in which a node represents a thread fragment and an edge represents the control flow.
  • For a multiple-program multiple-data programming model, a multithread program is composed of multiple individual executable files that are executed on different processors; in such a case, a TFG is constructed from several CFGs of the individual programs.
  • In the first superstep, four threads are further grouped into two groups (g1 to g2).
  • For a given group g, let the numbers of BSP barrier ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
  • There are multiple nodes and exit nodes in a TFG, denoted by V ′ and V ′(exit), respectively.

4.2. Constructing TFGs

  • The authors designed a TFG construction algorithm that builds the TFG for each BSP superstep from a CFG and performs the lineal thread fragments analysis for each TFG.
  • Algorithm 2 is the kernel algorithm that collects the thread fragment of a designated group as well as constructs the TFG of the group and computes the concurrency information.
  • The output of the algorithm would be a set of nodes between the entry barrier of an ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.
  • After performing TraverseGroup for each subgroup, the output blocked set V ′blk needs to cross barriers; then the processed V ′ blk set is added to Vitr so that thread fragments are collected in the subsequent iterations.

4.3. Lineal Thread Fragments Analysis and MTF

  • Once the TFG has been constructed, the authors can compute the concurrent thread fragments of a hierarchical BSP program.
  • The authors collect all nodes along the TFG in their dataflow analysis and maintain the set of entire lineal thread fragments by adding nodes symmetrically so as to keep this set symmetric.
  • The MHP regions are determined by first constructing an MHP graph G′′ = (V ′, E′′), that is an indirected graph whose nodes are thread fragments and edges are nodes that may happen in parallel; that is, they are related to the MTF set.
  • Table II lists the results of GEN, OUT, ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014. and LTF sets for the example in Figure 5.

5. MULTITHREAD POWER-GATING ANALYSIS

  • The TFCA results and the component usage for a power-gating candidate Ci of all concurrent thread fragments can be categorized in the following three cases.
  • Figure 10 demonstrates two possible placements of PPG operations based on the MTPGA results.
  • Therefore, within the MHP region there will be only a pair of power-gating operations, namely the first power-on and the last power-off operations, belonging to a pair of PPG operations being executed, whereas the power gating of the other PPG operations will be disabled.
  • Figure 10 portrays the implications of the aforesaid functions with TF1 and C1 as parameters.

6.1. Platform

  • The authors used a DEC-Alpha-compatible architecture with the PPG controls and two-way to 8-way simultaneous multithreading as the target architecture for their experiments.
  • By default, the simulator performs out-of-order execution.
  • Format with SUIF, processed by concurrent thread fragment analysis, and then translated to the machine- or instruction-level CFG form with Machine-SUIF.
  • Four components of the low-power optimization phase for multithread programs (implemented as a Machine-SUIF pass) were then performed, and finally, the compiler generated DEC Alpha assembly code with extended power-gating controls.
  • Also, the baseline data was provided by the power estimation of Wattch cc3 with a clock-gating mechanism that gates the clocks of unused resources in multiport hardware to reduce the dynamic power; however, leakage power still exists.

6.2. Simulation Results

  • To verify their proposed MTPGA algorithm and PPG mechanism, the authors focused on investigating component utilization in the supersteps.
  • As indicated in Table VI, while CADFA results in less leakage energy in power-gateable units (about 30% energy consumption ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014. relative to MTPG), it suffers the overhead of traditional power-gating instructions (about 11× the energy consumption relative to MTPG).
  • Figures 14 through 19 show their experimental results for BSP programs from OpenCL-based kernels.

7. DISCUSSION

  • The authors discuss the impact of latency and the capability to apply MTPGA on real hardware.
  • Latencies in processors include pipelining latency and memory access latency.
  • Nevertheless, MTPGA also conservatively estimates the inactive period in an MHP region with the worst case, namely the minimal thread execution time among threads.
  • When the instruction fetching policy changes in SMT, their method is also applied because it estimates energy consumption with the worst case of concurrent threads, which guarantees that leakage energy would be reduced in any case.
  • Finally, the power management controller removes the power-gating instructions from the power-gating direction buffer.

8. CONCLUSION

  • This article has presented a foundation framework for compilation optimization that reduces the power consumption on SMT architectures.
  • It has also presented PPG operations for improving the energy management of multithread programs in hierarchical BSP models.
  • Based on a multithread component analysis with dataflow equations, their MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management.

Did you find this useful? Give us your feedback

Figures (29)

Content maybe subject to copyright    Report

9
Compiler Optimization for Reducing Leakage Power in M ultithread
BSP Programs
WEN-LI SHIH, National Tsing Hua University
YI-PING YOU, National Chiao Tung University
CHUNG-WEN HUANG and JENQ KUEN LEE, National Tsing Hua University
Multithread programming is widely adopted in novel embedded system applications due to its high perfor-
mance and flexibility. This article addresses compiler optimization for reducing the power consumption of
multithread programs. A traditional compiler employs energy management techniques that analyze compo-
nent usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage
power can be controlled by inserting on and off instructions based on component usage information generated
by flow equations. However, these methods cannot be directly extended to a multithread environment due
to concurrent execution issues.
This article presents a multithread power-gating framework composed of multithread power-gating anal-
ysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage
power when executing multithread programs on simultaneous multithreading (SMT) machines. Our mul-
tithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on
a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy
usage of multithread programs and inserts PPG operations as power controls for energy management. We
performed experiments by incorporating our power optimization framework into SUIF compiler tools and
by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The
experimental results show that the total energy consumption of a system with PPG support and our power
optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a
power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced
by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are
effective in reducing the leakage energy of BSP multithread programs.
Categories and Subject Descriptors: D.3.2 [Programming Languages]: Language Classifications—
Concurrent; distributed; parallel languages; D.3.4 [Programming Languages]: Processors—Compiler;
optimization
General Terms: Design, Language
Additional Key Words and Phrases: Compilers for low power, leakage power reduction, power-gating mech-
anisms, multithreading
ACM Reference Format:
Wen-Li Shih, Yi-Ping You, Chung-Wen Huang, and Jenq Kuen Lee. 2014. Compiler optimization for reducing
leakage power in multithread BSP programs. ACM Trans. Des. Autom. Electron. Syst. 20, 1, Article 9
(November 2014), 34 pages.
DOI: http://dx.doi.org/10.1145/2668119
This work is supported in part by Ministry of Science and Technology (under grant no. 103-2220-E-007-019)
and Ministry of Economic Affairs (under grant no. 103-EC-17-A-02-S1-202) in Taiwan.
Author’s addresses: W.-L. Shih, Department of Computer Science, National Tsing Hua University, Hsinchu,
Taiwan; Y.-P. You, Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan;
C.-W. Huang and J. K. Lee (corresponding author), Department of Computer Science, N ational Tsing Hua
University, Hsinchu, Taiwan; email: jklee@cs.nthu.edu.tw.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components o f this work owned by
others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions
from permissions@acm.org.
c
2014 ACM 1084-4309/2014/11-ART9 $15.00
DOI: http://dx.doi.org/10.1145/2668119
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

9:2 W.-L.Shihetal.
1. INTRODUCTION
Approaches for minimizing power dissipation can be applied at the algorithmic, com-
piler, architectural, logic, and circuit levels [Chandrakasan et al. 1992]. Aspects rel-
ative to combining architecture design and software arrangement at the instruction
level have been addressed with the aim of reducing power consumption [Bellas et al.
2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su
and Despain 1995; Tiwari et al. 1997, 1998]. Major efforts in power optimization in-
clude dynamic and leakage power optimization. Works in dynamic power optimization
include utilizing the value locality of registers [Chang and Pedram 1995], scheduling
VLIW (very long instruction word) instructions to reduce the power consumption on
the instruction bus [Lee et al. 2003], reducing instruction encoding to reduce code
size and power consumption [Lee et al. 2013], and gating the clock to reduce work-
loads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]. Compiler code for reducing
leakage power can employ power gating [Kao and Chandrakasan 2000; Butts and Sohi
2000; Hu et al. 2004]. Various studies have attempted to reduce the leakage power using
integrated architectures and compiler-based power gating mechanisms [Dropsho et al.
2002; Yang et al. 2002; You et al. 2002, 2007; Rele et al. 2002; Zhang et al. 2003; Li and
Xue 2004]. These approaches involve compilers inserting instructions into programs to
shut down and wake up components as appropriate, based on a dataflow analysis or a
profiling analysis. The power analysis and instruction insertion are further integrated
into trace-based binary translation [Li and Xue 2004]. The Sink-N-Hoist framework
[You et al. 2005, 2007] has been used to reduce the number of power-gating instructions
generated by compilers. However, these power-gating control frameworks are only ap-
plicable to single-thread programs, and care is needed in multithread programs since
some of the threads might share the same hardware resources. Turning resources on
and off requires careful consideration of cases where multiple threads are present.
Herein, we extend previous work to deal with the case of multithread systems in a
bulk-synchronous parallel (BSP) model.
The BSP model, proposed by Valiant [1990], is designed to bridge between theory
and practice of parallel computations. The BSP model structures multiple processors
with local memory and a global barrier synchronous mechanism. Threads processed by
processors are separated by synchronous points, called supersteps, that form the basic
unit of the BSP model. A superstep consists of a computation phase and a communica-
tion phase, allowing processors to compute data in local memory until encountering a
global synchronous point in the computation phase and synchronizing local data with
each other in the communication phase. The algorithm complexity of parallel programs
can then be analyzed in the BSP model by considering both locality and parallelism
issues. The BSP model works well for a family of parallel applications in which the
tasks are balanced. However, global barrier synchronization was found inflexible in the
practice [McColl 1996], which promoted proposals for several enhanced BSP models
presenting hierarchical groupings. NestStep [Keßler 2000] is a programming language
for the BSP model that adopts nested parallelism with support for virtual shared
memory. The H-BSP model [Cha and Lee 2001] splits processors into groups and dy-
namically runs BSP programs within each group in a bulk-synchronous fashion, while
the multicore BSP [Valiant 2008, 2011] provides hierarchical multicore environments
with independent communication costs. In the present study we adopted the concept
of hierarchical BSP models [Keßler 2000; Cha and Lee 2001; Torre and Kruskal 1996]
as the basis for a power reduction framework for use in parallel programming.
Several methods have been proposed for analyzing the concurrency of multithread
programs. May-happen-in-parallel (MHP) analysis computes which statements may
be executed concurrently in a multithread program [Callahan and Sublok 1989;
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in BSP Programs 9:3
Duesterwald and Soffa 1991; Masticola and Ryder 1993; Naumovich and Avrunin
1998; Naumovich et al. 1999; Li and Verbrugge 2004; Barik 2005]. The problem of
precisely computing all pairs of statements that may execute in parallel is undecidable
[Ramalingam 2000]; however, it was proved that the problem is NP-complete if we
assume that all control paths are executable [Taylor 1983]. The general approach
involves using a dataflow framework to compute a conservative estimate of MHP
information.
This article presents a multithread power-gating (MTPG) framework, composed of
MTPG Analysis (MTPGA) and predicated power-gating (PPG) energy management
mechanisms for reducing leakage power when executing multithread programs on si-
multaneous multithreading (SMT) machines. SMT is a widely adopted processor tech-
nique that allows multithread programs to utilize functional units more efficiently
by fetching and executing instructions from multiple threads at the same time. Our
multithread programming model is based on hierarchical BSP models. We propose us-
ing thread fragment concurrency analysis (TFCA) to analyze MHP information among
threads and MTPGA to report the component usages shared by multiple threads in hi-
erarchical BSP models. TFCA reports the concurrency of threads, which allows power-
gating candidates to be classified into those used by multiple threads and those used
by a single thread. A conventional power-gating optimization framework [You et al.
2005, 2007] can be employed for candidates used by a single thread, with the compiler
inserting instructions into the program to shut down and wake up components as ap-
propriate. For candidates used concurrently by different threads, PPG instructions are
adopted to turn components on and off as appropriate. Based on the TFCA, our MTPGA
framework estimates the energy usage of multithread programs with our proposed cost
model and inserts a pair of predicated power-on and predicated power-off operations at
those positions where a power-gating candidate is first activated and last deactivated
within a thread.
To our best knowledge, this is the first work to attempt to devise an analysis scheme
for reducing leakage power in multithread programs. We performed experiments by
incorporating TFCA and MTPGA i nto SUIF compiler tools and by simulating the
energy consumption with a post-estimated SMT simulator based on Wattch toolkits.
Our preliminary experimental results on a system with leakage contribution set to
30% show that the total energy consumption of a system with PPG support and our
power optimization method is reduced by an average of 10.09% for BSP programs
converted from the OpenCL kernel and by up to 10.49% for D-BSP programs relative
to the system without a power-gating mechanism, and is reduced by an average of
4.27% for BSP programs and by up to 6.68% for D-BSP programs on a system with
leakage contribution set to 10%, demonstrating our mechanisms effective in reducing
the leakage power in hierarchical BSP multithread environments.
The remainder of the article is organized as follows. Section 2 gives a motivating
example for the problem addressed by our study. Section 3 presents the technical
rationale of our work, first presenting the PPG instruction and architectures, and
then summarizing our compilation flow. Section 4 presents the method of TFCA for
hierarchical BSP programs while Section 5 presents our MTPGA compiler framework
for power optimizations. Section 6 presents the experimental results, discussion is
given in Section 7, and conclusions are drawn in Section 8.
2. MOTIVATION
A system might be equipped with a power-gating mechanism to activate and deac-
tivate components in order to reduce the leakage current [Goodacre 2011]. In such
systems, programmers or compilers should analyze the behavior of programs, inves-
tigate component utilization based on execution sequences, and insert power-gating
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

9:4 W.-L.Shihetal.
Fig. 1. The traditional power-gating mechanism adopted in a single-thread or SMT environment. Both
environments are equipped with two categories of components C
0
and C
1
,whereC
0
is capable of controlling
the power-gating status of C
1
: (a) Two code segments of threads T
1
and T
2
, where power-gating instructions
are inserted by power-gating analysis results for threads T
1
and T
2
individually. Note that op1ofT
1
and
op2andop5ofT
2
demonstrate those cases where instructions might need more than one component (in this
case, C
0
and C
1
) to complete operation; (b); (c) how the code segments in (a) are executed in a single-thread
and SMT environment, respectively. All component usages of instructions for the two threads are labeled as
square boxes with corresponding labels, and power-off instructions are labeled as boxes with a cross.
instructions into programs [You et al. 2002, 2006] to ensure that the leakage current
is gated appropriately. Traditional compiler analysis algorithms for low power focus
on single-thread programs, and the methods cannot be directly applied to multithread
programs. We use the example in Figure 1 to illustrate the scenario for motivating
the need of new compiler schemes for reducing the power consumption in multithread
environments. Assume we have hardware equipped with two categories of functional
units, named C
0
and C
1
, where C
0
is capable of controlling the power-gating status of
C
1
, and the hardware is configurable as a single-thread or SMT environment. We first
present two pseudocode segments for threads T
1
and T
2
in Figure 1(a), that are ana-
lyzed and processed by traditional low-power optimization analysis. Note that op1ofT
1
and op2andop5ofT
2
demonstrate the cases where instructions might need more than
one component (in this case, C
0
and C
1
) to complete operation. Traditional sequential
analysis of the compiler will yield the component utilization for every instruction. As
shown in Figure 1(a), the compiler inserts two power-gating instructions “pg-off C
1
at the end of both code segments because C
1
is no longer used for those segments in
subsequent codes. These code segments work smoothly when executed individually in
single-thread environments as shown in Figure 1(b). In the figure, all component us-
ages of instructions for the two threads are labeled as square boxes with corresponding
labels, and power-off instructions are labeled as boxes with a cross. For thread T
1
,after
instructions op1 and op2 are executed, the power-off instruction is executed at t
4
; hence
the system could save leakage energy from idle component C
1
. For thread T
2
, after five
instructions are executed, the power-off instruction is executed at t
6
,whichturnsoff
component C
1
to stop the leakage current.
However, when the multithread program is executed in an SMT system, the system
could concurrently execute threads T
1
and T
2
with shared components C
0
and C
1
as
illustrated in Figure 1(c). At time t
4
, thread T
1
powers off C
1
because the traditional
compiler analysis reports that C
1
will no longer be used in T
1
and a power-off instruction
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Compiler Optimization for Reducing Leakage Power in BSP Programs 9:5
is inserted. However, T
2
actually still uses C
1
at time t
4
and t
5
, which means that
powering off C
1
at t
4
will make the system fail if the powered-off components fully
rely on power-gating instructions; or the system would pay the penalty associated with
executing T
2
at t
4
if the system could internally turn on the components according to
the status of instruction queues.
The prior example indicates that the traditional single-thread analyzer cannot be
naively applied to the MTPG case, as it will likely break the logic that a unit must be in
the active state (i.e., powered on) before being used for processing, since the unit might
be powered off by a thread while other concurrent threads are still using or about to
use it. Moreover, a unit might be powered on multiple times by a set of concurrent
threads. The preceding problems must be appropriately addressed when constructing
power-gating controls for multithread programs. This article presents our solution for
addressing this issue.
3. TECHNICAL RATIONALE
3.1. PPG Operations
Predicated execution support provides an effective means to eliminate branches from
an instruction stream. Predicated or guarded execution refers to the conditional exe-
cution of an instruction based on the value of a Boolean source operand, referred to as
the predicate [Hsu and Davidson 1986]. Predicated instructions are fetched regardless
of their predicate value. Instructions whose predicate is true are executed normally,
while those whose predicate is false are nullified and thus prevented from modifying
the processor state.
We include the concept of predicated execution in power-gating devices for controlling
the power gating of a set of concurrent threads. We combine the predicated executions
into three special power-gating operations: predicated power-on, predicated power-off,
and initialization operations. The main ideas are: (1) to turn on a component only
when it is actually in the off state; (2) to keep track of the number of threads using the
component; and (3) to turn off the component when this is the last exit of all threads
using this component. Note that these operations must be atomic with respect to each
other in order to prevent multiple threads from accessing control at the same time.
Initialization operation. An initialization operation is designed to clean all predicated
bits (i.e., pgp
1
, pgp
2
, ..., pgp
N
) and empty all reference counters (i.e., rc
1
, rc
2
, ...,
rc
N
) when the processor is starting up.
Predicated power-on operation. The predicated power-on operation takes an explicit
operand and two implicit operands to record component usage and conditionally turn
on a power-gating candidate. The explicit operand is power-gating candidate C
i
,and
the implicit operands include predicated bit pgp
i
of C
i
and a reference counter rc
i
of
C
i
. The operation consists of the following steps:
(1) power on C
i
if pgp
i
(i.e., the predicated bit of C
i
)isset;
(2) increase rc
i
(i.e., the reference counter of C
i
) by 1. The reference counter keeps
track of the number of threads that reference the power-gating candidate at this
time; and
(3) unset predicated bit pgp
i
.
Predicated power-off operation. The predicated power-off operation also takes an
explicit operand C
i
and two implicit operands pgp
i
and rc
i
. Predicated power-off
instructions update component usage rc
i
and conditionally turn off a power-gating
candidate C
i
by predicated bit pgp
i
. The operation consists of the following steps:
(1) decrease the reference counter rc
i
by 1;
(2) set predicate bit pgp
i
if reference counter rc
i
is 0; and
(3) power off C
i
if predicated bit pgp
i
is set.
ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 1, Article 9, Pub. date: November 2014.

Citations
More filters
Proceedings ArticleDOI
26 Sep 2016
TL;DR: A power optimization framework composed of a multithread power-gating power model and a probabilistic time-slice based multith read power- gating algorithm for reducing the leakage energy when executingMultithread programs on simultaneous multithreading (SMT) machines is presented.
Abstract: Various studies have attempted to reduce the leakage power using integrated architectures and compiler-based power-gating mechanisms. These approaches involve compilers inserting instructions into programs to shut down and wake up components as appropriate based on a compiler data-flow analysis. When applying the power gating control in compilers for multithreading programs, conventional methods can not be applied directly due to that the threads might share the same hardware resources. This paper presents a multithread power-gating framework composed of a multithread power-gating power model and a probabilistic time-slice based multithread power-gating algorithm for reducing the leakage energy when executing multithread programs on simultaneous multithreading (SMT) machines. Our framework estimates the energy usage of multithread programs with aid of profitability computation and inserts predicated-power-gating instructions as power controls for energy management. Compared to previous work, our method applies power control in multithread programs to the whole program of a concurrent thread region while the previous work only considers the head and tail of a concurrent region. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a SMT simulator based on Wattch toolkits. The experimental results shows about 10% energy reduction in random generated graphs of multithread programs relative to one without power-gating, where leakage contribution was set to 30%. In OpenCL-based BSP kernels, our method reduces total energy by average of 35% that demonstrates our method is efficient on reducing leakage energy of multithread programs in SMT environment.

1 citations


Cites background or methods from "Compiler Optimization for Reducing ..."

  • ...In OpenCL-based BSP programs, our method reduces total energy by average of 35% that demonstrates our method is efficient on reducing leakage energy in SMT environment....

    [...]

  • ...…usage that is not associated with the user program Also, the baseline data were provided by the power estimation of Wattch cc3 with a clock-gating mechanism, which gates the clocks of those unused resources in multiport hardware to reduce the dynamic power; however, leakage power still exists....

    [...]

  • ...Our method is efficient on reducing leakage energy for multithread programs in SMT environment....

    [...]

  • ...In addition, the H-BSP model [9] splits processors into groups and dynamically runs BSP programs within each group in a bulk-synchronous fashion, while the multicore BSP provides hierarchical multi-core environment with independent communication costs....

    [...]

  • ...In this paper, we address the power optimization issues with multi-threading environments....

    [...]

Journal ArticleDOI
TL;DR: In this article , a fixed-point type that is supported by an integer of 16-bit type and saturation instructions is added to replace the original 32-bit float type, and an auto-tuning method is proposed to use a uniform selector mechanism (USM) to find the binary point position for fixedpoint type use.
Abstract: Today, as deep learning (DL) is applied more often in daily life, dedicated processors such as CPUs and GPUs have become very important for accelerating model executions. With the growth of technology, people are becoming accustomed to using edge devices, such as mobile phones, smart watches, and VR devices in their daily lives. A variety of technologies using DL are gradually being applied to these edge devices. However, there is a large number of computations in DL. It faces a challenging problem how to provide solutions in the edge devices. In this article, the proposed method enables a flow with the RISC-V Packed extension (P extension) in TVM. TVM, an open deep learning compiler for neural network models, is growing as a key infrastructure for DL computing. RISC-V is an open instruction set architecture (ISA) with customized and flexible features. The Packed-SIMD extension is a RISC-V extension that enables subword single-instruction multiple-data (SIMD) computations in RISC-V architectures to support fallback engines in AI computing. In the proposed flow, a fixed-point type that is supported by an integer of 16-bit type and saturation instructions is added to replace the original 32-bit float type. In addition, an auto-tuning method is proposed to use a uniform selector mechanism (USM) to find the binary point position for fixed-point type use. The tensorization feature of TVM can be used to optimize specific hardware such as subword SIMD instructions with RISC-V P extension. With our experiment on the Spike simulator, the proposed method with the USM can improve performance by approximately 2.54 to 6.15× in terms of instruction counts with little accuracy loss.
References
More filters
Journal ArticleDOI
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

3,885 citations

Journal ArticleDOI
TL;DR: In this paper, techniques for low power operation are presented which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations to reduce power consumption in CMOS digital circuits while maintaining computational throughput.
Abstract: Motivated by emerging battery-operated applications that demand intensive computation in portable environments, techniques are investigated which reduce power consumption in CMOS digital circuits while maintaining computational throughput. Techniques for low-power operation are shown which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations. An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations. This optimum is achieved by trading increased silicon area for reduced power consumption. >

2,690 citations

Journal Article
TL;DR: An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations, and is achieved by trading increased silicon area for reduced power consumption.
Abstract: Motivated by emerging battery-operated applications that demand intensive computation in portable environments, techniques are investigated which reduce power consumption in CMOS digital circuits while maintaining computational throughput Techniques for low-power operation are shown which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations This optimum is achieved by trading increased silicon area for reduced power consumption >

2,337 citations


"Compiler Optimization for Reducing ..." refers methods in this paper

  • ...Approaches for minimizing power dissipation can be applied at the algorithmic, compiler, architectural, logic, and circuit levels [Chandrakasan et al. 1992]....

    [...]

  • ...INTRODUCTION Approaches for minimizing power dissipation can be applied at the algorithmic, com­piler, architectural, logic, and circuit levels [Chandrakasan et al. 1992]....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors present several dual-threshold voltage techniques for reducing standby power dissipation while still maintaining high performance in static and dynamic combinational logic blocks MTCMOS sleep transistor sizing issues are addressed, and a hierarchical sizing methodology based on mutual exclusive discharge patterns is presented.
Abstract: Scaling and power reduction trends in future technologies will cause subthreshold leakage currents to become an increasingly large component of total power dissipation This paper presents several dual-threshold voltage techniques for reducing standby power dissipation while still maintaining high performance in static and dynamic combinational logic blocks MTCMOS sleep transistor sizing issues are addressed, and a hierarchical sizing methodology based on mutual exclusive discharge patterns is presented A dual-V/sub t/ domino logic style that provides the performance equivalent of a purely low-V/sub t/ design with the standby leakage characteristic of a purely high-V/sub t/ implementation is also proposed

473 citations

Proceedings ArticleDOI
10 Oct 1994
TL;DR: In this article, the authors used an energy-delay metric to compare many of the proposed techniques and provided insight into some of the basic trade-offs in low-power design, including trade speed for power, do not waste power, and find a lower power problem.
Abstract: Recently there has been a surge of interest in low-power devices and design techniques. While many papers have been published describing power-saving techniques for use in digital systems, trade-offs between the methods are rarely discussed. We address this issue by using an energy-delay metric to compare many of the proposed techniques. Using this metric also provides insight into some of the basic trade-offs in low-power design. The next section describes the energy-loss mechanisms that are present in CMOS circuits, which provides the parameters that must be changed to lower the power dissipation. With these factors in mind, the rest of the paper reviews the energy saving techniques that have been proposed. These proposals fall into one of three main strategies: trade speed for power, do not waste power, and find a lower power problem.

471 citations


"Compiler Optimization for Reducing ..." refers background or methods in this paper

  • ...…long instruction word) instructions to reduce the power consumption on the instruction bus [Lee et al. 2003], reducing instruction encoding to reduce code size and power consumption [Lee et al. 2013], and gating the clock to reduce work­loads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]....

    [...]

  • ...…to combining architecture design and software arrangement at the instruction level have been addressed with the aim of reducing power consumption [Bellas et al. 2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su and Despain 1995; Tiwari et al. 1997, 1998]....

    [...]

  • ...Aspects relative to combining architecture design and software arrangement at the instruction level have been addressed with the aim of reducing power consumption [Bellas et al. 2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 1997, 2003, 2013; Su and Despain 1995; Tiwari et al. 1997, 1998]....

    [...]

  • ...2013], and gating the clock to reduce workloads [Horowitz et al. 1994; Tiwari et al. 1997, 1998]....

    [...]

Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Compiler optimization for reducing leakage power in multithread bsp programs" ?

This article addresses compiler optimization for reducing the power consumption of multithread programs. This article presents a multithread power-gating framework composed of multithread power-gating analysis ( MTPGA ) and predicated power-gating ( PPG ) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading ( SMT ) machines. The authors performed experiments by incorporating their power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. 

It would be a possible direction for future research to apply their method on GPU architectures.