(Open Access) Compiler-Directed Energy Reduction Using Dynamic Voltage Scaling and Voltage Islands for Embedded Systems (2013) | Ozcan Ozturk

Compiler-Directed Energy Reduction

Using Dynamic Voltage Scaling and

Voltage Islands for Embedded Systems

Ozcan Ozturk, Member, IEEE, Mahmut Kandemir, Member, IEEE, and

Guangyu Chen, Member , IEEE

Abstract—Addressing power and energy consumption related issues early in the system design flow ensures good design and

minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are

very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands

provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the

chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture

design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically,

we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and

determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated

support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such

as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very

effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide

range of values of our major simulation parameters.

Index Terms—Voltage islands, compiler optimizations, energy consumption, voltage scaling, compiler-based parallelization

1INTRODUCTION

OWER and energy related issues in deep submicron

embedded designs may limit functionality, reliability,

and performance and severely affect yield and manufactur-

ability. It is well known that higher power dissipation

increases junction temperatures, which in turn slows down

transistors and increases interconnect resistance. Therefore,

power consumption needs to be considered as one of the

primary metrics in embedded system design, and any

optimization approach targeted at improving performance

may therefore fall short if power is not also taken into account.

Recent years have witnessed several efforts aimed at

reducing power consumption from both hardware and

software perspectives. One such hardware approach is

voltage islands, which are areas (logic and/or memory)

supplied through separate, dedicated power feed. The prior

work on voltage islands so far generally focused on the

design and placement issues, and will be discussed in detail

in Section 2. Our goal in this paper is to study the necessary

software compiler support for voltage islands. Specifically,

we focus on an embedded multiprocessor architecture that

supports both voltage islands and control domains within

these islands and determine how an optimizing compiler

can map an embedded application onto this architecture.

The specific types of applications this paper considers are

embedded multimedia codes that are built from multi-

dimensional arrays of signals and multiloop nests that

operate on these arrays. One of the nice characteristics of

these applications is that an optimizing compiler can analyze

their data access patterns at compile time and restructure

them based on the target optimization in mind (e.g.,

enhancing iteration-level parallelism or improving data

locality).

We first give, in Section 3, a characterization of a set of

embedded applications that illustrates the potential benefits

that could be obtained from a voltage island based embedded

multiprocessor architecture. Based on this characterization,

we then present, in Section 4, a compiler-directed code

parallelization scheme, which is the main contribution of this

paper. A unique characteristic of this scheme is that it

minimizes power consumption (both dynamic and leakage)

by exploiting both task and data parallelism. We tested the

impact of this approach using a suite of eight embedded

multimedia applications and a simulation environment. Our

experiments, discussed in Section 5, reveal that the proposed

parallelization strategy is very effective in reducing power

(40.7 percent energy savings on the average) as well as

execution cycles (14.6 percent performance improvement on

the average). The experiments also show that the power

savings we achieve are consistent across a wide range of

values of our simulation parameters. For example, we found

that our approach scales very well as we increase the number

of processor cores in the architecture and the number of

268 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 2, FEBRUARY 2013

. O. Ozturk is with the Department of Computer Engineering, Bilkent

University, Ankara, Turkey. E-mail: ozturk@cs.bilkent.edu.tr.

. M. Kandemir is with the Computer Science and Engineering Department,

The Pennsylvania State University, 111 IST Building, University Park,

PA 16802. E-mail: kandemir@cse.psu.edu.

. G. Chen is with Facebook, 156 University Ave., Palo Alto, CA.

Manuscript received 28 Aug. 2010; revised 20 Oct. 2011; accepted 26 Oct.

2011; published online 29 Nov. 2011.

Recommended for acceptance by A. Zomaya.

For information on obtaining reprints of this article, please send e-mail to:

tc@computer.org, and reference IEEECS Log Number TC-2010-08-0479.

Digital Object Identifier no. 10.1109/TC.2011.229.

0018-9340/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

voltage islands. Our results also indicate that, for the best

energy savings, both data and task parallelism need to be

employed together and application mapping should be

performed very carefully. Overall, our results show that

automated compiler support can be very effective in

exploiting unique features of a voltage island based multi-

processor architecture.

2RELATED WORK

As power consumption and heat dissipation are becoming

increasingly important issues in chip design, major chip

manufacturers, such as IBM [3] and Intel [4], are adopting

voltage islands in their current and future products [2]. For

example, voltage islands will be used in IBM’s new CU-08

manufacturing process for application-specific integrated

processors (ASIPs) [3]. The chip design tools that support

voltage islands are also starting to appear in the market

(e.g., [1]).

Different approaches for adapting and using voltage

islands have been explored [43], [23], [34]. Specifically,

Lackey et al. [19] discuss the methods and design tools that

are being used today to design voltage island based

architectures. Hu et al. [14] present an algorithm for

simultaneous voltage island partitioning, voltage level

assignment, and physical-level floor planning. In [27],

authors discuss the problem of energy optimal local speed

and voltage selection in frequency/voltage island based

systems under given performance constraints. Liu et al. [22]

propose a method to reduce the total power under timing

constraints and to implement voltage islands with minimal

overheads. Wu et al. [42] implement a methodology to

exploit nontrivial voltage island boundaries. They evaluate

the optimal power versus design cost tradeoff under

performance requirements. In [7], authors explore a semi-

custom voltage-island approach based on internal regula-

tion and selective custom design. Giefers and Rettberg [12]

propose a technique that partitions the design into different

frequency/voltage islands during the scheduling phase of

the High-Level Synthesis (HLS). Our approach is different

from all these prior efforts on voltage islands as we focus on

automated compiler support for such architectures, with the

goal of reducing energy consumption.

An important advantage of chip multiprocessors is that it

is able to reduce the cost from both performance and power

perspectives. The prior work [5], [6], [11], [17], [18], [26],

[28], [30], [39] discusses several advantages of these

architectures over complex single-processor-based designs.

Besides voltage island based systems, there are many prior

efforts that target at reducing energy consumption of

MPSoC-based architectures and chip multiprocessors [15].

For example, Ozturk et al. [29] propose an energy-efficient

on-chip memory design for embedded chip multiprocessor

systems. Manolache et al. [25] present a fault and energy-

aware communication mapping strategy for applications

implemented on NoCs. Soteriou and Peh [38] explore the

design space for communication channel turn-on/off based

a dynamic power management technique for both on-chip

and off-chip interconnections. Yang et al. [44] present an

approximate algorithm for energy efficient scheduling. Rae

and Parameswaran [31] study voltage reduction for power

minimization. Shang et al. [35] propose applying dynamic

voltage scaling to communication channels.

There has been various compiler-based approaches to

voltage scaling. Chen et al. [10] propose a compiler-directed

approach where the compiler decides the appropriate

voltage/frequency levels to be used for each communica-

tion channel in the NoC. Their approach builds and operates

on a graph-based representation of a parallel program. In

[20], authors propose a compiler-based communication link

voltage management technique. They specifically extract the

data communication pattern among parallel processors

along with network topology to set the voltages accordingly.

In [36], authors propose a real-time loop scheduling

technique using dynamic voltage scaling. They implement

two different scheduling algorithm based on a directed

acyclic graph (DAG) using voltage scaling. Shi et al. [37]

present a framework for embedded processors using a

dynamic compiler. Their compiler-based approach specifi-

cally utilizes the OS-level information and hardware status.

Rangasamy et al. [32] propose a petri net-based performance

model where compiler is used to set the frequencies.

Jejurikar and Gupta [16] present a DVS technique that

focuses on minimizing the entire system energy consump-

tion. In addition to the leakage energy, authors also consider

the energy consumption of the components like memory

and network interfaces. Hsu et al. [13] present a compiler-

based system to identify memory-bound loops. Authors

reduce the voltage level for such loops since the memory

subsystem is much slower than the processor. Our work is

different from these compiler-based studies since our

scheme minimizes dynamic and leakage energy consump-

tion for voltage islands by exploiting both task and data

parallelism at the loop level.

3LOAD IMBALANCE IN MULTIMEDIA APPLICATIONS

In this section, we focus on a set of multimedia applications

and study the opportunities for saving energy through

voltage scaling. Fig. 1 shows load imbalances across eight

processors for the first five loop nests of some of our

multimedia applications (we will discuss the important

characteristics of these applications later). These results were

obtained by parallelizing the loop nests of the applications

on a uniform multiprocessor architecture (simulated using

the SIMICS tool-set [24]), i.e., all the processors are the same

and operate under the highest clock frequency and voltage

levels available. Each bar in this figure corresponds to the

normalized completion time of the workload of that

processor in the corresponding loop nest, assuming that

the time of the latest processor, i.e., the one finishes last, is set

to 100 for ease of observing the load imbalance trends across

the applications. We see that, for all the applications and all

their loop nests, there is a significant load imbalance among

the parallel processors. There are several factors that

contribute to this load imbalance, which we explain next.

First, sometimes, the upper bound or lower bound of an

inner loop in a nest depends on the index of the outer loop.

This situation is illustrated by the following loop nest

written in a pseudolanguage:

for I :¼ 1;N

for J :¼ I;N

f:::g:

OZTURK ET AL.: COMPILER-DIRECTED ENERGY REDUCTION USING DYNAMIC VOLTAGE SCALING AND VOLTAGE ISLANDS FOR... 269

In this loop nest, assuming that the outer loop is

parallelized over multiple processors and the inner loop

is run sequentially, it is easy to see that each processor will

execute a different number of loop iterations. This is

because the number of loop iterations that will be executed

by the inner loop (J) of each processor is different since the

lower bound of J depends on I. The second possible reason

for the load imbalance among the processors is the different

data cache behavior of the different processors. Since each

processor can experience a different number of data cache

hits and misses, this can cause load imbalance across them.

A third possible reason for load imbalance is the condi-

tional constructs such as IF statements within the bodies of

the parallelized loop nests. If two different processors

happen to take the different branches of an IF statement in

the loop iterations they execute, their execution times (i.e.,

the time it takes for a processor to complete its workload in

that loop nest) can be different from each other. Fig. 2 gives

the contribution of these three factors to the overall load

imbalance for each of our applications. The segment

marked “Other” in each bar shown in this figure represents

the remaining load imbalances in the corresponding

application whose source we could not identify. We see

from these results that the loop bound based imbalance, the

first reason discussed above, dominates the remaining

factors for all eight applications. This, in a sense, is good

news from the compiler’s perspective, as this is the only

type of load imbalance, among those mentioned above, that

we can identify at compile time and try to eliminate as

much as possible; the remaining causes of load imbalance

are very difficult to capture and characterize at compile

time. The next section discusses such a compiler approach

to exploit the load imbalances across processors in the

context of a voltage island based embedded multiprocessor

architecture.

4COMPILER-DIRECTED APPLICATION CODE

MAPPING

4.1 Architecture Abstraction

A high-level view of the embedded architecture considered

in this paper is shown in Fig. 3. In this architecture, the chip

area is divided into multiple voltage islands, each of which is

270 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 2, FEBRUARY 2013

Fig. 1. Load imbalance across eight processors for the first five loop nests of some of our multimedia applications.

Fig. 2. Breakdown of the load imbalances in our applications.

controlled by a separate power feed and operates under a

different voltage level/frequency. We further assume that

each voltage island is divided into multiple power domains.

All the domains within an island are fed by same V

source

but independently controlled through intraisland switches.

To implement power domains, a power isolation logic

ensures that all inputs to the active power domain are

clamped to a stable value. The important point to note is that

this island-based architecture can help save both dynamic

and leakage power. Specifically, it can save dynamic energy

by employing different voltage levels for the different

islands, and leakage energy by shutting down the power

domains that are not needed by the current computation. The

difficult task however is to decide how a given embedded

multimedia application needs to be mapped to this multi-

processor architecture, i.e., code parallelization in this island-

based architecture, which is discussed in the rest of our

paper.

4.2 Mapping Algorithm

In order to map a given multimedia code to the architecture

shown in Fig. 3, our approach uses both data parallelism and

task parallelism. Data parallelism involves performing a

similar computation on many data objects simultaneously.

In our approach, this corresponds to a group of processors

executing a given loop nest in parallel. All the processors

execute a similar code (i.e., the same loop body) but work on

the different parts of array data, i.e., they execute different

iterations of the loop. Task parallelism, in comparison,

involves performing different tasks in parallel, where a task

is an arbitrary sequence of computations. In our approach,

this type of parallelism represents executing different loop

nests in different processors at the same time. Our compiler

uses a structure called the Loop Dependence Graph (LDG) to

represent the application code being optimized. Each node,

, of this graph corresponds to a loop nest in the application

and there is a directed edge from node N

to N

if the loop

nest represented by the latter is dependent on the loop nest

represented by the former. The proposed compiler support

maps this application onto our voltage island based

architecture. In the rest of this section, we describe the

details of three different voltage island and power domain

aware code mapping (parallelization) schemes that map a

given LDG onto our architecture.

4.2.1 The EA_DP Scheme

The first scheme that we describe, referred to as EA_DP,

exploits only data parallelism. It proceeds in three steps, as

suggested by its pseudocode given in Algorithm 1. The first

step is the parallelization step. In this step, the compiler

parallelizes an application in a loop nest basis. That is, each

loop nest of the given LDG is parallelized independently

considering the intrinsic data dependencies it has. Since we

are targeting a chip multiprocessor, our parallelization

strategy tries to achieve for each nest the outer loop

parallelism to the extent allowed by the data dependencies

exhibited by the loop nests [21]. The second step is the

processor workload estimation step. In this step, the

compiler estimates the load of each processor in each nest.

To do this, it performs two calculations: 1) iteration count

estimation and 2) per-iteration cost estimation. Since in

most array-based applications bounds of loops are known

before execution starts or they can be estimated through

profiling, estimating the iteration count for each loop nest is

not very difficult. The challenge is in determining the cost,

in terms of execution cycles, of a single iteration of a given

loop nest. Note that, various Worst Case Execution Time

(WCET) calculation methods have been explored in

literature [40]. Since the processors employed in our chip

multiprocessor are simple single-issue embedded cores, the

cost computation is closely dependent on the number and

types of the assembly instructions that will be generated for

the loop body. Specifically, we associate a base execution

cost with each type of assembly instruction. In addition, we

also estimate the number of cache misses. Since loop-based

embedded applications exhibit very good instruction

locality, we focus on data cache only and estimate data

cache misses using the method proposed by Carr et al. [8].

An important issue is to estimate, at the source level, what

assembly instructions will be generated for the loop body in

question. We address this problem as follows. The

constructs that are vital to the studied group of codes, that

is, array-based multimedia applications, include a typical

loop, a nested loop, assignment statements, array refer-

ences, and scalar variable references within and outside

loops. Our objective is to estimate the number of assembly

instructions of each type associated with the actual

execution of these constructs. To achieve this, the assembly

equivalents of several codes were obtained using our back-

end compiler (a variant of gcc) with the O3-level optimiza-

tion flag. Next, the portions of the assembly code were

correlated with corresponding high-level constructs to

extract the number and type of each instruction associated

with the construct.

Algorithm 1. EA

: Number of loop nests

: Number of processors

: Loop nest i

: Processor i

: Keeps the IDs of processors (based on a sorted

workload)

: Workload of processor i

: Execution time of processor i

max

: Highest voltage level

max

: Execution time of SP

(maximum workload)

OZTURK ET AL.: COMPILER-DIRECTED ENERGY REDUCTION USING DYNAMIC VOLTAGE SCALING AND VOLTAGE ISLANDS FOR... 271

Fig. 3. An example voltage island based architecture, with three islands.

We assume that there exists a large off-chip memory, shared by all

processors.

1: i ¼ 1

2: while i  N

3: Parallelize loop nest L

among the N

processors

4: i++

5: end while

6: i ¼ 1

7: while i  N

8: IterCount

¼ Estimate the number of iterations

executed by P

9: IterCost

¼ Estimate the average cost per iteration

executed by P

10: Estimated Workload for P

is, W

¼ IterCount



IterCost

11: i++

12: end while

13: Sort the processors in non-increasing order of their

workloads (SP

...SP

)

14: Assign the highest voltage level V

max

to SP

15: T

max

¼ Time to execute the workload of SP

16: i ¼ 2

17: while i  N

18: Select lowest voltage level V

for SP

such that

 T

max

19: i++

20: end while

To illustrate our parameter extraction process in more

detail, let us focus on some specifics of the following sample

constructs. First, let us focus on a loop construct. Each loop

construct is modeled to have a one-time overhead to load

the loop index variable into a register and initialize it. Each

loop also has an index comparison and an index increment

(or decrement) overhead, whose costs are proportional to

the number of loop iterations (called trip count or trip).

From correlating the high-level loop construct to the

corresponding assembly code, each loop initialization code

is estimated to execute one load (lw) and one add (add)

instruction (in general). Similarly, an estimate of trip+1 load

(lw), store-if-less-than (stl), and branch (bra) instructions is

associated with the index variable comparison. For index

variable increment (resp. decrement), 2 trip addition

(resp. subtraction) and trip load, store, and jump instruc-

tions are estimated to be performed. We next consider

extracting the number of instructions associated with array

accesses. First, the number and types of instructions

required to compute the address of the element are

identified. This requires the evaluation of the base address

of the array and the offset provided by the subscript(s). Our

current implementation considers the dimensionality of the

array in question, and computes the necessary instructions

for obtaining each subscript value. Computation of the

subscript operations is modeled using multiple shift and

addition/subtraction instructions, instead of multiplica-

tions, as this is the way our back-end compiler generates

code when invoked with the O3 optimization flag. Finally,

an additional load/store instruction was associated with

read/write the corresponding array element.

Based on the process outlined above, the compiler

estimates the iteration count for each processor and per-

iteration cost. Then, by multiplying these two, it calculates

the estimated workload for each processor. While this

workload estimation may not be extremely accurate, it

allows the compiler to rank processors according to their

workloads and assign suitable voltage levels and frequen-

cies to them as will be described in the next item. As an

example, consider the code fragment shown in Fig. 4,

parallelized using three processors. Assuming that our

estimator estimates the cost of loop body as L instructions,

the loads of processors P

, P

, and P

are 25050L, 15050L,

and 5050L, respectively.

The last step that implements EA_DP is voltage and

frequency assignment. In this step, the compiler first orders

the processors according to their nonincreasing workloads.

After that, the highest voltage is assigned to the processor

with the largest workload (the objective being not to affect

the execution time to the greatest extent possible). Then, the

processor with the second highest workload gets assigned to

the minimum voltage level V

supported by the architecture

that does not cause its execution time to exceed that of the

processor with the largest workload. In this way, each

processor gets the minimum voltage level to save maximum

amount of power without increasing the overall parallel

execution time of the nest, which is determined by the

processor with the largest workload. The unused processors

(and their caches) are turned off to save leakage. Note that,

in EA_DP, the loop nests of the application are handled one

by one. That is, observing the dependencies between the

nodes of the given LDG, we process a single nest at a time.

Also, this scheme uses at most one processor from each

island (assuming that no two islands have the same

voltage).

4.2.2 The EA_TP Scheme

We now describe how our second voltage island aware

parallelization scheme, called EA_TP, operates. This scheme

implements only task parallelism. In fact, it implements an

algorithm similar to list scheduling on the LDG. Specifically,

it can execute multiple nests in parallel if the dependencies

captured in LDG allow such an execution. Suppose that we

have Q loop nests that can execute in parallel. We first

estimate the workload of each loop nest, using a similar

procedure to the one explained in detail above. Then, each

loop nest is assigned to an island and executed in a single

processor there. This assignment is done considering the

workloads of the processors as well as the voltage/frequency

levels of the islands. It needs to be emphasized that, since in

this scheme the iterations of the same loop are not executed in

parallel, we exploit only task parallelism. Algorithm 2 gives

the pseudocode for this scheme.

272 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 2, FEBRUARY 2013

Fig. 4. Data parallelization of a loop nest.

Compiler-Directed Energy Reduction Using Dynamic Voltage Scaling and Voltage Islands for Embedded Systems

Figures

Citations

A DCM-Only Buck Regulator With Hysteretic-Assisted Adaptive Minimum-On-Time Control for Low-Power Microcontrollers

Optimising energy consumption of design patterns

A Low-Power Dual-Frequency SIMO Buck Converter Topology With Fully-Integrated Outputs and Fast Dynamic Operation in 45 nm CMOS

Energy Transparency for Deeply Embedded Programs

A 1 A, Dual-Inductor 4-Output Buck Converter With 20 MHz/100 MHz Dual-Frequency Switching and Integrated Output Filters in 65 nm CMOS

References

Simics: A full system simulation platform

The worst-case execution-time problem—overview of methods and survey of tools

The case for a single-chip multiprocessor

SUIF: an infrastructure for research on parallelizing and optimizing compilers

The design and use of simplepower: a cycle-accurate energy estimation tool

Related Papers (5)

Energy-conscious compilation based on voltage scaling

Reducing NoC energy consumption through compiler-directed channel voltage scaling

Energy management for real-time embedded applications with compiler support

Efficient and scalable compiler-directed energy optimization for realtime applications

Exploring efficient operating points for voltage scaled embedded processor cores

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Compiler-directed energy reduction using dynamic voltage scaling and voltage islands for embedded systems" ?