scispace - formally typeset
Open AccessJournal ArticleDOI

Compiler-Directed Energy Reduction Using Dynamic Voltage Scaling and Voltage Islands for Embedded Systems

Reads0
Chats0
TLDR
This paper studies the necessary software compiler support for voltage islands in an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determines how an optimizing compiler can automatically map an embedded application onto this architecture.
Abstract
Addressing power and energy consumption related issues early in the system design flow ensures good design and minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically, we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide range of values of our major simulation parameters.

read more

Content maybe subject to copyright    Report

Compiler-Directed Energy Reduction
Using Dynamic Voltage Scaling and
Voltage Islands for Embedded Systems
Ozcan Ozturk, Member, IEEE, Mahmut Kandemir, Member, IEEE, and
Guangyu Chen, Member , IEEE
Abstract—Addressing power and energy consumption related issues early in the system design flow ensures good design and
minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are
very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands
provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the
chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture
design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically,
we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and
determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated
support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such
as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very
effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide
range of values of our major simulation parameters.
Index Terms—Voltage islands, compiler optimizations, energy consumption, voltage scaling, compiler-based parallelization
Ç
1INTRODUCTION
P
OWER and energy related issues in deep submicron
embedded designs may limit functionality, reliability,
and performance and severely affect yield and manufactur-
ability. It is well known that higher power dissipation
increases junction temperatures, which in turn slows down
transistors and increases interconnect resistance. Therefore,
power consumption needs to be considered as one of the
primary metrics in embedded system design, and any
optimization approach targeted at improving performance
may therefore fall short if power is not also taken into account.
Recent years have witnessed several efforts aimed at
reducing power consumption from both hardware and
software perspectives. One such hardware approach is
voltage islands, which are areas (logic and/or memory)
supplied through separate, dedicated power feed. The prior
work on voltage islands so far generally focused on the
design and placement issues, and will be discussed in detail
in Section 2. Our goal in this paper is to study the necessary
software compiler support for voltage islands. Specifically,
we focus on an embedded multiprocessor architecture that
supports both voltage islands and control domains within
these islands and determine how an optimizing compiler
can map an embedded application onto this architecture.
The specific types of applications this paper considers are
embedded multimedia codes that are built from multi-
dimensional arrays of signals and multiloop nests that
operate on these arrays. One of the nice characteristics of
these applications is that an optimizing compiler can analyze
their data access patterns at compile time and restructure
them based on the target optimization in mind (e.g.,
enhancing iteration-level parallelism or improving data
locality).
We first give, in Section 3, a characterization of a set of
embedded applications that illustrates the potential benefits
that could be obtained from a voltage island based embedded
multiprocessor architecture. Based on this characterization,
we then present, in Section 4, a compiler-directed code
parallelization scheme, which is the main contribution of this
paper. A unique characteristic of this scheme is that it
minimizes power consumption (both dynamic and leakage)
by exploiting both task and data parallelism. We tested the
impact of this approach using a suite of eight embedded
multimedia applications and a simulation environment. Our
experiments, discussed in Section 5, reveal that the proposed
parallelization strategy is very effective in reducing power
(40.7 percent energy savings on the average) as well as
execution cycles (14.6 percent performance improvement on
the average). The experiments also show that the power
savings we achieve are consistent across a wide range of
values of our simulation parameters. For example, we found
that our approach scales very well as we increase the number
of processor cores in the architecture and the number of
268 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 2, FEBRUARY 2013
. O. Ozturk is with the Department of Computer Engineering, Bilkent
University, Ankara, Turkey. E-mail: ozturk@cs.bilkent.edu.tr.
. M. Kandemir is with the Computer Science and Engineering Department,
The Pennsylvania State University, 111 IST Building, University Park,
PA 16802. E-mail: kandemir@cse.psu.edu.
. G. Chen is with Facebook, 156 University Ave., Palo Alto, CA.
Manuscript received 28 Aug. 2010; revised 20 Oct. 2011; accepted 26 Oct.
2011; published online 29 Nov. 2011.
Recommended for acceptance by A. Zomaya.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-2010-08-0479.
Digital Object Identifier no. 10.1109/TC.2011.229.
0018-9340/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

voltage islands. Our results also indicate that, for the best
energy savings, both data and task parallelism need to be
employed together and application mapping should be
performed very carefully. Overall, our results show that
automated compiler support can be very effective in
exploiting unique features of a voltage island based multi-
processor architecture.
2RELATED WORK
As power consumption and heat dissipation are becoming
increasingly important issues in chip design, major chip
manufacturers, such as IBM [3] and Intel [4], are adopting
voltage islands in their current and future products [2]. For
example, voltage islands will be used in IBM’s new CU-08
manufacturing process for application-specific integrated
processors (ASIPs) [3]. The chip design tools that support
voltage islands are also starting to appear in the market
(e.g., [1]).
Different approaches for adapting and using voltage
islands have been explored [43], [23], [34]. Specifically,
Lackey et al. [19] discuss the methods and design tools that
are being used today to design voltage island based
architectures. Hu et al. [14] present an algorithm for
simultaneous voltage island partitioning, voltage level
assignment, and physical-level floor planning. In [27],
authors discuss the problem of energy optimal local speed
and voltage selection in frequency/voltage island based
systems under given performance constraints. Liu et al. [22]
propose a method to reduce the total power under timing
constraints and to implement voltage islands with minimal
overheads. Wu et al. [42] implement a methodology to
exploit nontrivial voltage island boundaries. They evaluate
the optimal power versus design cost tradeoff under
performance requirements. In [7], authors explore a semi-
custom voltage-island approach based on internal regula-
tion and selective custom design. Giefers and Rettberg [12]
propose a technique that partitions the design into different
frequency/voltage islands during the scheduling phase of
the High-Level Synthesis (HLS). Our approach is different
from all these prior efforts on voltage islands as we focus on
automated compiler support for such architectures, with the
goal of reducing energy consumption.
An important advantage of chip multiprocessors is that it
is able to reduce the cost from both performance and power
perspectives. The prior work [5], [6], [11], [17], [18], [26],
[28], [30], [39] discusses several advantages of these
architectures over complex single-processor-based designs.
Besides voltage island based systems, there are many prior
efforts that target at reducing energy consumption of
MPSoC-based architectures and chip multiprocessors [15].
For example, Ozturk et al. [29] propose an energy-efficient
on-chip memory design for embedded chip multiprocessor
systems. Manolache et al. [25] present a fault and energy-
aware communication mapping strategy for applications
implemented on NoCs. Soteriou and Peh [38] explore the
design space for communication channel turn-on/off based
a dynamic power management technique for both on-chip
and off-chip interconnections. Yang et al. [44] present an
approximate algorithm for energy efficient scheduling. Rae
and Parameswaran [31] study voltage reduction for power
minimization. Shang et al. [35] propose applying dynamic
voltage scaling to communication channels.
There has been various compiler-based approaches to
voltage scaling. Chen et al. [10] propose a compiler-directed
approach where the compiler decides the appropriate
voltage/frequency levels to be used for each communica-
tion channel in the NoC. Their approach builds and operates
on a graph-based representation of a parallel program. In
[20], authors propose a compiler-based communication link
voltage management technique. They specifically extract the
data communication pattern among parallel processors
along with network topology to set the voltages accordingly.
In [36], authors propose a real-time loop scheduling
technique using dynamic voltage scaling. They implement
two different scheduling algorithm based on a directed
acyclic graph (DAG) using voltage scaling. Shi et al. [37]
present a framework for embedded processors using a
dynamic compiler. Their compiler-based approach specifi-
cally utilizes the OS-level information and hardware status.
Rangasamy et al. [32] propose a petri net-based performance
model where compiler is used to set the frequencies.
Jejurikar and Gupta [16] present a DVS technique that
focuses on minimizing the entire system energy consump-
tion. In addition to the leakage energy, authors also consider
the energy consumption of the components like memory
and network interfaces. Hsu et al. [13] present a compiler-
based system to identify memory-bound loops. Authors
reduce the voltage level for such loops since the memory
subsystem is much slower than the processor. Our work is
different from these compiler-based studies since our
scheme minimizes dynamic and leakage energy consump-
tion for voltage islands by exploiting both task and data
parallelism at the loop level.
3LOAD IMBALANCE IN MULTIMEDIA APPLICATIONS
In this section, we focus on a set of multimedia applications
and study the opportunities for saving energy through
voltage scaling. Fig. 1 shows load imbalances across eight
processors for the first five loop nests of some of our
multimedia applications (we will discuss the important
characteristics of these applications later). These results were
obtained by parallelizing the loop nests of the applications
on a uniform multiprocessor architecture (simulated using
the SIMICS tool-set [24]), i.e., all the processors are the same
and operate under the highest clock frequency and voltage
levels available. Each bar in this figure corresponds to the
normalized completion time of the workload of that
processor in the corresponding loop nest, assuming that
the time of the latest processor, i.e., the one finishes last, is set
to 100 for ease of observing the load imbalance trends across
the applications. We see that, for all the applications and all
their loop nests, there is a significant load imbalance among
the parallel processors. There are several factors that
contribute to this load imbalance, which we explain next.
First, sometimes, the upper bound or lower bound of an
inner loop in a nest depends on the index of the outer loop.
This situation is illustrated by the following loop nest
written in a pseudolanguage:
for I :¼ 1;N
for J :¼ I;N
f:::g:
OZTURK ET AL.: COMPILER-DIRECTED ENERGY REDUCTION USING DYNAMIC VOLTAGE SCALING AND VOLTAGE ISLANDS FOR... 269

In this loop nest, assuming that the outer loop is
parallelized over multiple processors and the inner loop
is run sequentially, it is easy to see that each processor will
execute a different number of loop iterations. This is
because the number of loop iterations that will be executed
by the inner loop (J) of each processor is different since the
lower bound of J depends on I. The second possible reason
for the load imbalance among the processors is the different
data cache behavior of the different processors. Since each
processor can experience a different number of data cache
hits and misses, this can cause load imbalance across them.
A third possible reason for load imbalance is the condi-
tional constructs such as IF statements within the bodies of
the parallelized loop nests. If two different processors
happen to take the different branches of an IF statement in
the loop iterations they execute, their execution times (i.e.,
the time it takes for a processor to complete its workload in
that loop nest) can be different from each other. Fig. 2 gives
the contribution of these three factors to the overall load
imbalance for each of our applications. The segment
marked “Other” in each bar shown in this figure represents
the remaining load imbalances in the corresponding
application whose source we could not identify. We see
from these results that the loop bound based imbalance, the
first reason discussed above, dominates the remaining
factors for all eight applications. This, in a sense, is good
news from the compiler’s perspective, as this is the only
type of load imbalance, among those mentioned above, that
we can identify at compile time and try to eliminate as
much as possible; the remaining causes of load imbalance
are very difficult to capture and characterize at compile
time. The next section discusses such a compiler approach
to exploit the load imbalances across processors in the
context of a voltage island based embedded multiprocessor
architecture.
4COMPILER-DIRECTED APPLICATION CODE
MAPPING
4.1 Architecture Abstraction
A high-level view of the embedded architecture considered
in this paper is shown in Fig. 3. In this architecture, the chip
area is divided into multiple voltage islands, each of which is
270 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 2, FEBRUARY 2013
Fig. 1. Load imbalance across eight processors for the first five loop nests of some of our multimedia applications.
Fig. 2. Breakdown of the load imbalances in our applications.

controlled by a separate power feed and operates under a
different voltage level/frequency. We further assume that
each voltage island is divided into multiple power domains.
All the domains within an island are fed by same V
dd
source
but independently controlled through intraisland switches.
To implement power domains, a power isolation logic
ensures that all inputs to the active power domain are
clamped to a stable value. The important point to note is that
this island-based architecture can help save both dynamic
and leakage power. Specifically, it can save dynamic energy
by employing different voltage levels for the different
islands, and leakage energy by shutting down the power
domains that are not needed by the current computation. The
difficult task however is to decide how a given embedded
multimedia application needs to be mapped to this multi-
processor architecture, i.e., code parallelization in this island-
based architecture, which is discussed in the rest of our
paper.
4.2 Mapping Algorithm
In order to map a given multimedia code to the architecture
shown in Fig. 3, our approach uses both data parallelism and
task parallelism. Data parallelism involves performing a
similar computation on many data objects simultaneously.
In our approach, this corresponds to a group of processors
executing a given loop nest in parallel. All the processors
execute a similar code (i.e., the same loop body) but work on
the different parts of array data, i.e., they execute different
iterations of the loop. Task parallelism, in comparison,
involves performing different tasks in parallel, where a task
is an arbitrary sequence of computations. In our approach,
this type of parallelism represents executing different loop
nests in different processors at the same time. Our compiler
uses a structure called the Loop Dependence Graph (LDG) to
represent the application code being optimized. Each node,
N
i
, of this graph corresponds to a loop nest in the application
and there is a directed edge from node N
i
to N
j
if the loop
nest represented by the latter is dependent on the loop nest
represented by the former. The proposed compiler support
maps this application onto our voltage island based
architecture. In the rest of this section, we describe the
details of three different voltage island and power domain
aware code mapping (parallelization) schemes that map a
given LDG onto our architecture.
4.2.1 The EA_DP Scheme
The first scheme that we describe, referred to as EA_DP,
exploits only data parallelism. It proceeds in three steps, as
suggested by its pseudocode given in Algorithm 1. The first
step is the parallelization step. In this step, the compiler
parallelizes an application in a loop nest basis. That is, each
loop nest of the given LDG is parallelized independently
considering the intrinsic data dependencies it has. Since we
are targeting a chip multiprocessor, our parallelization
strategy tries to achieve for each nest the outer loop
parallelism to the extent allowed by the data dependencies
exhibited by the loop nests [21]. The second step is the
processor workload estimation step. In this step, the
compiler estimates the load of each processor in each nest.
To do this, it performs two calculations: 1) iteration count
estimation and 2) per-iteration cost estimation. Since in
most array-based applications bounds of loops are known
before execution starts or they can be estimated through
profiling, estimating the iteration count for each loop nest is
not very difficult. The challenge is in determining the cost,
in terms of execution cycles, of a single iteration of a given
loop nest. Note that, various Worst Case Execution Time
(WCET) calculation methods have been explored in
literature [40]. Since the processors employed in our chip
multiprocessor are simple single-issue embedded cores, the
cost computation is closely dependent on the number and
types of the assembly instructions that will be generated for
the loop body. Specifically, we associate a base execution
cost with each type of assembly instruction. In addition, we
also estimate the number of cache misses. Since loop-based
embedded applications exhibit very good instruction
locality, we focus on data cache only and estimate data
cache misses using the method proposed by Carr et al. [8].
An important issue is to estimate, at the source level, what
assembly instructions will be generated for the loop body in
question. We address this problem as follows. The
constructs that are vital to the studied group of codes, that
is, array-based multimedia applications, include a typical
loop, a nested loop, assignment statements, array refer-
ences, and scalar variable references within and outside
loops. Our objective is to estimate the number of assembly
instructions of each type associated with the actual
execution of these constructs. To achieve this, the assembly
equivalents of several codes were obtained using our back-
end compiler (a variant of gcc) with the O3-level optimiza-
tion flag. Next, the portions of the assembly code were
correlated with corresponding high-level constructs to
extract the number and type of each instruction associated
with the construct.
Algorithm 1. EA
DP
N
L
: Number of loop nests
N
P
: Number of processors
L
i
: Loop nest i
P
i
: Processor i
SP
i
: Keeps the IDs of processors (based on a sorted
workload)
W
i
: Workload of processor i
T
i
: Execution time of processor i
V
max
: Highest voltage level
T
max
: Execution time of SP
1
(maximum workload)
OZTURK ET AL.: COMPILER-DIRECTED ENERGY REDUCTION USING DYNAMIC VOLTAGE SCALING AND VOLTAGE ISLANDS FOR... 271
Fig. 3. An example voltage island based architecture, with three islands.
We assume that there exists a large off-chip memory, shared by all
processors.

1: i ¼ 1
2: while i N
L
do
3: Parallelize loop nest L
i
among the N
P
processors
4: i++
5: end while
6: i ¼ 1
7: while i N
P
do
8: IterCount
i
¼ Estimate the number of iterations
executed by P
i
9: IterCost
i
¼ Estimate the average cost per iteration
executed by P
i
10: Estimated Workload for P
i
is, W
i
¼ IterCount
i
IterCost
i
11: i++
12: end while
13: Sort the processors in non-increasing order of their
workloads (SP
1
...SP
N
P
)
14: Assign the highest voltage level V
max
to SP
1
15: T
max
¼ Time to execute the workload of SP
1
16: i ¼ 2
17: while i N
P
do
18: Select lowest voltage level V
L
for SP
i
such that
T
i
T
max
19: i++
20: end while
To illustrate our parameter extraction process in more
detail, let us focus on some specifics of the following sample
constructs. First, let us focus on a loop construct. Each loop
construct is modeled to have a one-time overhead to load
the loop index variable into a register and initialize it. Each
loop also has an index comparison and an index increment
(or decrement) overhead, whose costs are proportional to
the number of loop iterations (called trip count or trip).
From correlating the high-level loop construct to the
corresponding assembly code, each loop initialization code
is estimated to execute one load (lw) and one add (add)
instruction (in general). Similarly, an estimate of trip+1 load
(lw), store-if-less-than (stl), and branch (bra) instructions is
associated with the index variable comparison. For index
variable increment (resp. decrement), 2 trip addition
(resp. subtraction) and trip load, store, and jump instruc-
tions are estimated to be performed. We next consider
extracting the number of instructions associated with array
accesses. First, the number and types of instructions
required to compute the address of the element are
identified. This requires the evaluation of the base address
of the array and the offset provided by the subscript(s). Our
current implementation considers the dimensionality of the
array in question, and computes the necessary instructions
for obtaining each subscript value. Computation of the
subscript operations is modeled using multiple shift and
addition/subtraction instructions, instead of multiplica-
tions, as this is the way our back-end compiler generates
code when invoked with the O3 optimization flag. Finally,
an additional load/store instruction was associated with
read/write the corresponding array element.
Based on the process outlined above, the compiler
estimates the iteration count for each processor and per-
iteration cost. Then, by multiplying these two, it calculates
the estimated workload for each processor. While this
workload estimation may not be extremely accurate, it
allows the compiler to rank processors according to their
workloads and assign suitable voltage levels and frequen-
cies to them as will be described in the next item. As an
example, consider the code fragment shown in Fig. 4,
parallelized using three processors. Assuming that our
estimator estimates the cost of loop body as L instructions,
the loads of processors P
0
, P
1
, and P
2
are 25050L, 15050L,
and 5050L, respectively.
The last step that implements EA_DP is voltage and
frequency assignment. In this step, the compiler first orders
the processors according to their nonincreasing workloads.
After that, the highest voltage is assigned to the processor
with the largest workload (the objective being not to affect
the execution time to the greatest extent possible). Then, the
processor with the second highest workload gets assigned to
the minimum voltage level V
k
supported by the architecture
that does not cause its execution time to exceed that of the
processor with the largest workload. In this way, each
processor gets the minimum voltage level to save maximum
amount of power without increasing the overall parallel
execution time of the nest, which is determined by the
processor with the largest workload. The unused processors
(and their caches) are turned off to save leakage. Note that,
in EA_DP, the loop nests of the application are handled one
by one. That is, observing the dependencies between the
nodes of the given LDG, we process a single nest at a time.
Also, this scheme uses at most one processor from each
island (assuming that no two islands have the same
voltage).
4.2.2 The EA_TP Scheme
We now describe how our second voltage island aware
parallelization scheme, called EA_TP, operates. This scheme
implements only task parallelism. In fact, it implements an
algorithm similar to list scheduling on the LDG. Specifically,
it can execute multiple nests in parallel if the dependencies
captured in LDG allow such an execution. Suppose that we
have Q loop nests that can execute in parallel. We first
estimate the workload of each loop nest, using a similar
procedure to the one explained in detail above. Then, each
loop nest is assigned to an island and executed in a single
processor there. This assignment is done considering the
workloads of the processors as well as the voltage/frequency
levels of the islands. It needs to be emphasized that, since in
this scheme the iterations of the same loop are not executed in
parallel, we exploit only task parallelism. Algorithm 2 gives
the pseudocode for this scheme.
272 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 2, FEBRUARY 2013
Fig. 4. Data parallelization of a loop nest.

Citations
More filters
Journal ArticleDOI

A DCM-Only Buck Regulator With Hysteretic-Assisted Adaptive Minimum-On-Time Control for Low-Power Microcontrollers

TL;DR: In this article, a 40-mA buck regulator operating in the inherently stable Discontinuous Conduction Mode (DCM) for the entire load range is presented, which is implemented using a proposed hysteretic-assisted adaptive minimum-on-time controller to automatically adapt the regulator to a wide range of operating scenarios.
Proceedings ArticleDOI

Optimising energy consumption of design patterns

TL;DR: This paper proposes compiler transformations for two design patterns, Observer and Decorator, and performs an initial evaluation of their energy efficiency.
Journal ArticleDOI

A Low-Power Dual-Frequency SIMO Buck Converter Topology With Fully-Integrated Outputs and Fast Dynamic Operation in 45 nm CMOS

TL;DR: The dual-frequency single-inductor multiple-output (DF-SIMO) buck converter topology is proposed and a low-power 5-output 2 MHz/120 MHz design in 45 nm with 1.8 V input targeting low- power microcontrollers is presented as an application.
Journal ArticleDOI

Energy Transparency for Deeply Embedded Programs

TL;DR: In this paper, the authors focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing static resource analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements.
Journal ArticleDOI

A 1 A, Dual-Inductor 4-Output Buck Converter With 20 MHz/100 MHz Dual-Frequency Switching and Integrated Output Filters in 65 nm CMOS

TL;DR: The dual-frequency dual-inductor multiple-output (DF-DIMO) buck converter topology is presented and achieves a peak efficiency of 74%, less than 40 mV output voltage ripple, dynamic voltage scaling (DVS), and settling time of less than 85 ns for 125 mA load steps; all with no observable cross-regulation transients.
References
More filters
Journal ArticleDOI

Simics: A full system simulation platform

TL;DR: Simics is a platform for full system simulation that can run actual firmware and completely unmodified kernel and driver code, and it provides both functional accuracy for running commercial workloads and sufficient timing accuracy to interface to detailed hardware models.
Proceedings ArticleDOI

The case for a single-chip multiprocessor

TL;DR: It is shown that in advanced technologies it is possible to implement a single-chip multiprocessor in the same area as a wide issue superscalar processor, and it is found that for applications with little parallelism the performance of the two microarchitectures is comparable.
Journal ArticleDOI

SUIF: an infrastructure for research on parallelizing and optimizing compilers

TL;DR: The SUIF compiler is built into a powerful, flexible system that may be useful for many other researchers and the authors invite you to use and welcome your contributions to this infrastructure.
Proceedings ArticleDOI

The design and use of simplepower: a cycle-accurate energy estimation tool

TL;DR: This paper uses the use of SimplePower to evaluate the impact of a new selective gated pipeline register optimization, a high-level data transformation and a pow er-conscious post compilation optimization on the datapath, memory and on-chip bus energy, respectively.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Compiler-directed energy reduction using dynamic voltage scaling and voltage islands for embedded systems" ?

As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Their experiments with the proposed compiler support show that their approach is very effective in reducing energy consumption. Specifically, the authors focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture.