scispace - formally typeset

Journal ArticleDOI

On the Interplay of Parallelization, Program Performance, and Energy Consumption

01 Mar 2010-IEEE Transactions on Parallel and Distributed Systems (IEEE)-Vol. 21, Iss: 3, pp 342-353

TL;DR: This paper derives simple, yet fundamental formulas to describe the interplay between parallelism of an application, program performance, and energy consumption and derives optimal frequencies allocated to the serial and parallel regions in an application to either minimize the total energy consumption or minimize the energy-delay product.

AbstractThis paper derives simple, yet fundamental formulas to describe the interplay between parallelism of an application, program performance, and energy consumption. Given the ratio of serial and parallel portions in an application and the number of processors, we derive optimal frequencies allocated to the serial and parallel regions in an application to either minimize the total energy consumption or minimize the energy-delay product. The impact of static power is revealed by considering the ratio between static and dynamic power and quantifying the advantages of adding to the architecture capability to turn off individual processors and save static energy. We further determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. While the formulas we obtain use simplifying assumptions, they provide valuable theoretical insights into energy-aware processor resource management. Our results form a basis for several interesting research directions in the area of energy-aware multicore processor architectures.

Summary (4 min read)

1 INTRODUCTION

  • A surge of attention is being paid to parallel processingwith the recent emergence of commodity multicore processors.
  • While the increased amount of on-chip computing resources promises higher performance through parallel execution of applications, suppressing the power and energy consumption remains an even more stringent constraint to the design and management of such processors [5], [26].
  • Under this condition, the ratio between the processor speed of the serial and parallel section to minimize energy is N1= .

2.1 Problem Formulation and Assumptions

  • Therefore, the maximum clock frequency, Fmax, has a relative speed of 1 and the program has the serial portion whose amount of work is represented with s, and the parallel portion with p (or 1 s).
  • To be general, the authors also assume that the power consumption of each processor consists of two components, a frequency-dependent component that can be controlled by changing the frequency of the processor (DVFS), and a frequency-independent component that is not controlled by DVFS.
  • The authors call these two components as “dynamic” and “static,” respectively.
  • Specifically, for systems with relatively large , the frequency-independent component of power dominates the total energy consumption, and thus, applying DVFS techniques will not produce any appreciable energy savings.
  • For clear presentation and intuitive discussion, the authors will not consider the impact of the constant-speed operations on program execution time (and hence, energy consumption) in the following three sections.

2.2 Two Machine Models

  • In the problem formulation in (5), the authors assumed that processors consume static energy in both the serial and the parallel regions.
  • Naturally, one would regard an idle processor as an opportunity to save energy consumption if processors can be turned off when not busy, given a mechanism to turn off individual processors.
  • Hence, the authors will study in this work two machine models: one without and one with the capability to turn off individual processors.
  • Throughout this paper, the authors refer to these two machine models MA and MB.
  • Given the same processor speed setting, the dynamic energy consumption of MB is the same as that of MA: sum of the first two terms in (5).

3.1 The Case of x ¼ 1

  • While one can further reduce the dynamic energy by slowing down processors, considering the case of x ¼ 1 provides us with interesting insights as well as a basis for later discussions.
  • The curves are also higher than those given by Amdahl’s law (not shown).
  • Note that the optimal solution obtained for f s and f p is feasible since both the frequencies are smaller than the maximum frequency Fmax ¼ 1.

3.2 The Case of Unrestricted x

  • Amdahl’s law explores the effect of parallelization on speedup, and the authors have described the effect of parallelization on energy consumption when the program execution time is unchanged, i.e., x ¼ 1.
  • The authors relax the program execution time constraint and revisit the same problem of minimizing the total energy consumption.
  • In other words, the total energy consumption is minimized when the dynamic energy consumption is 1=ð 1Þ times the static energy.
  • In this figure, the values of are divided into three regions.

3.3 Optimal Energy Consumption Given a Speedup

  • The authors have thus far considered the problem of calculating the optimal speeds of processors (hence, program speedup) to minimize the total energy consumption given p, , and N .
  • Moreover, the largest energy improvement ratio occurs at a smaller program speedup.
  • For machine modelMA considered in the previous section, there was no reason for using fewer than the available N processors during parallel execution.
  • While n assumes discrete values, the authors will treat it as a continuous function when they derive the conditions for optimal energy consumption.

4.1 Conditions for Optimal Energy Consumption

  • That is, the energy-optimal time allocated to the serial section (thus the speed of the single processor used) does not depend on the number of processors, n.
  • The above equations illustrate that the optimal energy is obtained when f s ¼ f p , i.e., all the processors are given the same speed during program execution regardless of the serial or the parallel section.
  • This condition is greatly relaxed compared with that ofMA, where the condition was < ð 1Þ=N .
  • Fig. 8 depicts the minimum energy consumption of machine model MB at different values.

4.2 Optimal Energy Consumption Given a Speedup

  • The authors consider the problem of obtaining the minimum total energy consumption when a desired speedup x (equivalently, program execution time y) is specified.
  • Fig. 9 shows how the minimum energy consumption changes as the authors target a different program speedup, x, along with the contributions of the dynamic and static energy consumption.
  • At the maximum speedup dictated by Amdahl’s law, when fs ¼ fp ¼ Fmax ¼ 1, the dynamic energy consumption reaches 1 (i.e., same as the dynamic energy consumption of sequential execution) and the total energy reaches 1þ (i.e., same as the total energy consumption of sequential execution).
  • Fig. 10 further shows how the improvement ratio of the minimum energy changes with the different value of Fig.
  • Running a program having a serial portion s with (a) two processors and (b) one processor.

4.3 Energy Advantage of Turning

  • Off Idle Processors Fig. 11 compares the two machine models, MA and MB, with respect to the minimum energy consumption at different program speedups.
  • It is shown that the energy consumption ofMB is strictly lower than that ofMA at any desired speedup.
  • At the maximum program speedup, MB has the same energy consumption as sequential execution, regardless of the number of processors used.
  • Finally, the authors consider the energy consumption of the two machine models at the maximum program speedup given by Amdahl’s law, 1=ðsþ p=NÞ.
  • That is, as expected, the advantage of turning off processors increases when the static power is larger.

5 MINIMIZING ENERGY-DELAY PRODUCT

  • In the previous sections, the authors have considered the problem of obtaining the minimum energy given the two machine models MA and MB.
  • In many systems, however, it is desirable to strike a trade-off between energy consumption and performance by minimizing the energy-delay product rather than the total energy.
  • Processors Cannot be Turned Off Individually, also known as 1 MA.
  • This similarity also appears in the calculation of energy.
  • Numeric techniques are needed to obtain a solution.

6 EFFECT OF CONSTANT-SPEED OPERATIONS

  • In their problem formulation in Section 2 and derivations in the previous three sections, the authors assumed that processor speed (processor’s clock frequency) solely determines the runtime of a program region.
  • The authors will discuss how constant-speed operations, such as memory access and I/O processing, affect program execution time and, hence, their derivations.
  • Let us assume that the constant-speed operations (e.g., memory access) are distributed uniformly across the work and their aggregated amount is m (m < 1).
  • If the authors assume that memory accesses from multiple processor cores can be overlapped, then the total amount of time spent on the constant-speed operations (tm) is: m ðsþ p=NÞ.
  • Given that translating a problem formulated using Fig. 13 into a problem based on their original problem formulation (Fig. 2) is straightforward, the authors do not pursue in this paper the task of deriving new formulas that incorporate the impact of constant-speed operations.

8 CONCLUSIONS

  • The authors developed an analytical framework to study the trade-offs between parallelization, program performance, and energy consumption.
  • The main simplification inherited from Amdahl’s law is that the parallel section of an application is fully parallelizable.
  • Both the minimum energy and the minimum energy-delay are obtained when the speed of the serial section fs is N ð1= Þ, the speed in the parallel section, fp.
  • It also provides for a simple way to determine the effect of the static/ dynamic power ratio on the aforementioned trade-offs.
  • When processors can be individually turned off, the analysis indicates that the minimum total energy is independent of the number of processors used for executing the parallel section, while the energy-delay product is minimized when the maximum number of available processors are used during the parallel execution section.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

On the Interplay of Parallelization, Program
Performance, and Energy Consumption
Sangyeun Cho, Member, IEEE, and Rami G. Melhem, Fellow, IEEE
Abstract—This paper derives simple, yet fundamental formulas to describe the interplay between parallelism of an application,
program performance, and energy consumption. Given the ratio of serial and parallel portions in an application and the number of
processors, we derive optimal frequencies allocated to the serial and parallel regions in an application to either minimize the total
energy consumption or minimize the energy-delay product. The impact of static power is revealed by considering the ratio between
static and dynamic power and quantifying the advantages of adding to the architecture capability to turn off individual processors and
save static energy. We further determine the conditions under which one can obtain both energy and speed improvement, as well as
the amount of improvement. While the formulas we obtain use simplifying assumptions, they provide valuable theoretical insights into
energy-aware processor resource management. Our results form a basis for several interesting research directions in the area of
energy-aware multicore processor architectures.
Index Terms—Multicore processor, Amdahl’s law, dynamic voltage and frequency scaling (DVFS), energy-delay product (EDP).
Ç
1INTRODUCTION
A
surge of attention is being paid to parallel processing
with the recent emergence of commodity multicore
processors. Microprocessors carrying two to eight general
processing cores are commercially available [2], [15], [21],
[31], [32] and projections suggest that future technologies
will allow integrating many more cores, potentially in the
order of hundreds to thousands, in a single chip [4], [22].
Abundant on-chip processing elements and much reduced
processor-to-processor communication overhead will offer
an unprecedented environment for efficient and affordable
parallel processing on every desktop.
While the increased amount of on-chip computing
resources promises higher performance through parallel
execution of applications, suppressing the power and
energy consumption remains an even more stringent
constraint to the design and management of such proces-
sors [5], [26]. The allowed power level has been and will be
limited to the same constant value (150W) and the chip
operating voltage will not be significantly lowered in the
future, according to the ITRS projection [19]. Battery
capacity and efficiency are not improving as quickly as
the number of transistors in a chip increases. Henceforth,
many previously developed low-power and low-energy
ideas will play an even more significant role in future
multicore processors, including highly beneficial dynamic
voltage and frequency scaling (DVFS or simply DVS)
techniques [11], [35], [36] and the capability of turning off
individual processor cores [20].
This paper presents a theoretical study on the interplay
of parallelization, program performance, and energy con-
sumption. It has been previously pointed out that parallel
processing can be used to lower energy instead of
improving performance. For instance, Borkar [5] suggested
that a perfect two-way parallelization would lead to half the
clock frequency (and voltage), one-quarter of the energy
consumption, and one-eighth of the power density com-
pared with the sequential execution given the same
execution time constraint. Little work has been done,
however, to understand how parallelizing an application
would enable us to achieve both speedup and energy
improvement or how to obtain the best energy improve-
ment given an application and a processor architecture. We
will address the following specific questions:
1. What is the maximum energy improvement due to
parallelization, and how can we determine the
processor speeds to achieve that improvement?
2. How does static power affect the energy-optimal
program speedup and energy consumption?
3. Given a target speedup, how do we set the processor
speeds to minimize energy?
4. What is the condition for obtaining the minimum
energy-delay product?
Our goal in this work is to follow a simple analytical
approach, similar to the one used in Amdahl’s law [3], to
derive formulas that describe the behavior of the system.
For example, according to Amdahl’s law, the maximum
program speedup due to parallelization is
Speedup ¼
1
s þ
p
N
; ð1Þ
where ðs þ pÞ¼1;s (p) is the ratio of the serial (parallel)
portion in the program, and N is the number of processors.
Using the same input parameters, we can calculate how
parallelization improves energy consumption:
342 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 3, MARCH 2010
. S. Cho and R.G. Melhem are with the Department of Computer Science,
University of Pittsburgh, 5407 Sennott Square, 210 S. Bouquet Street,
Pittsburgh, PA 15260. E-mail: {cho, melhem}@cs.pitt.edu.
Manuscript received 21 Oct. 2008; revised 23 Jan. 2009; accepted 19 Feb.
2009; published online 27 Feb. 2009.
Recommended for acceptance by D. Bader.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2008-10-0427.
Digital Object Identifier no. 10.1109/TPDS.2009.41.
1045-9219/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society

Improvement in Dynamic Energy ¼
1
s þ
p
N
ð1Þ=

; ð2Þ
when the parallel program execution time is identical to the
sequential execution time and the dynamic power con-
sumption of a processor running at frequency f is
proportional to f
. The above equation, whose detailed
derivation will be shown in Section 3, illustrates that more
parallelism (larger p and smaller s) and more processors
(larger N) help reduce energy. Fig. 1 presents a plot of (2).
For a typical value of , the energy improvement function
in (2) is growing faster than the speedup function in (1) with
p or N.
1
The result obtained in this paper reveals interesting
relationships between processor speeds in the sequential
and parallel regions of a program and between dynamic
and static energy consumption. For instance, when indivi-
dual processors cannot be turned off, the total energy is
minimized when the dynamic energy is equal to 1=ð 1Þ
times the static energy regardless of the value of N or s.
Moreover, we find that minimum energy is achieved at
processor speeds smaller than the maximum speeds only
when <ð 1Þ=N, where is the ratio of the static power
consumption to the dynamic power consumption at the
maximum processor speed. Under this condition, the ratio
between the processor speed of the serial and parallel
section to minimize energy is N
1=
. When the condition
does not hold, the program’s serial section must be
executed at full speed. If >ð 1Þ, processor speeds in
both the serial and the parallel sections must be set to the
maximum speed to achieve the minimum total energy. The
above conditions are greatly relaxed if we can turn off
individual processors to save static energy consumption.
The rest of this paper is organized as follows: Section 2
presents our problem formulation and the two machine
models that we will consider in the paper. Sections 3 and 4
study the problem of minimizing energy consumption for
the two machine models, followed by a discussion on
energy-delay product in Section 5. Section 6 further
discusses how our derivations can be used in situations
when there are constant-speed operations (such as memory
access) whose latencies do not depend on the processor
speed. Sections 3, 4, and 5 assume that the processor speeds
simply determine the execution time of a program. Related
work will be discussed and contrasted with our work in
Section 7 and conclusions will be summarized in Section 8.
2PROBLEM FORMULATION AND MACHINE MODELS
2.1 Problem Formulation and Assumptions
We have an application model identical to that of Amdahl’s
law. An application has a serial section that can be executed
by a single processor and a parallel section that can be
executed by any number of processors in the system, i.e., fully
parallelizable. When the number of processors employed is
N, the speedup of the parallel section is N. We do not consider
the overhead of processor-to-processor communications.
We normalize the sequential execution time of the
program to be 1, in order to present our derivation in an
intuitive way. Similarly, we normalize the amount of work
(i.e., number of cycles) in the program to be 1. Therefore, the
maximum clock frequency, F
max
, has a relative speed of 1
and the program has the serial portion whose amount of
work is represented with s, and the parallel portion with p (or
1 s). Fig. 2 shows this arrangement. The program speedup
is denoted by x and the resulting program execution time
with y ¼ 1=x. The clock frequencies for the two regions in the
work, namely, s and p, are calculated as follows:
f
s
¼
s
t
; ð3Þ
f
p
¼
1 s
ðy tÞN
: ð4Þ
We assume a DVFS scheme where voltage and frequency
are changed linearly. To be general, we also assume that the
power consumption of each processor consists of two
components, a frequency-dependent component that can be
controlled by changing the frequency of the processor
(DVFS), and a frequency-independent component that is
not controlled by DVFS. We call these two components as
“dynamic” and “static,” respectively. There have been
many studies to either theoretically or empirically model
the dependence of the power consumption on the operating
frequency, and most of these studies conclude that this
dependence can be approximated by C f
, for some value
of 2 [30], [35]. The frequency-independent component
is a constant that depends on the technology and system
CHO AND MELHEM: ON THE INTERPLAY OF PARALLELIZATION, PROGRAM PERFORMANCE, AND ENERGY CONSUMPTION 343
1. In literature is between 2 and 3, typically 3 [30], [35].
Fig. 1. Achievable dynamic energy improvement assuming ¼ 3 and
using 1, 2, 3, and 4 processors given the parallel portion’s ratio of a
program.
Fig. 2. Normalized “work” and “time.” “Parallel time” is partitioned into
serial and parallel regions.

architecture. However, assuming that E
max
¼ C F
max
is the
energy of the frequency-dependent component at the
maximum processor speed, the energy for the frequency-
independent component can be normalized with regard to
E
max
and expressed as E
max
. In this paper, we further
normalize E
max
to 1 so that the static power consumption is
simply . Although the power model described above is
approximate, its simplicity allows us to compare the effect
of parallelism on the choice of processor speed policies
using closed-form formulas. It is common to use simplified
power models to derive DVFS policies [33].
Clearly, the benefit of any DVFS technique largely
depends on the value of . Specifically, for systems with
relatively large , the frequency-independent component of
power dominates the total energy consumption, and thus,
applying DVFS techniques will not produce any appreci-
able energy savings. However, in systems with relatively
small , applying DVFS techniques can reduce the total
system energy consumption. Many DVFS algorithms have
been proposed recently and demonstrated to reduce power
consumption [11], [18], [25], [29], [35], [36], and many
commercial microprocessors are now equipped with DVFS
capabilities [8], [9], [16], [27].
For a given problem, s is fixed, and for a given
architecture, N, , and are fixed. Hence, the energy
consumption, E, is a function of t and y. Specifically,
Eðt; yÞ¼t f
s
þ N ðy tÞf
p
þ N y: ð5Þ
In (5), the three terms represent energy for the serial
portion, energy for the parallel portion, and energy for the
static power consumption during the whole execution
time, respectively. We do not consider the processor
temperature, and hence, the term for the static energy
is the product of per-processor power consumption rate, ,
the number of processors, N, and the total execution time,
y. We do not assume a specific interprocessor network
topology and do not consider energy consumption of the
interprocessor network.
In our problem formulation, we assumed that the
processor speed (or processor’s clock frequency) solely
determines the runtime of a program. However, certain
“constant-speed” operations, such as memory access
(caused by a cache miss in cache-based systems) and I/O
processing, may take a fixed amount of time that is
independent of the processor speeds. Consequently, in-
creasing or decreasing the processor speed will have
“sublinear” effect (rather than “linear”) on performance
[14]. For clear presentation and intuitive discussion, we will
not consider the impact of the constant-speed operations on
program execution time (and hence, energy consumption)
in the following three sections. However, Section 6 will
specifically discuss how constant-speed operations affect
our derivations and intuitions learned.
Finally, it is important to note that this work pays little
attention to practical issues of parallelizing an application
or mapping serial and parallel regions of an application to
multiple cores. We assume (as Amdahl’s law suggests) that
an application is perfectly parallelized given its parallelism
and work described by a parallel code region can be
perfectly distributed to an arbitrary number of processors.
2.2 Two Machine Models
In the problem formulation in (5), we assumed that
processors consume static energy in both the serial and
the parallel regions. That is, even when a processor is not
assigned a task to execute (and thus, sits idle), it consumes
static energy. Naturally, one would regard an idle processor
as an opportunity to save energy consumption if processors
can be turned off when not busy, given a mechanism to turn
off individual processors. Because low-energy consumption
via suppressing static power will be increasingly important,
viable mechanisms such as separate power supply designs
are actively explored in the context of multicore processors
in industry [20]. Hence, we will study in this work two
machine models: one without and one with the capability to
turn off individual processors. Throughout this paper, we
refer to these two machine models M
A
and M
B
.
For the simplicity of our derivation, we assume that M
B
can turn off or on processors without any overhead. Given
the same processor speed setting, the dynamic energy
consumption of M
B
is the same as that of M
A
: sum of the
first two terms in (5). Only the static energy part need be
replaced by t þ N ðy tÞ.
We further assume that processors can run at an
arbitrary clock frequency, subject to the maximum fre-
quency F
max
in both machine models. While processor clock
frequencies in real chip implementations are typically
discrete, previous work showed that energy savings using
discrete frequencies closely match that of continuous
frequencies [18]. The speedup x one would achieve with
parallelization and processor speed scaling is subject to
x
1
s þ
p
N
ð6Þ
according to Amdahl’s law in (1). It is clear that this
condition is equivalent to f
s
;f
p
F
max
¼ 1, and y s þ
p
N
.
We will discuss the two machine models in detail in the
following two sections.
3MACHINE MODEL M
A
:PROCESSORS CANNOT
BE TURNED OFF INDIVIDUALLY
3.1 The Case of x ¼ 1
We first obtain the minimum energy consumption when
x ¼ 1 (or y ¼ 1), i.e., program execution time is unchanged.
Imposing the condition x ¼ 1 is similar to setting a
deadline, which is the original sequential execution time,
to finish the computation. While one can further reduce the
dynamic energy by slowing down processors, considering
the case of x ¼ 1 provides us with interesting insights as
well as a basis for later discussions. To minimize the total
energy, we rewrite (5) as
EðtÞ¼t
s
t

þ Nð1 tÞ
1 s
ð1 tÞN

þ N: ð7Þ
Next, we obtain the derivative of EðtÞ with respect to t,
dEðtÞ
dt
¼
ð 1Þs
t
þ
ð 1Þð1 sÞ
ð1 tÞ
N
ð1Þ
; ð8Þ
344 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 3, MARCH 2010

and then, we compute the value of t which minimizes EðtÞ
by setting dEðtÞ=dt to 0 and obtain
t
1 t
¼
s
1 s
N
ð1Þ
: ð9Þ
Hence, the value of t which minimizes EðtÞ is
t
¼
s
s þ
p
N
ð1Þ=
: ð10Þ
We are ready to obtain the values of f
s
and f
p
, which
minimize EðtÞ using (3), (4), and (10). Specifically,
f
s
¼
s
t
¼ s þ
p
N
ð1Þ=
; ð11Þ
f
p
¼ s þ
p
N
ð1Þ=

N
1
; ð12Þ
¼ f
s
=N
1
: ð13Þ
Both f
s
and f
p
are a function of s and N in (11) and (12),
and (13) shows the relationship between f
s
and f
p
when
EðtÞ is minimized. Interestingly, the ratio between the two
frequencies, f
s
=f
p
, is a function of N, but not s.
Finally, from (7) and (10), we obtain the minimum
energy consumption
E
min
¼ Eðt
Þ¼ s þ
p
N
ð1Þ=

þ N : ð14Þ
Fig. 1 depicts the maximum energy improvement due to
parallelization (E
1
min
) when the number of processors is
varied between one and four, assuming ¼ 3 and ¼ 0 .It
is clear that energy improvement is a function mono-
tonically increasing with p and N. The curves are also
higher than those given by Amdahl’s law (not shown).
Fig. 3 shows how the overall energy (EðtÞ) changes as we
adjust t. It also presents t
, the value of t that minimizes
EðtÞ. Note that the optimal solution obtained for f
s
and f
p
is feasible since both the frequencies are smaller than the
maximum frequency F
max
¼ 1.
3.2 The Case of Unrestricted x
Amdahl’s law explores the effect of parallelization on
speedup, and we have described the effect of parallelization
on energy consumption when the program execution time
is unchanged, i.e., x ¼ 1. We note that the fixed value of x,
in turn, fixes the static energy consumption. We have thus
focused only on the dynamic energy consumption. Hence,
the optimal processor speeds in (11) and (12) are indepen-
dent of . In this section, we relax the program execution
time constraint and revisit the same problem of minimizing
the total energy consumption. Unrestricting x will expose
the impact of static power consumption on the optimal
speed of a program’s serial and parallel sections.
We begin by setting the derivatives of (5) with respect to
both t and y to zero as follows:
@E
@t
¼
ð 1Þs
t
þ
ð 1Þð1 sÞ
ðy tÞ
N
ð1Þ
¼ 0 ¼)
t
y t
¼
s
1 s
N
ð1Þ=
;
ð15Þ
@E
@y
¼
ð 1Þð1 sÞ
ðy tÞ
N
1
N

¼ 0 ¼)
ðy tÞ¼
1

1
1 s
N
:
ð16Þ
Solving (15) and (16) for t and y gives
t
¼
1
N

1
s; ð17Þ
y
¼
1
N

1
s þ
p
N
ð1Þ=

: ð18Þ
With t
and y
, we can use (3) and (4) to calculate the
optimum frequencies
f
s
¼
N
1

1
; ð19Þ
f
p
¼
1

1
¼ f
s
=N
1
; ð20Þ
from which we can compute the minimum energy. Again,
note that f
s
and f
p
are independent of s and that the ratio
between them is N
1
. An interesting observation is that at f
s
and f
p
, the dynamic energy is given by
E
dynamic
¼ t
f
s
þ N ðy
t
Þf
p
¼
1
1
N y
;
ð21Þ
which is equal to 1=ð 1Þ of the static energy,
E
static
¼ N y
. In other words, the total energy consump-
tion is minimized when the dynamic energy consumption is
1=ð 1Þ times the static energy. This relation holds during
the execution of both the serial and the parallel sections of
the program.
The above solution is only applicable if both f
s
and f
p
are smaller than F
max
, however, necessitating that
ð 1Þ=N.If, the ratio between the static and
dynamic power, is large so that it is not possible to
maintain the aforementioned relation between the static and
CHO AND MELHEM: ON THE INTERPLAY OF PARALLELIZATION, PROGRAM PERFORMANCE, AND ENERGY CONSUMPTION 345
Fig. 3. Dynamic energy consumption versus serial time for two cases,
s ¼ 0:25 and s ¼ 0:5, when N ¼ 4. The bound of t is marked with “X”
(when f
s
¼ F
max
¼ 1) and “O” (when f
p
¼ F
max
¼ 1). The minimum
energy point in each curve (at t ¼ t
) is marked with a filled rectangle.

dynamic energy, we should set f
s
¼ 1 and find the values of
y and f
p
that minimize the total energy consumption.
Denoting these values by y

and f

p
, we obtain
y

¼ s þ
p
N

1

1
; ð22Þ
f

p
¼
1

1
: ð23Þ
Again, these values result in the dynamic power consump-
tion being 1=ð 1Þ times the static power consumption
during the execution of the parallel portion of the program.
In order to summarize the relationship between and
speedup that results in minimum energy consumption, we
show that relationship in Fig. 4. In this figure, the values of
are divided into three regions. When ð 1Þ=N, the
solution for the optimum energy consumption problem is
given by (18), (19), and (20). When ð 1Þ=N < ð 1 Þ,
the solution is given by f
s
¼ 1, (22), and (23). Finally, when
>ð 1Þ, the solution is given by f
s
¼ f
p
¼ 1, and the
speedup is that given by Amdahl’s law (1).
3.3 Optimal Energy Consumption Given a Speedu p
We have thus far considered the problem of calculating the
optimal speeds of processors (hence, program speedup) to
minimize the total energy consumption given p, , and N.
In this section, we consider the problem of how to set the
speeds of the processors (f
s
and f
p
) to minimize the total
energy consumption when the target program speedup x
(or, equivalently, y) is specified. Because the static energy
N y is immediately determined given x, we only need to
minimize the dynamic energy while meeting the program
speedup requirement and our solution derived from (3), (4),
(18), (19), and (20) is as follows:
If x
1
s þ
p
N
ð1Þ=
;f
s
¼ xf
s;x¼1
;f
p
¼ xf
p;x¼1
; ð24Þ
if
1
s þ
p
N
ð1Þ=
<x
1
s þ
p
N
;f
s
¼ 1;f
p
¼
px
Nð1 sxÞ
; ð25Þ
where f
s;x¼1
and f
p;x¼1
are the optimal frequencies when
x ¼ 1, as given in (11) and (12), respectively. We call the
interval in (24) as the linear frequency scaling interval because
the energy-optimal f
s
and f
p
can be obtained by simply
scaling f
s;x¼1
and f
p;x¼1
by a factor of x. We also note that
the upper bound of the condition in (24) is, in fact,
equivalent to ð 1Þ=N.
Fig. 5 shows how the minimum energy consumption
changes as we target a different program speedup, along
with the contribution of the dynamic and static energy
consumption. It is noticeable from the plot that the
dynamic energy of the sequential region saturates at
around x ¼ 2:3. This is due to the inability to scale f
s
beyond F
max
. Finally, when f
s
¼ f
p
¼ 1 (i.e., at the max-
imum speedup), the dynamic energy is 1; it is the same as
that of sequential execution.
We point out that static power consumption plays an
important role in determining the minimum energy con-
sumption of an application. Fig. 6 depicts the improvement
ratio of the minimum energy at different program speed-
ups, relative to the baseline sequential execution of a given
application. The plot clearly demonstrates that a smaller
leads to a larger energy improvement ratio at any selected
program speedup. Moreover, the largest energy improve-
ment ratio occurs at a smaller program speedup. In other
words, one can slow down processor speeds further to
benefit from reducing dynamic energy to a greater degree
before static energy starts to offset the benefit, if is small.
346 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 3, MARCH 2010
Fig. 4. changes the speedup of a program when its energy
consumption is minimized. We assume that ¼ 3.
Fig. 5. Optimal energy at the program speedup of x when ¼ 3. The
thick dotted line shows the sequential machine’s energy consumption
(1 þ ).
Fig. 6. Energy improvement at different speedups over sequential
execution.

Citations
More filters

Proceedings ArticleDOI
04 Dec 2010
Abstract: To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. Although U-cores are effective at increasing performance, their benefits can also diminish given the scarcity of projected bandwidth in the future. To understand the relative merits between different approaches in the face of technology constraints, this work builds on prior modeling of heterogeneous multicores to support U-cores. Unlike prior models that trade performance, power, and area using well-known relationships between simple and complex processors, our model must consider the less-obvious relationships between conventional processors and a diverse set of U-cores. Further, our model supports speculation of future designs from scaling trends predicted by the ITRS road map. The predictive power of our model depends upon U-core-specific parameters derived by measuring performance and power of tuned applications on today's state-of-the-art multicores, GPUs, FPGAs, and ASICs. Our results reinforce some current-day understandings of the potential and limitations of U-cores and also provides new insights on their relative merits.

242 citations


Proceedings ArticleDOI
27 Feb 2011
TL;DR: A new FPGA memory architecture called Connected RAM (CoRAM) is proposed to serve as a portable bridge between the distributed computation kernels and the external memory interfaces to improve performance and efficiency and to improve an application's portability and scalability.
Abstract: FPGAs have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing performance and efficiency, FPGAs have not yet gained widespread acceptance as mainstream computing devices. A fundamental obstacle to FPGA-based computing today is the FPGA's lack of a common, scalable memory architecture. When developing applications for FPGAs, designers are often directly responsible for crafting the application-specific infrastructure logic that manages and transports data to and from the processing kernels. This infrastructure not only increases design time and effort but will frequently lock a design to a particular FPGA product line, hindering scalability and portability. We propose a new FPGA memory architecture called Connected RAM (CoRAM) to serve as a portable bridge between the distributed computation kernels and the external memory interfaces. In addition to improving performance and efficiency, the CoRAM architecture provides a virtualized memory environment as seen by the hardware kernels to simplify development and to improve an application's portability and scalability.

151 citations


Journal ArticleDOI
TL;DR: How the energy-cognizant scheduler's role has been extended beyond simple energy minimization to also include related issues like the avoidance of negative thermal effects as well as addressing asymmetric multicore architectures is explored.
Abstract: Execution time is no longer the only metric by which computational systems are judged. In fact, explicitly sacrificing raw performance in exchange for energy savings is becoming a common trend in environments ranging from large server farms attempting to minimize cooling costs to mobile devices trying to prolong battery life. Hardware designers, well aware of these trends, include capabilities like DVFS (to throttle core frequency) into almost all modern systems. However, hardware capabilities on their own are insufficient and must be paired with other logic to decide if, when, and by how much to apply energy-minimizing techniques while still meeting performance goals. One obvious choice is to place this logic into the OS scheduler. This choice is particularly attractive due to the relative simplicity, low cost, and low risk associated with modifying only the scheduler part of the OS. Herein we survey the vast field of research on energy-cognizant schedulers. We discuss scheduling techniques to perform energy-efficient computation. We further explore how the energy-cognizant scheduler's role has been extended beyond simple energy minimization to also include related issues like the avoidance of negative thermal effects as well as addressing asymmetric multicore architectures.

150 citations


Cites background from "On the Interplay of Parallelization..."

  • ...Cho and Melhem [26], [27] derive analytical models to study the potential of DPM and DVFS to reduce energy consumption for parallelizable applications on multicore systems....

    [...]


Journal ArticleDOI
TL;DR: Investigation of energy-efficient scheduling of sequential tasks with precedence constraints on multiprocessor computers with dynamically variable voltage and speed makes initial contribution to analytical performance study of heuristic power allocation and scheduling algorithms for precedence constrained sequential tasks.
Abstract: Energy-efficient scheduling of sequential tasks with precedence constraints on multiprocessor computers with dynamically variable voltage and speed is investigated as combinatorial optimization problems. In particular, the problem of minimizing schedule length with energy consumption constraint and the problem of minimizing energy consumption with schedule length constraint are considered. Our scheduling problems contain three nontrivial subproblems, namely, precedence constraining, task scheduling, and power supplying. Each subproblem should be solved efficiently so that heuristic algorithms with overall good performance can be developed. Such decomposition of our optimization problems into three subproblems makes design and analysis of heuristic algorithms tractable. Three types of heuristic power allocation and scheduling algorithms are proposed for precedence constrained sequential tasks with energy and time constraints, namely, prepower-determination algorithms, postpower-determination algorithms, and hybrid algorithms. The performance of our algorithms are analyzed and compared with optimal schedules analytically. Such analysis has not been conducted in the literature for any algorithm. Therefore, our investigation in this paper makes initial contribution to analytical performance study of heuristic power allocation and scheduling algorithms for precedence constrained sequential tasks. Our extensive simulation data demonstrate that for wide task graphs, the performance ratios of all our heuristic algorithms approach one as the number of tasks increases.

74 citations


Journal ArticleDOI
TL;DR: Which application/platform characteristics are necessary for a successful energy-performance trade-off of large scale parallel applications and how cluster power consumption characteristics together with application sensitivity to frequency scaling determine the energy effectiveness of the DVFS technique is analyzed.
Abstract: DVFS is a ubiquitous technique for CPU power management in modern computing systems Reducing processor frequency/voltage leads to a decrease of CPU power consumption and an increase in the execution time In this paper, we analyze which application/platform characteristics are necessary for a successful energy-performance trade-off of large scale parallel applications We present a model that gives an upper bound on performance loss due to frequency scaling using the application parallel efficiency The model was validated with performance measurements of large scale parallel applications Then we track how application sensitivity to frequency scaling evolved over the last decade for different cluster generations Finally, we study how cluster power consumption characteristics together with application sensitivity to frequency scaling determine the energy effectiveness of the DVFS technique

74 citations


References
More filters

Book
01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Abstract: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today. In this edition, the authors bring their trademark method of quantitative analysis not only to high-performance desktop machine design, but also to the design of embedded and server systems. They have illustrated their principles with designs from all three of these domains, including examples from consumer electronics, multimedia and Web technologies, and high-performance computing.

11,485 citations


"On the Interplay of Parallelization..." refers background in this paper

  • ...Amdahl’s law [3], being of a very simple form, has inspired much work in the domain of computer architecture and parallel processing [1], [12]....

    [...]


Proceedings ArticleDOI
Gene Myron Amdahl1
18 Apr 1967
Abstract: For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.

3,559 citations


18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

2,228 citations


"On the Interplay of Parallelization..." refers background in this paper

  • ...The impact of static power is revealed by considering the ratio between static and dynamic power and quantifying the advantages of adding to the architecture capability to turn off individual processors and save static energy....

    [...]


Proceedings ArticleDOI
23 Oct 1995
TL;DR: This paper proposes a simple model of job scheduling aimed at capturing some key aspects of energy minimization, and gives an off-line algorithm that computes, for any set of jobs, a minimum-energy schedule.
Abstract: The energy usage of computer systems is becoming an important consideration, especially for battery-operated systems. Various methods for reducing energy consumption have been investigated, both at the circuit level and at the operating systems level. In this paper, we propose a simple model of job scheduling aimed at capturing some key aspects of energy minimization. In this model, each job is to be executed between its arrival time and deadline by a single processor with variable speed, under the assumption that energy usage per unit time, P, is a convex function, of the processor speed s. We give an off-line algorithm that computes, for any set of jobs, a minimum-energy schedule. We then consider some on-line algorithms and their competitive performance for the power function P(s)=s/sup p/ where p/spl ges/2. It is shown that one natural heuristic, called the Average Rate heuristic, uses at most a constant times the minimum energy required. The analysis involves bounding the largest eigenvalue in matrices of a special type.

1,497 citations


"On the Interplay of Parallelization..." refers background in this paper

  • ...Energy saving techniques that utilize available timing slack with DVFS have been extensively studied, especially in the domain of real-time task scheduling [18], [25], [29], [35], [36]....

    [...]

  • ...Ç...

    [...]


Journal ArticleDOI
TL;DR: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores.
Abstract: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster.

1,182 citations


"On the Interplay of Parallelization..." refers background or methods in this paper

  • ...Using Amdahl’s law, recently, Hill and Marty [ 13 ] looked into the trade-off between processor core types and sizes in a multicore processor, parallelism in applications, and processor performance (i.e., large, high-performance, power-hungry cores versus small, low-performance, low-power cores)....

    [...]

  • ...Woo and Lee [34] extended [ 13 ] to consider energy efficiency of processor architectures using different core types and confirmed that a heterogeneous architecture with a full-blown core along with many small, power-efficient cores is a viable alternative to homogeneous many-core architectures....

    [...]


Frequently Asked Questions (2)
Q1. What are the contributions mentioned in the paper "On the interplay of parallelization, program performance, and energy consumption" ?

This paper derives simple, yet fundamental formulas to describe the interplay between parallelism of an application, program performance, and energy consumption. The authors further determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. While the formulas the authors obtain use simplifying assumptions, they provide valuable theoretical insights into energy-aware processor resource management. 

In this paper, the authors developed an analytical framework to study the trade-offs between parallelization, program performance, and energy consumption. The authors considered two machine models ; one assumes that individual processors can not be turned off independently, and the other assumes that they can. When processors can be individually turned off, the analysis indicates that the minimum total energy is independent of the number of processors used for executing the parallel section, while the energy-delay product is minimized when the maximum number of available processors are used during the parallel execution section. The demonstrated substantial power advantage that can be gained from turning off individual processors is a great incentive to designing multicore processors with the capability of turning off individual processors.